MS BI Guy Does Hadoop (Part 1 – Getting Started)

With Big Data and Hadoop getting a lot of attention nowadays, I’ve decided it’s time to take a look so I am starting a log of my Hadoop adventures. I hope it’ll benefit Windows users especially BI pros. If not, at least I’ll keep a track of my experience, so I can recreate it if needed. Before I start, for an excellent introduction to Hadoop from a Microsoft perspective, watch the talk on Big Data – What is the Big Deal? by David Dewitt. Previously, I’ve experimented and I got my feed wet with Apache Hadoop-based Services for Windows Azure, which is the Microsoft implementation for Hadoop in the cloud, but I was thirsty for more and wanted to dig deeper. Microsoft is currently working on CTP of Hadoop-based Services For Windows, which will provide a supported environment for installing and running Hadoop on Windows. While waiting, O’Reilly was king enough to send me a review copy of Hadoop – The Definitive Guide, 3rd Edition, by Tom White. Since Hadoop is an open-source project, I had to rediscover and relearn something I thought I would never had to since my university days – Unix, or to be more precise its CentOS Linux variant which is installed on the Cloudera VM. So, part 1 is about setting up your environment.

From the book, I discovered that Cloudera has a virtual machine for Virtual Box. I have VirtualBox on my Windows 7 laptop so I could run SharePoint 2010 (available in x64 only). VirtualBox is a great piece of software that was originally developed by Sun Microsystems and currently owned by Oracle. So, I’ve decided to take the VM shortcut since I don’t have much time to mess around with Cygwin, Java, etc. After downloading and double-extracting the Cloudera file, I created a new VirtualBox machine and I’ve made the following changes.

060412_0245_MSBIGuyDoes1

On the next step, I increased the memory to 2GB (recommended by Cloudera). In the Virtual Hard Disk step, I chose the “Use existing hard disk” option and pointed to the vmdi file I extracted from the Cloudera downloadable. Then, in the Settings page for the new VM, I’ve changed the storage to use the IDE controller instead of SATA which Cloudera said that the VM might have an issue with.

060412_0245_MSBIGuyDoes2

Once this was done, I was able to start the VM, which automatically logged me into CentOS as user cloudera. The first challenge I had to overcome was installing the VirtualBox Guest Editions for Linux in order to be able to resize the window and move the mouse cursor in and out without having to hold the right Ctrl key. This turned out to be more difficult than expected. The final solution took the following steps:

  1. Once you’ve started the guest OS, in the VM menu toolbar click Install Guest Additions to mount the disk.
  2. Open the File Manager and navigate to the /etc/yum.repos.d folder. Right-click the folder and click Open Terminal Here.

    In the command window, type the following command to elevate your privileges:

    $ su

    Enter the password (claudera) when prompted

  3. Open the Vi editor to edit the Cloudera-cdh3.repo as mentioned in the Cloudera VM demo note by typing this command.

    su -c vi Cloudera-cdh3.repo

  4. Change the baseurl line (changes in bold):

    [Cloudera-cdh3]

    name=Cloudera’s Distribution for Hadoop, Version 3

    enabled=1

    gpgcheck=0

    baseurl=http://archive.cloudera.com/redhat/cdh/3u4/

  5. Press ESC to go to command mode and type :wq to save and exit vi.

    Tip: To edit files in a more civilized way, click the File Manager icon in the menu bar at the bottom of the shell. However, you won’t have access to save files. As a workaround, launch the File Manager with elevated permissions as follows:

    $ su –c Thunar

  6. Enter the following command to install a few utilities and development kernel:

    $ yum install dkms binutils gcc make patch libgomp glibc-headers glibc-devel kernel-headers kernel-devel

  7. Then navigate to the media folder and run the Guest Additions file.
    $ cd /media
    $ cd VBOXADDITIONS_4.1.16_78094
    $ ./VBoxLinuxAdditions.run

    This should install the guest additions successfully. If you see any error messages, execute additional packages with yum as requested.

Next, you can verify the Hadoop installation by executing the steps in the Starting Hadoop and Verifying it is Working Properly section in the Hadoop Quick Start Guide.

Is Excel Calculation Services Needed on a PowerPivot for SharePoint Application Server?

Scenario: A SharePoint 2010 scale-out farm with two application servers:

APP1 – Runs Excel Calculation Services (ECS)

APP2 – Runs PowerPivot for SharePoint (R2 or 2010)

Do you need to start ECS on APP2 in order for PowerPivot for SharePoint to work?

Answer: You don’t but if you don’t have the updated Analysis Services OLE DB Provider on APP01, you’ll get the dreaded “Failed to create new connection…” error when you browse the PowerPivot workbook. SharePoint ships with the SQL 2008 ADO MD, MSOLAP, etc. stack; not the R2 or later components. Therefore, you must upgrade the OLE DB provider (R2 or 2010 depending on the PowerPivot version) as explained in the How to: Install the Analysis Services OLE DB Provider on an Excel Services Computer topic in Books Online.

Thanks to Dave Wickert for clarifying this.

Organizational BI vs. Personal BI

This is a blog post that I’ve been planning for a while. With the ever-increasing power of laptops and desktop computers and declining hardware prices, personal BI is on the rise. Typically, a new technology usually has a self-propelled upward spiral to a point – vendors are talking about it to clients, executives are talking about it on golf courses, consultancies are talking about it, and are rushing to fill in the void. There is a lot of money to be made with a lot of misinformation and sometimes outright lies. I’ll be quick to point out that personal BI alone is not going to fix your BI and data challenges. However, it can complement organizational BI well and open possibilities when organizational BI alone is not enough. You might find the following information interesting when you’re contemplating which way to go.

Organizational BI

Organizational BI is a set of technologies and processes for implementing an end-to-end BI solution where the implementation effort is shifted to BI Pros. An example of a “classic” (and somewhat simplified) organizational BI solution follows.

052712_2221_Organizatio1

PROS

  • Pervasive business intelligence – Available to all users across the enterprise, subject to security policies.
  • Single version of the truth with trusted data – Provides accurate and trusted analysis and reporting. Data is clean, validated, and secure.
  • Rich feature set –OLAP, data mining, KPIs, dashboards. For more information about using Analysis Services for organizational BI, read my blog Why an Analytical Layer?
  • Performance – High performance and scalability with massive data volumes

CONS

  • Effort – Significant development effort might be required
  • Skills – Specialized skills required (BI pros)
  • More rigid – Less flexible to react to new business requirements

 

PERSONAL BI

Personal BI provides business users tools for implementing ad-hoc BI models with help, guidance and supervision from IT (see below). In the Microsoft BI world, the tool for personal BI is PowerPivot with its two flavors: PowerPivot for Excel and PowerPivot for SharePoint.

052712_2221_Organizatio2

PROS

  • Offloads effort from IT – Anyone can implement BI models if they have access to data. However, IT must still provide ongoing guidance and supervision, such as to provide access to data, to implement more advanced business calculations, to monitor the shared environment where the BI models are deployed. Therefore, I believe more in “managed” personal BI than just personal BI.
  • Knowledge domain expertise – Business users should know their domain better than IT.
  • Data mashups — Easy to mix data from different data sources.
  • Data exploration – Let business users explore data and tell IT what they really want before BI pros take over.

CONS

  • “Spreadmarts” – Proliferation of models. Which model do you trust?
  • Data integrity and validation issues – If users don’t import data that is already validated, such as importing data stored in the company’s data warehouse, reports probably cannot be trusted.
  • Power users – In reality, personal BI requires power users. In my experience, regular users don’t have the desire, skills, and time to create models. A case in point – a major organization decided to embrace a popular tool for personal BI but hired a consultancy to implement the reports! Have you heard from your users that they want operational reports, preferably delivered to them via subscriptions?
  • Security issues – Another burden on IT to secure data and make sure that data is not compromised when the business user imports it and share the model with another user.

So, each approach has pros and cons. Instead of exclusivity, consider using them together. For example, implement organizational BI for pervasive BI and single version of the truth, coupled with isolated scenarios for personal BI, such as when the data is not in the data warehouse or when users need to mash up data.

Why an Analytical Layer?

I had a presentation on the BI Semantic Layer and Tabular modeling for the Atlanta BI Group on Monday. Midway during the presentation, a DBA asked why we need an analytical layer on top of data. I’m sure that those of you who are familiar with traditional reporting and haven’t discovered yet Analysis Services might have the same question so let’s clarify.

  1. Semantic layer

    In general, semantics relates to discovering the meaning of the message behind the words. In the context of data and BI, semantics represents the user’s perspective of data: how the end user views the data to derive knowledge from it. As a modeler, your job is to translate the machine-friendly database structures and terminology into a user-friendly semantic layer that describes the business problems to be solved. To address this need, you create a semantic layer. In the world of Microsoft BI, this is the Business Intelligence Semantic Model (BISM). The first chapter (you can download it from the book page) of my latest book “Applied Microsoft SQL Server 2012 Analysis Services (Tabular Modeling)” explains this in more details.

  2. Reducing reporting effort

    Suppose that your boss comes one day and tells you that IT spends too much effort on creating operational reports. Instead, he wants to minimize cost and empower the business users to create their own reports. One of the nice features of Analysis Services is that the entity relationships become a part of the model. So, end users don’t have to know how to relate the Product to Sales entities. They just select which fields they want on the report and the model knows how to relate and aggregate data.

  3. Performance

    Analysis Services is designed to provide excellent performance when aggregating massive amounts of data. For example, in a real-life project we are able to achieve delivering operational reports within milliseconds that require aggregating a billion rows. Try to do that with relational reporting, especially when you need more involved calculations, such as YTD, QTD, parallel period, etc. Having an analytical layer might save you millions of dollars to overcome performance limitations (to a point) with relational reporting by purchasing MPP systems.

  4. Single version of the truth

    The unfortunate reality that we’re facing quite often is that many important business metrics end up being defined and redefined either in complex SQL code or reports. This presents maintenance, implementation, and testing challenges. Instead, you can encapsulate metrics where they belong – in your analytical model. As an added bonus, you will be able to use an expression language (MDX or DAX) that is specifically designed for business calculations. Moreover, the modeler can define key performance indicators (KPIs).

  5. Additional BI possibilities

    This goes hand in hand with 2, but the point that I want to emphasize here is that many reporting tools are designed to integrate and support Analysis Services well. For example, Microsoft provides Excel on the desktop and the SharePoint-based Power View tool that allows business users to create their own reports. An analytical layer opens also additional possibilities, such as performance dashboards.

  6. Security

    How much time do you spend implementing custom security frameworks for authorizing users to access data they are allowed to see on reports? Moving to Analysis Services, you’ll find that the model can apply security on connect. I wrote more about this in my article Protect UDM with Dimension Data Security.

  7. Isolation

Because an analytical layer sits on top of the relational database, it provides a natural separation between reports and data. For example, assuming distributed deployment, a long-running ETL job in the database won’t impact the performance of the reports serviced by the analytical layer.

Presenting at Atlanta BI Group

I’ll be presenting “The Analysis Services 2012 Tabular Model” for the Atlanta BI Group on Monday, May 21st. And, Darren Herbold will share some great insights harvested from a real-life project of how to use SSIS to integrate with Salesforce.com. I hope you can make the event, which will be sponsored by Prologika.

“SQL Server 2012 introduces the BI Semantic Model that gives BI pros two paths for implementing analytical layers: Multidimensional and Tabular. The Tabular model builds upon the xVelocity engine that was introduced in PowerPivot. Although not as feature-rich as Multidimensional, Tabular promotes rapid professional development and great out-of-box performance. This session introduces you to Tabular development and shares lessons learned. I’ll also discuss how Tabular and Multidimensional compare.”

Analysis Services Processing and CXPACKET Waits

Performance tuning – my favorite! This blog originated from a quest to reduce the processing time of an SSAS cube which loads some 2.5 billion rows and includes DISCINTCT COUNT measure groups. The initial time to fully process the cube was about 50 minutes on a dedicated DELL PowerEdge R810 server, with 256 GB RAM and two physical processors (32 cores total). Both the SSAS and database servers were underutilizing the CPU resources with SSAS about 60-70 utilizations and the database server about 20-30 CPU utilization. What was the bottleneck?

By using the sys.dm_os_waiting_tasks DMV like the statement below (you can use also the SQL Server Activity Monitor), we saw a high number of CXPACKET wait types.

SELECT
dm_ws.wait_duration_ms,

dm_ws.wait_type,

dm_es.status,

dm_t.TEXT,

dm_qp.query_plan,

dm_ws.session_ID,

dm_es.cpu_time,

dm_es.memory_usage,

dm_es.logical_reads,

dm_es.total_elapsed_time,

dm_es.program_name,

DB_NAME(dm_r.database_id) DatabaseName,

— Optional columns

dm_ws.blocking_session_id,

dm_r.wait_resource,

dm_es.login_name,

dm_r.command,

dm_r.last_wait_type

FROM
sys.dm_os_waiting_tasks
dm_ws

INNER
JOIN
sys.dm_exec_requests
dm_r
ON
dm_ws.session_id = dm_r.session_id

INNER
JOIN
sys.dm_exec_sessions
dm_es
ON
dm_es.session_id = dm_r.session_id

CROSS
APPLY
sys.dm_exec_sql_text
(dm_r.sql_handle)
dm_t

CROSS
APPLY
sys.dm_exec_query_plan
(dm_r.plan_handle) dm_qp

WHERE
dm_es.is_user_process = 1

The typical advice given to address CXPACKET waits is to decrease the SQL parallelism by using the MAXDOP setting. This might help in some isolated scenarios, such as UPDATE or DELETE queries. However, the SQL Sentry Plan Explorer showed that each processing query is highly parallelized to utilize all cores. Notice in the screenshot below, that thread 16 fetches only 14,803 rows.

051512_0211_AnalysisSer1

Therefore, the CXPACKET waits were simply caused by faster threads waiting for other threads to finish. In other words, CXPACKET wait is just a coordination mechanism between the threads being parallelized. To confirm this, we set the SQL Server MAXDOP setting to 1. Surely, the CXPACKET waits disappeared but the overall cube processing time went up as well. In our case, the biggest benefit was realized not by decreasing the SQL Server parallelism but by increasing it, by increasing the maximum number of database connections. This resulted in decreasing the overall processing time some 20%.

051512_0211_AnalysisSer2

You need to be careful here though. While increasing the connections to max out the CPU on the SSAS server will yield the biggest gain, it might also slow down other processing, such as reports that query the cube while the database is being processed. So, as a rule of thumb, target no more than 80% CPU utilization to leave room for other tasks.

SQL Server 2012 ETL Framework Features in the New SSIS Project Deployment Model

When moving to SQL Server 2012, ETL developers would need to rethink custom ETL frameworks. If you embrace the new project deployment model, SQL Server 2012 offers a lot of plumbing features out of the box. Jen Underwood wrote a great SQL Server 2012 ETL Framework Features in the New SSIS Catalog article about what supported and what’s missing. I guess in most cases developers will find that they’ll need to implement incremental features on top of the Microsoft provided toolset that are not supported, such as restartability and execution control, while letting SSIS 2012 handle logging and performance reporting. Another great resource for getting started with the new project deployment model is the SSIS Logging in Denali blog by Jammie Thompson.

We ran into an interesting snag when validating and executing packages using the new project deployment model with Windows integrated security:

  1. Developer initiates the validation and execution in SSMS from his machine.
  2. Packages are deployed to a dedicated SQL Server and packages use Windows integrated security to connect to a SQL Server database on a different server.
  3. The source SQL Server database is on a different server.

As you’ve probably guessed it, this scenario requires configuring Kerberos delegation and registering SPNs for both the SQL Server hosting the SSIS packages and the SQL Server hosting the source database. That’s because validation and execution happen under your login and a double-hop is required to delegate your credentials to the source SQL Server. This can be avoided by using standard security (username and password) to establish connections from your packages to the source SQL Server. This wasn’t an option in our case because customer requirements dictated using Windows security to SQL Server.

I personally believe that the new Integration Services enhancements alone warrant upgrading to SQL Server 2012.

Reducing Analysis Services Index Processing Time

Scenario: Processing a 100 GB SSAS 2012 cube on a dedicated DELL PowerEdge server (128 GB RAM with 32 logical cores) reveals that 50% of the time is spent on processing the cube indexes and building aggregations with only 20% average CPU utilization and not even a small dent to the memory consumption. It was clear that SSAS doesn’t utilize the server resources and additional tuning is required to increase the parallelism of this phase.

Solution: Following Greg Galloway’s advice, we’ve changed the OLAP\Process\AggregationMemoryLimitMin to 2 and OLAP\Process\AggregationMemoryLimitMin to 3 to increase the parallelism of the index processing. This led to 25% reduction of the overall processing time and increased the CPU utilization to 70-80%. As to how we derived to these numbers, James Rogers has a great write-up. For the sake of preserving the precious formulas:

AggregationMemoryLimitMax=Ceiling(100/(<number processing cores>-2))

AggregationMemoryLimitMin=Ceiling(100/((<number processing cores>-2)*1.5))

In case you’re curious about what these setting do, the amount of memory needed to build an aggregation is unknown up front. The engine has to estimate it based of certain factors (estimated rows, granularity attributes, measures, etc.) – and it uses the AggregationMemoryLimitMin value to ensure that if it’s estimate is wrong, that at least there will be a sufficient buffer for underestimation. Similarly, if the estimate is wrong in the upward direction, it will use the AggregationMemoryLimitMax value to trim the max value downward.

Presenting at SQL Saturday

I have a talk “The Personal-Team-Organizational BI Conundrum” at SQL Saturday on April 14th at 9 AM.

Confused about the BI alphabetical soup? Not sure which one makes sense for your organization? Join this session to learn you can use the Microsoft BI platform to address various analytical needs. Discover how to implement the Personal-Team-Organizational continuum on a single platform. I’ll also discuss how Tabular and Multidimensional compare.

I hope you can make it. SQL Saturday is always a good show and this year we have great speakers and great BI content.

Report Builder Connectivity Issues

Issue: Previewing a Report Builder report connected to Analysis Services works in Report Designer but it fails in Report Builder 3.0 with the following error:

An existing connection was forcibly closed by the remote host

—————————-

Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.

—————————-

Further, SQL Profiler reveals that an Anonymous connection is attempted to Analysis Services even though you use an embedded report data source (using a shared data source on the server will run the report on the server while with an embedded data source results the report runs on the client).

Resolution: Check if Report Builder is connected to a report server running in SharePoint mode by examining the status bar. If it is, click the Disconnect link. Report preview should now work.

3465.rb.png-550x0