Estimating Data Compression

Now that the age of in-memory databases has dawned on us, I’ve been experimenting with different aspects of VertiPaq and I find its technology fascinating. It’s safe to say that VertiPaq will propel Microsoft Business Intelligence for years to come with all of its three flavors: PowerPivot (self-service BI), Business Intelligence Semantic Model (corporate BI), and data warehousing (column-based stored indexes). Unlike Analysis Services MOLAP, which compresses data by rows, VertiPaq compresses data by columns.

Column-based compression fits business intelligence like a glove because data is typically analyzed by columns. And, the lower the data cardinality (that is the more repeating values a column has), the higher its compression rate will be. Since most columns used for analytics fit this category, you could expect 10-15 times data compression. Consequently, a half terabyte database could potentially fit into 80 GB of memory. Notice that I account for more memory because when data is brought from disk into memory, additional space is needed to accommodate index tables and other structures. So, how is this not cool? Moving to BISM, all of your data could fit into memory so you could get a head start in performance just by having a superfast storage medium!

Continuing this line of thought, I decided to take the Contoso Retail DW database and a real-life data warehouse for a spin to see how VertiPaq compresses them. I used the excellent PowerShell script that Vidas Matelis developed. Since fact tables typically contribute to 95% of the data warehouse storage space, I focused on only one fact table from both databases. To measure the VertiPaq compression, I loaded the fact tables into PowerPivot and run the script to obtain their storage space.

MOLAP

VertiPaq

Contoso (FactSales)

129 MB

35 MB

DW (20 mil fact table)

810 MB

792 MB

As you can see, the Contoso VertiPaq results are very impressive. But what happened with the real-life DW? How could VertiPaq be only marginally better than MOLAP? To analyze this further, I used the script ability to get a column compression breakdown and I’ve noticed that the following top two columns occupy the most space:

TRANS_DTLS_KEY: 400 MB (primary key)

LOG_ID: 268 MB (transaction identifier)

Both of these columns are high-cardinality columns. I used these columns in a cube drillthrough action to get the transaction details behind a cell. Subsequently, I added them to PowerPivot so I could compare the same data structures. Currently, PowerPivot doesn’t support drillthrough so adding these columns to PowerPivot is probably not needed in real life. It will be interesting to see how BISM will implement drillthrough. One thing is sure: it makes sense to leave high-cardinality attributes out of VertiPaq and possibly use report actions (if supported) to report on details.

PowerPivot Stories from the Trenches

Now that a snow blizzard has paralyzed Atlanta for a week, what a better way to start the new year than sharing a PowerPivot success story. A bank institution has approached Prologika to help them implement a solution to report the customer’s credit history so the bank can evaluate the risk for granting the customer a loan. Their high-level initial requirements call for:

  • Flexible searching and filtering to the let the bank user find a particular customer or search for the customer accounts both owned by the bank or externally reported from other banks.
  • Flexible report layout that will let the bank user change the report layout by adding or removing fields.
  • Ability to download the report locally to allow the bank user to run the report when there is no connectivity.
  • Refreshing the credit history on a schedule.

Initially, the bank was gravitating toward a home-grown solution that would require tons of custom code to implement a desktop client that uses a third-party control or ReportViewer for reporting. One cool thing about Microsoft BI and my favorite tip I share at any public event is that reduces or eliminates custom code. This may sound strange coming from someone who spent most of his career writing code but less code is indeed in the customer’s best interest. So, we suggested a solution based on PowerPivot with the following implementation highlights:

  • Implement a PowerPiivot application and import the customer and account information.
  • Allow the end user to use slicers and PivotTable filters to search and filter data.
  • Upload to PowerPivot application to SharePoint to let each bank user view and interact with the report without requiring Excel 2010 installed locally.
  • Allow the user to download the application for offline reporting. This requires Excel 2010 with PowerPivot installed locally.

Here is a screenshot of the PowerPivot application rendered in SharePoint (sensitive information replaced to protect the innocent) that demonstrates some of the filtering capabilities that are built in Excel Services:

011511_1805_PowerPivotS1

Needless to say, this PowerPivot-based solution saved our customer a lot of time and money. True, we had to make some tradeoffs in terms of functionality and flexibility. One interesting implementation detail was that the reporting requirements required detail-level reporting at the account level instead of aggregating data. Out initial thought was that the in-memory VertiPaq database could handle this very efficiently. However, we ran into performance issues related to the fact that Excel generates MDX queries against VertiPaq and MDX is not optimized for detailed-level reporting especially when cross-joining multiple dimensions. To mitigate this issue, we took advantage of the Analysis Services Autoexists behavior that cross-joins attributes of the same dimension very efficiently. To do so, we consolidated the Customer and Account information into the Accounts table. We used the Customer table only for the Excel slicers and filtering by customer.

Looking ahead, it will be nice if a future release of PowerPivot detects that it deals with VertiPaq and generates native DAX queries instead of MDX to take a full advantage of VertiPaq. Crescent will be the first and only Microsoft client when SQL 11 ships that will generate native queries so detailed-level reporting will likely to be faster with Crescent.

PowerPivot Time Calculations

A recommended practice for implementing time calculations in PowerPivot is to have a Data table with a datetime column. Kasper de Jonge explains in more details in this blog. This approach will probably save effort when importing data from normalized schemas and won’t require specifying additional arguments to the PowerPivot time functions. However, it will undoubtedly present an issue when importing data from a star schema. A dimensional modeling best practice is to have an integer key for a Data dimension table in the format YYYYMMDD and integer foreign keys in the fact tables. Luckily, you don’t have to normalize data back to datetime when building a PowerPivot model on top of star schemas after the issue with the All filter Kasper reported a while back got fixed in PowerPivot RTM.

Let’s consider the AdventureWorksDW schema. Its DimDate table has an integer key (DateKey). Let’s say you import this table in PowerPivot and name it Date. This is what the resulting PowerPivot table may look like. Fact tables join the Data table on the DateKey field which is an integer key.

011511_1555_PowerPivotT1

Now you want to calculate YTD values. Since we don’t have scope assignments in PowerPivot, you need to implement a time calculation for each field that you need YTD. For the SalesAmount field, you can use the following DAX expression:

=TotalYTD(Sum([SalesAmount]), ‘Date'[Date], All(‘Date’))

The TotalYTD function is one of the PowerPivot time intelligence functions. The second argument references the Date column in the Date table which must be of datetime. Finally, the third argument overwrites the evaluation context by telling PowerPivot to evaluate the expression irrespective of date selection, that is across all dates. A nice enhancement for a future PowerPivot (aka BISM) release is to provide some sort of UI to allow the modeler to flag which column in the Data table is a datetime column in order to avoid the All flag.

Report Actions for Report Server in SharePoint Mode

Report actions are an Analysis Services extensibility mechanism that lets end users run a Reporting Services report as they browse the cube. A customer reported that they have trouble setting up a Reporting Services action for a report server configured in a SharePoint integrated mode – a scenario which appears that wasn’t tested properly by Microsoft.

First, the customer wanted to pass multiple parameters to the report. This was achieved by setting up the report action Target Type to Cells and Target Object to All Cells. This setup allows the user to right-click any cell on the cube to initiate the action. What’s more important is that an action-capable browser, such an Excel, will be able to collect the coordinates of all dimensions used in the browser, such as those added to the filter area, so you can pass them as parameters to the report.

121210_0238_ReportActio1

A condition is further specified to restrict the action. Because the cube browser will be passing three parameters to the report, the condition enables the action only if the corresponding members’ current member is not the All member.

This is where things get interesting. You need to specify the server name and report path. Instead of pointing to the report server, you should point to the SharePoint library using the syntax that Prash Shirolkar from the Reporting Services explains in his blog. I haven’t tested this but I’d expect that if you reference the report server, the report will use the report definition copy on the report server and not the one stored in SharePoint. Recall that the latter is the master copy of the report definition and the one in the report server might be outdated. However, when you reference the SharePoint library, you’ll find that Analysis Services doesn’t prefix the parameters with rp:, which in turn makes the report server ignores the parameters. The ultimate hack is to prefix the parameters yourself:

121210_0238_ReportActio2

Here, the server name is set to the SharePoint site URL. I’ve found that actually it doesn’t matter where you break the URL to server name and report path. SSAS will concatenate them, so I could have set the server name to elite only and moved the rest of the URL to the Report Path line. Also, notice that you don’t use http: as in http://elite/sites/reports. SSAS will add the protocol for you.

Dundas Dashboard and PerformancePoint Comparison Review

Digital dashboards, also known as enterprise dashboards or executive dashboards, are rapidly rising in popularity as the presentation layer for business intelligence. The chances are that you’ve been asked to implement a dashboard to let management quickly ascertain the status (or “health”) of an organization via key business indicators (KPIs) and this task might seem daunting. This is where Dundas Dashboard can help.

As a leader in data visualization solutions, Dundas has given us great products that power many business intelligence solutions, including Microsoft Reporting Services and .NET charting. Its latest offering, Dundas Dashboard 2.5, lets you implement compelling dashboards quickly and easily. Since my career focus has been Microsoft Business Intelligence, I was curious to evaluate the capabilities of Dundas Dashboard and compare them with Microsoft PerformancePoint 2010.

Read the full review here.

Denali Forums

Microsoft launched SQL Server 11 (Denali) pre-released forums and is eagerly awaiting your feedback. Judging by the number of BI-related forums, one can easily see that BI will be a big part of Denali with major enhancements across the entire BI stack. Of course, not many questions to ask if we don’t have the bits yet to play with (CTP1 doesn’t include the BI stuff) but still you can start probing Microsoft.

Book Review – Microsoft PowerPivot for Excel 2010

0020.pp.png-550x0I dare to predict that in a few years after SQL 11 ships, there will be two kinds of BI professionals – those who know the Business Intelligence Semantic Model and those who will learn it soon. By the way, the same applies to SharePoint. What can you do to start on the path and prepare while waiting for BISM? Learn PowerPivot, of course, which is one of the three technologies that are powered by VertiPaq – the new column-oriented in-memory store. This is where the book PowerPivot for Excel 2010 can help. It’s written by Marco Russo and Alberto Ferrari, whose names should be familiar for those of you who have been following Microsoft BI for a while. Both authors are respected experts who have contributed a lot to the community. Stationed in Italy, they run the SQLBI website and share their knowledge via their blog and publications.

This is the second book that I’ve read about PowerPivot – after Professional Microsoft PowerPivot for Excel and SharePoint, which I reviewed in this blog. What I liked about this book is its deep coverage of Data Analysis Expressions (DAX). I specifically enjoyed the following chapters:

Chapter 6: Evaluation Context and CALCULATE – Provides a deep coverage of how DAX measures work. Although DAX is meant to be simpler than MDX, expressions can get complex and this chapter will help you understand how DAX works behind the hood.

Chapter 7: Date Calculations in DAX – Time calculations, such as YTD, QTD, are an important requirement for most BI projects. This chapter goes into details to explain how to implement them and provide workarounds for PowerPivot limitations.

Chapter 9: PowerPivot DAX Patterns – If you wonder whether PowerPivot can do this and that, read this chapter. It demonstrates advanced concepts ranging from ratio, percent of total, standard deviation, ranking over measures, Pareto computations, and more.

Chapter 10: PowerPivot Data Model Patterns – Another gem for addressing popular BI needs, such as banding, courier simulation, and many-to-many relationships.

Although not big in size (370 pages), you will find this book rich in patterns and solutions. What impressed me is that the authors put a great effort to cover not only the PowerPivot basics but to leave no stone unturned when the tool lacks in features. The authors discuss a requirement, approach it from different angles, and provide several implementation approaches. Thus, this book will benefit both beginners and advanced users. An indispensible resource for learning PowerPivot and giving a head start on BISM!

Prologika Training Classes Dec 2010-Jan 2011

Our online Microsoft BI classes for December 2010 and January 2011:

Class

Mentor

Date

Price

 

Applied SSRS 2008Teo Lachev12/14-12/16 12:00-5:00 EDT

$799

Register

Applied SSAS 2008Teo Lachev1/11-1/13 12:00-5:00 EDT

$799

Register

Applied PowerPivotTeo Lachev1/25-1/26 12:00-4:00 EDT

$599

Register

Visit our training page to register and more details.

VertiPaq Column Store

In SQL 11, the VertiPaq column store that will power the new Business intelligence Semantic Model (BISM) will be delivered in three ways:

1. PowerPivot – in-process DLL with Excel 2010.

2. A second storage mode of Analysis Services that you will get by installing SSAS in VertiPaq mode.

3. A new column stored index in the SQL RDBMS.

The third option picked up my interest. I wanted to know if a custom application will be able to take advantage of these indexes outside VertiPaq. For example, this could be useful for standard reports that query directly the database. It turns out that this will be possible. The following paper discusses columnstore in more details. In a nutshell, SQL 11 will run the VertiPaq engine in-process. Here are some highlights of the VertiPaq columnstore:

  • You create an index on a table.
  • You cannot update the table after the index is created. Not a big limitation since a data warehouse is typically loaded on a schedule so the index can be dropped and rebuilt at the end of the ETL process.
  • The relational engine scans the entire table and loads it into memory in a Vertipaq store. As you could imagine, the more RAM the better.
  • Query performance will be great but probably slower than VertiPaq in SSAS.
  • Creation of this index isn’t cheap, but the query performance gains should justify using it in common DW scenarios.
  • From performance standpoint, it will be best to load data in VertiPaq (SSAS store) instead of using column store.

Business Intelligence Semantic Model – The Good, The Bad, and the Ugly

UPDATE 11/14/2010

This blog probably will be the most updated blog I’ve ever written. I have to admit that my initial reaction to BISM was negative. For the most part, this was a result of my disappointment that Microsoft switched focus from UDM to BISM in Denali and my limited knowledge of the BISM vision. After SQL PASS, I exchanged plenty of e-mails and Microsoft was patient enough to address them and disclosed more details. Their answers helped me to “get it” and see the BISM big picture through more optimistic lenses.

 

UPDATE 05/21/2011

Having heard the feedback from the community, Microsoft announced at TechEd 2011 that Crescent will support OLAP cubes as data sources. This warrants removing the Ugly part from this blog.

 

After the WOW announcement at SQL PASS about PowerPivot going corporate under a new name, Business Intelligence Semantic Model (BISM), there were a lot of questions from the SSAS community. Your humble correspondent did his own share of nagging. For some obscure reason, this time Microsoft decided to keep MVPs and community leaders in the dark until PASS, so I was as unprepared as the rest of the community about what is to come. Prior to PASS, the SSAS team told me that corporate BI would be a major focus in Denali, which I interpreted as enhancements to UDM, only to find out that the entire buzz is about BISM…what a thunder from a clear sky. To its credit, the SSAS team was quick in its attempt to cover the collateral damage as we can see in the posts below from Amir Netz and T.K. Anand.

http://cwebbbi.wordpress.com/2010/11/11/pass-summit-day-2/#comment-1498

http://blogs.technet.com/b/dataplatforminsider/archive/2010/11/12/analysis-services-roadmap-for-sql-server-denali-and-beyond.aspx

I decided to put down the gist about what I’ve learned about BISM so I have a ready answer for questions that I am sure I will be asked over and over. Let’s start with the WHY.

Why BISM?

While SSAS is now an undisputed leader in the OLAP space and the BI platform of choice, Microsoft is facing competitive pressures. As you know, new vendors and technologies are popping up like daisies every day in the fast-changing BI landscape attacking on all fronts – QlikView, Tableau, Business Object Universe, to name a few, and Microsoft cannot rest on its laurels. The main selling point of these companies is that OLAP is too complex. I’ve seen their marketing campaigns in action and I know how persuasive they could be. One interesting thing I’ve noticed though is that the sleek demos I’ve seen all used star schemas, where data is neatly organized and stored in a set of dimension and fact tables. Creating this schema and populating it usually take about 60-80% or the development effort for implementing a BI solution.

So, if the star schema is not obsolete, what’s so appealing then in the offerings of these vendors? The notion that the OLAP layer is too complex and you don’t need it. This is a clever strategy (and one of the laws of power) – If you can’t have it (OLAP market share) than despise it J Instead, these companies build their analytical layer as a thin wrapper directly on top of the star schema, which is exactly what PowerPivot does. This abstracts IT from knowing all these intimidating OLAP concepts, such as cubes, dimensions, measures, and the most complex of them all – MDX. This is a perceived simplification though because it’s unlikely to meet more demanding requirements. For example, I wouldn’t have been successful implementing BI financial solutions if I didn’t have UDM features, such as scope assignments, parent-child dimensions, and many-to-many relationships.

While I don’t personally believe that UDM (the SSAS implementation of the OLAP layer) is too complex, I do think that it carries a lot of baggage partially due to its deep roots in traditional OLAP and evolution over years. Let’s name a few:

  1. Cubes – The cube as a core OLAP concept that became somewhat outdated after moving to the UDM which is an attribute-based model. Ever wondered if you should go for one cube or multiple cubes? This is one of the main dilemmas when starting a new project.
  2. Detail-level reporting – This is where UDM over-promised but under-delivered. For example, UDM doesn’t support text-based measure. Besides actions, the clumsy workarounds are action or degenerate dimensions, with the latter being what the “classic” OLAP recommends.
  3. Storage complexities – the need to process the cube, partitioning, aggregations, etc. Moving to BISM, you still need to load the data into VertiPaq (unless you do real-time). So there is still “processing” involved but there is no need for aggregations or “index building” anymore. Partitions are still there – but they are needed for data management purposes (incremental updates, rolling off data etc) and not for performance reasons.
  4. Data latency – the UDM proactive caching attempts to address it but it’s far from ideal.

Again, I don’t list complexity because I don’t think UDM is very complex. Or if it is, that’s because business requirements are and I sure don’t want BISM to have less features that will prevent us meeting such requirements. So, the SSAS team had a dilemma: should they continue enhancing UDM or go back to the drawing board. Microsoft realized that the UDM internal architecture and storage model have become increasingly complex over the years, thus barring major changes and growth. Instead of continue building on the old foundation, they’ve decided to start from scratch by going with the VertiPaq column-oriented store that debuted in PowerPivot. I have to give credit to the SSAS team for making such a bold and a gut-wrenching decision, which I am sure hasn’t been made lightly.

The Good

BISM is essentially a fresh start for Microsoft in the BI space. It will attempt to simplify OLAP and make BI even more accessible. As many of the BI professionals agree, based on what we know and seen in PowerPivot, BISM has a huge potential and will bring welcome enhancements, such as:

  1. Schema simplification – no need to define explicit cubes, dimensions, measures, and thus eliminate the perceived complexity of implementing an OLAP solution. In fact, expect the term “OLAP” to be heavily deemphasized in years to come.
  2. Blazing performance – this will probably void partitions and aggregations.
  3. Flexibility – There will be no distinction between measures and dimensions. Every attribute can be used for aggregating and slicing. This is one PowerPivot feature that I like the most.
  4. Detail-level reporting without hacks.
  5. DAX expression-based language that removes some of the complexity of MDX.
  6. Possibility of real-time data access – BISM running in VertiPaq mode will deliver the best performance and it is similar to MOLAP. However, data latency will also be improved if you work in real-time (like ROLAP) against a SQL database that has column store indexes and you don’t query massive data volumes. It should be fast, but still slower than working directly against the BISM VertiPaq.

The Bad

In Denali, BISM will not be as feature-rich as UDM and it probably won’t be for years to come. The lack of more advanced features (we don’t have a feature comparison list yet), will probably make organizations and consultants favor UDM in the interim, which is still the “bread and butter” for MS OLAP. I personally won’t go for a new model knowing that its limitations can bite me in a long run or a model that puts me back a decade ago no matter how simple and fast it is. As you know, all projects start simple until a requirement that changes everything.

So, why didn’t Microsoft left BISM in the self-service BI area to marinate for a while as PowerPivot?

  1. BISM will let Microsoft levels the playing field against the various enterprise BI semantic models
  2. Remove Excel hosting limitations, such the 2 GB size limitation an Excel workbook has. In reality, 2 GB translates to many more gigabytes of uncompressed data but still not enough for corporate data volumes.
  3. Microsoft can now position BISM as a BI platform for both reporting and analytics without abandonment of relational concepts in favor of “weird” OLAP terminology. This is something that mainstream IT will find very attractive.
  4. BISM will provide the continuum of Self-service BI -Corporate BI on a single platform. You can start with self-service BI with PowerPivot and migrate it to a corporate solution.

Unfortunately, since BISM will not get all UDM features at least in Denali, in the interim we will have to make a choice between UDM and BISM, which essentially will be a compromise between features and ease of use. In years, as BISM catches up with UDM, I’d expect the choice to become easier. Meanwhile, BISM will not be as advanced as the UDM and for more advanced requirements UDM is your only choice.

That said, even in its first release, you may find BISM much more advanced and powerful than competing technologies, such as BO Universe and similar metadata storage models. So, when evaluating the BISM capabilities, it is important to pay attention to your reference point. The UDM is a very high bar. However, if you compare it against Universe, Qlikview, etc., you will probably find BISM superior.

The Ugly

Crescent (the new SSRS ad-hoc reporting tool) will support BISM only. Frankly, I don’t know what was the thinking here especially given that UDM is still the “bread and butter” of OLAP (see the links above). The message that everyone would get is that UDM has fallen out of favor which is apparently not the case. I strongly encourage Microsoft to support UDM in Denali even if it won’t make it in the box (no a big surprise there, Report Builder 2.0 has done it and hell didn’t freeze over). I think Crescent has a lot of promise and since we don’t have a webi browser, it is long due on the BI wish list. There are probably good technical reasons for supporting BISM only but if Excel can do MDX against PowerPivot, so should Crescent. This will also reinforce the message that UDM is still the premium model for years to come while BISM is in the works.

So?

BISM has a bright future and will be a winner if it delivers on its promise but doesn’t throw the baby with the bathwater:

  1. Simplifies OLAP
  2. Preserves UDM flexibility and features we have today, especially the ones that came with SSAS 2005, such as flexible dimension relationships and scope assignments. 
  3. Supports both DAX and MDX as query and expression languages to preserve investment developers and vendors have made throughout all these years.

Do I feel that my consulting business will be negatively affected by BISM should it become super user-friendly so even my mom can use it? I don’t think so. Here is where the C++/C# analogy could be helpful. Those C++ developers who took the time to learn C#, which for them was a walk in the park, found that it opened new opportunities. And again, no matter how simple the tool, the requirements are what makes things complex and I don’t think BISM will read minds any time soon. Sure, sometimes an old dog must learn new tricks but that’s why we love it …

May you live in interesting times!
A Chinese proverb