Data Warehouse Physical Design Best Practices Session by Carlos Rodrigues at Atlanta BI Tonight

It looks like Carlos will deliver another slam tonight with a record number of some 63 people registered and counting… Join our Atlanta BI meeting at 6:30 to see his Data Warehouse Physical Design Best Practices presentation.

Query Options for Analytical Models and Transactional Reporting

Requesting both historical and transactional reports is a very common requirement. As they stand, neither Multidimensional nor Tabular are designed to support well transactional (detail-level) reporting with large datasets, as I discussed in my Transactional Reporting with Tabular blog post. There are two query options that might provide some relief with transactional reporting and they both have limitations.

DRILLTHROUGH QUERIES

You can use a DRILLTHROUGH query to request data at the lowest level. The syntax differs somewhat between Multidimensional and Tabular. Here is a sample drillthrough query to Adventure Works cube:

DRILLTHROUGH MAXROWS 1000

SELECT FROM [Adventure Works]

WHERE ([Measures].[Reseller Sales Amount],[Product].[Category].[Accessories],[Date].[Calendar Year].&[2006])

RETURN

[$Product].[Product],

[$Date].[Date],

[Reseller Sales].[Reseller Sales Amount],

[Reseller Sales].[Reseller Tax Amount] 

And a similar query to Adventure Works Tabular:

DRILLTHROUGH MAXROWS 1000

SELECT FROM [Model]

WHERE ([Measures].[Reseller Total Sales],[Product].[Category].[Category].&[Accessories],[Date].[Calendar].[Year].&[2005])

RETURN

[$Product].[Product Name],

[$Date].[Date],

[$Reseller Sales].[Sales Amount],

[$Reseller Sales].[Tax Amount] 

Notice that the WHERE clause specifies the coordinate that is drilled through. The optional RETURN statement enumerates the columns you need in the output with dimensions prefixed with a dollar sign. Since Tabular doesn’t have measure groups, you treat all tables as dimensions.

Drillthrough queries are very fast. They honor the data security (both dimension data security in Multidimensional and Row Filters in Tabular). Not allowed members simple don’t exist as far as DRILLTHROUGH is concerned. For example, attempting to request a dimension member outside the allowed set would return the following error:

Drillthrough failed because the coordinate identified by the SELECT clause is out of range.

This is the same error you will get if your request a non-existent or wrong dimension member.

DRILLTHROUGH queries have limitations:

  1. You can’t request measures from different measure groups in Multidimensional, such as Internet Sales and Reseller Sales. If you attempt to do so, you’ll get the following error:
    Errors from the SQL query module: Drillthrough columns must belong to exactly one measure group. The columns requested by the drillthrough statement are referencing multiple measure groups.
  2. The WHERE coordinate must resolve to a single coordinate. In other words, you can’t specify a set, such as {[Product].[Category].[Accessories],[Product].[Category].[Bikes]. You can avoid the error by using SUBSELECT instead of WHERE but you’ll find that the SUBSELECT clause doesn’t set the default member and you’ll get all products. This limitation will prevent you from using multi-select parameters in SSRS.
  3. Related to the previous item, you can’t specify FILTER conditions either, e.g. where the transaction date is between a range of dates. Again, you can only use a tuple in the WHERE clause.
  4. You must use the Analysis Services OLE DB provider in SSRS. If you need to parameterize the query, you need to use an expression-based query statement.

SQL QUERIES

You can also use SQL-like queries as Chris Webb explains in the “Can I run SQL against an Analysis Services cube?” blog. Here is a SQL query against the Adventure Works cube:

select

[Adventure Works].[$Customer].[City],

[Adventure Works].[Internet Sales].[Internet Sales Amount]

from [Adventure Works].[Internet Sales]

natural join [Adventure Works].[$Date]

natural join [Adventure Works].[$Customer]

where [Adventure Works].[$Date].[Calendar Year] = ‘CY 2008’ 

SQL queries have limitations as well:

  1. You must specify SQLQueryMode=DataKeys in the connection string if you connect to Tabular.
  2. SQL queries don’t work with data security (row filters) in Tabular but they appear to work with dimension data security in Multidimensional. Attempting to execute a SQL query against a Tabular model with row filters will throw the following exception:
    Tabular queries are not supported with row level security.
  3. Unlike T-SQL, SSAS SQL queries support only the equal operator. Anything else and you’ll get the following error. This limitation precludes more complicated filtering, such as returning only transactions within a given date range.
    Errors from the SQL query module: There is an unexpected operator in the WHERE clause. Only the Equals operator ( = ) is supported.
  4. You must use the Analysis Services OLE DB provider in SSRS. Unlike DRILLTHROUGH, you can use AND in the WHERE clause to support multi-valued parameters. However, since the Analysis Services OLE DB provider doesn’t support parameters, you must resort to using an expression-based query statement.

Feel-free to join me to escalate this issue to Microsoft because historical and transactional reporting are both very common scenarios. Thanks to its in-memory nature and more relaxed schema requirements, Tabular is best positioned to support both in a long run so I hope that Microsoft will make improvements in this area. For now, when SSAS doesn’t give you the performance you need, consider delegating transactional reporting to the Database Engine.

Transactional Reporting with Tabular

Scenario: We had a requirement to replace the existing implementation of transactional reporting over large data volumes. By “transactional reporting” I mean allowing the user to query individual transactions as they’re stored in the fact table. In some cases, this style of reporting requires simply reading data without aggregations. This, of course is what RDBMS are designed for but in our case, requirements call for extending the reporting model with metadata about the item, such as the item barcode, alias, etc., as well as supporting fast aggregation analysis (for trend reports) and data security. In general, when implementing BI solutions, you should always have an analytical layer between the database and presentation layer for the reasons I discussed in the “Why an Analytical Layer?” blog.

Challenges: Because of the in-memory nature of Tabular and more relaxed schema requirements, such as no separation of dimension and fact tables and ability to keep text-based columns in fact tables, we opted for Tabular on top of a data warehouse and star schema. Our initial schema included an Item dimension table (700 million rows) where the item metadata is kept and Transactions table (1.5 billion rows) that stores the item transactions. However, as we found out quickly, this design presented challenges for Tabular:

  1. Excessive processing time – It took very long to process the initial model. By default, SSMS generates a transactional processing batch that supports processing multiple tables in parallel. However, if there is an error during processing the entire transaction is rolled back and you need start from scratch. To avoid this, we tried scripting the batch and changing it to non-transactional (Transaction=’false’). This option preserves the processed partitions but it doesn’t allow tables to be processed in parallel. This is by design to avoid unexpected interactions due to the potential cross-object dependencies in the recalculation chain. Since both the Item and Transactions tables were portioned, the net effect was that the processing was serialized by table and partition when two scripts for processing Item and Transactions are simultaneously executed (partition 1 of Transactions is processed, partition 1 of Item is processed, partition 2 of Transactions is processed, partition 2 of Item is processed, and so on). To make things worse, there is currently a bug with Tabular where the server builds relationships even when Process Data is used as a processing option which further slows down processing.
  2. Excessive memory footprint – For some reason, Tabular used a lot of memory with the two table approach. Granted, we had to load the dataset twice (one for Item and a second time for Transactions) but still it should have taken less memory (see the memory footprint in the Solution section). The development server had 80 GB of RAM. We ran out of memory half-way during the full load and ended up with the in-memory model taking 65 GB of RAM.
  3. Excessing query time – This was the worst. When we tried Power View as a reporting tool, each time the Item table was involved in the query, the report took more than a minute to finish. The report filter would return only in a few rows from the Transactions table but when the join was made to the Item table (one the one side of the Item-Transactions relationship), the report performance would degrade significantly. As it turned out, Tabular hasn’t been tuned for such scenarios where queries involve large dimension tables and the filter is applied on the fact table.

Solution: To work around the Tabular limitations, we merged the two datasets by joining the Item and Transactions tables in a SQL view. This approach resolved all issues:

  1. Processing time was reduced x10 because Tabular had to process only one large table.
  2. The memory footprint of the entire database with all the data was reduced to 13.5 GB.
  3. The query times went down to seconds.

The only caveat is that processing would incur a hit because of the join between two large tables but we’d rather spend more time during processing than sacrificing query performance.

At least for now, don’t be confused by the “relational” nature of Tabular. It’s still optimized for queries that request a restricted set of columns and aggregate data. Currently, it’s not optimized for transactional (detail-level) style of reporting that requires many columns and requests data at the lowest level. Microsoft is aware of this and would probably introduce optimizations in time.

In a future blog, I’ll discuss techniques to optimize the query performance with transactional reporting and Tabular. Many thanks to Marius Dumitru (Principal Architect on the SSAS team) for answering my questions.

Improving Tabular Design Experience

When you develop an organizational Tabular model in SSDT, there is always an implicit processing phase for each action you perform, such as renaming columns, creating hierarchies, changing formatting, and so on. This “data-driven” paradigm could be both a blessing and a curse. A blessing, because you always work with data and you don’t have to explicitly process the model to see the effect of the changes. And a curse, because each time you make a change the UI blocks until the change is committed and this can get old pretty soon.

Note: As long as you don’t use the Table Properties dialog or explicitly refresh the data, all changes are done directly in the Tabular database and Tabular doesn’t re-query the data source to refresh the data.

While waiting for Microsoft to make the necessary changes, here are a few tips to improve your Tabular design experience:

  1. In my experience, the size of the dataset doesn’t affect the duration of the “refresh” step. However, I always use a small dataset during development for faster processing. To do this, if you have a large table, partition the table in Tabular and load only one (e.g. the smallest) partition. If you have already loaded all partitions, you can clear all but one with Process Clear. Now, you have a small dataset to work with during development. Once you deploy to the QA or production server, you can process all partitions.
  2. Disable Automatic Calculation – To do this, go to the Model menu and click Calculation Options -> Manual Calculation. Sometimes (depending on model/calculation complexity) disabling automatic calculation in SSDT may help make modeling operations more responsive. To update the calculated columns, once you’re done with the changes, do Model->Calculate Now. Thanks to Marius for this tip.
  3. While the size of the dataset doesn’t affect the refresh duration, the hardware configuration of your development machine does. Suppose you have an underpowered company laptop and a more powerful personal laptop (lots of memory, solid state drive, many cores, etc.) If your company policy allows using your personal laptop, follow these steps to temporarily switch development during the change-intensive part of the design process:
    1. Copy the source from the first laptop to your personal laptop.
    2. Back up and restore the workspace database to your local Analysis Services Tabular instance. If you follow my first tip, the workspace database should be fairly small.
    3. Load the project in SSDS and double-click the Model.bim file to initialize your environment. This will create a second empty workspace database to your local Tabular instance. Close the solution in SSDT. Now, open the Model.bim_<your name>.settings file in Notepad, find the <Database>element and change it to the name of the original workspace database. Open the solution in SSDT. Now you should see the data in your original workspace database.
    4. Perform the design changes. As a comparison, it takes about 5 seconds to commit a change on my laptop vs. 15 seconds on an underpowered laptop.
    5. Once you’re done with the changes, replace Model.bim and Model.bim.layout files on your company’s laptop.

When Developers and BI Collide

I’ve been running in this situation quite often so I thought this will make a good topic for a blog.

Scenario: Management has asked for some sort of a BI solution, such as a dashboard. BI hasn’t happened to the organization in question yet. But they have smart developers and there is no project that they can’t do. As the story goes, developers go to work and whip out some code… lots of it. Code for doing ETL, code for the database layer, and code for implementing the presentation layer, such as as a custom web application with cool third-party widgets. Everyone is happy… at least for a while.

I don’t have an issue with developers. In fact, I spent most of my career writing code. However, there are several pitfalls with this approach so let’s mention some of them:

  1. Custom code is expensive to write and maintain – Developers are pulled away from their primary tasks to implement the BI solution. When the initial enthusiasm wears off, managers find it difficult to allocate developers to extending and maintaining the BI solution.
  2. You end up with a proprietary, tightly-coupled solution – The solution would probably meet the initial goals but it might be difficult to extend. What if the custom web application is not enough and users prefer to analyze data in another tool, such as Microsoft Excel? Where are business calculations defined? How do you handle security?
  3. Not the right tools – There are many scenarios when writing custom code makes sense but BI is not one of it. There are specialized tools that are geared specifically toward BI. If you write custom code, more than likely you’ll face performance and maintenance challenges in a long run.
  4. Not a best practice – Most projects start simply but grow in complexity over time. It’s not uncommon for management to ask for more features, such as analyzing data by other subject areas or drilling to details. What’s has started as a dashboard project might evolve to an end-to-end BI solution that requires a data warehouse, analytical layer, and different presentation options.
  5. Coders are not BI pros – I hate to say this but even the smartest programmers don’t know much or care about BI. More than likely, you’ll end with a normalized database and summary tables for improving performance with aggregates. Again, there tools and methodologies for BI so there is no point reinventing the wheel. If you don’t know better, hire someone who does. “When all you have is a hammer everything looks like a nail” paradigm won’t bring you too far.

Solution: I’m sure you can add to the list, but what’s the solution? When you hear about BI or its manifestations, such as dashboards, reporting, analytics, etc., the following architecture should immediately come to mind.

3806.biarch.png-550x0

This architecture is somewhat simplified. For example, it doesn’t show a staging database but it has the main pieces and their corresponding technologies in the Microsoft BI platform:

  1. A data warehouse whose schema is designed with reporting in mind.
  2. Integration services packages for ETL processes. They still have to be maintained but I dare to declare that maintaining SSIS is much easier that maintaining custom code. For example, it doesn’t require coding skills, it scales well, and it has a comprehensive logging infrastructure.
  3. An analytical layer, either as a multi-dimensional cube or a tabular model. This is the most overlooked piece but it’s the most important for the reasons I explained in my “Why an Analytical Layer?” blog.
  4. Finally, just like a car manufacturer, you should strive to assemble your solution with prefabricated parts instead of writing custom code. For example, you can implement very compelling dashboards with Power View that uses your analytical layer as a data source without having to write a single line of code. Not to mention that you can delegate this task to business users.

Now, with custom code you can do anything, including features that you can’t get out of the box with prepackaged BI tools. However, you’ll be surprised how willing your management might be to compromise with features especially in this economy.

Adding a Set Total

Scenario: You have defined an MDX set in the cube script, such as a set that returns a few months. For the sake of simplicity, we will hardcode the months.

CREATE SET CURRENTCUBE.[12Months] AS

{

[Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302]

};

As useful as the set might be, it has an issue. It doesn’t return the total and users might complain about it:

030813_2047_AddingaSetT1

Solution: As a first try, you might attempt to add the [All] member to the set:

CREATE SET CURRENTCUBE.[Last12Months] AS

{

[Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302]

+

[Date].[Month].[All]

};

You’re on the right path. Now, you’ll get the grand total row in Excel but it shows the unrestricted (non-visual total) total instead of the total across set members only. Instead, use the VisualTotals() function:

CREATE DYNAMIC SET CURRENTCUBE.[12Months] AS

{

[Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302]

+

VisualTotals

(

{

[Date].[Month].[All],

[Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302]

}

)

};

 

030813_2047_AddingaSetT2

Here, the VisualTotals function returns the All member with the total adjusted to set members only. Notice also that I’ve changed the set definition to by dynamic (DYNAMIC keyword) to force re-evaluation of the set.

In a slightly more complicated scenario, what if you want the total to appear as an additional member of the set as opposed to a grand total row. One implementation approach will be to add a dimension calculated member to the Month hierarchy that does the aggregation and then add the new member to the set:

CREATE MEMBER CurrentCube.[Date].[Month].Agg AS Aggregate([Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302])

CREATE SET CURRENTCUBE.[12Months] AS

{

[Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302]

+

[Date].[Month].Agg

};

 

Optimizing Massive SQL Joins

Scenario: Run ETL to perform a full data warehouse load. One of the steps requires joining four biggish tables in a stating database with 1:M logical relationships. The tables have the following counts:

VOUCHER: 1,802,743

VOUCHER_LINE: 2,183,469

DISTRIB_LINE: 2,658,726

VCHR_ACCTG_LINE: 10,242,414

Observations: On the development server, the SELECT query runs for hours. However, on the UAT server it finished within a few minutes. Both servers have the same data and hardware configuration, running SQL Server 2012 SP1.

Solution: Isolating the issue and coming up with a solution wasn’t easy. Once we ruled out resource bottlenecks (both servers have similar configuration and similar I/O throughput), we took a look at the estimated query plan (we couldn’t compare with the actual execution plan because we couldn’t wait for the query to finish on the slow server).

We’ve notice that the query plans were very different between the two servers. Specifically, the estimated query plan on the fast server included parallelism and hash match join predicates. However, the slow server had merge M:M join predicates. This requires a tempdb work table for inner side rewinds which surely can cause performance degradation.

030313_0112_OptimizingM1

Interestingly, the cardinality of the tables and estimated number of rows didn’t change much between the two plans. Yet, the query optimizer decided to choose very different plans. At this point, we figured that this could be an issue with statistics although both servers were configured to auto update statistics (to auto-update statistics SQL Server requires modifications to at least 20% of the rows in that table). The statistics on the slow server probably just happened to have a sample distribution that led to a particular path through the optimizer that ended up choosing a serial plan instead of a parallel plan. Initially, we tried sp_updatestats but we didn’t get an improvement. Then, we did Update Statistics <table name> With Fullscan on the four tables. This resolved the issue and the query on the slow server executed in par with the query on the fast server.

Note: Updating statistics with full scan is an expensive operation that probably shouldn’t be in your database maintenance plan. Instead, consider:

1. Stick with default sampled statistics

2. Try hints for specific queries that exhibit slow performance, such as OPTION (HASH JOIN, LOOP JOIN) to preclude the expensive merge joins.

Special thanks to fellow SQL Server MVPs, Magi Naumova, Paul White, Hugo Kornelis, and Erland Sommarskog for shedding light in dark places!