Posts

Optimizing Dimension Data Security

Dimension data security is a great feature of Analysis Services and one of the top reasons to strongly consider a semantic layer in your BI implementations, as I mentioned in my “Why Semantic Layer” newsletter. That’s because the server automatically secures the cube at connect time so you don’t have to implement your own security framework and secure every report query. If you are new to dimension data security, I covered it in various places including this article that I wrote for SQL Magazine a while back. One thing to be aware of is that the server applies dimension data security when the user sends the first query after the connection is established even if this query doesn’t specifically request the secured dimension, e.g. SELECT NULL ON 0 FROM [CubeName]. Many things happen at this point (see the steps after “Dimension security is applied to all attributes” in this Mosha blog). This initialization sequence is the price you need to pay to have your entire cube secured. The initialization duration depends on many factors, including the size of the dimension, the size of the allowed set, and the cube script. However, you have some control over the initial performance hit or at least to offset it.

Let’s consider a real-life scenario. A customer has a medium-size cube (about 50 GB) and dimension data security needs to be applied to an Account dimension whose dimension key attribute had about four million members. Specifically, the allowed set had to be applied on an attribute that had about 300,000 members. In this case, the security policies are stored in a factless fact table (for more information how dimension data security works with a factless fact table, see the second part of the above article). The initial initialization time for an allowed set of only two accounts was 9 seconds which is not so bad. However, if the allowed set return 20,000 allowed accounts, the user had to wait for 30 seconds to connect which can surely lead to user complaints.

Reducing the Autoexists Performance Hit

When the server applied dimension data security, it applies it to all attributes within the secured dimension. This behavior is known as Autoexists. In this case, the Account dimension had about 30 attributes. Although dimension security is applied on a single attribute (Level 3), the server automatically secures the rest of the attributes to allow only the members that exist with the allowed members of the secured attribute (if the allowed set returns ‘USA’, Autoexists allows only cities within USA). One way to determine how much time the server spends applying Autoexists is to enable and monitor in the Query Dimension event in the Profiler. Another way would be to remove all the cube script (except the first CALCULATE statement) and track the query execution time before and after.

As I said, the larger the allowed set and the dimension size, the more time will be spent in propagating security to the attributes in the secured dimension. To reduce this time, consider introducing a new dimension, e.g. Account Filter, that has as fewer attributes as possible. Ideally, it should include only the secured attribute although you might need to include the dimension key attribute so you don’t have to make changes to the fact tables in order to join them to the dimension at a higher level. In our case, we tested three scenarios:

Scenario 1: Security applied on the original dimension

An allowed set of two members took 1.7 seconds in Autoexists. An allowed set 20,000 members took 13 seconds in Autoexists.

Scenario 2: Security applied on a new Account Filter dimension that had three attributes (dimension key, secured attribute, and another attribute required to meet the business requirements for security).

An allowed set of two members took 1.5 seconds in Autoexists. An allowed set 20,000 members took 3.5 seconds in Autoexists.

Scenario 3: Security applied on a new Account Filter dimension that had two attributes (secured attribute and another attribute required to meet the business requirements for security). This scenario required changes to the fact tables to bring in the secured attribute in order to join them to the Account Filter dimension.

An allowed set of two members took 1.2 seconds in Autoexists. An allowed set 20,000 members took 1.7 seconds in Autoexists.

Optimizing the Cube Script

This cube had about 3,000 lines, most of them basic calculated members (a best practice is to have calculations defined in the semantic layer instead of reports or elsewhere). The items to pay attention to in the script are static sets and default members (default members can be defined in the dimension itself or in the script). A static set is evaluated once per the user session. With regards to dimension data security, we’ve found that the server evaluates the static set in the context of the allowed members. If you turn the Execute MDX Script Current event to see how much time the server spends evaluating each line in the script. In our case, I’ve found that the following static set took some 10 seconds to evaluate with an allowed set of 20,000 members:

CREATE SET [MonthsWithInventory] AS Exists([Date].[Fiscal by Month].[Month].Members, , “Inventory”); 

When I converted the set to dynamic (CREATE DYNAMIC SET), the overhead disappeared from the initialization time. However, be aware that during the static set evaluation the server warms the cache with the results of the static set. So switching to a dynamic set really saves you the cost of evaluating the Exists query which in this case was expensive. Consequently, the first query will execute faster. But it obviously means that the cache that would have been built up as part of evaluating that set is now not available for other operations. Therefore, subsequent queries that requests data by month might be slower.

What I’ve found also is that overwriting the default members contributes to the initialization times. If you have an option, remove explicit attribute defaults. Interestingly, trying to conditionally bypass the defaults didn’t eliminate the overhead, such as using the following approach:

ALTER CUBE CURRENTCUBE UPDATE DIMENSION [Account].[Active Flag], Default_Member = iif(IsEmpty(CustomData()), [Account].[Active Flag].&[Active], [Account].[Active Flag].DefaultMember);

 

Special thanks to Akshai Mirchandani on the SSAS team for shedding light in dark places.

The SharePoint 2013 Data Model

Somehow, this PowerPivot for SharePoint gem escaped my attention but it’s a very important deployment enhancement that was more than likely initiated by Power BI and the cloud. Starting with SQL Server 2012 Service Pack 1 and SharePoint 2013, you can set up a dedicated SSAS server in SharePoint mode outside the SharePoint farm and configure the farm to point to it. This was possible as a result of all the integration work done between Excel Calculation Services, SharePoint and AS for the 2013 release. Previously, the SSAS in SharePoint configuration mode instance had to be installed on the SharePoint application server.

This brings more flexibility to scaling out your PowerPivot environment outside SharePoint and simplifies the PowerPivot setup. For example, to scale out Analysis Services, you can simply add a new SSAS server configured in SharePoint mode. Excel Calculation Services will balance the load between all available SSAS servers automatically based on health statistics obtained from these server instances. Same thing if you want to make the backend fault-tolerant. If one server goes down, the workbooks that used to be on the failed server will automatically get redeployed on the others by Excel Calculation Services. Specifically, the next time the workbook is accessed by a user, ECS would realize that the workbook is no longer on any active node and the workbook would get loaded once again on a different node.

To register a new server in SharePoint 2013:

  1. Open the SharePoint Central Administration, in the Application Management section, click Manage service applications.
  2. On the Manage Service Applications page, click the Excel Services service application that you want to manage.
  3. On the Manage Excel Services page, click Data Model.

102713_0124_TheSharePoi1

Then, in the Excel Services Application Data Model Settings, click Add Server to register the new server.

102713_0124_TheSharePoi2

The PowerPivot setup is simplified as well because you don’t need to install the PowerPivot for SharePoint add-in, as explained here. However, the add-in is still required for the PowerPivot for SharePoint full capabilities, including PowerPivot Gallery, scheduled data refresh, and the PowerPivot management dashboard. Finally, having a dedicated SSAS server might save you money if you need to scale out. That’s because you no longer need to license SSAS on each SharePoint application server assuming you don’t need the other SQL Server components, such as SSRS. Further savings can be realized by deploying SSAS in SharePoint as another instance to an existing SSAS server (Multidimensional or Tabular) you have. To my understanding, the same version instances are covered by a single license.

Flavors of Self-service BI

My Fall newsletter entitled “Why a Semantic Layer?” is out and it will be e-mailed to subscribers on Monday. The reason why I wrote is to bring the focus back on organizational BI. Based on my experience, decision are makers are confused about what their company’s BI strategy and roadmap should be. I’ve asked a BI manager recently if he knows that a semantic layer could be a true self-service BI enabler and he said he didn’t know. After talking to some well-known BI vendors and listening to their “pure” self-service mantra, the focus was shifted from a comprehensive BI architecture that truly addresses the company needs to selecting a tool that has the most visualization fluff.

In general, there are two self-service BI paths your organization can take:

  • Semantic layer + ad hoc reporting – If your organization decides to invest in implementing a comprehensive organizational BI framework (see the diagram in my newsletter) that features a semantic layer, the end user’s focus will be where it should be: on data analysis and not on building models. For example, end users can use Excel, Power View, or whatever third-party data visualization tool your company felt in love with, to connect to the semantic layer and create insightful ad hoc reports with a few clicks. No need to import data and build models before the business users ever get to data analysis.
  • Self-service models – In this scenario, you put the responsibility on end users to create models and their version of the truth. Following this path, an end user would use a tool, such as Power Pivot, to import data, create a model on the desktop, and then derive insights from the data. Don’t get me wrong. There are good reasons when this path makes sense, including the ones I mentioned in my blog “Does Self-service Make Sense”.

Ideally, your BI roadmap should consider and plan for both paths although the focus should be on organizational BI first and then finding gaps that self-service BI could address, such as testing ideas requiring external data. If your organization relies only on “pure” self-service BI, it won’t be long before you’ll find out that the pendulum has swung the wrong way. Many have taken this road but the outcome is always the same: “spreadmarts” that might get the work done in the interim but would fail in a long run. As a rule of thumb, I recommend the 80/20 principle, where 80% of the effort is spent on organizational BI (DW, ETL, semantic layer, dashboards, operational reports, data mining, big data, etc.) and 20% is left for self-service BI. But each organization is different so the ratio might vary.

Let me know your thoughts.

Semantic Layer Flavors

Semantics relates to discovering the meaning of the message behind the words. In the context of data and BI, semantics represents the user’s perspective of data: how the end user views the data to derive knowledge from it. As a modeler, your job is to translate the machine-friendly database structures and terminology into a user-friendly semantic layer that describes the business problems to be solved and centralizes business calculations. Different vendors have different ideology and implementations of semantic layers. Some, such as Oracle and MicroStrategy, implement the semantic layer as a thin, pass-through layer whose main goal is to centralize access to data sources, metadata and business calculations in a single place.

A Fortune 100 company that had a major investment in Oracle engaged us to recommend a solution for complementing their OBIEE semantic layer with user-friendly reporting tool that would allow end users to create ad hoc transactional reports from their Oracle Exadata-based enterprise data warehouse. In the OBIEE world, the role of the semantic layer is fulfilled by the Common Enterprise Information Model that encompasses data sources, metrics, calculations, definitions, and hierarchies. When users submit report queries, the Oracle BI Server compiles incoming requests and auto-generates native queries against the data sources plugged in the Common Enterprise Information Model.

In theory, this “thin” implementation should work pretty well. In reality, several issues surface:

  1. Performance – Any tool, either a reporting tool or semantic layer, that auto-generates queries should be highly suspect of performance issues. First, no matter how well the tool is designed, it’s unlikely that it would generate efficient queries in all cases. Second, because the tool sends the query to the underlying data source, the overall report performance is determined by how fast the data source crunches data. In this case, the organization was facing performance issues with data aggregation. To circumvent them, their BI developers have implemented summarized fact tables. Once the user reaches the lowest level of the summary fact table, OBIEE would allow the user to drill down to a chained table, such as the actual fact table. Needless to say, business users were complaining that analysis was limited to the dimensionality of the summarized fact table. Further, to avoid severe performance issues with historical reports, the BI developers have taken the “no big aggregations, no problem” approach by forcing a time filter on each request that touches the large fact tables. This effectively precluded historical analysis by year, quarter, etc.
  2. Scalability – When a query involves multiple data source, OBIEE would send a query to each data source, load the data in memory, and then join the datasets together to find matching data. This also presents performance and scalability challenges. To solve performance and scalability issues with “thin” semantic layers and biggish data, organizations ultimately resort to expensive back-end MPP systems, such as Teradata or Netezza, which can parallelize data crunching and compensate for inefficient auto-generated queries.
  3. Common Denominator – You’re limited you to a subset of supported features of the underlying data source.
  4. Vendor Dependency – Such an architecture ultimately locks you in the data sources supported by the vendor. What happens when there is a new version of the data source that has some cool new features? Well, you need to wait until the vendor officially supports the new version and its extensions.

In the Microsoft BI world, the role of the semantic layer is fulfilled by the Business Intelligence Semantic Model (BISM). BISM is an umbrella name that encompasses two technologies: Multidimensional for implementing OLAP cubes and Tabular for designing relational-like models. The big difference is that by default both Multidimensional and Tabular store data on the server. In order words, from a data perspective, you can view BISM and its default storage mode (MOLAP for Multidimensional and xVelocity for Tabular) as a disk-based (MOLAP) or in-memory (xVelocity) cache of the data in your data warehouse. This proprietary storage is designed for fast data crunching even with billions of rows. You pay a price for loading and refreshing the data but you enjoy great query performance with historical and trend reports. And, you get a great ROI because you may not need an expensive MPP appliance or excessive data maintenance. When sent to BISM, queries are served by BISM, and not by the RDBMS. No need for summarized tables and time filters.

Analysis Services Relationship Limitations and Workarounds

Relationships in analytical models (Multidimensional and Tabular) are both a blessing and a curse. Once defined by the modeler, relationships power “slicing and dicing” data. The analytical tool doesn’t have to worry about and include the relationships in the query because they are defined in the model and the server “knows” how to use them to aggregate the data. On the other hand, as they stand, all analytical model flavors (Multidimensional, Tabular, and PowerPivot) have inherent limitations. The following table compares relationship limitations between Multidimensional and Tabular (PowerPivot).

MultidimensionalTabular and PowerPivot
Many-to-ManySupportedNot supported
Parent-childSupportedRequires flattening the hierarchy
Role-playingSupportedNot supported
Multi-grainSupportedNot supported
Fact-to-factNot supportedSupported but rarely useful

 

Let’s discuss these limitations in more details.

Many-to-Many

The classic example of a many-to-many relationship is a joint bank account that is owned by multiple customers. Many-to-many relationships are natively supported in Multidimensional and the server produces correct results when aggregating the data. Tabular and PowerPivot don’t support many-to-many out of the box. When the relationship One-To-Many flow is reversed through a bridge table, Tabular will let you continue but produce wrong results. In PowerPivot, this limitation manifests with the “Relationships might be needed” warning in the PowerPivot Field List. The workaround is to introduce DAX explicit measures for each column that requires aggregation using the formula =CALCULATE (SUM (FactTable[Column] ), BridgeTable). On the upside, due to the Tabular memory-resident nature, expect M2M over a large fact table to perform much better in comparison with Multidimensional.

Parent-Child

A parent-child relationship defines a recursive join between two columns, such as an employee and a manager. This relationship type is supported natively in Multidimensional but requires flattening the hierarchy using DAX calculated columns for each level in the hierarchy. Flattening the hierarchy is accomplished by using DAX PATH-related functions.

Role-playing

A role-playing relationship allows a dimension (lookup) table to be joined multiple times to a fact table, such as to join a Date dimension multiple times to a Sales fact table in order to support aggregating the fact data by OrderDate, ShipDate, and DueDate. Multidimensional supports role-playing relationships natively. Tabular doesn’t and flags only one of the relationships as active. The first (and recommended for most scenarios) workaround is simply to reimport the Date table as many times as needed. This gives you the flexibility to control the schema and data between the Date table instances, e.g. the ShipDate table includes only valid ship dates, at the expense of storage and increased model complexity. The second more complicated workaround is to create calculated measures, such as ShipAmount, DueAmount, etc., that use the DAX USERELATIONSHIP function to traverse the inactive relationships. This approach presents maintenance challenges because the chances are that the fact table will have many columns and you’ll have to create role-playing variants for each measure that needs to be aggregated.

Multi-grain

A multi-grain relationship exists when a dimension joins a fact table at a higher level than the dimension key. For example, you might have an Organization table that goes all the way down to business unit. However, an Expenses fact table might record expenses at a higher level in the organizational hierarchy, such as Division. Consequently, the modeler needs to set up a relationship from the Expenses fact table to the Division column as opposed to the Business Unit column (dimension key). Multidimensional supports this scenario out of the box. Tabular doesn’t because the joined column on the One side of the relationship must be the dimension key which uniquely identifies each row in the dimension table. The workaround is to break the Organization dimension table into two tables: Organization and OrganizationDivision in order to join the OrganizationDivision table to the Expense fact table on the Division column which is now the key column. Then, you need to join the OrganizationDivision table to the Organization table for consolidated reporting purposes, such as to see Sales and Expenses side by side.

Fact-to-fact

Suppose that your model has two fact tables: OrderHeaders and OrderLineItems. You would like to have a transactional report that shows the line items for each order. Let’s say that that the OrderHeaders fact table has an OrderID column and the OrderLineItems have a matching OrderID column. On a first thought, you might attempt to solve the problem by creating a relationship between the OrderHeaders and OrderLineItems tables. However, Multidimensional doesn’t support fact-to-fact relationships. That’s because the classic OLAP model insists on having relationships between dimensions and fact tables only. To solve this predicament in Multidimensional you need a PO Order dimension that joins both OrderHeaders and OrderLineItems tables. However, the chances are that you might end up with another complication if you take this route. This might not be applicable to the order header-line items scenario but what if the rows in the two fact tables have different dates. If a Date dimension joins them and you attempt to slice by date, the business question will be “Show me orders and order line items that have matching dates”. This might or might not be correct depending on your situation. So, this is where you’ll find that although very useful, relationship in analytical models might not be that flexible for ad hoc analysis.

 

Thanks to its more relaxed schema requirements, Tabular doesn’t care about the table type (dimension or fact) and it will let you set up a relationship between OrderHeaders and OrderLineItems as long as they comply to the One-To-Many rule . However, you’ll face another limitation that Tabular doesn’t support closed-loop relationships. More than likely, the OrderHeaders and OrderLineItems will have relationships to other lookup tables already, such as Account, Organization, etc. Once you create a fact-to-fact relationship, you’ll close the relationship loop. Tabular and PowerPivot will detect this and automatically deactivate the fact-to-fact relationship (its line will be dotted in the diagram view). Consequently, you must resort to the Multidimensional workaround to create an OrderHeader lookup table. Another workaround is to merge the OrderHeaders and OrderLineItems tables into a single table. Or, create a model that has only the OrderHeaders and OrderLineItems tables so you can create the fact-to-fact relationship.

 

Relationships are a challenging concept to grasp. As a modeler, you need to understand their limitations and workarounds while hoping that a future version of Analysis Services will relax or remove some of the existing constraints.

Performance Tuning of Tabular Models in SQL Server 2012 Analysis Services

A really impressive white paper/book with insights about Tabular performance and inner-workings. Kudos to the authors for producing this colossal resource.

“Tabular models hosted in SQL Server 2012 Analysis Service provide a comparatively lightweight, easy to build and deploy solution for business intelligence. However, as you increase the data load, add more users, or run complex queries, you need to have a strategy for maintaining and tuning performance. This paper describes strategies and specific techniques for getting the best performance from your tabular models, including processing and partitioning strategies, DAX query tuning, and server tuning for specific workloads.”

SSAS Instance Not Starting

Issue: A customer reports an issue with a production SSAS Tabular instance not starting after reboot. The server has also an SSAS Multidimensional instance which starts just fine. There are no interesting messages in the Event Viewer.

Resolution: During installation, the customer has configured the LogDir of both instances to point to the same physical folder. However, the Flight Recorder can’t start on the Tabular instance because the log file is already locked by the Multidimensional instance. The solution to this horrible problem was to reconfigure the LogDir setting of the Tabular instance (in SSMS, connect to the SSAS instance, right-click the server, and click Properties) to a separate folder.

Moral: Always separate DataDir, LogDir and TempDir folders so each instance has its own set of folders.

Power View Connectivity to Multidimensional (DAXMD) Released

Microsoft released Cumulative Update 4 for SQL Server 2012 Service Pack 1. This is more than a regular cumulative update as it includes enhancements to both Power View and Analysis Services to support connecting Power View to OLAP cubes, also known as DAXMD. For more information, read the official announcement by the Analysis Services product group. Now, business users can easily author ad-hoc reports and interactive dashboards by leveraging your OLAP investment. As a proud contributor to the DAXMD TAP program, I’d delighted with the results as I discussed in my blog “DAXMD Goes Public”.

I hate to dampen the spirit but be aware that only the SharePoint version of Power View has been extended to support Multidimensional. We don’t know yet when the same will happen to Power View in Excel 2013. Anyway, this is a very important BI enhancement and it’s time to plan your DAXMD testing and deployment.

Query Options for Analytical Models and Transactional Reporting

Requesting both historical and transactional reports is a very common requirement. As they stand, neither Multidimensional nor Tabular are designed to support well transactional (detail-level) reporting with large datasets, as I discussed in my Transactional Reporting with Tabular blog post. There are two query options that might provide some relief with transactional reporting and they both have limitations.

DRILLTHROUGH QUERIES

You can use a DRILLTHROUGH query to request data at the lowest level. The syntax differs somewhat between Multidimensional and Tabular. Here is a sample drillthrough query to Adventure Works cube:

DRILLTHROUGH MAXROWS 1000

SELECT FROM [Adventure Works]

WHERE ([Measures].[Reseller Sales Amount],[Product].[Category].[Accessories],[Date].[Calendar Year].&[2006])

RETURN

[$Product].[Product],

[$Date].[Date],

[Reseller Sales].[Reseller Sales Amount],

[Reseller Sales].[Reseller Tax Amount] 

And a similar query to Adventure Works Tabular:

DRILLTHROUGH MAXROWS 1000

SELECT FROM [Model]

WHERE ([Measures].[Reseller Total Sales],[Product].[Category].[Category].&[Accessories],[Date].[Calendar].[Year].&[2005])

RETURN

[$Product].[Product Name],

[$Date].[Date],

[$Reseller Sales].[Sales Amount],

[$Reseller Sales].[Tax Amount] 

Notice that the WHERE clause specifies the coordinate that is drilled through. The optional RETURN statement enumerates the columns you need in the output with dimensions prefixed with a dollar sign. Since Tabular doesn’t have measure groups, you treat all tables as dimensions.

Drillthrough queries are very fast. They honor the data security (both dimension data security in Multidimensional and Row Filters in Tabular). Not allowed members simple don’t exist as far as DRILLTHROUGH is concerned. For example, attempting to request a dimension member outside the allowed set would return the following error:

Drillthrough failed because the coordinate identified by the SELECT clause is out of range.

This is the same error you will get if your request a non-existent or wrong dimension member.

DRILLTHROUGH queries have limitations:

  1. You can’t request measures from different measure groups in Multidimensional, such as Internet Sales and Reseller Sales. If you attempt to do so, you’ll get the following error:
    Errors from the SQL query module: Drillthrough columns must belong to exactly one measure group. The columns requested by the drillthrough statement are referencing multiple measure groups.
  2. The WHERE coordinate must resolve to a single coordinate. In other words, you can’t specify a set, such as {[Product].[Category].[Accessories],[Product].[Category].[Bikes]. You can avoid the error by using SUBSELECT instead of WHERE but you’ll find that the SUBSELECT clause doesn’t set the default member and you’ll get all products. This limitation will prevent you from using multi-select parameters in SSRS.
  3. Related to the previous item, you can’t specify FILTER conditions either, e.g. where the transaction date is between a range of dates. Again, you can only use a tuple in the WHERE clause.
  4. You must use the Analysis Services OLE DB provider in SSRS. If you need to parameterize the query, you need to use an expression-based query statement.

SQL QUERIES

You can also use SQL-like queries as Chris Webb explains in the “Can I run SQL against an Analysis Services cube?” blog. Here is a SQL query against the Adventure Works cube:

select

[Adventure Works].[$Customer].[City],

[Adventure Works].[Internet Sales].[Internet Sales Amount]

from [Adventure Works].[Internet Sales]

natural join [Adventure Works].[$Date]

natural join [Adventure Works].[$Customer]

where [Adventure Works].[$Date].[Calendar Year] = ‘CY 2008’ 

SQL queries have limitations as well:

  1. You must specify SQLQueryMode=DataKeys in the connection string if you connect to Tabular.
  2. SQL queries don’t work with data security (row filters) in Tabular but they appear to work with dimension data security in Multidimensional. Attempting to execute a SQL query against a Tabular model with row filters will throw the following exception:
    Tabular queries are not supported with row level security.
  3. Unlike T-SQL, SSAS SQL queries support only the equal operator. Anything else and you’ll get the following error. This limitation precludes more complicated filtering, such as returning only transactions within a given date range.
    Errors from the SQL query module: There is an unexpected operator in the WHERE clause. Only the Equals operator ( = ) is supported.
  4. You must use the Analysis Services OLE DB provider in SSRS. Unlike DRILLTHROUGH, you can use AND in the WHERE clause to support multi-valued parameters. However, since the Analysis Services OLE DB provider doesn’t support parameters, you must resort to using an expression-based query statement.

Feel-free to join me to escalate this issue to Microsoft because historical and transactional reporting are both very common scenarios. Thanks to its in-memory nature and more relaxed schema requirements, Tabular is best positioned to support both in a long run so I hope that Microsoft will make improvements in this area. For now, when SSAS doesn’t give you the performance you need, consider delegating transactional reporting to the Database Engine.

Transactional Reporting with Tabular

Scenario: We had a requirement to replace the existing implementation of transactional reporting over large data volumes. By “transactional reporting” I mean allowing the user to query individual transactions as they’re stored in the fact table. In some cases, this style of reporting requires simply reading data without aggregations. This, of course is what RDBMS are designed for but in our case, requirements call for extending the reporting model with metadata about the item, such as the item barcode, alias, etc., as well as supporting fast aggregation analysis (for trend reports) and data security. In general, when implementing BI solutions, you should always have an analytical layer between the database and presentation layer for the reasons I discussed in the “Why an Analytical Layer?” blog.

Challenges: Because of the in-memory nature of Tabular and more relaxed schema requirements, such as no separation of dimension and fact tables and ability to keep text-based columns in fact tables, we opted for Tabular on top of a data warehouse and star schema. Our initial schema included an Item dimension table (700 million rows) where the item metadata is kept and Transactions table (1.5 billion rows) that stores the item transactions. However, as we found out quickly, this design presented challenges for Tabular:

  1. Excessive processing time – It took very long to process the initial model. By default, SSMS generates a transactional processing batch that supports processing multiple tables in parallel. However, if there is an error during processing the entire transaction is rolled back and you need start from scratch. To avoid this, we tried scripting the batch and changing it to non-transactional (Transaction=’false’). This option preserves the processed partitions but it doesn’t allow tables to be processed in parallel. This is by design to avoid unexpected interactions due to the potential cross-object dependencies in the recalculation chain. Since both the Item and Transactions tables were portioned, the net effect was that the processing was serialized by table and partition when two scripts for processing Item and Transactions are simultaneously executed (partition 1 of Transactions is processed, partition 1 of Item is processed, partition 2 of Transactions is processed, partition 2 of Item is processed, and so on). To make things worse, there is currently a bug with Tabular where the server builds relationships even when Process Data is used as a processing option which further slows down processing.
  2. Excessive memory footprint – For some reason, Tabular used a lot of memory with the two table approach. Granted, we had to load the dataset twice (one for Item and a second time for Transactions) but still it should have taken less memory (see the memory footprint in the Solution section). The development server had 80 GB of RAM. We ran out of memory half-way during the full load and ended up with the in-memory model taking 65 GB of RAM.
  3. Excessing query time – This was the worst. When we tried Power View as a reporting tool, each time the Item table was involved in the query, the report took more than a minute to finish. The report filter would return only in a few rows from the Transactions table but when the join was made to the Item table (one the one side of the Item-Transactions relationship), the report performance would degrade significantly. As it turned out, Tabular hasn’t been tuned for such scenarios where queries involve large dimension tables and the filter is applied on the fact table.

Solution: To work around the Tabular limitations, we merged the two datasets by joining the Item and Transactions tables in a SQL view. This approach resolved all issues:

  1. Processing time was reduced x10 because Tabular had to process only one large table.
  2. The memory footprint of the entire database with all the data was reduced to 13.5 GB.
  3. The query times went down to seconds.

The only caveat is that processing would incur a hit because of the join between two large tables but we’d rather spend more time during processing than sacrificing query performance.

At least for now, don’t be confused by the “relational” nature of Tabular. It’s still optimized for queries that request a restricted set of columns and aggregate data. Currently, it’s not optimized for transactional (detail-level) style of reporting that requires many columns and requests data at the lowest level. Microsoft is aware of this and would probably introduce optimizations in time.

In a future blog, I’ll discuss techniques to optimize the query performance with transactional reporting and Tabular. Many thanks to Marius Dumitru (Principal Architect on the SSAS team) for answering my questions.