Improving Tabular Design Experience

When you develop an organizational Tabular model in SSDT, there is always an implicit processing phase for each action you perform, such as renaming columns, creating hierarchies, changing formatting, and so on. This “data-driven” paradigm could be both a blessing and a curse. A blessing, because you always work with data and you don’t have to explicitly process the model to see the effect of the changes. And a curse, because each time you make a change the UI blocks until the change is committed and this can get old pretty soon.

Note: As long as you don’t use the Table Properties dialog or explicitly refresh the data, all changes are done directly in the Tabular database and Tabular doesn’t re-query the data source to refresh the data.

While waiting for Microsoft to make the necessary changes, here are a few tips to improve your Tabular design experience:

  1. In my experience, the size of the dataset doesn’t affect the duration of the “refresh” step. However, I always use a small dataset during development for faster processing. To do this, if you have a large table, partition the table in Tabular and load only one (e.g. the smallest) partition. If you have already loaded all partitions, you can clear all but one with Process Clear. Now, you have a small dataset to work with during development. Once you deploy to the QA or production server, you can process all partitions.
  2. Disable Automatic Calculation – To do this, go to the Model menu and click Calculation Options -> Manual Calculation. Sometimes (depending on model/calculation complexity) disabling automatic calculation in SSDT may help make modeling operations more responsive. To update the calculated columns, once you’re done with the changes, do Model->Calculate Now. Thanks to Marius for this tip.
  3. While the size of the dataset doesn’t affect the refresh duration, the hardware configuration of your development machine does. Suppose you have an underpowered company laptop and a more powerful personal laptop (lots of memory, solid state drive, many cores, etc.) If your company policy allows using your personal laptop, follow these steps to temporarily switch development during the change-intensive part of the design process:
    1. Copy the source from the first laptop to your personal laptop.
    2. Back up and restore the workspace database to your local Analysis Services Tabular instance. If you follow my first tip, the workspace database should be fairly small.
    3. Load the project in SSDS and double-click the Model.bim file to initialize your environment. This will create a second empty workspace database to your local Tabular instance. Close the solution in SSDT. Now, open the Model.bim_<your name>.settings file in Notepad, find the <Database>element and change it to the name of the original workspace database. Open the solution in SSDT. Now you should see the data in your original workspace database.
    4. Perform the design changes. As a comparison, it takes about 5 seconds to commit a change on my laptop vs. 15 seconds on an underpowered laptop.
    5. Once you’re done with the changes, replace Model.bim and Model.bim.layout files on your company’s laptop.

When Developers and BI Collide

I’ve been running in this situation quite often so I thought this will make a good topic for a blog.

Scenario: Management has asked for some sort of a BI solution, such as a dashboard. BI hasn’t happened to the organization in question yet. But they have smart developers and there is no project that they can’t do. As the story goes, developers go to work and whip out some code… lots of it. Code for doing ETL, code for the database layer, and code for implementing the presentation layer, such as as a custom web application with cool third-party widgets. Everyone is happy… at least for a while.

I don’t have an issue with developers. In fact, I spent most of my career writing code. However, there are several pitfalls with this approach so let’s mention some of them:

  1. Custom code is expensive to write and maintain – Developers are pulled away from their primary tasks to implement the BI solution. When the initial enthusiasm wears off, managers find it difficult to allocate developers to extending and maintaining the BI solution.
  2. You end up with a proprietary, tightly-coupled solution – The solution would probably meet the initial goals but it might be difficult to extend. What if the custom web application is not enough and users prefer to analyze data in another tool, such as Microsoft Excel? Where are business calculations defined? How do you handle security?
  3. Not the right tools – There are many scenarios when writing custom code makes sense but BI is not one of it. There are specialized tools that are geared specifically toward BI. If you write custom code, more than likely you’ll face performance and maintenance challenges in a long run.
  4. Not a best practice – Most projects start simply but grow in complexity over time. It’s not uncommon for management to ask for more features, such as analyzing data by other subject areas or drilling to details. What’s has started as a dashboard project might evolve to an end-to-end BI solution that requires a data warehouse, analytical layer, and different presentation options.
  5. Coders are not BI pros – I hate to say this but even the smartest programmers don’t know much or care about BI. More than likely, you’ll end with a normalized database and summary tables for improving performance with aggregates. Again, there tools and methodologies for BI so there is no point reinventing the wheel. If you don’t know better, hire someone who does. “When all you have is a hammer everything looks like a nail” paradigm won’t bring you too far.

Solution: I’m sure you can add to the list, but what’s the solution? When you hear about BI or its manifestations, such as dashboards, reporting, analytics, etc., the following architecture should immediately come to mind.

3806.biarch.png-550x0

This architecture is somewhat simplified. For example, it doesn’t show a staging database but it has the main pieces and their corresponding technologies in the Microsoft BI platform:

  1. A data warehouse whose schema is designed with reporting in mind.
  2. Integration services packages for ETL processes. They still have to be maintained but I dare to declare that maintaining SSIS is much easier that maintaining custom code. For example, it doesn’t require coding skills, it scales well, and it has a comprehensive logging infrastructure.
  3. An analytical layer, either as a multi-dimensional cube or a tabular model. This is the most overlooked piece but it’s the most important for the reasons I explained in my “Why an Analytical Layer?” blog.
  4. Finally, just like a car manufacturer, you should strive to assemble your solution with prefabricated parts instead of writing custom code. For example, you can implement very compelling dashboards with Power View that uses your analytical layer as a data source without having to write a single line of code. Not to mention that you can delegate this task to business users.

Now, with custom code you can do anything, including features that you can’t get out of the box with prepackaged BI tools. However, you’ll be surprised how willing your management might be to compromise with features especially in this economy.

Adding a Set Total

Scenario: You have defined an MDX set in the cube script, such as a set that returns a few months. For the sake of simplicity, we will hardcode the months.

CREATE SET CURRENTCUBE.[12Months] AS

{

[Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302]

};

As useful as the set might be, it has an issue. It doesn’t return the total and users might complain about it:

030813_2047_AddingaSetT1

Solution: As a first try, you might attempt to add the [All] member to the set:

CREATE SET CURRENTCUBE.[Last12Months] AS

{

[Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302]

+

[Date].[Month].[All]

};

You’re on the right path. Now, you’ll get the grand total row in Excel but it shows the unrestricted (non-visual total) total instead of the total across set members only. Instead, use the VisualTotals() function:

CREATE DYNAMIC SET CURRENTCUBE.[12Months] AS

{

[Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302]

+

VisualTotals

(

{

[Date].[Month].[All],

[Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302]

}

)

};

 

030813_2047_AddingaSetT2

Here, the VisualTotals function returns the All member with the total adjusted to set members only. Notice also that I’ve changed the set definition to by dynamic (DYNAMIC keyword) to force re-evaluation of the set.

In a slightly more complicated scenario, what if you want the total to appear as an additional member of the set as opposed to a grand total row. One implementation approach will be to add a dimension calculated member to the Month hierarchy that does the aggregation and then add the new member to the set:

CREATE MEMBER CurrentCube.[Date].[Month].Agg AS Aggregate([Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302])

CREATE SET CURRENTCUBE.[12Months] AS

{

[Date].[Month].&[201302].Lag(11):[Date].[Month].&[201302]

+

[Date].[Month].Agg

};

 

Optimizing Massive SQL Joins

Scenario: Run ETL to perform a full data warehouse load. One of the steps requires joining four biggish tables in a stating database with 1:M logical relationships. The tables have the following counts:

VOUCHER: 1,802,743

VOUCHER_LINE: 2,183,469

DISTRIB_LINE: 2,658,726

VCHR_ACCTG_LINE: 10,242,414

Observations: On the development server, the SELECT query runs for hours. However, on the UAT server it finished within a few minutes. Both servers have the same data and hardware configuration, running SQL Server 2012 SP1.

Solution: Isolating the issue and coming up with a solution wasn’t easy. Once we ruled out resource bottlenecks (both servers have similar configuration and similar I/O throughput), we took a look at the estimated query plan (we couldn’t compare with the actual execution plan because we couldn’t wait for the query to finish on the slow server).

We’ve notice that the query plans were very different between the two servers. Specifically, the estimated query plan on the fast server included parallelism and hash match join predicates. However, the slow server had merge M:M join predicates. This requires a tempdb work table for inner side rewinds which surely can cause performance degradation.

030313_0112_OptimizingM1

Interestingly, the cardinality of the tables and estimated number of rows didn’t change much between the two plans. Yet, the query optimizer decided to choose very different plans. At this point, we figured that this could be an issue with statistics although both servers were configured to auto update statistics (to auto-update statistics SQL Server requires modifications to at least 20% of the rows in that table). The statistics on the slow server probably just happened to have a sample distribution that led to a particular path through the optimizer that ended up choosing a serial plan instead of a parallel plan. Initially, we tried sp_updatestats but we didn’t get an improvement. Then, we did Update Statistics <table name> With Fullscan on the four tables. This resolved the issue and the query on the slow server executed in par with the query on the fast server.

Note: Updating statistics with full scan is an expensive operation that probably shouldn’t be in your database maintenance plan. Instead, consider:

1. Stick with default sampled statistics

2. Try hints for specific queries that exhibit slow performance, such as OPTION (HASH JOIN, LOOP JOIN) to preclude the expensive merge joins.

Special thanks to fellow SQL Server MVPs, Magi Naumova, Paul White, Hugo Kornelis, and Erland Sommarskog for shedding light in dark places!

CU2 for SQL Server 2012 SP1

Microsoft released a while back a cumulative update 2 for SQL Server 2012 SP1. Besides other things, it includes a fix for the “wide” MDX query performance degradation issue with Analysis Services that I previously reported.

  • Knowledge base for the performance issue
  • Knowledge base with SP1 CU2 download link

Nevron Chart for Reporting Services

Nevron was kind enough to show me their chart and gauge for Reporting Services. In certain areas, their products exceed the charting capabilities of Reporting Services. Here are the highlights:

  1. The product provides support for SQL Server Reporting Services (SSRS) 2005, 2008, 2008R2 and SQL Server Data Tools (SSDT) 2012.
  2. Advanced Chart Designer021813_0226_NevronChart1
  3. Complete set of 2D and 3D Charting Types
  4. Support for Code Customization – Nevron Chart for Reporting Services supports customization through C# code, which allows you to use the full Nevron Chart for .NET API in SSRS and achieve a higher degree of customization.

    021813_0226_NevronChart2021813_0226_NevronChart3

  5. Expressions Everywhere – Nearly all properties of the Chart report item can now be specified by expressions. Appearance styles (fill, stroke, shadow and text) are also reworked to support both constant specification and specification that is based on expressions. This allows you to bind nearly all configurable aspects of the report item to user defined parameters and/or data.021813_0226_NevronChart4
  6. Support for 2D/3D Combo Charts – Combo Charts are used to combine certain series types within the same chart area.021813_0226_NevronChart5
  7. XML Formatted Texts – This feature is applicable to all elements which display text. It allows you to mix fonts, mix bold, italic underline formatting and many other advanced features like different fill styles, shadows, image filters etc. – all inside a single label, title etc.021813_0226_NevronChart6
  8. Non-overlapping Data Labels Layout – All Cartesian Chart types support automatic non-overlapping data labels layout. The data label layout prevents labels from overlapping in both ordinal and scatter charts and works in 2D and 3D appearance modes.021813_0226_NevronChart7
  9. Advanced Axis Model – complete set of axis-related features designed to target even the most compelling charting requirements.

021813_0226_NevronChart8021813_0226_NevronChart9021813_0226_NevronChart10

 

The online demo is available at: http://examplesssrsvision.nevron.com/

 

SQL Server 2012 Semantic Search

Suppose your company has a web application that allows customers to enter comments, such as to provide feedback about their product experience. How do you derive knowledge from these comments? Perhaps, customers are complaining a lot about a particular product. Or, they are raving about your service. Enter semantic search – a new feature in SQL Server 2012.

Building upon full text search, semantic search allows you to search not only for words, but also for the meaning (semantics) of these works. Mark Tabladillo, Ph.D., gave us great presentation at our Atlanta BI January meeting. He demoed how semantic search can be used to find the most common phrases in a document and how to find similar documents from a given search criteria.

Besides external documents, you can apply semantic search to text data already stored in a database, such as a text-based Comments field. Again, this is possible because semantic search builds upon the full-text search capabilities of SQL Server. So, any column that supports full-text search can be enhanced with semantic search.

Going back to my scenario, here is what it takes to configure a text field for semantic search:

  1. When you use semantic search, SQL Server performs statistical analysis of the words in the column contents. This analysis requires base data that is provided as a SQL Server database. Because the database is not installed by default, as a perquisite of using semantic search, you need to run the semantic database installer. The installer can be found on the SQL Server setup disk in the following folders:
    For x86, the setup is \x86\Setup\SemanticLanguageDatabase.msi
    For x64, the setup is \x64\Setup\SemanticLanguageDatabase.msi
  2. The setup simply extracts the MDF and LDF files of the semantic database to a folder that you specify during the installation. Next, simply attach the semantics database to your SQL Server 2012 instance as you would with any other database.
  3. The next step is another step that you need to perform only once for each SQL Server instance. Register the semantics database using this command:

EXEC sp_fulltext_semantic_register_language_statistics_db @dbname = ‘SemanticsDB’

4.    Next, create a full-text index using Statistical_Semantics option which is new with SQL Server 2012.

CREATE FULLTEXT INDEX ON [Production].[ProductReview]

(

[Comments] LANGUAGE [English] Statistical_Semantics

)

KEY INDEX [PK_ProductReview_ProductReviewID] ON ([AW2008FullTextCatalog], FILEGROUP [PRIMARY])

WITH (CHANGE_TRACKING = AUTO, STOPLIST = SYSTEM)

This command creates a semantic-enabled full-text index on the Comments column in the Production.ProductReview table on the AW2008FullTextCatalog full text catalog. The AW2008FullTextCatalog text catalog is included with the AdventureWorks2012 database. The CHANGE_TRACKING = AUTO clause instructs SQL Server to auto-update the index, and STOPLIST = SYSTEM specifies that the default full-text system STOPLIST should be used to parse words.

5.   Now that you’ve done the ground work, you can perform the semantic search. For example, the following query returns the most frequent words for each document:

SELECT TOP(5) KEYP_TBL.document_key, KEYP_TBL.keyphrase, KEYP_TBL.score

FROM SEMANTICKEYPHRASETABLE

(

[Production].[ProductReview],

    [Comments]

) AS KEYP_TBL

ORDER BY KEYP_TBL.score DESC

021113_0015_SQLServer201

Note that the document_key matches the ProductReviewID column (primary key) of the [Production].[ProductReview] table allowing you to match the semantic search results to the document. For example, “socks” and “trip” were found in the row with the primary key of 1 whose Comments column contains the following text:

“I can’t believe I’m singing the praises of a pair of socks, but I just came back from a grueling 3-day ride and these socks really helped make the trip a blast. They’re lightweight yet really cushioned my feet all day. The reinforced toe is nearly bullet-proof and I didn’t experience any problems with rubbing or blisters like I have with other brands. I know it sounds silly, but it’s always the little stuff (like comfortable feet) that makes or breaks a long trip. I won’t go on another trip without them!”

You can feed these results to a data mining model if you want to learn which phrases customers tend to use together similar to identifying what products customers tend to purchase together (market basket analysis). “Together” here means within a specific document.

What if you want to get the most popular phrases across all comments? I think the following query should help although there might be a better way. The query groups phrases and sorts them in a descending order by the average score.
SELECT TOP(5) KEYP_TBL.keyphrase, AVG(score) AS Score

FROM SEMANTICKEYPHRASETABLE

(

[Production].[ProductReview],

[Comments]

) AS KEYP_TBL

GROUP BY KEYP_TBL.keyphrase

ORDER BY AVG(KEYP_TBL.score) DESC;

 

021113_0015_SQLServer202

Gartner’s 2013 Data Warehouse Magic Quadrant

MVP fellow, Andrew Burst, discusses Gartner’s newly released DW (not BI, mind you) magic quadrant. It looks like Gartner fell in love with Teradata but Microsoft is moving up as well thanks to the advances in its Parallel Data Warehouse (PDW) and in-memory offerings.

1106.gartnerBI2013.png-550x0

 

Performance Degradation with the Aggregate Function and Unrelated Dimensions

I ran into a severe performance issue with the Aggregate function and unrelated dimensions which I reported on connect. In this context, an unrelated dimension is a dimension that doesn’t join any measure group in the cube. Why have unrelated dimensions? My preferred way to implement time calculations is use a regular dimension that is added to the cube but not related to any measure groups. Then, I use scope assignments to overwrite the time calculations, e.g. for YTD, MTD, etc. For example, the scope assignment for YTD might look like:

Scope (

[Relative Date].[YTD]

);


this =


Aggregate

(

{[Relative Date].[Current Period]} * PeriodsToDate([Date].[Fiscal].[Fiscal Year], [Date].[Fiscal].CurrentMember)

);

End Scope;

Notice the use of the Aggregate function which when executed maps to the default aggregation function of the underlying measure. For some reason with SQL Server 2012, a query that uses the Relative Date dimension experiences a significant performance hit. Replacing Aggregate with Sum fixes the issue, assuming you can sum up the affected measure to produce the time calculations.