Operational BI with Azure Stream Analytics

There is a lot of talk nowadays about Internet of Things (IoT). According to Gartner, there will be nearly 26 billion IoT devices by 2020. Naturally, the data generated by these devices needs to processed and analyzed, very often in real time. Indeed, an increasing number of customers need real-time (operational) analytics performed over a stream of events, such as data coming from sensors, barcode readers, social streams, and all sorts of other devices. Currently, .NET developers could use SQL Server StreamInsight to implement on-premise custom CEP (complex event processing) solutions. However, implementing StreamInsight-based applications is not easy as it requires solid .NET and LINQ skills.

Today, Microsoft announced the public preview of the Azure Stream Analytics service that allows organizations to perform stream analytics in the cloud. What’s interesting is that Microsoft made a significant effort to simplify CEP with the promise that “you can be up and running in minutes”. To that end another cloud service, Azure Event Hubs, simplifies the process of intercepting (sinking) events. And, instead of using .NET LINQ, developers can use Stream Analytics Query Language which has a SQL-like syntax for coding standing queries over the event streams, such as:

SELECT DateAdd(second,-5,System.TimeStamp) as WinStartTime, system.TimeStamp as WinEndTime, DeviceId, Avg(Temperature) as AvgTemperature, Count(*) as EventCount FROM input GROUP BY TumblingWindow(second, 5), DeviceId

At the same time, Azure Stream Analytics preserves the advanced features of StreamInsight, such as windowing. The results of the standing queries can be saved to Azure SQL Database, Azure Blob storage, and Azure Event Hub, for further analysis, such as by using Excel. For more information about Azure Stream Insight and to subscribe for the public preview, visit the service home page.

103014_0141_Operational1

I’m excited and expect to see a lot of interest around the Azure Stream Analytics service. If this sounds interesting and you need help, as a Microsoft Gold Partner and premier BI firm, Prologika can help you get started in a cost-effective way, such as by using your Software Assurance vouchers to deliver consulting services around data analytics, such as to implement a POC.

Atlanta MS BI Group Meeting on Oct 27th

Join us on Monday, October 27th for our next meeting of Atlanta MS BI Group to learn about predictive analytics and how to actually do it.

Presentation: Mine Craft
  Level: Intermediate
Date: Monday, October 27th, 2014
Time 6:30 – 8:30 PM ET
Place: South Terraces Building (Auditorium Room)

115 Perimeter Center Place

Atlanta, GA 30346

Overview: Why you should be mining your data and how to actually do it. Every company needs a rock star. We want it to be you. This session will give real world examples of data mining successes as well as walk you through how to get started down the path of data enlightenment, so that you too can say “I Am A Data Miner℠”.
Speaker: Mark Tabladillo provides enterprise data science analytics advice and solutions. He uses Microsoft Azure Machine Learning, Microsoft SQL Server Data Mining, SAS, SPSS, R, and Hadoop (among other tools). He works with Microsoft Business Intelligence (SSAS, SSIS, SSRS, SharePoint, Power BI, .NET). Mark has a national leader in analytics and data science (data mining and machine learning) through conference speaking and instructional leadership since 1998. He connects with people on LinkedIn and Twitter @marktabnet.

David McFarland is a Senior Manager Business Intelligence with RentPath, Inc. David spends the vast majority of his day trying to get out of useless meetings. He has no certifications whatsoever and is pretty sure Microsoft has no idea who he is, except when it’s time to renew enterprise software agreements.

Sponsor: Tegile
With demand growing for bigger data and faster service, your data storage choices can make or break your business. Accelerate your business with Tegile’s all-flash and hybrid storage solutions.

Sybase Integration

A Major League Baseball team engaged us to implement the foundation of their data analytics platform. They partner with TicketMaster for ticketing and sales. Interestingly, besides the TicketMaster cloud hosting that everyone is familiar with, Ticketmaster also offers a client application called Archtics to allow customers to sell tickets on premise. The client application uses a Sybase database (as you probably know, Sybase was acquired by SAP) that syncs with the host. Fortunately, Sybase and SQL Server has a lot in common but in the process we had to figure out a way to pull data from the Sybase database. To do so, you need to follow these steps:

  1. Install the SQL Anywhere Database Client Download. During development, you will need the 32-bit driver because BIDS/SSDT is a 32-bit app. When running via SQL Server Agent, you’ll need the 64-bit driver.
  2. Create an ODBC data source. Again you need to create two ODBC data sources using the 64-bit ODBC Administrator and then 32-bit ODBC Administrator.
  3. Use the ODBC Source in SSIS to extract data from Sybase. Unfortunately, the ODBC Source doesn’t support parameterized statements, such as to extract data incrementally. As a workaround, you can use an expression-based SQL command text. You can do this by clicking on the Data Flow task in the package control flow and setting up an expression for the SQLCommand property.

    101914_2246_SybaseInteg1

Looking for Talent

We have two contract positions in Alpharetta, GA. Contact us at info@prologika.com if you are interested.

BA ANALYST (4-MONTH CONTRACT)

Reviews, analyzes, and evaluates business systems and user needs. Documents requirements, defines scope and objectives, and formulates systems to parallel overall business strategies. May require a bachelor’s degree in a related area and 4-6 years of experience in the field or in a related area. Familiar with relational database concepts, and client-server concepts. Relies on experience and judgment to plan and accomplish goals. Performs a variety of complicated tasks. May lead and direct the work of others. A wide degree of creativity and latitude is expected. Typically reports to a manager.

CLIENT-SERVER TABLEAU DEVELOPER (1 YEAR CONTRACT)

Reviews, analyzes, and modifies programming systems including encoding, testing, debugging and installing to support an organization’s client/server software applications. May require a bachelor’s degree in a related area and 4-6 years of experience in the field or in a related area. Familiar with relational database concepts, and client-server concepts. Relies on experience and judgment to plan and accomplish goals. Performs a variety of complicated tasks. May lead and direct the work of others. Typically reports to a project leader or manager. A wide degree of creativity and latitude is expected. Specialty in Tableau dashboards/reporting.

Applied Excel and Analysis Services e-Learning Course Available

BI solutions typically include a semantic layer. In Microsoft BI, the role of a semantic layer is fulfilled by Multidimensional cubes or Tabular models, whose virtues I extolled in this newsletter. If you invested in an SSAS model already, congratulations! You’ve done the lion’s share. But as you know, the next step is to train your business users and get them all excited about BI. This is not a simple task. And, if you plan to use Excel to offload reporting effort, the task is even more difficult because Excel is packed with BI features. This is why I put together a recorded e-learning “Applied Excel and Analysis Services” class. Having more than 5 hours of video content, the class sells for only $120 per student. And, if you use coupon PROLOGIKA-EXCEL-1 by end of October, you can get it for only $90.

To BI managers and BI developers: You’ve implemented an organizational Analysis Service model. Now you need to choose a tool for interactive data analytics and train your users. As you know, there is a proliferation of Business Intelligence tools on the market and each claims to solve your challenges. But the chances are that you already have what you need – Microsoft Excel. And, as far as the documentation goes, who has time to document all Excel BI features and demo them to users? I designed this class to help you empower your users and get them excited about BI with Excel and Power Pivot.

To business users: If your organization have Analysis Services Multidimensional cubes or Tabular models and you want to gain valuable insights from them, then this course is for you. Designed as a step-by-step tour, this course teaches you how to become a data analyst and unlock the hidden power of data. You’ll learn how to apply the Excel desktop BI capabilities to create versatile reports and dashboards for historical and trend analysis. You’ll learn also how to share your BI artifacts across the organization by publishing them to SharePoint. “I never knew Excel can do this” is the most common feedback we hear from our students.

The class curriculum and promo video is also available on the Prologika website.

Enjoy!

Optimizing Distinct Count Excel Reports

I wonder how many people believe that Tabular DistinctCount outperforms Multidimensional judging by Excel reports alone. In this case, an insurance company reported a performance degradation with Excel reports connected to a multidimensional cube. One report was taking over three minutes to run and it was requesting multiple fields on rows (insured, insured state, insured city, policy number, policy year, underwriter, and a few more) and about a dozen measures, including several distinct count measures, such as claim count, open claim count, and so on. The report would only need subtotals on three of the fields added to the ROWS zone. The cube had about 20 GB a disk footprint so the data size is not the issue here. The real issue is the crappy MDX queries that Excel auto-generates because they are asking for subtotals for all fields added to ROWS, using the following pattern:

NON EMPTY CrossJoin(CrossJoin(CrossJoin(CrossJoin(CrossJoin(CrossJoin(CrossJoin(CrossJoin(

Hierarchize({DrilldownLevel({[Insured].[Insured Name].[All]},,,INCLUDE_CALC_MEMBERS)}),

Hierarchize({DrilldownLevel({[Insured].[Insured City].[All]},,,INCLUDE_CALC_MEMBERS)})),

Hierarchize({DrilldownLevel({[Insured].[Insured State].[All]},,,INCLUDE_CALC_MEMBERS)})),

Hierarchize({DrilldownLevel({[Policy Effective Date].[Year].[All]},,,INCLUDE_CALC_MEMBERS)})),

Hierarchize({DrilldownLevel({[Policy].[Natural Policy Key].[All]},,,INCLUDE_CALC_MEMBERS)})),…

As you can see, the query requests the ALL member of the hierarchy. By contrast, a smarter MDX query generator would request subtotals on the fields that need subtotals only. For example, a rewritten by hand query executes within milliseconds following this pattern:

Hierarchize({DrilldownLevel({[Insured].[Insured Name].[All]},,,INCLUDE_CALC_MEMBERS)}) *

Hierarchize({DrilldownLevel({[Insured].[Insured City].[Insured City].Members},,,INCLUDE_CALC_MEMBERS)})) *

Hierarchize({DrilldownLevel({[Insured].[Insured State].[Insured State].Members},,,INCLUDE_CALC_MEMBERS)}))…

But we can’t change the queries Excel generates and we are at the mercy of the MDX query generator. And, the more fields the report requests, the slower the query would be. DistinctCount measures aggravate the issue further. The problem is that the DC measures cannot be aggregated from caches at deeper levels. Therefore, increasing the number of granularities in the query increases the number of subcubes that are requested from the storage engine, and they’re not going to hit earlier subcubes unless they match at the exact granularity – which is unlikely when the query results are not cached. And at some point, the doubled subcube count will trigger the query degradation (you will see many “Getting data from partition” events in the Profiler). Many of these subcubes are really needed, but some of them are generated for subtotals that Excel doesn’t really need.

I actually logged this issue more than three years ago but the Office team didn’t bother. The original bug was with Power Pivot but the issue was the same. To Microsoft’s credit, the SSAS team introduced an undocumented and unsupported PreferredQueryPatterns setting for both Multidimensional and Tabular, which can be set in msmdsrv.ini (ConfigurationSettings\OLAP\Query\PreferredQueryPatterns). I don’t think it can be set in the connection string. Excel discovers when PreferredQueryPatterns is set to 1 and generates different (drilldown) query pattern instead of the original (crossjoin) pattern. Unfortunately, it looks like more work and testing were done on the Tabular side of things where PreferredQueryPatterns is actually set by default to 1 (although you won’t see it in msmdsrv.ini). I tried a Tabular version of the customer’s cube (only a subset of tables loaded with the biggest table about 50 mil rows fact snapshot and a few distinct count measures) to test with similar Excel queries. With the default configuration (PreferredQueryPatterns=1), Tabular outperformed MD by far (queries take about 3-5 seconds). Initially, I thought that Tabular fares better because of its in-memory nature. Then, I changed PreferredQueryPatterns to 0 on the Tabular instance and reran the Tabular test to send queries with the crossjoin pattern. Much to my surprise, Tabular performed worse than the original MD queries.

PreferredQueryPatterns is 0 by default with Multidimensional due to concerns over possible performance regressions. Indeed, my tests with setting PreferredQueryPatterns to 1 on MD, caused ever-increasing memory utilization until the server ran out of memory so unfortunately it was unusable for this customer. If customer approves, I plan to log a support case. Ideally, the Office team should fix this by auto-generating more efficient MDX queries. If no help on that end, the SSAS team should make PreferredQueryPatterns work with MD. BTW, I was able to optimize somewhat the MD reports by using member properties instead of attributes (from 3 min query execution time went down to 1 min) but that was pretty much the end of the optimization path.