Notes on Analysis Services Performance and Parallelism
As a part of a semantic model assessment, I got to experiment with different configurations of Azure Analysis Services. In this case, the largest fact table has almost 2 billion rows so it’s a good size model. I was particularly interested the correlation between a higher performance tier and query performance. One thing that I like about PaaS is that it’s easy to change and experiment with different configurations. The thing that I don’t like is that I can’t peek under the hood. This is probably done on purpose to keep people like me out, just like these green nets surrounding construction zones. So, I don’t know what hardware AAS is running on and how it’s configured. And I failed to get clarification from Microsoft. So, I might be out of line here, but I thought my findings are worth sharing albeit not officially vetted.
AAS Basic and Standard tiers are advertised as dedicated tiers running on dedicated virtual machines preconfigured with a certain number of cores that are a 20x multiple of the tier QPUs. For example, S2 tier (200 QPUs) comes with 10 cores while S9 (640 QPUs) comes with 32 cores. If you examine the advanced properties of the AAS instance in SSMS, you’ll see Group Affinity masks that confirms this. For example, the mask for S2 is x3FF (or xFFC) which converted to binary is 1111111111, whereas the mask for S9 is FFFFFFFFFFFF for 32 1’s. I don’t know why masks are used given that the VM has already a predefined number of cores. One, and most probable, explanation is that Microsoft co-hosts multiple Analysis Services instances on the same VM. When you provision a new AAS instance, Azure finds a VM that has enough capacity and spins the new service there. If you upgrade AAS, it could get relocated to a different VM if the old one doesn’t have enough capacity. Or, Microsoft might allocates more cores to the VM and a subset to AAS.
Anyone who’s done parallel programming knows that it’s not easy. Safe access and locks need to be used to synchronize access and protect shared resources. The AS xVelocity engine (aka Vertipaq) is a multi-threaded application that accesses RAM as a shared resource. As each thread competes for a global lock for memory allocation/deallocation, scalability decreases. Because processor cores reserve memory in chunks, called cache lines, additional synchronization is required when threads access memory location that are close to each other. Before SQL Server 2016 SP1, this was a serious issue with Tabular. As we’ve learned, starting with SQL Server 2016 SP1, Microsoft switched to using the Intel Threading Building Clocks (TBB) C++ library. Specifically, Tabular now uses the TBB scalable memory allocator. Mind you that TBB is not a CPU-specific extension so any modern Intel CPU should get these benefits. “Analysis Services SP1 uses an Intel TBB-based scalable allocator that provides separate memory pools for every core. As the number of cores increases, the system can scale almost linearly” and “The Intel TBB-based scalable allocator is also expected to help mitigate performance problems due to heap fragmentation that have been shown to occur with the Windows Heap”. Further, starting with SP1, Tabular is NUMA aware with 4-node NUMA system but I don’t know if the AAS VMs are preconfigured for NUMA.
These changes seem to be very beneficial and I do see the promised linear scalability (or close to it) as the number of cores increase. After all, how would you explain a customer that the query times remain the same when they switch to a higher tier? I used this query as a sample query to measure the memory scan performance (the filter is used to prevent Tabular short-circuiting the count by using internal statistics):
EVALUATE CALCULATETABLE (ROW ( “result”, DISTINCTCOUNT ( ‘Table'[day_id] ) ), ‘Table'[day_id] > 0)
The day_id column has a 2.5 GB of total column size with 875 distinct values. The query execution time decreased 2.5 times when switching from S2 to S9, while the storage engine scan time decreased 3 times. Of course, more complicated queries would carry additional overhead in the formula engine which as we know is single threaded.
AAS Tier | Storage Engine Parallelism | Storage Engine Time (ms) | Total Execution Time (ms) |
S2 (200 QPUs) | 9x | 1,563 | 1,922 |
S9 (640 QPUs) | 29x | 516 | 812 |
Tabular does an excellent job parallelizing the storage engine queries and maxing out all available cores in short bursts of time. Staring with SQL Server 2016 SP1, you should see substantial performance gains as the number of cores increase if the query is storage engine bound (which it will probably be with larger models).
I plan to update this blog when results from official load testing we plan to do with this semantic model are available.