To Partition or Not – This is the Question
As Analysis Services users undoubtedly know, partitioning and aggregations are the two core tenants of a good Analysis Services data design. Or, at least they have been since its first release. The thing though is that a lot of things have changed since then. Disks and memory got faster and cheaper, and Analysis Services have been getting faster and more intelligent with each new release. You should catch up with these trends, of course, and re-think some old myths, the most prevalent being that you must partition and add aggregations to every cube so you can get faster queries.
In theory, partitioning reduces the data slice the storage engine needs to cover and the smaller the data slice, the less time the SE will spent reading data. But there is a great deal of confusion about partitioning through the BI land. Folks don’t know when to partition and how to partition. They just know that they have to do it since everybody does it. The feedback from the Analysis Services team contributes somewhat to this state of affairs. For example, the Microsoft SQL Server 2005 Analysis Services Performance Guide advocates a partitioning strategy for 15-20 million rows per partitions or 2GB as a maximum partition size. However, the Microsoft SQL Server 2008 Analysis Services Performance Guide is somewhat silent on the subject of performance recommendations for partitioning. The only thing you will find is:
For nondistinct count measure groups, tests with partition sizes in the range of 200 megabytes (MB) to up to 3 gigabytes (GB) indicate that partition size alone does not have a substantial impact on query speeds.
Notice the huge data size range – 200 MB to 3 GB. Does this make you doubtful about partitioning for getting faster query times? And why did they remove the partition size recommendation? As it turns out, folks with really large cubes (billions of rows) would follow this recommendation blindly and end up with thousands of partitions! This is even a worse problem than having no partitions at all because the server trashes time for managing all these partitions. Unfortunately (or fortunately), I’ve never had to chance to deal with such massive cubes. Nevertheless, I would like to suggest the following partitioning strategy for the small to mid-size cubes:
- Start with no partitions.
- Use XPerf (see my Using XPerf to Test Analysis Services blog) to test a few queries that aggregate data only (no MDX functions or script involved, and no aggregations).
- Analyze the I/O time, that is, the time spent in reading data from disk. This is the time that partitioning may help you reduce.
- Compare this time with the overall query time. As I explained in the blog, the storage engine is responsible for retrieving and aggregating data. You may very well find that the most of the query time is spent in aggregating data at the grain requested by the query and only a small portion is spent reading data from disk.
- If the I/O time is substantial, partition your measure groups and compare times again. Otherwise, forget about partitions and see if aggregations can help you reduce the time spent in aggregating data.
If the above sounds too complicated, here are some general recommendations that originate from the Analysis Services team to give you rough size guidelines about partitioning:
- For tiny/small cubes (e.g. less that 100 million), don’t partition at all. I don’t have statistics but I won’t be surprised if 80% of the cubes in existence are smaller.
- For mid-size cubes (say 100 million records to 10 billion records), partition based on good slices and data load strategies. This is where you could follow the 20 million/2 GB guidelines that lead to 10 to 500 partitions.
- For large cubes, consider larger but fewer partitions. Perhaps, go with larger old partitions and smaller new partitions if possible. Choose an intelligent partitioning strategy where you understand the slices that your typical queries are going to apply and try to optimize based on that.
One area where partitioning does help is manageability if this is important for you. For example, partitions can be processed independently, incrementally (e.g. add only new data), have different aggregation designs, combine fact tables (e.g. actual, budget, and scenario tables in one measure group). Partitioning can help you also reduce the overall cube processing time with multiple CPUs and cores. Processing is a highly parallel operation and partitions are processed in parallel. So, you may still want to add a few partitions, e.g. partition by year, to get the cube to process faster. Just be somewhat skeptical about partitioning for improving the query performance.