Estimating Data Compression

Now that the age of in-memory databases has dawned on us, I’ve been experimenting with different aspects of VertiPaq and I find its technology fascinating. It’s safe to say that VertiPaq will propel Microsoft Business Intelligence for years to come with all of its three flavors: PowerPivot (self-service BI), Business Intelligence Semantic Model (corporate BI), and data warehousing (column-based stored indexes). Unlike Analysis Services MOLAP, which compresses data by rows, VertiPaq compresses data by columns.

Column-based compression fits business intelligence like a glove because data is typically analyzed by columns. And, the lower the data cardinality (that is the more repeating values a column has), the higher its compression rate will be. Since most columns used for analytics fit this category, you could expect 10-15 times data compression. Consequently, a half terabyte database could potentially fit into 80 GB of memory. Notice that I account for more memory because when data is brought from disk into memory, additional space is needed to accommodate index tables and other structures. So, how is this not cool? Moving to BISM, all of your data could fit into memory so you could get a head start in performance just by having a superfast storage medium!

Continuing this line of thought, I decided to take the Contoso Retail DW database and a real-life data warehouse for a spin to see how VertiPaq compresses them. I used the excellent PowerShell script that Vidas Matelis developed. Since fact tables typically contribute to 95% of the data warehouse storage space, I focused on only one fact table from both databases. To measure the VertiPaq compression, I loaded the fact tables into PowerPivot and run the script to obtain their storage space.

MOLAP

VertiPaq

Contoso (FactSales)

129 MB

35 MB

DW (20 mil fact table)

810 MB

792 MB

As you can see, the Contoso VertiPaq results are very impressive. But what happened with the real-life DW? How could VertiPaq be only marginally better than MOLAP? To analyze this further, I used the script ability to get a column compression breakdown and I’ve noticed that the following top two columns occupy the most space:

TRANS_DTLS_KEY: 400 MB (primary key)

LOG_ID: 268 MB (transaction identifier)

Both of these columns are high-cardinality columns. I used these columns in a cube drillthrough action to get the transaction details behind a cell. Subsequently, I added them to PowerPivot so I could compare the same data structures. Currently, PowerPivot doesn’t support drillthrough so adding these columns to PowerPivot is probably not needed in real life. It will be interesting to see how BISM will implement drillthrough. One thing is sure: it makes sense to leave high-cardinality attributes out of VertiPaq and possibly use report actions (if supported) to report on details.