Azure Data Catalog Announced

Microsoft just announced the public preview of Azure Data Catalog, previously known as “Project Tokyo”. I had a chance to take an early review of Azure Data Catalog to help a big organization which was interested in simplifying discoverability and access to data across the enterprise. As I mentioned in my “QUO VADIS DATA WAREHOUSE?” newsletter, logical data warehousing (LDW) has gained some traction and backing from vendors, including Composite (acquired by Cisco), Denado, and others.

“Logical data warehousing is at a very early stage of adoption. In my opinion, the companies that will benefit most of it are large organizations with many data repositories, where data availability is a major barrier for enabling self-service BI. If you believe that your organization might benefit from a Logical Data Warehouse, you might not need to make a major investment. If your organization has an Office 365 Power BI subscription, your first step could be leveraging the Power Query capabilities for dataset sharing and discovery. This process can work like this:

  1. Designated users promote virtual views in the form of Power Query queries to Office 365.
  2. A data steward verifies and approves these datasets.
  3. Once signed in to Office 365, other users can search, preview these queries, and import the associated data in self-service BI models.”

 You can view Azure Data Catalog is the Microsoft attempt to enter the logical data warehousing market. It also can be viewed as the next step of Power Query Dataset Sharing, now available as a standalone cloud service with new features. It’s all about providing and discovering the necessary metadata that describers the most valuable asset: your data. I welcome Azure Data Catalog but to make it more appealing to large organizations, which will be the most likely customers interested in it, I shared the following feedback with Microsoft:

  1. Support more data sources, as it stands Azure Data Catalog is limited to SQL Server as RDBMS. It’s paramount to support initially at least the most popular ones, including Oracle, DB2, Hadoop, SalesForce, etc. In fact, I don’t see why we can support all Power Query-supported data sources if we route the data acquisition through Power Query.
  2. Support for ODATA so that not only end users but applications can easily consume these shared datasets.
  3. Support for in-memory dataset caching to allow the administrator to configure the datasets to cache results on the server. This could be conceptually similar to SSRS shared dataset caching. Caching would allow us to avoid hitting underlying databases for slow queries.
  4. On-premise support – Cloud is great but it might be a hassle for a large organization to embrace the cloud if they haven’t done this already. Just extending their AD to the cloud would need to pass a lot of red tape.

Except this video, there isn’t much public information currently about Azure Data Catalog but that will probably change next week when Microsoft announces it officially.