Posts

First Look at Power BI AutoML

In my previous “First Look at the New Power BI Integration with Azure ML” blog, I showed you how Power BI will make it very easy to integrate it with Azure Machine Learning and Cognitive Services. Data scientists today have plenty of options to create predictive models but all of them have one thing in common: expertise in data science and statistics.

The promise of AutoML is to bring predictive analytics to data analysts just like self-service BI brought analytics to business users.

Let’s see how AutoML works based on what’s in the private preview (the usual disclaimer is that things will probably change). To start with, AutoML requires a dataflow (a note to Microsoft here is that AutoML will become more pervasive if it’s available in Power BI Desktop and it doesn’t require a premium capacity). In the private preview, AutoML requires the following steps. Presumably. the first (and most difficult step), preparing the dataset and cleansing the data is already done and available as a dataflow entity:

030319_1624_FirstLookat1.png

Choosing model type

In private preview, choosing a model type requires the data analysts to select one of four model types:

  • Binary Prediction – a type of a classification task where the model predicts only two (Boolean) states, such as if a customer is a potential buyer or not. It requires a Boolean outcome field.
  • General Classification – a type of a classification task where the model predicts more states, such as if the customer’s risk falls in Low, Medium, or High risk.
  • Regression – Similar to classification, regression predicts a numeric outcome, such as the customer risk on the scale from 0 to 10.
  • Forecasting – Time series forecasting, such as to predict the revenue for future months.

I’d argue here that if want to make this super easy, we shouldn’t be asking a business user about the type of model. Power BI can simply examine the field that needs to be predicted, which happens in the next step, and suggest the model type so the current choices can be collapsed between Classification and Forecasting.

Selecting training data

Next, you need to select the field that needs to be predicted (outcome field) and input columns. This step is the most important and arguably the most difficult even for data scientists because you need to have a business domain knowledge to guess which input fields are the most significant. Modelers tend to lean on the safer side by selecting all fields which dilutes the model accuracy. Here, it will be nice if AutoML suggests which fields are the most significant. AutoML has a validation report that shows you the key influencers after the model is ready, but it would be better if it has similar feature to aid you in the field selection.

The wizard’s next step depends on the mode type. For binary classification, it asks you to specify the field names for two states, such as Buyer and NonBuyer.

Training the model

The model is ready but requires an entity refresh to train it. At this point, AutoML generates two additional entities for providing inputs for testing and training data, such as 80% of the input dataset is used to train the model (generate the predictive patterns) and 20% is used to test the model accuracy against the historical data. The assumption here is probably let the user to fine tune the training and testing data if they want to have more control. I think though that this is too technical, and it should be handled internally by AutoML to reduce the number of entities and overall complexity.

The actual model exists on the Power BI capacity. All the training happens there, and the resulting patterns are stored there. Data does not leave the capacity and the model is not published as an experiment in Azure ML.

Iterating and retraining

One thing is to prepare the model, which is easy with AutoML, but another thing is to make sure that the model is useful and results in a true gain. This is where things get technical but AutoML seeks to simplify this with a validation report. The report has three tabs:

  • Model Performance – The most important information here is the key influencer charts that show you which input fields are the top predictors. You should review this chart and possibly remove input fields that are insignificant. This page also has the only input change you can make to influence the model prediction, which is to specify the probability threshold. For example, for Binary Prediction you can change the probability threshold to 50% to indicate that the customer is considered a buyer if the predicted score is equal or above 50%.
  • Accuracy Report – The most important information on this page is the so called lift chart. You want to verify here that the solid line is above the diagonal line which represents a random guess, i.e. if we don’t use predicative analytics but just randomly guess which customers will become buyers.
  • Technical Details – This page has additional technical information which will be over the head of data analysts but could be useful to pass them on to the data scientist, such as what parameters were used for the model algorithm.

Applying the model

Now that you have the model trained, you can operationalize it by applying it to your original entity, that was used to train the model, or to another entity, such as a list of new customers, as long as it has the same fields that are used as input to the model. Applying the model to an entity adds yet another (computed) entity which has the original fields, plus the predicted. For example, the Binary Prediction model type adds two new fields:

  • Outcome – TRUE if the model predicted that the customer will be a buyer and FALSE otherwise.
  • PredictionScore – The likelihood of the prediction from 0 to 100.
  • PredictionExplanation – Additional details that include the weights of each input field to explain the predicted outcome.

So, once all is done, you’ll have 7 entities:

  • Original entity
  • AutoML entity – trains the model
  • Training data – entity used for training the model
  • Testing data – entity used for testing the model
  • AutoMLScore – a function used for scoring the input dataset
  • AutoMLPollingQuery – a helper entity which I believe returns the identifier of the model

I’d personally prefer less entities. For example, I’d rather see the predicted fields added to the entity to which the model is applied to, as opposed to having another “enriched” entity.

Power BI AutoML brings predictive analytics to data analysts. Powered by ML.NET, AutoML automates the creating of predictive models to handle common predictive tasks, such as classification and forecasting. I hope Microsoft simplifies the process by reducing the number of entities and refreshes required, and by bringing AutoML to Power BI Desktop.

First Look at the New Power BI Integration with Azure ML

Currently, Power BI doesn’t integrate natively with Azure ML so you must resort to using the Web connector to call the experiment’s web service. I demonstrate how you can do this in a singleton way (one row at a time) in chapter 11 of my latest Power BI book (you can download the code from the book page). Microsoft announced several forthcoming Machine Learning enhancements, including integration with Cognitive Services and Azure ML, and the new Auto ML features (skip to 1:08:30).

In this blog, I’ll review the Power BI integration with Azure ML, which is currently in private preview but expected to be available for public preview soon. I’ll use the Adventure Works Bike Buyer Azure ML predictive model (explained in more detail in my book). The predictive model is simple. The model is trained with data from the vTargetMail view in the AdventureWorksDW database and predicts the probability of a customer to purchase a product based on a small subset of features, such as the customer age, geographic region, house owner, number of cars owned, etc.

030219_2041_FirstLookat1.png

When you publish the web service in Azure ML, you must use the Deploy Web Service [New] Preview option. Power BI integration won’t work if you deploy it as a classic web service. There are more prerequisites required to enable integration to Azure ML and Cognitive Services, such as granting the consumers specific rights, which will be documented. Currently, this integration scenario is available in Power BI Service dataflows only. In addition, the dataflow must be created in a premium capacity and the AI Insights workload must be enabled.

Those of you who had to use the Web connector in the past to integrate Power BI with Azure ML will appreciate the simplicity of the new integration:

  1. Create a dataflow with an entity of the dataset you want to score.
  2. Click the AI Insights button in the entity toolbar. If you don’t see this button, the capacity is not a premium capacity, or the AI Insights workload is not enabled.
  3. Power BI enumerates all “new style’ web services assigned to price plans in the Power BI tenant (could be in different subscriptions) and show them in the Invoke Function window (the screenshot below shows only one web service).

  4. Once you select the Azure ML service, Power BI will ask you to map the columns in your data entity as input columns (features) to the web service.
  5. Next, Power BI adds a new “AzureML.WebServiceName”column to the entity dataset. Specifically, Power BI invokes the Azure ML web service in batch mode, which is more efficient that calling one row at the time. The underlying invocation and integration details with Azure ML are not available to you.
  6. You can expand the new column to get the scored label and probability. That’s it! That’s exactly how I imagined the Azure ML integration to work and I’m glad to see how easy it is.scoredcolumns

The new Power BI-Azure ML integration makes it very easy for a data analyst to score a dataset against a previously published (presumably by a data scientist) Azure ML experiment. In the preview, this integration requires a premium capacity and dataflow. I hope Microsoft brings this integration scenario to Power BI Desktop too.

AzureML – Current State of Affairs

Organizations are showing increased interest in predictive analytics. A large retailer that is using Azure Machine Learning (or AzureML) has approached me to help them automate the process. I can’t help it but compare AzureML with the Analysis Service Data Mining each time I look at it. It looks like both Microsoft and I agree at this point, that what used to be difficult was made simple and vice versa. On the upside, AzureML provides business users with a cloud environment for creating predictive solutions without involvement from IT. AzureML has more features than SSAS Data Mining. Now that the focus is on the cloud, we probably won’t see future investments from Microsoft in SSAS Data Mining (and the Excel data mining add-in for that matter) which is currently limited to nine algorithms. By contrast, not only has AzureML many more algorithms but also it allows users to use R and Python, create workflows, and enjoys a vibrant community that is eager to contribute samples. So, AzureML to predictive analytics is what Power BI is to descriptive analytics. It’s all about democratizing analytics to anyone who needs it.

On the downside, operationalizing AzureML predictive experiments has left some room for improvement. True, it’s very easy to promote your experiment as a web service that can be called in a singleton (one row at the time) or batch manner. The Excel Azure Machine Learning Add-in makes it easy to call the service as singleton or batch if your data lives in Excel and user interaction is acceptable. The Chapter 10 source code of my Applied Power BI book demonstrates how a Power BI model can call a predictive service as singleton and the steps are explained in the book. However, automating this process is much more difficult. Previously, all it was required to periodically retrain an SSAS data mining model was to schedule a job with the SQL Server Agent. By contrast, simply scheduling an AzureML experiment for retraining is currently not an option. You need to integrate with the experiment’s web service either programmatically or by using Azure Data Factory. Similarly, scoring the model in a batch mode (instead of singleton execution), requires calling the batch web service endpoint. Consequently, this requires either custom code (see the Microsoft Azure Machine Learning Samples on CodePlex) or using an Azure Data Factory pipeline. Either of these scenarios would require an intervention from your fellow programmer or Azure Data Factory expert. By contrast, SSAS data mining models can be queried with DMX which is easy but promoting them to the outside world requires custom coding.

Here is a diagram of an Azure Data Factory pipeline that I implemented to demonstrate scoring an AzureML model in a batch execution mode using data from a stored procedure residing in an on-premises SQL Server. In this case, my AzureML predictive experiment returns the probability of a customer to purchase a product although what the experiment does is irrelevant as far automation is concerned.

The high-level steps to automate the process are:

  1. Create a NewCustomers dataset (table) that represents the output of the stored procedure for the new customers that you’ll predict on. Because in this case we’re sourcing data from an on-premises server, we have to install (you guess it) a Data Management Gateway, which you can download from your Azure Data Factory service in the Azure portal. As of now, this gateway is not the same as the Power BI Gateway that is used also by PowerApps, Microsoft Flow and Azure Logic Apps, mind you.
  2. Because the predictive batch service requires an input, use the Copy Activity (blessed be the ADF team for adding it!) to save the results of the stored procedure to Azure Storage as an CSV file with headers.
  3. Use the AzureMLBatchExecution activity to pass the file to the batch endpoint of the predictive experiment.
  4. Save the predicted results from the predictive service as a file in Azure Storage and do whatever is needed with the results.

If the experiment gets the input data from Import Data (formerly known as a Reader), such as from on-premises or Azure SQL table, you can simplify scoring and retraining the model by creating a simple Azure Data Factory pipeline that has only the AzureMLBatchExecution activity. In this case, the activity passes no input to the web service and if the experiment writes to a database, no output is needed either. However, because the pipeline definition requires an output, I still had to create a dummy Azure Storage dataset (PredictedBuyersSQL) although not files will be sent to it.

azuremlreader

Here is what the simplified ADF pipeline may look when the experiment reads and writes to a database.

{
 "name": "PredictBuyersPipelineSQL",
 "properties": {
 "description": "Azure ML model with sql azure reader/writer",
 "activities": [
 {
 "type": "AzureMLBatchExecution",
 "typeProperties": {},
 "outputs": [
 {
 "name": "PredictedBuyersSQL"
 }
 ],
 "policy": {
 "timeout": "00:10:00",
 "concurrency": 1,
 "executionPriorityOrder": "NewestFirst",
 "retry": 1
 },
 "scheduler": {
 "frequency": "Day",
 "interval": 1
 },
 "name": "MLSqlReaderSqlWriterActivity",
 "description": "test",
 "linkedServiceName": "BikeBuyersWebServiceSQL"
 }
 ],
 "start": "2016-09-10T00:00:00Z",
 "end": "2016-09-14T00:00:00Z",
 "isPaused": true,
 "hubName": "adfprologikaonpremdf_hub",
 "pipelineMode": "Scheduled"
 }
}

AzureML will play an increasingly important role with predictive analytics on the rise. I hope the next release of AzureML would simplify operationalizing predictive experiments so that it doesn’t require programming effort or involvement from IT. Now that AzureML supports importing data from on-premises SQL Server and other cloud data sources, it should be simple to automate retraining and scoring models.

Microsoft Azure Machine Learning

Yesterday, Microsoft announced Microsoft Azure Machine Learning, previously known as project Passau. Your humble correspondent has been participating in the Preview Program. Basically, Microsoft Azure Machine Learning is a service for self-service cloud-based predictive analytics. You upload your data to the cloud, define datasets and workflows to create “experiments”. Previously, you could create data mining models using the SQL Server data mining capabilities. Microsoft Azure Machine Learning to organizational DM models is what Power Pivot to Analysis Services is. Besides its cloud-based nature, Microsoft Azure Machine Learning offers:

  1. Data sources – Allow business users to upload source data as files.
  2. Workflows – Business users can drag and drop tasks to create workflows, such as to perform basic data transformation tasks, remove outliers, train, and score mining models. Users will familiar with SSIS will undoubtedly find workflows similar. Users familiar with SAS data mining will do the same.
  3. Scalability – You can use Big Data coming from Azure HDInsight and scale out accordingly.
  4. Endpoints – You can easily publish the predictive results as an ODATA service.
  5. Algorithms – There are more algorithms than the nine algorithms in SQL Server

061814_0133_MicrosoftAz1