Azure ML Filter Based Feature Selection vs. Permutation Feature Importance

At first glance both Filter Based Feature Selection and Permutation Feature Importance seem to accomplish similar tasks in that both assign scores to variables so that we can determine which variables or features are important and which ones are not.  So let’s look at the definition of each from ML Studio:

Filter Based Feature Selection (FBFS) – Identifies the features in a dataset with the greatest predictive power.

Permutation Feature Importance (PFI) – Computes the permutation feature importance scores of feature variables given a trained model and a test dataset.

FBFS essentially can help to pick features with most value for the model before data is sent for training because non-informative features can negatively impact your model.  It has one input and two outputs. The input in my example is connected to my dataset as listed below.  Left output can be consumed by something else or like in my case used for analysis while I traverse through different algorithms offered for feature scoring method, which is the first parameter in FBFS.  Second parameter is the target column, which  is the column I’m trying to predict.  Third parameter is the number of features or columns to be selected starting with the highest score, the default is 1.  Right output provides scores for all columns starting with the highest score.

1-20-2016 3-12-33 PM

 

From PFI’s definition and the fact that the container in Azure ML has two inputs it’s clear that it needs to be connected from a trained model and a test dataset as illustrated below:

1-20-2016 2-59-35 PM

My dataset has two highly correlated columns, is_fraud and fraud_reason.  When is_fraud = False, fraud_reason = None, in the case when is_fraud = True, fraud_reason can be different things but not None. When running my model and leaving fraud_reason in it, PFI picked up on that condition and automatically selected fraud_reason as most valuable feature in predicting results.  My model was getting AUC score of 1, no mystery here.  However, FBFS indicated fraud_reason to be the most valuable only when using Mutual Information and Chi Squared for feature scoring method.   More information about Pearson Correlation, Mutual Information, Kendall Correlation, Spearman Correlation, Chi Squared, Fisher Score and Count Based can be found online or in the following book by Roger Barga, Valentine Fontama and Wee Hyong Tok:  Predictive Analytics with Microsoft Azure Machine Learning 2nd Edition.

When using PFI, need to make sure that appropriate metric for measuring performance is selected.  Otherwise you would get Permutation Feature Importance Error – Unsupported parameter type ‘Metric for measuring performance’ specified. . ( Error 0105 ).  This error is typical when using regression algorithm but the default property of PFI is not changed, since it defaults to Classification – Accuracy. 

 

1-23-2016 10-44-50 AM

 

 

 

 

 

Free Statistical Learning Course from Stanford

Here is a great statistical course using R offered from Stanford for free starting now!

A brief description of the course from the source site.

This is an introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes: linear and polynomial regression, logistic regression and linear discriminant analysis; cross-validation and the bootstrap, model selection and regularization methods (ridge and lasso); nonlinear models, splines and generalized additive models; tree-based methods, random forests and boosting; support-vector machines. Some unsupervised learning methods are discussed: principal components and clustering (k-means and hierarchical).

This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics. We focus on what we consider to be the important elements of modern data analysis. Computing is done in R. There are lectures devoted to R, giving tutorials from the ground up, and progressing with more detailed sessions that implement the techniques in each chapter.

The lectures cover all the material in An Introduction to Statistical Learning, with Applications in R by James, Witten, Hastie and Tibshirani (Springer, 2013). The pdf for this book is available for free on the book website.

https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about

 

Azure Machine Learning Pricing and Billing

From my research  on the pricing structure for Azure ML, it appears to have three components, per API call, computational time and per seat.  As of writing of this post, the “seat” implies Azure ML workspace tied to an Azure Subscription.  So for each Azure ML workspace, the subscription will be billed $9.69/month.  Please note $9.69 is what I found to be in our billing statement, other sources mentioned $9.99.

The time taken to compute is charged at $2 / hour, for both BES and RRS.

Every API call is charged at the listed prices ($0.0005 per call), both BES (BATCH EXECUTION) and RRS (REQUEST/RESPONSE).

So for example if 1000 records take 1 hour to compute, the total charge will be $2.0005.  $2 for for the time taken + $0.0005 for 1 API call.

Why does this matter? RRS is generally low computation time but high API usage, while BES is usually low API usage but high computational time.

I’ve noticed that processing 10 – 10000 records took about the same time using batch execution.  As I mentioned in my earlier post, it took about 28 minutes using batch execution program to process 2.1 million records, with 160 columns.  I suspect that most of the time is spent uploading data to blob storage and downloading to my machine.  Our billing statement shows Machine Learning Production API Compute Hours at 0.1358 consumed units, where unit of measure is 10 hours.  Machine Learning Production API Transactions metric indicates a value of 0.0056 whereas units of measure here 10000 s.  These numbers do change in real time in Microsoft Azure Enterprise Portal.

So far I’ve processed over 8 million records with batch execution, and our billing statement shows $0 for both BES and RRS.   The only charge is $9.69 for the seat.

Also, I’ve been told that there is no distinction anymore between Stage and Production.  And 10 GB limit that is there for free accounts is lifted with enterprise accounts.

High AUC value in Azure Machine Learning Model

AUC or Area Under the Curve is one of the metrics that can help with machine learning model evaluation.  If this value is .5 or 50% then the model is no better than random guess.  If the value is 1 then the model is 100% correct and in data science this would be a red flag.  This would be an indication of a feature in the model which is perfectly correlated with what we are trying to predict.

1-20-2016 9-13-20 PM

 

Recently I’ve been trying to figure out why my classification model had AUC value ranging between 95% – 99% .  AUC this high generally is too good to be true.  This is a weird problem because generally one would want to increase their AUC score rather than decrease it.

So I started with the most obvious, leakage into my model, a variable which would not be available at the time of prediction.  Given that I generated about 160 variables for my model, analysis of each did not shed any light on my issue.  No leakage found.

I’ve been recommended to use different techniques one of which is SMOTE (Synthetic Minority Oversampling Technique) to apply to my data set.  Considering the ratio of my ‘good guys’ to my ‘bad guys’ (I’m looking to predict ‘bad guys’) is 75% to 25% a suggestion was made that a more even ratio like 50/50 would have been more beneficial for my model.  Usage of SMOTE artificially generated more’bad guys’ for my data set but my AUC have not really changed.

Then I tried to limit randomly my ‘good guys’ to the same number as my ‘bad guys’.  My AUC actually went up a few points.

My next attempt to understand the problem was to gradually eliminate columns with the highest Permutation Feature Importance value from my model and check the AUC score.  I realized that AUC value decreased anywhere from a few to a hundred bases points with each iteration.

It turns out that I was solving a bit of a different problem.  Here, management used the same data elements that I used in my model to come up with the business rules to categorized someone as a ‘bad guy’.  So the model picked up those rules hence we have such a high AUC, accuracy and precision.  So my model wasn’t really predicting probabilities for the ‘bad guys’ but rather the view of the bad guys through the lens of our business experts.

The model is still very useful.  There are at least two benefits as I see them.  First, if the expert user is no longer available and his or her rules have not been documented, the model will fulfill that gap.  Second, automation, the model can relief  experts from manual work so that they can concentrate on more value adding activities.

I would like to extend special thanks to Ilya Lipkovich, Ph.D. for his contribution to the resolution of this issue.

I would also like to thank Microsoft Azure ML Team, specifically Akshaya Annavajhala and Xinwei Xue, Ph.D. for their assistance.

Azure Machine Learning Batch Execution

I prefer to have an end-to-end solution when evaluating a potential product.  This goes well with “small batch” processing concept mentioned by Eric Ries in his book The Lean Startup.  So in order to truly understand how to use a particular technology I like to build a prototype solution for a problem I’m trying to solve.  I applaud Microsoft and Azure ML Team in particular for the ease with which Azure ML can be used by a novice user.  The fact that a web service can be deployed with a single click of a button is absolutely awesome.  Another really great feature is the availability of sample code to consume your web service.  Once you click on your newly created web service, click on BATCH EXECUTION.  Scroll to the bottom and you’ll see sample code in C#, Python and R.

I created a console app using Visual Studio 2015 and sample code mentioned above.  Installed ‘Package Manager Console’ as listed in the Sample Code instructions:

Tools -> Nuget Package Manager -> Package Manager Console

Install-Package Microsoft.AspNet.WebApi.Client

I also searched Nuget for WindowsAzure.Storage, installed it, which automatically added all the require references including Microsoft.WindowsAzure.Storage.dll as listed in the documentation.

Search for “replace” in the code to find all the reference where you need to add your own data, like Azure Blob storage account, storage key, the key for your newly created web service, etc.

Below are some tips to get the code running, you’ll see a command prompt window come up when the program runs.

The program moves a file from my local machine to Azure Blob storage, scores it using a classification model and downloads a new file with scored labels and probabilities to my local machine.

Things to note when putting file location, use verbatim c# string in file name, basically putting @ in front of the string quotes will allow the compiler to disregard the slashes as escaped characters.  Here are key constants that require to be set:

const string StorageAccountName = “yourstoragaccount”; // Replace this with your Azure Storage Account name

const string StorageAccountKey = “yourkey==”; // Replace this with your Azure Storage Key

const string StorageContainerName = “yourcontainer”; // Replace this with your Azure Storage Container name

const string InputFileLocation = @”C:\Temp\records.csv”; // Replace this with the location of your input file

const string InputBlobName = “scored_records.csv”; // Replace this with the name you would like to use for your Azure blob; this needs to have the same extension as the input file

const string apiKey = “yourAPIKey==”; // Replace this with the API key for the web service

const string OutputFileLocation = @”C:\Temp\Scoring_Output.csv”; // Replace this with the location you would like to use for your output file

Also, make sure BaseUrl is set to POST value.  You can find this when you click on BATCH EXECUTION, make sure not to include anything after the ‘jobs’, for example https://ussouthcentral.services.azureml.net/workspaces/feb9f1db037d499fa3e3081a318eada2/services/1a7q3vfd95cd4791afc06216f354c697/jobs

const string BaseUrl = “https://ussouthcentral.services.azureml.net/workspaces/feb9f1db037d499fa3e3081a318eada2/services/1a7q3vfd95cd4791afc06216f354b697/jobs”;

Set timeout in the code to what you need it to be, you might want to increase that value, I set mine to 30 minutes.

const int TimeOutInMilliseconds = 1800 * 1000; // Set a timeout of 30 minutes

It took 28 minutes to process 2.1 million records or about 1.56 GB of data when I ran my initial test.

Once I got the file with scored probabilities, I realized that I did not need all the data back that I’m sending to my model to get scored.  So I changed my web service using Project Columns to only return my key value with Scored Labels and Scored Probabilities which drastically reduced the size of the file returned by the console app.

This console app can be schedule as a task with Windows Task Scheduler or integrated into SSIS package, however current version of SSIS (VS 2012 – 2013) at the time of this writing had older versions of references required by the sample code to run.

Here are some useful links for the consumption of Azure Machine Learning web services:

https://azure.microsoft.com/en-us/documentation/articles/machine-learning-consume-web-services/

http://blogs.msdn.com/b/ssis/archive/2015/06/25/data-preparation-for-azure-machine-learning-using-ssis.aspx

Note that Scored Probabilities values appear with scientific notation, ex. 3.25789456123547E-08 which aren’t easily manipulative in a SQL Server table.  So I just created another field decimal(20, 18) and set with the following value using T-SQL:

CASE

      WHEN [Scored Probabilities] like ‘%E-%’ THEN LTRIM(RTRIM(CAST(CAST([Scored Probabilities] AS FLOAT) AS DECIMAL(20,18))))

      ELSE [Scored Probabilities]

   END

Also, I’ve made a mistake of renaming one of my columns when generating the file to be scored to be ‘Processed’ instead of ‘processed’ but did not update it in the model.  I kept getting an error that column ‘processed’ could not be found when trying to run the app.  Be careful with case sensitivity, it does matter.

I would like to extend special thanks to Microsoft Azure ML Team and specifically Akshaya Annavajhala for his assistance and patience. 

Welcome to my slice of the web!

This blog is dedicated to my experience with Microsoft Azure Machine Learning and predictive analytics in an attempt to help others leverage the product for their needs.

I first heard about Azure ML during a hackathon/POC we had in Redmond, WA, with Microsoft Azure evangelists, Rob Bagby, Jesus Aguilar and the rest of their team.

This was a great event that covered a lot of Microsoft technologies from the inside of Building 20, Microsoft’s campus.  Having access to Microsoft’s  engineers during the event provided invaluable experience.