Microsoft Advanced Analytics Boot Camp Review

This was a great event to attend for those of you looking to leverage Cortana Intelligence in their applications.

We’ve gone through a business case of collecting some IoT data from HVAC at different facilities and built a logical data flow from collecting the data to implementing reporting using Power BI.  It was insightful to set up a Hadoop cluster on HDInsight with some blob storage and use Hive to query HVAC data from Visual Studio 2015.

Here are some interesting takeaways:

  • Don’t make the mistake of using HDInsight for storage, Blob or Data Lake is for that. Its primary purpose is  for computations.
  • Since Spark has machine learning capabilities, why use R server?  Well, few people want to program in Scala.
  • A good use case for NoSQL technologies is when you don’t know the structure of your data upfront or don’t care about it, but want to query it later.
  • Dataset limitations of 10 GB in Azure ML will change soon to 300 GB.

It was great to connect with some fellow data enthusiasts like Sean Werick and Ginger Grant of Pragmatic Works as well as Ali Zaidi and Beverly Hanson of Microsoft.

Azure Machine Learning SMOTE – Part 3

One aspect to keep in mind when publishing a model which uses SMOTE component is that SMOTE needs to be removed from predictive experiment before it is published as a web service.  SMOTE is used to improve a model during training and is not intended for actual scoring.  Failure to remove SMOTE for a predictive model in my cases produced the following error when using Excel 2013 or later to score one row of data: “Error! {“error”:{“code”:”ModuleExecutionError”,”message”:”Module execution encountered an error.”,”details”:[{“code”:”8″,”target”:”SMOTE”,”message”:”Error 0008: Parameter \”Number of nearest neighbors\” value should be in the range of [1, 0].”}]}}


The error implies that since SMOTE’s parameter ‘Number of nearest neighbors’ is set to 1 and we are scoring only 1 record there is not enough data for that module to synthetically generate more data.

Here is what my predictive model looks like before and after removing SMOTE.  Essentially we connect Apply Transformation to Project Columns.





Azure Machine Learning SMOTE – Part 2

It may be beneficial for your model to use Clean Missing Data module when using SMOTE. Let’s consider the following example of stock data.


My dataset is missing a values in the first row for columns Long and Short.  These two fields have been defined by me and the values depend on next days data.  If today was February 4th, we won’t have next trading days data yet, hence the missing values.
Here is what my model looks like with a Cleaner and SMOTE.


Here is the output of our data after the Cleaner.

Since my Cleaner replaced empty strings with 0s and I already used 0s in my dataset to indicate negative outcomes this is probably not an ideal practice, but for illustrative purposes this should be just fine.

Let’s Visualize the content of Evaluate Model.
Pretty high AUC, accuracy, precision, recall and F1 score.  Let’s check what happens when we connect our dataset directly to SMOTE bypassing Clean Missing Data module.


Here is the output from SMOTE
So we can see that rows with empty values were also added to our dataset.  Considering that I had 4 records where Long had a value of 1 in my original dataset and I used 200% in SMOTE, I was expecting only 8 additional rows, however 10 were added.  Let’s evaluate our model.
A significant drop across the board, AUC of 0.5 which means our model is no better than a random guess.




Azure ML Feature Engineering – Convert to Indicator Values

Feature engineering is probably one on my favorite aspects of data science.  This is the area where domain expertise and creativity can pay high dividends.  Essentially feature engineering allows us to come up with our own features or columns to make our models better.  We can apply numerous tricks from a variety of tools provided by Azure ML Studio.  Here is a screenshot of different manipulation modules:


Convert to Indicator Value is a module that will transform values in a rows of a column into separate columns with binary values.  For example, if we have a data set with a single column A with 3 rows and values ‘b’, ‘c’, and ‘d’, applying Convert to Indicator Values produces a data set with original column A and 3 new columns b, c and d with 1s and 0s indicating appropriate value.    Transformation of categorical values into columns has been available in most statistical software, for example Minitab.  In one of my M.B.A. courses when studying regression we called ‘b’, ‘c’ and ‘d’, “dummy variables“.


When adding Convert to Indicator Value, make sure to use a Metadata Editor to convert the field to categorical in order to avoid getting the following error ‘Column with name “xxx” is not in an allowed category. . ( Error 0056 )”.

Make sure to check Overwrite categorical columns, otherwise original column would stay in the dataset. It has been mentioned that keeping the original column may help a decision tree algorithm whereas removing it may help a simple linear algorithm.

Azure Machine Learning SMOTE – Part 1

SMOTE or Synthetic Minority Oversampling Technique is designed for dealing with class imbalances.  Based on a few books and articles that I’ve read on the subject, machine learning algorithms tend to perform better when the number of observations in both classes are about the same.  As I mentioned in one of my earlier posts  High AUC value in Azure Machine Learning Model this technique is useful for dealing with a highly common problem where the number of observations of one class in much greater than the number of observations of the other class.

While reviewing SMOTE in SMOTE: Synthetic Minority Over-sampling Technique, article from Journal of Artificial Intelligence Research the technique is a combination of undersampling of the majority class and oversampling of the minority class. However, SMOTE definition and Blood Donation dataset example on MSDN’s website illustrated that majority class stays intact, only minority class gets a boost.

In my earlier classification example of fraud detection I already had a high AUC value and using SMOTE have not moved my AUC score a single percentage point.  So while working on a new model to determine quality of our panelists, my initial model without SMOTE had an AUC of .883

After adding SMOTE with SMOTE percentage of 100 and a value of 1 for number of nearest neighbors I got a slight improvement in AUC score.
Note that number of true positives, false negatives and false positives jumped significantly.

Increasing number of nearest neighbors parameter to 2 decreased AUC, although 3 and 4 brought it up to a similar reading.

Considering that my dataset is highly imbalanced, 99% vs 1%, using SMOTE only provided a 1% boost, let’s examine what happens when we start t increase our minority set to be roughly 50% of the set.  When changing value of 7000% and using nearest neighbor of 1, AUC increased even more.  With a value of 9000% for SMOTE percentage my dataset distribution was about 53% to 47% and my AUC score 0.931.


Note the higher Recall and F1 scores, the model appears to be more balanced.  More about model evaluation in my later posts.



Azure Machine Learning Pricing and Billing

From my research  on the pricing structure for Azure ML, it appears to have three components, per API call, computational time and per seat.  As of writing of this post, the “seat” implies Azure ML workspace tied to an Azure Subscription.  So for each Azure ML workspace, the subscription will be billed $9.69/month.  Please note $9.69 is what I found to be in our billing statement, other sources mentioned $9.99.

The time taken to compute is charged at $2 / hour, for both BES and RRS.

Every API call is charged at the listed prices ($0.0005 per call), both BES (BATCH EXECUTION) and RRS (REQUEST/RESPONSE).

So for example if 1000 records take 1 hour to compute, the total charge will be $2.0005.  $2 for for the time taken + $0.0005 for 1 API call.

Why does this matter? RRS is generally low computation time but high API usage, while BES is usually low API usage but high computational time.

I’ve noticed that processing 10 – 10000 records took about the same time using batch execution.  As I mentioned in my earlier post, it took about 28 minutes using batch execution program to process 2.1 million records, with 160 columns.  I suspect that most of the time is spent uploading data to blob storage and downloading to my machine.  Our billing statement shows Machine Learning Production API Compute Hours at 0.1358 consumed units, where unit of measure is 10 hours.  Machine Learning Production API Transactions metric indicates a value of 0.0056 whereas units of measure here 10000 s.  These numbers do change in real time in Microsoft Azure Enterprise Portal.

So far I’ve processed over 8 million records with batch execution, and our billing statement shows $0 for both BES and RRS.   The only charge is $9.69 for the seat.

Also, I’ve been told that there is no distinction anymore between Stage and Production.  And 10 GB limit that is there for free accounts is lifted with enterprise accounts.

High AUC value in Azure Machine Learning Model

AUC or Area Under the Curve is one of the metrics that can help with machine learning model evaluation.  If this value is .5 or 50% then the model is no better than random guess.  If the value is 1 then the model is 100% correct and in data science this would be a red flag.  This would be an indication of a feature in the model which is perfectly correlated with what we are trying to predict.

1-20-2016 9-13-20 PM


Recently I’ve been trying to figure out why my classification model had AUC value ranging between 95% – 99% .  AUC this high generally is too good to be true.  This is a weird problem because generally one would want to increase their AUC score rather than decrease it.

So I started with the most obvious, leakage into my model, a variable which would not be available at the time of prediction.  Given that I generated about 160 variables for my model, analysis of each did not shed any light on my issue.  No leakage found.

I’ve been recommended to use different techniques one of which is SMOTE (Synthetic Minority Oversampling Technique) to apply to my data set.  Considering the ratio of my ‘good guys’ to my ‘bad guys’ (I’m looking to predict ‘bad guys’) is 75% to 25% a suggestion was made that a more even ratio like 50/50 would have been more beneficial for my model.  Usage of SMOTE artificially generated more’bad guys’ for my data set but my AUC have not really changed.

Then I tried to limit randomly my ‘good guys’ to the same number as my ‘bad guys’.  My AUC actually went up a few points.

My next attempt to understand the problem was to gradually eliminate columns with the highest Permutation Feature Importance value from my model and check the AUC score.  I realized that AUC value decreased anywhere from a few to a hundred bases points with each iteration.

It turns out that I was solving a bit of a different problem.  Here, management used the same data elements that I used in my model to come up with the business rules to categorized someone as a ‘bad guy’.  So the model picked up those rules hence we have such a high AUC, accuracy and precision.  So my model wasn’t really predicting probabilities for the ‘bad guys’ but rather the view of the bad guys through the lens of our business experts.

The model is still very useful.  There are at least two benefits as I see them.  First, if the expert user is no longer available and his or her rules have not been documented, the model will fulfill that gap.  Second, automation, the model can relief  experts from manual work so that they can concentrate on more value adding activities.

I would like to extend special thanks to Ilya Lipkovich, Ph.D. for his contribution to the resolution of this issue.

I would also like to thank Microsoft Azure ML Team, specifically Akshaya Annavajhala and Xinwei Xue, Ph.D. for their assistance.