Azure Machine Learning SMOTE – Part 2

It may be beneficial for your model to use Clean Missing Data module when using SMOTE. Let’s consider the following example of stock data.

Vizualize_TEAM_Data

My dataset is missing a values in the first row for columns Long and Short.  These two fields have been defined by me and the values depend on next days data.  If today was February 4th, we won’t have next trading days data yet, hence the missing values.
Here is what my model looks like with a Cleaner and SMOTE.

TEAM_Model

Here is the output of our data after the Cleaner.
TEAM_Cleaner_Outputl

Since my Cleaner replaced empty strings with 0s and I already used 0s in my dataset to indicate negative outcomes this is probably not an ideal practice, but for illustrative purposes this should be just fine.

Let’s Visualize the content of Evaluate Model.
TEAM_Evaluate
Pretty high AUC, accuracy, precision, recall and F1 score.  Let’s check what happens when we connect our dataset directly to SMOTE bypassing Clean Missing Data module.

TEAM_No_Cleaner_Model

Here is the output from SMOTE
TEAM_SMOTE_output_withOUT_cleaner2.jpg
So we can see that rows with empty values were also added to our dataset.  Considering that I had 4 records where Long had a value of 1 in my original dataset and I used 200% in SMOTE, I was expecting only 8 additional rows, however 10 were added.  Let’s evaluate our model.
TEAM_No_Cleaner_Evaluate.jpg
A significant drop across the board, AUC of 0.5 which means our model is no better than a random guess.

 

 

 

Azure Machine Learning SMOTE – Part 1

SMOTE or Synthetic Minority Oversampling Technique is designed for dealing with class imbalances.  Based on a few books and articles that I’ve read on the subject, machine learning algorithms tend to perform better when the number of observations in both classes are about the same.  As I mentioned in one of my earlier posts  High AUC value in Azure Machine Learning Model this technique is useful for dealing with a highly common problem where the number of observations of one class in much greater than the number of observations of the other class.

While reviewing SMOTE in SMOTE: Synthetic Minority Over-sampling Technique, article from Journal of Artificial Intelligence Research the technique is a combination of undersampling of the majority class and oversampling of the minority class. However, SMOTE definition and Blood Donation dataset example on MSDN’s website illustrated that majority class stays intact, only minority class gets a boost.

In my earlier classification example of fraud detection I already had a high AUC value and using SMOTE have not moved my AUC score a single percentage point.  So while working on a new model to determine quality of our panelists, my initial model without SMOTE had an AUC of .883
before_SMOTE_1

After adding SMOTE with SMOTE percentage of 100 and a value of 1 for number of nearest neighbors I got a slight improvement in AUC score.
modle
after_SMOTE_1.jpg
Note that number of true positives, false negatives and false positives jumped significantly.

Increasing number of nearest neighbors parameter to 2 decreased AUC, although 3 and 4 brought it up to a similar reading.

Considering that my dataset is highly imbalanced, 99% vs 1%, using SMOTE only provided a 1% boost, let’s examine what happens when we start t increase our minority set to be roughly 50% of the set.  When changing value of 7000% and using nearest neighbor of 1, AUC increased even more.  With a value of 9000% for SMOTE percentage my dataset distribution was about 53% to 47% and my AUC score 0.931.
SMOTE_Properties

after_SMOTE_2.jpg

Note the higher Recall and F1 scores, the model appears to be more balanced.  More about model evaluation in my later posts.

 

 

High AUC value in Azure Machine Learning Model

AUC or Area Under the Curve is one of the metrics that can help with machine learning model evaluation.  If this value is .5 or 50% then the model is no better than random guess.  If the value is 1 then the model is 100% correct and in data science this would be a red flag.  This would be an indication of a feature in the model which is perfectly correlated with what we are trying to predict.

1-20-2016 9-13-20 PM

 

Recently I’ve been trying to figure out why my classification model had AUC value ranging between 95% – 99% .  AUC this high generally is too good to be true.  This is a weird problem because generally one would want to increase their AUC score rather than decrease it.

So I started with the most obvious, leakage into my model, a variable which would not be available at the time of prediction.  Given that I generated about 160 variables for my model, analysis of each did not shed any light on my issue.  No leakage found.

I’ve been recommended to use different techniques one of which is SMOTE (Synthetic Minority Oversampling Technique) to apply to my data set.  Considering the ratio of my ‘good guys’ to my ‘bad guys’ (I’m looking to predict ‘bad guys’) is 75% to 25% a suggestion was made that a more even ratio like 50/50 would have been more beneficial for my model.  Usage of SMOTE artificially generated more’bad guys’ for my data set but my AUC have not really changed.

Then I tried to limit randomly my ‘good guys’ to the same number as my ‘bad guys’.  My AUC actually went up a few points.

My next attempt to understand the problem was to gradually eliminate columns with the highest Permutation Feature Importance value from my model and check the AUC score.  I realized that AUC value decreased anywhere from a few to a hundred bases points with each iteration.

It turns out that I was solving a bit of a different problem.  Here, management used the same data elements that I used in my model to come up with the business rules to categorized someone as a ‘bad guy’.  So the model picked up those rules hence we have such a high AUC, accuracy and precision.  So my model wasn’t really predicting probabilities for the ‘bad guys’ but rather the view of the bad guys through the lens of our business experts.

The model is still very useful.  There are at least two benefits as I see them.  First, if the expert user is no longer available and his or her rules have not been documented, the model will fulfill that gap.  Second, automation, the model can relief  experts from manual work so that they can concentrate on more value adding activities.

I would like to extend special thanks to Ilya Lipkovich, Ph.D. for his contribution to the resolution of this issue.

I would also like to thank Microsoft Azure ML Team, specifically Akshaya Annavajhala and Xinwei Xue, Ph.D. for their assistance.