Azure Machine Learning SMOTE – Part 3

One aspect to keep in mind when publishing a model which uses SMOTE component is that SMOTE needs to be removed from predictive experiment before it is published as a web service.  SMOTE is used to improve a model during training and is not intended for actual scoring.  Failure to remove SMOTE for a predictive model in my cases produced the following error when using Excel 2013 or later to score one row of data: “Error! {“error”:{“code”:”ModuleExecutionError”,”message”:”Module execution encountered an error.”,”details”:[{“code”:”8″,”target”:”SMOTE”,”message”:”Error 0008: Parameter \”Number of nearest neighbors\” value should be in the range of [1, 0].”}]}}

error2

The error implies that since SMOTE’s parameter ‘Number of nearest neighbors’ is set to 1 and we are scoring only 1 record there is not enough data for that module to synthetically generate more data.

Here is what my predictive model looks like before and after removing SMOTE.  Essentially we connect Apply Transformation to Project Columns.

before_removal.jpg

after_removal.jpg

 

 

Azure Machine Learning SMOTE – Part 2

It may be beneficial for your model to use Clean Missing Data module when using SMOTE. Let’s consider the following example of stock data.

Vizualize_TEAM_Data

My dataset is missing a values in the first row for columns Long and Short.  These two fields have been defined by me and the values depend on next days data.  If today was February 4th, we won’t have next trading days data yet, hence the missing values.
Here is what my model looks like with a Cleaner and SMOTE.

TEAM_Model

Here is the output of our data after the Cleaner.
TEAM_Cleaner_Outputl

Since my Cleaner replaced empty strings with 0s and I already used 0s in my dataset to indicate negative outcomes this is probably not an ideal practice, but for illustrative purposes this should be just fine.

Let’s Visualize the content of Evaluate Model.
TEAM_Evaluate
Pretty high AUC, accuracy, precision, recall and F1 score.  Let’s check what happens when we connect our dataset directly to SMOTE bypassing Clean Missing Data module.

TEAM_No_Cleaner_Model

Here is the output from SMOTE
TEAM_SMOTE_output_withOUT_cleaner2.jpg
So we can see that rows with empty values were also added to our dataset.  Considering that I had 4 records where Long had a value of 1 in my original dataset and I used 200% in SMOTE, I was expecting only 8 additional rows, however 10 were added.  Let’s evaluate our model.
TEAM_No_Cleaner_Evaluate.jpg
A significant drop across the board, AUC of 0.5 which means our model is no better than a random guess.