SMOTE or Synthetic Minority Oversampling Technique is designed for dealing with class imbalances. Based on a few books and articles that I’ve read on the subject, machine learning algorithms tend to perform better when the number of observations in both classes are about the same. As I mentioned in one of my earlier posts High AUC value in Azure Machine Learning Model this technique is useful for dealing with a highly common problem where the number of observations of one class in much greater than the number of observations of the other class.
While reviewing SMOTE in SMOTE: Synthetic Minority Over-sampling Technique, article from Journal of Artificial Intelligence Research the technique is a combination of undersampling of the majority class and oversampling of the minority class. However, SMOTE definition and Blood Donation dataset example on MSDN’s website illustrated that majority class stays intact, only minority class gets a boost.
In my earlier classification example of fraud detection I already had a high AUC value and using SMOTE have not moved my AUC score a single percentage point. So while working on a new model to determine quality of our panelists, my initial model without SMOTE had an AUC of .883
After adding SMOTE with SMOTE percentage of 100 and a value of 1 for number of nearest neighbors I got a slight improvement in AUC score.
Note that number of true positives, false negatives and false positives jumped significantly.
Increasing number of nearest neighbors parameter to 2 decreased AUC, although 3 and 4 brought it up to a similar reading.
Considering that my dataset is highly imbalanced, 99% vs 1%, using SMOTE only provided a 1% boost, let’s examine what happens when we start t increase our minority set to be roughly 50% of the set. When changing value of 7000% and using nearest neighbor of 1, AUC increased even more. With a value of 9000% for SMOTE percentage my dataset distribution was about 53% to 47% and my AUC score 0.931.
Note the higher Recall and F1 scores, the model appears to be more balanced. More about model evaluation in my later posts.