Azure ML Feature Engineering – Convert to Indicator Values

Feature engineering is probably one on my favorite aspects of data science.  This is the area where domain expertise and creativity can pay high dividends.  Essentially feature engineering allows us to come up with our own features or columns to make our models better.  We can apply numerous tricks from a variety of tools provided by Azure ML Studio.  Here is a screenshot of different manipulation modules:

manipulation

Convert to Indicator Value is a module that will transform values in a rows of a column into separate columns with binary values.  For example, if we have a data set with a single column A with 3 rows and values ‘b’, ‘c’, and ‘d’, applying Convert to Indicator Values produces a data set with original column A and 3 new columns b, c and d with 1s and 0s indicating appropriate value.    Transformation of categorical values into columns has been available in most statistical software, for example Minitab.  In one of my M.B.A. courses when studying regression we called ‘b’, ‘c’ and ‘d’, “dummy variables“.

 

When adding Convert to Indicator Value, make sure to use a Metadata Editor to convert the field to categorical in order to avoid getting the following error ‘Column with name “xxx” is not in an allowed category. . ( Error 0056 )”.
convert_to_indicator_values_error

Make sure to check Overwrite categorical columns, otherwise original column would stay in the dataset. It has been mentioned that keeping the original column may help a decision tree algorithm whereas removing it may help a simple linear algorithm.

Azure Machine Learning SMOTE – Part 1

SMOTE or Synthetic Minority Oversampling Technique is designed for dealing with class imbalances.  Based on a few books and articles that I’ve read on the subject, machine learning algorithms tend to perform better when the number of observations in both classes are about the same.  As I mentioned in one of my earlier posts  High AUC value in Azure Machine Learning Model this technique is useful for dealing with a highly common problem where the number of observations of one class in much greater than the number of observations of the other class.

While reviewing SMOTE in SMOTE: Synthetic Minority Over-sampling Technique, article from Journal of Artificial Intelligence Research the technique is a combination of undersampling of the majority class and oversampling of the minority class. However, SMOTE definition and Blood Donation dataset example on MSDN’s website illustrated that majority class stays intact, only minority class gets a boost.

In my earlier classification example of fraud detection I already had a high AUC value and using SMOTE have not moved my AUC score a single percentage point.  So while working on a new model to determine quality of our panelists, my initial model without SMOTE had an AUC of .883
before_SMOTE_1

After adding SMOTE with SMOTE percentage of 100 and a value of 1 for number of nearest neighbors I got a slight improvement in AUC score.
modle
after_SMOTE_1.jpg
Note that number of true positives, false negatives and false positives jumped significantly.

Increasing number of nearest neighbors parameter to 2 decreased AUC, although 3 and 4 brought it up to a similar reading.

Considering that my dataset is highly imbalanced, 99% vs 1%, using SMOTE only provided a 1% boost, let’s examine what happens when we start t increase our minority set to be roughly 50% of the set.  When changing value of 7000% and using nearest neighbor of 1, AUC increased even more.  With a value of 9000% for SMOTE percentage my dataset distribution was about 53% to 47% and my AUC score 0.931.
SMOTE_Properties

after_SMOTE_2.jpg

Note the higher Recall and F1 scores, the model appears to be more balanced.  More about model evaluation in my later posts.

 

 

Azure ML Filter Based Feature Selection vs. Permutation Feature Importance

At first glance both Filter Based Feature Selection and Permutation Feature Importance seem to accomplish similar tasks in that both assign scores to variables so that we can determine which variables or features are important and which ones are not.  So let’s look at the definition of each from ML Studio:

Filter Based Feature Selection (FBFS) – Identifies the features in a dataset with the greatest predictive power.

Permutation Feature Importance (PFI) – Computes the permutation feature importance scores of feature variables given a trained model and a test dataset.

FBFS essentially can help to pick features with most value for the model before data is sent for training because non-informative features can negatively impact your model.  It has one input and two outputs. The input in my example is connected to my dataset as listed below.  Left output can be consumed by something else or like in my case used for analysis while I traverse through different algorithms offered for feature scoring method, which is the first parameter in FBFS.  Second parameter is the target column, which  is the column I’m trying to predict.  Third parameter is the number of features or columns to be selected starting with the highest score, the default is 1.  Right output provides scores for all columns starting with the highest score.

1-20-2016 3-12-33 PM

 

From PFI’s definition and the fact that the container in Azure ML has two inputs it’s clear that it needs to be connected from a trained model and a test dataset as illustrated below:

1-20-2016 2-59-35 PM

My dataset has two highly correlated columns, is_fraud and fraud_reason.  When is_fraud = False, fraud_reason = None, in the case when is_fraud = True, fraud_reason can be different things but not None. When running my model and leaving fraud_reason in it, PFI picked up on that condition and automatically selected fraud_reason as most valuable feature in predicting results.  My model was getting AUC score of 1, no mystery here.  However, FBFS indicated fraud_reason to be the most valuable only when using Mutual Information and Chi Squared for feature scoring method.   More information about Pearson Correlation, Mutual Information, Kendall Correlation, Spearman Correlation, Chi Squared, Fisher Score and Count Based can be found online or in the following book by Roger Barga, Valentine Fontama and Wee Hyong Tok:  Predictive Analytics with Microsoft Azure Machine Learning 2nd Edition.

When using PFI, need to make sure that appropriate metric for measuring performance is selected.  Otherwise you would get Permutation Feature Importance Error – Unsupported parameter type ‘Metric for measuring performance’ specified. . ( Error 0105 ).  This error is typical when using regression algorithm but the default property of PFI is not changed, since it defaults to Classification – Accuracy. 

 

1-23-2016 10-44-50 AM

 

 

 

 

 

Free Statistical Learning Course from Stanford

Here is a great statistical course using R offered from Stanford for free starting now!

A brief description of the course from the source site.

This is an introductory-level course in supervised learning, with a focus on regression and classification methods. The syllabus includes: linear and polynomial regression, logistic regression and linear discriminant analysis; cross-validation and the bootstrap, model selection and regularization methods (ridge and lasso); nonlinear models, splines and generalized additive models; tree-based methods, random forests and boosting; support-vector machines. Some unsupervised learning methods are discussed: principal components and clustering (k-means and hierarchical).

This is not a math-heavy class, so we try and describe the methods without heavy reliance on formulas and complex mathematics. We focus on what we consider to be the important elements of modern data analysis. Computing is done in R. There are lectures devoted to R, giving tutorials from the ground up, and progressing with more detailed sessions that implement the techniques in each chapter.

The lectures cover all the material in An Introduction to Statistical Learning, with Applications in R by James, Witten, Hastie and Tibshirani (Springer, 2013). The pdf for this book is available for free on the book website.

https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about