Azure ML Filter Based Feature Selection vs. Permutation Feature Importance

At first glance both Filter Based Feature Selection and Permutation Feature Importance seem to accomplish similar tasks in that both assign scores to variables so that we can determine which variables or features are important and which ones are not.  So let’s look at the definition of each from ML Studio:

Filter Based Feature Selection (FBFS) – Identifies the features in a dataset with the greatest predictive power.

Permutation Feature Importance (PFI) – Computes the permutation feature importance scores of feature variables given a trained model and a test dataset.

FBFS essentially can help to pick features with most value for the model before data is sent for training because non-informative features can negatively impact your model.  It has one input and two outputs. The input in my example is connected to my dataset as listed below.  Left output can be consumed by something else or like in my case used for analysis while I traverse through different algorithms offered for feature scoring method, which is the first parameter in FBFS.  Second parameter is the target column, which  is the column I’m trying to predict.  Third parameter is the number of features or columns to be selected starting with the highest score, the default is 1.  Right output provides scores for all columns starting with the highest score.

1-20-2016 3-12-33 PM

 

From PFI’s definition and the fact that the container in Azure ML has two inputs it’s clear that it needs to be connected from a trained model and a test dataset as illustrated below:

1-20-2016 2-59-35 PM

My dataset has two highly correlated columns, is_fraud and fraud_reason.  When is_fraud = False, fraud_reason = None, in the case when is_fraud = True, fraud_reason can be different things but not None. When running my model and leaving fraud_reason in it, PFI picked up on that condition and automatically selected fraud_reason as most valuable feature in predicting results.  My model was getting AUC score of 1, no mystery here.  However, FBFS indicated fraud_reason to be the most valuable only when using Mutual Information and Chi Squared for feature scoring method.   More information about Pearson Correlation, Mutual Information, Kendall Correlation, Spearman Correlation, Chi Squared, Fisher Score and Count Based can be found online or in the following book by Roger Barga, Valentine Fontama and Wee Hyong Tok:  Predictive Analytics with Microsoft Azure Machine Learning 2nd Edition.

When using PFI, need to make sure that appropriate metric for measuring performance is selected.  Otherwise you would get Permutation Feature Importance Error – Unsupported parameter type ‘Metric for measuring performance’ specified. . ( Error 0105 ).  This error is typical when using regression algorithm but the default property of PFI is not changed, since it defaults to Classification – Accuracy. 

 

1-23-2016 10-44-50 AM

 

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s