Pure Storage Performance

Black sports car. Non-branded car design.

Here is an update on performance of our all flash array, since we’ve had it running for a few months in our production environment.  Since our systems are heavy reads, the obvious benefits can be seen from the graphs below where there is a clear reduction in wait time since our upgrade in August.  The graphs illustrate total wait time by individual SQL Server data file.  The scale is in days of wait on Y axis and time in weeks on X axis. This data is taken from two different servers running on Pure Storage.

performance1

performance2

I want to share some more phenomenal results achieved through the upgrade to the new storage device.  Before and after of some key metrics speak for themselves.  Below are Total I/O, Total Read I/O, Total Write I/O, O/S Disk Queue Length, Page Read and Write per Second, SQL Disk Read Latency and SQL Write Latency.  Again, this data is from two of our servers.
performance3

performance4

Our database restores and backups completed in one-third the time, and for a number of our most common database queries, response times are as much as eight times faster. Routine database maintenance tasks are also completed much sooner, including one that has gone from three hours to 15 minutes.

On a separate note, I want to mention that availability of snapshots has been a great bonus for us.  Now we can restore a copy of production data for free on the device due to this feature. Our total data reduction ration is 3.2 to 1.

One useful tip I found with getting signed out from the UI management console is that pressing Ctrl+Q signs you right back in with your saved credentials.

consol

Again, special thanks to Dennis Friedley of Pure Storage for his follow-ups and superb customer service.

Flash Storage Upgrade: My Experience With Pure Storage

We’ve recently completed a project to upgrade storage which runs our database clusters of relational and OLAP data.  The experience has been delightful enough to be worthy of a blog post.

Our previous device, EMC VNX-5300, which is a hybrid of flash and spinning disk has been struggling with the load our systems generate.  SolarWinds and Newrelic, aside from performance monitor, have been good friends in helping us to identify IO as our primary bottleneck.

It’s been just a little over three years since our storage upgrade to VNX-5300 but the playing field has changed significantly since then.  A few days into research, we realized that for the same money we paid three years ago for our hybrid solution, we could now get more storage and ALL of it flash.  Software is the reason all SSD arrays are now cheaper than legacy spinning disk or hybrid systems. Some bright minds figured out a way to compress and deduplicate data layered on storage devices resulting in a significant reduction of actual data being stored.  More about actual ratios and performance in my next post.

Picking up the latest copy of Gartner report and running some Google searches helped us zero in on a few leaders in the storage space.  Given that we’ve had a successful relationship with EMC in the past, it topped our list.  SolidFire which has been acquired by NetApp was our second contender.  Considering that both SolidFire and our CTO are out of Boulder, CO we had a personal connection with this company.  Pure Storage got on our list for several reasons.  The company has been listed as the first to sell all flash arrays, dating back to 2009 hence capturing significant market share.  Customer satisfaction with the products and support appeared to be through the roof.  Given stringent time and budget constraints we set off to select our next partner.  EMC’s proposal came in very interesting after a significant discount and promotion but lacked references and felt like a Beta product.  NetApp’s product was the most expensive of the three and lacked flexibility with the amount of storage that we could get.  Their offering with the least amount of storage we could buy was way more than we needed.  A gradual expansion of their array, also was not an option. It would have required a significant investment for a sizable chunk of additional storage all of which would not have been necessary.

Pure was the fastest to provide customer references in Colorado and Florida.  Within a week, we had a chance to go through a list of our questions with people in charge of storage upgrades in their companies.  Needless to say all three companies we surveyed were more than satisfied with their selection of the storage vendor. The feedback we got from one customer was that they also did their own research during a six month period going through the same storage providers and concluded that Pure was the way to go. Interesting enough all three customers we interviewed had switched from EMC to Pure. Some of them now were repeat customers of Pure, upgrading additional systems.

Once we decided to go with Pure and issued a PO, the device arrived at our office within ten days.  We had a Pure engineer installing it in our production data center within three days.

Thanks to Dennis Friedley of Pure Storage and George Orr of Corus360, the execution of this transaction from engagement to setup and configuration has been more efficient and straight forward than any other.  As mentioned earlier, more on performance of the array to follow.

Microsoft Advanced Analytics Boot Camp Review

This was a great event to attend for those of you looking to leverage Cortana Intelligence in their applications.

We’ve gone through a business case of collecting some IoT data from HVAC at different facilities and built a logical data flow from collecting the data to implementing reporting using Power BI.  It was insightful to set up a Hadoop cluster on HDInsight with some blob storage and use Hive to query HVAC data from Visual Studio 2015.

Here are some interesting takeaways:

  • Don’t make the mistake of using HDInsight for storage, Blob or Data Lake is for that. Its primary purpose is  for computations.
  • Since Spark has machine learning capabilities, why use R server?  Well, few people want to program in Scala.
  • A good use case for NoSQL technologies is when you don’t know the structure of your data upfront or don’t care about it, but want to query it later.
  • Dataset limitations of 10 GB in Azure ML will change soon to 300 GB.

It was great to connect with some fellow data enthusiasts like Sean Werick and Ginger Grant of Pragmatic Works as well as Ali Zaidi and Beverly Hanson of Microsoft.

Azure Machine Learning SMOTE – Part 3

One aspect to keep in mind when publishing a model which uses SMOTE component is that SMOTE needs to be removed from predictive experiment before it is published as a web service.  SMOTE is used to improve a model during training and is not intended for actual scoring.  Failure to remove SMOTE for a predictive model in my cases produced the following error when using Excel 2013 or later to score one row of data: “Error! {“error”:{“code”:”ModuleExecutionError”,”message”:”Module execution encountered an error.”,”details”:[{“code”:”8″,”target”:”SMOTE”,”message”:”Error 0008: Parameter \”Number of nearest neighbors\” value should be in the range of [1, 0].”}]}}

error2

The error implies that since SMOTE’s parameter ‘Number of nearest neighbors’ is set to 1 and we are scoring only 1 record there is not enough data for that module to synthetically generate more data.

Here is what my predictive model looks like before and after removing SMOTE.  Essentially we connect Apply Transformation to Project Columns.

before_removal.jpg

after_removal.jpg

 

 

Azure Machine Learning SMOTE – Part 2

It may be beneficial for your model to use Clean Missing Data module when using SMOTE. Let’s consider the following example of stock data.

Vizualize_TEAM_Data

My dataset is missing a values in the first row for columns Long and Short.  These two fields have been defined by me and the values depend on next days data.  If today was February 4th, we won’t have next trading days data yet, hence the missing values.
Here is what my model looks like with a Cleaner and SMOTE.

TEAM_Model

Here is the output of our data after the Cleaner.
TEAM_Cleaner_Outputl

Since my Cleaner replaced empty strings with 0s and I already used 0s in my dataset to indicate negative outcomes this is probably not an ideal practice, but for illustrative purposes this should be just fine.

Let’s Visualize the content of Evaluate Model.
TEAM_Evaluate
Pretty high AUC, accuracy, precision, recall and F1 score.  Let’s check what happens when we connect our dataset directly to SMOTE bypassing Clean Missing Data module.

TEAM_No_Cleaner_Model

Here is the output from SMOTE
TEAM_SMOTE_output_withOUT_cleaner2.jpg
So we can see that rows with empty values were also added to our dataset.  Considering that I had 4 records where Long had a value of 1 in my original dataset and I used 200% in SMOTE, I was expecting only 8 additional rows, however 10 were added.  Let’s evaluate our model.
TEAM_No_Cleaner_Evaluate.jpg
A significant drop across the board, AUC of 0.5 which means our model is no better than a random guess.

 

 

 

Azure ML Feature Engineering – Convert to Indicator Values

Feature engineering is probably one on my favorite aspects of data science.  This is the area where domain expertise and creativity can pay high dividends.  Essentially feature engineering allows us to come up with our own features or columns to make our models better.  We can apply numerous tricks from a variety of tools provided by Azure ML Studio.  Here is a screenshot of different manipulation modules:

manipulation

Convert to Indicator Value is a module that will transform values in a rows of a column into separate columns with binary values.  For example, if we have a data set with a single column A with 3 rows and values ‘b’, ‘c’, and ‘d’, applying Convert to Indicator Values produces a data set with original column A and 3 new columns b, c and d with 1s and 0s indicating appropriate value.    Transformation of categorical values into columns has been available in most statistical software, for example Minitab.  In one of my M.B.A. courses when studying regression we called ‘b’, ‘c’ and ‘d’, “dummy variables“.

 

When adding Convert to Indicator Value, make sure to use a Metadata Editor to convert the field to categorical in order to avoid getting the following error ‘Column with name “xxx” is not in an allowed category. . ( Error 0056 )”.
convert_to_indicator_values_error

Make sure to check Overwrite categorical columns, otherwise original column would stay in the dataset. It has been mentioned that keeping the original column may help a decision tree algorithm whereas removing it may help a simple linear algorithm.

Azure Machine Learning SMOTE – Part 1

SMOTE or Synthetic Minority Oversampling Technique is designed for dealing with class imbalances.  Based on a few books and articles that I’ve read on the subject, machine learning algorithms tend to perform better when the number of observations in both classes are about the same.  As I mentioned in one of my earlier posts  High AUC value in Azure Machine Learning Model this technique is useful for dealing with a highly common problem where the number of observations of one class in much greater than the number of observations of the other class.

While reviewing SMOTE in SMOTE: Synthetic Minority Over-sampling Technique, article from Journal of Artificial Intelligence Research the technique is a combination of undersampling of the majority class and oversampling of the minority class. However, SMOTE definition and Blood Donation dataset example on MSDN’s website illustrated that majority class stays intact, only minority class gets a boost.

In my earlier classification example of fraud detection I already had a high AUC value and using SMOTE have not moved my AUC score a single percentage point.  So while working on a new model to determine quality of our panelists, my initial model without SMOTE had an AUC of .883
before_SMOTE_1

After adding SMOTE with SMOTE percentage of 100 and a value of 1 for number of nearest neighbors I got a slight improvement in AUC score.
modle
after_SMOTE_1.jpg
Note that number of true positives, false negatives and false positives jumped significantly.

Increasing number of nearest neighbors parameter to 2 decreased AUC, although 3 and 4 brought it up to a similar reading.

Considering that my dataset is highly imbalanced, 99% vs 1%, using SMOTE only provided a 1% boost, let’s examine what happens when we start t increase our minority set to be roughly 50% of the set.  When changing value of 7000% and using nearest neighbor of 1, AUC increased even more.  With a value of 9000% for SMOTE percentage my dataset distribution was about 53% to 47% and my AUC score 0.931.
SMOTE_Properties

after_SMOTE_2.jpg

Note the higher Recall and F1 scores, the model appears to be more balanced.  More about model evaluation in my later posts.