Search This Blog

Saturday, August 28, 2010

Data Mining on a small amount of data

SQL Server 2005 | SQL Server 2008 | SQL Server 2008 R2

Data Mining is usually associated with finding previously unknown patterns in a large amount of data. But also a small amount of data may contain patterns, that are difficult to spot. In order to illustrate this, let’s look at the following situation. A customer with 80 stores wants to understand why some stores perform better than others. This is especially important before setting up new stores. It would be great to estimate the performance of the new store before building it up. This would make it much easier to decide on future store locations. Also, it might be useful for optimizing the performance of the existing stores. In order to find out about this, the customer collects some data per store as shown in the following table:

Criteria Description
Sales area Area for sales (in square meters)
Total working hours per week Total working hour for all staff members per week (avg over the last 3 months)
Total opening hours per week Total hours that the store is open per week
Location E.g. City center, City or Outskirts
Location type E.g. Mall, Plaza or separated
Store interior status Condition of the store, e.g. modern, average or old
Store age in months How old is this store
Parking facilities Is it easy to park near the store? E.g. values include good, average, bad
Average parking costs per hour What are the average parking costs per hour? Zero means free parking
Number of competitors within 10 minutes walk distance Number of competitors within 10 minutes walk distance
Number of competitors within 15 minutes driving distance Number of competitors within 15 minutes driving distance
Buying Power Buying power of the people how live near the store ranging from very low to very high.
Sales amount per week Average over the last three months

The data can be retrieved by querying the IT systems (for example HR and ERP), by using a survey (usually the staff in the store knows about the competitors and parking facilities around) or by using external market research data (for the Buying power). In my case, the data is just generated sample data:


After gathering all the data, things got more complicated as expected. How do you “score” each store? Which of the parameters are most relevant? Even if we’re only having 80 rows of data, it is not at all easy to see the dependencies.

For this example I’m using the Microsoft Office Data Mining Add-In so we can do all the data analysis using Excel. Since it’s Excel everything should be very easy. We want to use the Microsoft Decision Tree algorithm (Icon “Classify” from the ribbon bar):


The process is pretty easy. We have to decide which attribute we want to predict (Sales amount per week) and the wizard does all the rest. Now, here is the complete resulting decision tree:


No branches? What did we do wrong? Well, here are a few steps we should have taken, before simply invoking the decision tree. The most important thing is the proper preparation of the data:

1. Create relative measures, not absolute ones

The decision tree is capable of detecting rules like “if A then B” or “if A then not B” or even complicated combinations. However, dealing with continuous values is more difficult. The decision tree does not work quite well with rules like “if A is multiplied by 2, B is multiplied by 1.5” but tries to branch this as “if A is > 20 then B > 15”, “if A is >10, then B >7.5”. This might be especially true with input variables like our sales area or opening hours.

To quickly analyze the relationship we can use Excel’s Scatter chart type. Let’s start with the store size:


From this chart it seems reasonable to calculate sales by square meter instead of taking the absolute sales amount as there seems to be a more or less linear relationship between the sales and the store size.

Now let’s have a look at the influence of the opening hours. Again we’re using Excel’s Scatter chart:


As the trend line shows, the relationship seems to be a little bit logarithmic. However, let’s assume it’s also linear. Therefore we’re going to create an additional column in our spreadsheet computing sales by square meter and opening hours. This is the formula for our new column “Relative Sales”:

=[@[Sales Amount per week]]/[@[Sales Area]]/[@[Total opening hours per week]]

Of course, you would like to also check the influence of other variables, for example the age of the store:


This one looks pretty scattered, so we’re just taking the age of the store as an input variable.

But there is another relative measure we should create: The average number of sales persons in the store. We’re simply using this formula:

=[@[Total working hours per week]]/[@[Total opening hours per week]]

2. Make the input parameters discrete if possible

While it’s always a good idea to use discrete values for our input parameters, it almost becomes a must if you’re not having many rows of data. An example for a good discrete value is our location as it can only take these values: City center, City or Outskirts. The fewer the number of buckets, the better is it. If you’re not getting any results, try making the data more simple by choosing less buckets for your discrete values.

But look at our newly created columns for the relative sales or the average staff:


For our Mining purpose these are too many distinct values and although our decision tree will try to cluster them, we should do this in advance. Therefore we can either use Excel formulas or we could use the functionality of the data mining add-in: The “Explore Data” wizard:


After selecting the table and the column, the wizard analyzes the data and proposes some buckets as shown below:


For my purpose, I reduce the number of buckets to three (be brave!). By clicking the “Add new Column” button the resulting values are added as an additional column to our table:


I’m doing the same for the average staff members (4 buckets), the total opening hours (4 buckets) the store age in months (5 buckets) and the parking costs (4 buckets):


Our number of competitors is here 0-3 and 0-5, so we leave this data unchanged (not too many buckets).

So let’s try again with our prepared data set. Now, our decision tree looks like this:


As you can see there are only three influences identified by the mining algorithm here:


So, if parking is free, then the parking facilities are important while if parking is not free, the opening hours are important. This is a good start for looking for new store locations.

Another nice tool from Excel’s data mining add-in is the prediction calculator which can be found on the “Analyze” ribbon:


This one creates a ready to use input sheet in which you can enter the values of a potential new store and Excel immediately computes the likeliness for high sales:


This calculator is based on a different mining model (Logistic Regression). You can also see the impact of the input values on your sales:


In our case, smaller stores seem to perform better. The high value at the opening time below 41.5 hours may indicate that our computed column from above might not be well designed. And while there are quite a lot of stores with a sales area of less than 57 square meters there is only one single store which is already opened for 52 months: Our first store, which is definitely special (always equipped in the most modern style and only selected staff members are chosen to work there). Maybe you want to take this out of the data before doing the analysis.

So, after the mining you have to review your results properly. And of course you should also verify the other methods of making sure, your model is working fine (lift chart, case support as from my last post etc.).