Wednesday, December 8, 2010

Mining a Star Schema: Telco Churn Case Study (3 of 3)

Now that the "difficult" work is complete - preparing the data - we can move to building a predictive model to help identify and understand churn.

The case study suggests that separate models be built for different customer segments (high, medium, low, and very low value customer groups).  To reduce the data to a single segment, a filter can be applied:

create or replace view churn_data_high as
select * from churn_prep where value_band = 'HIGH';

It is simple to take a quick look at the predictive aspects of the data on a univariate basis.  While this does not capture the more complex multi-variate effects as would occur with the full-blown data mining algorithms, it can give a quick feel as to the predictive aspects of the data as well as validate the data preparation steps.  Oracle Data Mining includes a predictive analytics package which enables quick analysis.

select * from expl_churn_tab where rank <= 5 order by rank;
-------------------- ----------------- ----------------- ----------
LOS_BAND                                      .069167052          1
MINS_PER_TARIFF_MON  PEAK-5                   .034881648          2
REV_PER_MON          REV-5                    .034527798          3
DROPPED_CALLS                                 .028110322          4
MINS_PER_TARIFF_MON  PEAK-4                   .024698149          5

From the above results, it is clear that some predictors do contain information to help identify churn (explanatory value > 0).  The strongest uni-variate predictor of churn appears to be the customer's (binned) length of service.  The second strongest churn indicator appears to be the number of peak minutes used in the most recent month.  The subname column contains the interior piece of the DM_NESTED_NUMERICALS column described in the previous post.  By using the object relational approach, many related predictors are included within a single top-level column.

When building a predictive model, we need to ensure that the model is not over-trained.  One way to do this is to retain a held-aside dataset and test the model on unseen data.  Splitting the data into training and testing datasets is straightforward by using SQL predicates and a hash function:

create or replace view churn_test_high as
select * from churn_data_high where ora_hash(customer_id,99,0) >= 60;
create or replace view churn_train_high as
select * from churn_data_high where ora_hash(customer_id,99,0) < 60;

The above statements will separate the data into a 40% random sample for testing and the remaining 60% for training the model.  We can now pass the training data into an Oracle Data Mining routine to create a mining model.  In this example, we will use the GLM algorithm with automatic data preparation.

create table churn_set (setting_name varchar2(30), setting_value varchar2(4000));
insert into churn_set values ('ALGO_NAME','ALGO_GENERALIZED_LINEAR_MODEL');
insert into churn_set values ('PREP_AUTO','ON');

Now that we have built a model - CHURN_MOD_HIGH - we can test that model against the held-aside data to see how it performs.

select actual, predicted, count(*) cnt from (
select churn_m6 actual, prediction(churn_mod_high using *) predicted
from churn_test_high)
group by actual, predicted;

The above query will show the number of correct and incorrect predictions for all combinations, often referred to as a confusion matrix.

Thus, without having to extract data or jump through hoops to massage star schema data into a flattened form, Oracle Data Mining is able to extract insight directly from the rich database data.

1 comment:

  1. Added another link to a version of this script (to the first post) that works in 10g. Some major features that were introduced in 11g include the Generalized Linear Model algorithm and automatic data preparation. The alternative script relies on older functionality, but is still capable of mining the telco example star schema.