Wednesday, December 8, 2010

Mining a Star Schema: Telco Churn Case Study (3 of 3)

Now that the "difficult" work is complete - preparing the data - we can move to building a predictive model to help identify and understand churn.

The case study suggests that separate models be built for different customer segments (high, medium, low, and very low value customer groups).  To reduce the data to a single segment, a filter can be applied:

create or replace view churn_data_high as
select * from churn_prep where value_band = 'HIGH';

It is simple to take a quick look at the predictive aspects of the data on a univariate basis.  While this does not capture the more complex multi-variate effects as would occur with the full-blown data mining algorithms, it can give a quick feel as to the predictive aspects of the data as well as validate the data preparation steps.  Oracle Data Mining includes a predictive analytics package which enables quick analysis.

select * from expl_churn_tab where rank <= 5 order by rank;
-------------------- ----------------- ----------------- ----------
LOS_BAND                                      .069167052          1
MINS_PER_TARIFF_MON  PEAK-5                   .034881648          2
REV_PER_MON          REV-5                    .034527798          3
DROPPED_CALLS                                 .028110322          4
MINS_PER_TARIFF_MON  PEAK-4                   .024698149          5

From the above results, it is clear that some predictors do contain information to help identify churn (explanatory value > 0).  The strongest uni-variate predictor of churn appears to be the customer's (binned) length of service.  The second strongest churn indicator appears to be the number of peak minutes used in the most recent month.  The subname column contains the interior piece of the DM_NESTED_NUMERICALS column described in the previous post.  By using the object relational approach, many related predictors are included within a single top-level column.

When building a predictive model, we need to ensure that the model is not over-trained.  One way to do this is to retain a held-aside dataset and test the model on unseen data.  Splitting the data into training and testing datasets is straightforward by using SQL predicates and a hash function:

create or replace view churn_test_high as
select * from churn_data_high where ora_hash(customer_id,99,0) >= 60;
create or replace view churn_train_high as
select * from churn_data_high where ora_hash(customer_id,99,0) < 60;

The above statements will separate the data into a 40% random sample for testing and the remaining 60% for training the model.  We can now pass the training data into an Oracle Data Mining routine to create a mining model.  In this example, we will use the GLM algorithm with automatic data preparation.

create table churn_set (setting_name varchar2(30), setting_value varchar2(4000));
insert into churn_set values ('ALGO_NAME','ALGO_GENERALIZED_LINEAR_MODEL');
insert into churn_set values ('PREP_AUTO','ON');

Now that we have built a model - CHURN_MOD_HIGH - we can test that model against the held-aside data to see how it performs.

select actual, predicted, count(*) cnt from (
select churn_m6 actual, prediction(churn_mod_high using *) predicted
from churn_test_high)
group by actual, predicted;

The above query will show the number of correct and incorrect predictions for all combinations, often referred to as a confusion matrix.

Thus, without having to extract data or jump through hoops to massage star schema data into a flattened form, Oracle Data Mining is able to extract insight directly from the rich database data.

Mining a Star Schema: Telco Churn Case Study (2 of 3)

This post will follow the transformation steps as described in the case study, but will use Oracle SQL as the means for preparing data.  Please see the previous post for background material, including links to the case study and to scripts that can be used to replicate the stages in these posts.

1) Handling missing values for call data records
The CDR_T table records the number of phone minutes used by a customer per month and per call type (tariff).  For example, the table may contain one record corresponding to the number of peak (call type) minutes in January for a specific customer, and another record associated with international calls in March for the same customer.  This table is likely to be fairly dense (most type-month combinations for a given customer will be present) due to the coarse level of aggregation, but there may be some missing values.  Missing entries may occur for a number of reasons: the customer made no calls of a particular type in a particular month, the customer switched providers during the timeframe, or perhaps there is a data entry problem.  In the first situation, the correct interpretation of a missing entry would be to assume that the number of minutes for the type-month combination is zero.  In the other situations, it is not appropriate to assume zero, but rather derive some representative value to replace the missing entries.  The referenced case study takes the latter approach.  The data is segmented by customer and call type, and within a given customer-call type combination, an average number of minutes is computed and used as a replacement value.
In SQL, we need to generate additional rows for the missing entries and populate those rows with appropriate values.  To generate the missing rows, Oracle's partition outer join feature is a perfect fit.
select cust_id, cdre.tariff, cdre.month, mins
from cdr_t cdr partition by (cust_id) right outer join
     (select distinct tariff, month from cdr_t) cdre
     on (cdr.month = cdre.month and cdr.tariff = cdre.tariff);

I have chosen to use a distinct on the CDR_T table to generate the set of values, but a more rigorous and performant (but less compact) approach would be to explicitly list the tariff-month combinations in the cdre inlined subquery rather than go directly against the CDR_T table itself.

Now that the missing rows are generated, we need to replace the missing value entries with representative values as computed on a per-customer-call type basis.  Oracle's analytic functions are a great match for this step.
select cust_id, tariff, month,
  nvl(mins, round(avg(mins) over (partition by cust_id, tariff))) mins
from (<
prev query>);

We can use the avg function, and specify the partition by feature of the over clause to generate an average within each customer-call type group.  The nvl function will replace the missing values with the tailored, computed averages.

2) Transposing call data records
The next transformation step in the case study involves transposing the data in CDR_T from a multiple row per customer format to a single row per customer by generating new columns for all of the tariff-month combinations.  While this is feasible with a small set of combinations, it will be problematic when addressing items with higher cardinality.  Oracle Data Mining does not need to transpose the data.  Instead, the data is combined using Oracle's object-relational technology so that it can remain in its natural, multi-row format.  Oracle Data Mining has introduced two datatypes to capture such data - DM_NESTED_NUMERICALS and DM_NESTED_CATEGORICALS.
In addition, the case study suggests adding an attribute which contains the total number of minutes per call type for a customer (summed across all months).  Oracle's rollup syntax is useful for generating aggregates at different levels of granularity.

select cust_id,
       as dm_nested_numericals) mins_per_tariff_mon from
 (select cust_id, tariff, month, sum(mins) mins
  from (<prev query>)
  group by cust_id, tariff, rollup(month))
 group by cust_id;

The above query will first aggregate the minutes by cust_id-tariff-month combination, but it will also rollup the month column to produce a total for each cust_id-tariff combination.  While the data in the case study was already aggregated at the month level, the above query would also work on data that is at a finer granularity.

Once the data is generated by the inner query, there is an outer group by on cust_id with the COLLECT operation.  The purpose of this step is to generate an output of one row per customer, but each row contains an entry of type DM_NESTED_NUMERICALS.  This entry is a collection of pairs that capture the number of minutes per tariff-month combination.
While we performed missing value replacement in the previous transformation step, thereby densifying the data, Oracle Data Mining has a natural treatment for missing rows.  When data is presented as a DM_NESTED_NUMERICALS column, it is assumed that any missing entries correspond to a zero in the value - matching the first option for missing value treatment described earlier.  If this is the correct interpretation for missing values, then no missing value treatment step is necessary.  The data can remain in sparse form, yet the algorithms will correctly interpret the missing entries as having an implicit value of zero.

3) Transposing revenue records
Again, no need to transpose when using Oracle Data Mining.  We add an aggregate to produce the total revenue per customer in addition to the per-month breakout coming from the COLLECT.

select cust_id, sum(revenue) rev_tot_sum,
       as dm_nested_numericals) rev_per_mon
from revenues
group by cust_id;

4) Creating derived attributes
The final transformation step in the case study is to generate some additional derived attributes, and connect everything together so that a each customer is composed of a single entity that includes all of the attributes that have been identified to this point.
A view which comprises all of the above data preparation steps as well as the final pieces is as follows: 

create or replace view churn_prep as
q322 as
(select cust_id, tariff, month,
  nvl(mins, round(avg(mins) over (partition by cust_id, tariff))) mins
 from (
  select cust_id, cdre.tariff, cdre.month, mins
   cdr_t cdr partition by (cust_id)
    right outer join
   (select distinct tariff, month from cdr_t) cdre
   on (cdr.month = cdre.month and cdr.tariff = cdre.tariff))),
q323 as
(select cust_id,
       as dm_nested_numericals) mins_per_tariff_mon from
 (select cust_id, tariff, month, sum(mins) mins
  from q322
  group by cust_id, tariff, rollup(month))
 group by cust_id),
q324 as
(select cust_id, sum(revenue) rev_tot_sum,
       as dm_nested_numericals) rev_per_mon
 from revenues
 group by cust_id)
 customer_id, age, gender, handset, tariff_type, tariff_plan, dropped_calls,
 churn_m6, all_dv51, all_dv52, all_dv53, all_dv54, all_ang51,
 rev_per_mon, mins_per_tariff_mon,
 case when l_o_s < 24 then 'SHORT' when l_o_s > 84 then 'LONG' else 'MED' end los_band,
 case when rev_tot_sum <= 100 then 'VERY LOW' when rev_tot_sum < 130 then 'LOW'
      when rev_tot_sum < 220 then 'MEDIUM' else 'HIGH' end value_band
 customers c,
 services s,
 (select cust_id, "5_MINS"-"1_MINS" ALL_DV51, "5_MINS"-"2_MINS" ALL_DV52,
         "5_MINS"-"3_MINS" ALL_DV53, "4_MINS"-"2_MINS" ALL_DV54,
         ("5_MINS"-"1_MINS")/4 ALL_ANG51 from
  (select *
   from (select cust_id, month, mins from q322)
   pivot (sum(mins) as mins
   for month in (1,2,3,4,5)))) vm
where customer_id = vm.cust_id(+)
  and customer_id = s.cust_id
  and customer_id = q324.cust_id
  and customer_id = q323.cust_id(+)
  and s.tariff_plan in ('CAT','PLAY');
The PIVOT operation is used to generate named columns that can be easily combined with arithmetic operations.  Binning and filtering steps, as identified in the case study, are included in the above SQL.

The query can execute in parallel on SMPs, as well as MPPs using Oracle's RAC technology.  The data can be directly fed to Oracle Data Mining without having to extract it from the database, materialize copies of any parts of the underlying tables, or pivot data that is in a naturally multi-row format.

The final post in this series will show how to mine the prepared data using Oracle Data Mining.


Mining a Star Schema: Telco Churn Case Study (1 of 3)

One of the strengths of Oracle Data Mining is the ability to mine star schemas with minimal effort.  Star schemas are commonly used in relational databases, and they often contain rich data with interesting patterns.  While dimension tables may contain interesting demographics, fact tables will often contain user behavior, such as phone usage or purchase patterns.  Both of these aspects - demographics and usage patterns - can provide insight into behavior.

Churn is a critical problem in the telecommunications industry, and companies go to great lengths to reduce the churn of their customer base.  One case study1 describes a telecommunications scenario involving understanding, and identification of, churn, where the underlying data is present in a star schema.  That case study is a good example for demonstrating just how natural it is for Oracle Data Mining to analyze a star schema, so it will be used as the basis for this series of posts.

The case study schema includes four tables: CUSTOMERS, SERVICES, REVENUES, and CDR_T.  The CUSTOMERS table contains one row per customer, as does the SERVICES table, and both contain a customer id that can be used to join the tables together.  Most data mining tools are capable of handling this type of data, where one row of input corresponds to one case for mining.  The other two tables have multiple rows for each customer.  The CDR_T (call data records) table contains multiple records for each customer which captures calling behavior.  In the case study, this information is already pre-aggregated by type of call (peak, international, etc.) per month, but the information may also be available at a finer level of granularity.  The REVENUES table contains the revenue per customer on a monthly basis  for a five month history, so there are up to five rows per customer.  Capturing the information in the CDR_T and REVENUES table to help predict churn for a single customer requires collapsing all of this fact table information into a single "case" per customer.  Most tools will require pivoting the data into columns, which has the drawbacks of densifying data as well as pivoting data beyond column count limitations.  The data in a fact table is often stored in sparse form (this case study aggregates it to a denser form, but it need not be this way for other mining activities), and keeping it in sparse form is highly desirable.

For fact table data that has a much larger number of interesting groups (such as per-product sales information of a large retailer), retaining the sparse format becomes critical to avoid densification of such high cardinality information.  Oracle Data Mining algorithms are designed to interpret missing entries in a sparse fact table appropriately, enabling increased performance and simpler transformation processing.

Some steps in the referenced case study are not completely defined (in my opinion), and in those situations I will take my best guess as to the intended objective.  This approximation is sufficient since the intent of this series of posts is to show the power and flexibility of Oracle Data Mining on a real-world scenario rather than to match the case study letter-for-letter.

The following files support reproduction of the results in this series of posts:
telcoddl.sql - SQL which creates the four tables
telcoloadproc.plb - Obfuscated SQL which creates the procedure that can generate data and populate the tables - all data is generated, and patterns are injected to make it interesting and "real-world" like
telcoprep.sql - A SQL create view statement corresponding to the data preparation steps from part 2 of this series
telcomodel.sql - A SQL script corresponding to the steps from part 3 of this series

In order to prepare a schema that can run the above SQL, a user must be created with the following privileges: create table, create view, create mining model, and create procedure (for telcoloadproc), as well as any other privs as needed for the database user (e.g., create session).Once the schema is prepared, telcoddl.sql and telcoloadproc.plb can be run to create the empty tables and the procedure for loading data.  The procedure that is created is named telco_load, and it takes one optional argument - the number of customers (default 10000).  The results from parts 2 and 3 of this series correspond to loading 10,000 customers.

The sample code in these posts has been tested against an 11gR2 database.  Many new features have been added in each release, so some of the referenced routines and syntax are not available in older releases; however, similar functionality can be achieved with 10g.  The following modified scripts can be used with 10g (tested with 10gR2):
telcoprep_10g.sql - A SQL create view statement corresponding to the data preparation steps from part 2 of this series, including substitution for the 11g PIVOT syntax and inclusion of manual data preparation for nested columns.
telcomodel_10g.sql - A SQL script corresponding to the steps from part 3 of this series, including substitution of the Generalized Linear Model algorithm for 10g Support Vector Machine, manual data preparation leveraging the transformation package, use of dbms_data_mining.apply instead of 10gR2 built-in data mining scoring functions, explicit commit of settings prior to build, and removal of the EXPLAIN routine from the script flow.
In addition, the create mining model privilege is not available in 10g.

The next part in this series will demonstrate how data in a star schema can be prepared for Oracle Data Mining.  The final part in this series will mine the data by building and testing a model.

1 Richeldi, Marco and Perrucci, Alessandro, Churn Analysis Case Study, December 17, 2002.