Wednesday, December 8, 2010

Mining a Star Schema: Telco Churn Case Study (2 of 3)

This post will follow the transformation steps as described in the case study, but will use Oracle SQL as the means for preparing data.  Please see the previous post for background material, including links to the case study and to scripts that can be used to replicate the stages in these posts.

1) Handling missing values for call data records
The CDR_T table records the number of phone minutes used by a customer per month and per call type (tariff).  For example, the table may contain one record corresponding to the number of peak (call type) minutes in January for a specific customer, and another record associated with international calls in March for the same customer.  This table is likely to be fairly dense (most type-month combinations for a given customer will be present) due to the coarse level of aggregation, but there may be some missing values.  Missing entries may occur for a number of reasons: the customer made no calls of a particular type in a particular month, the customer switched providers during the timeframe, or perhaps there is a data entry problem.  In the first situation, the correct interpretation of a missing entry would be to assume that the number of minutes for the type-month combination is zero.  In the other situations, it is not appropriate to assume zero, but rather derive some representative value to replace the missing entries.  The referenced case study takes the latter approach.  The data is segmented by customer and call type, and within a given customer-call type combination, an average number of minutes is computed and used as a replacement value.
In SQL, we need to generate additional rows for the missing entries and populate those rows with appropriate values.  To generate the missing rows, Oracle's partition outer join feature is a perfect fit.
 
select cust_id, cdre.tariff, cdre.month, mins
from cdr_t cdr partition by (cust_id) right outer join
     (select distinct tariff, month from cdr_t) cdre
     on (cdr.month = cdre.month and cdr.tariff = cdre.tariff);

I have chosen to use a distinct on the CDR_T table to generate the set of values, but a more rigorous and performant (but less compact) approach would be to explicitly list the tariff-month combinations in the cdre inlined subquery rather than go directly against the CDR_T table itself.

Now that the missing rows are generated, we need to replace the missing value entries with representative values as computed on a per-customer-call type basis.  Oracle's analytic functions are a great match for this step.
 
select cust_id, tariff, month,
  nvl(mins, round(avg(mins) over (partition by cust_id, tariff))) mins
from (<
prev query>);

We can use the avg function, and specify the partition by feature of the over clause to generate an average within each customer-call type group.  The nvl function will replace the missing values with the tailored, computed averages.

2) Transposing call data records
The next transformation step in the case study involves transposing the data in CDR_T from a multiple row per customer format to a single row per customer by generating new columns for all of the tariff-month combinations.  While this is feasible with a small set of combinations, it will be problematic when addressing items with higher cardinality.  Oracle Data Mining does not need to transpose the data.  Instead, the data is combined using Oracle's object-relational technology so that it can remain in its natural, multi-row format.  Oracle Data Mining has introduced two datatypes to capture such data - DM_NESTED_NUMERICALS and DM_NESTED_CATEGORICALS.
In addition, the case study suggests adding an attribute which contains the total number of minutes per call type for a customer (summed across all months).  Oracle's rollup syntax is useful for generating aggregates at different levels of granularity.

select cust_id,
  cast(collect(dm_nested_numerical(tariff||'-'||nvl(month,'ALL'),mins))
       as dm_nested_numericals) mins_per_tariff_mon from
 (select cust_id, tariff, month, sum(mins) mins
  from (<prev query>)
  group by cust_id, tariff, rollup(month))
 group by cust_id;


The above query will first aggregate the minutes by cust_id-tariff-month combination, but it will also rollup the month column to produce a total for each cust_id-tariff combination.  While the data in the case study was already aggregated at the month level, the above query would also work on data that is at a finer granularity.

Once the data is generated by the inner query, there is an outer group by on cust_id with the COLLECT operation.  The purpose of this step is to generate an output of one row per customer, but each row contains an entry of type DM_NESTED_NUMERICALS.  This entry is a collection of pairs that capture the number of minutes per tariff-month combination.
While we performed missing value replacement in the previous transformation step, thereby densifying the data, Oracle Data Mining has a natural treatment for missing rows.  When data is presented as a DM_NESTED_NUMERICALS column, it is assumed that any missing entries correspond to a zero in the value - matching the first option for missing value treatment described earlier.  If this is the correct interpretation for missing values, then no missing value treatment step is necessary.  The data can remain in sparse form, yet the algorithms will correctly interpret the missing entries as having an implicit value of zero.


3) Transposing revenue records
Again, no need to transpose when using Oracle Data Mining.  We add an aggregate to produce the total revenue per customer in addition to the per-month breakout coming from the COLLECT.


select cust_id, sum(revenue) rev_tot_sum,
  cast(collect(dm_nested_numerical('REV-'||month,revenue))
       as dm_nested_numericals) rev_per_mon
from revenues
group by cust_id;


4) Creating derived attributes
The final transformation step in the case study is to generate some additional derived attributes, and connect everything together so that a each customer is composed of a single entity that includes all of the attributes that have been identified to this point.
A view which comprises all of the above data preparation steps as well as the final pieces is as follows: 

create or replace view churn_prep as
with
q322 as
(select cust_id, tariff, month,
  nvl(mins, round(avg(mins) over (partition by cust_id, tariff))) mins
 from (
  select cust_id, cdre.tariff, cdre.month, mins
  from
   cdr_t cdr partition by (cust_id)
    right outer join
   (select distinct tariff, month from cdr_t) cdre
   on (cdr.month = cdre.month and cdr.tariff = cdre.tariff))),
q323 as
(select cust_id,
  cast(collect(dm_nested_numerical(tariff||'-'||nvl(month,'ALL'),mins))
       as dm_nested_numericals) mins_per_tariff_mon from
 (select cust_id, tariff, month, sum(mins) mins
  from q322
  group by cust_id, tariff, rollup(month))
 group by cust_id),
q324 as
(select cust_id, sum(revenue) rev_tot_sum,
  cast(collect(dm_nested_numerical('REV-'||month,revenue))
       as dm_nested_numericals) rev_per_mon
 from revenues
 group by cust_id)
select
 customer_id, age, gender, handset, tariff_type, tariff_plan, dropped_calls,
 churn_m6, all_dv51, all_dv52, all_dv53, all_dv54, all_ang51,
 rev_per_mon, mins_per_tariff_mon,
 case when l_o_s < 24 then 'SHORT' when l_o_s > 84 then 'LONG' else 'MED' end los_band,
 case when rev_tot_sum <= 100 then 'VERY LOW' when rev_tot_sum < 130 then 'LOW'
      when rev_tot_sum < 220 then 'MEDIUM' else 'HIGH' end value_band
from
 customers c,
 services s,
 q324,
 q323,
 (select cust_id, "5_MINS"-"1_MINS" ALL_DV51, "5_MINS"-"2_MINS" ALL_DV52,
         "5_MINS"-"3_MINS" ALL_DV53, "4_MINS"-"2_MINS" ALL_DV54,
         ("5_MINS"-"1_MINS")/4 ALL_ANG51 from
  (select *
   from (select cust_id, month, mins from q322)
   pivot (sum(mins) as mins
   for month in (1,2,3,4,5)))) vm
where customer_id = vm.cust_id(+)
  and customer_id = s.cust_id
  and customer_id = q324.cust_id
  and customer_id = q323.cust_id(+)
  and s.tariff_plan in ('CAT','PLAY');
 
The PIVOT operation is used to generate named columns that can be easily combined with arithmetic operations.  Binning and filtering steps, as identified in the case study, are included in the above SQL.


The query can execute in parallel on SMPs, as well as MPPs using Oracle's RAC technology.  The data can be directly fed to Oracle Data Mining without having to extract it from the database, materialize copies of any parts of the underlying tables, or pivot data that is in a naturally multi-row format.


The final post in this series will show how to mine the prepared data using Oracle Data Mining.


 

2 comments:

  1. Thanks for sharing this.
    BTW, the PIVOT operation was introduced in Oracle 11g. So I can't run the script. :(

    ReplyDelete
  2. True. Besides the Oracle database PIVOT syntax, there have been quite a few significant changes in Oracle Data Mining over the last few releases. However, ODM is able to mine a star schema in 10g. I have added links to the first post of this series that identify scripts that are supported by 10g (tested in 10gR2). For this post (2 of 3), PIVOT can be replaced by a group by with multiple decode statements. In addition, some manual data preparation for nested columns needs to be performed as ODM automatic data preparation was first introduced in 11g. See the linked script for actual code.

    ReplyDelete