Wednesday, December 8, 2010

Mining a Star Schema: Telco Churn Case Study (1 of 3)

One of the strengths of Oracle Data Mining is the ability to mine star schemas with minimal effort.  Star schemas are commonly used in relational databases, and they often contain rich data with interesting patterns.  While dimension tables may contain interesting demographics, fact tables will often contain user behavior, such as phone usage or purchase patterns.  Both of these aspects - demographics and usage patterns - can provide insight into behavior.

Churn is a critical problem in the telecommunications industry, and companies go to great lengths to reduce the churn of their customer base.  One case study1 describes a telecommunications scenario involving understanding, and identification of, churn, where the underlying data is present in a star schema.  That case study is a good example for demonstrating just how natural it is for Oracle Data Mining to analyze a star schema, so it will be used as the basis for this series of posts.

The case study schema includes four tables: CUSTOMERS, SERVICES, REVENUES, and CDR_T.  The CUSTOMERS table contains one row per customer, as does the SERVICES table, and both contain a customer id that can be used to join the tables together.  Most data mining tools are capable of handling this type of data, where one row of input corresponds to one case for mining.  The other two tables have multiple rows for each customer.  The CDR_T (call data records) table contains multiple records for each customer which captures calling behavior.  In the case study, this information is already pre-aggregated by type of call (peak, international, etc.) per month, but the information may also be available at a finer level of granularity.  The REVENUES table contains the revenue per customer on a monthly basis  for a five month history, so there are up to five rows per customer.  Capturing the information in the CDR_T and REVENUES table to help predict churn for a single customer requires collapsing all of this fact table information into a single "case" per customer.  Most tools will require pivoting the data into columns, which has the drawbacks of densifying data as well as pivoting data beyond column count limitations.  The data in a fact table is often stored in sparse form (this case study aggregates it to a denser form, but it need not be this way for other mining activities), and keeping it in sparse form is highly desirable.

For fact table data that has a much larger number of interesting groups (such as per-product sales information of a large retailer), retaining the sparse format becomes critical to avoid densification of such high cardinality information.  Oracle Data Mining algorithms are designed to interpret missing entries in a sparse fact table appropriately, enabling increased performance and simpler transformation processing.

Some steps in the referenced case study are not completely defined (in my opinion), and in those situations I will take my best guess as to the intended objective.  This approximation is sufficient since the intent of this series of posts is to show the power and flexibility of Oracle Data Mining on a real-world scenario rather than to match the case study letter-for-letter.

The following files support reproduction of the results in this series of posts:
telcoddl.sql - SQL which creates the four tables
telcoloadproc.plb - Obfuscated SQL which creates the procedure that can generate data and populate the tables - all data is generated, and patterns are injected to make it interesting and "real-world" like
telcoprep.sql - A SQL create view statement corresponding to the data preparation steps from part 2 of this series
telcomodel.sql - A SQL script corresponding to the steps from part 3 of this series

In order to prepare a schema that can run the above SQL, a user must be created with the following privileges: create table, create view, create mining model, and create procedure (for telcoloadproc), as well as any other privs as needed for the database user (e.g., create session).Once the schema is prepared, telcoddl.sql and telcoloadproc.plb can be run to create the empty tables and the procedure for loading data.  The procedure that is created is named telco_load, and it takes one optional argument - the number of customers (default 10000).  The results from parts 2 and 3 of this series correspond to loading 10,000 customers.

The sample code in these posts has been tested against an 11gR2 database.  Many new features have been added in each release, so some of the referenced routines and syntax are not available in older releases; however, similar functionality can be achieved with 10g.  The following modified scripts can be used with 10g (tested with 10gR2):
telcoprep_10g.sql - A SQL create view statement corresponding to the data preparation steps from part 2 of this series, including substitution for the 11g PIVOT syntax and inclusion of manual data preparation for nested columns.
telcomodel_10g.sql - A SQL script corresponding to the steps from part 3 of this series, including substitution of the Generalized Linear Model algorithm for 10g Support Vector Machine, manual data preparation leveraging the transformation package, use of dbms_data_mining.apply instead of 10gR2 built-in data mining scoring functions, explicit commit of settings prior to build, and removal of the EXPLAIN routine from the script flow.
In addition, the create mining model privilege is not available in 10g.

The next part in this series will demonstrate how data in a star schema can be prepared for Oracle Data Mining.  The final part in this series will mine the data by building and testing a model.

1 Richeldi, Marco and Perrucci, Alessandro, Churn Analysis Case Study, December 17, 2002.

3 comments:

  1. Updated the post with additional links to versions of the scripts that only leverage functionality available in 10g.

    ReplyDelete
  2. I think telcoloadproc.plb isn't correct.
    Can you update telcoloadproc.plb file in order to populate data into the star schema?

    ReplyDelete
  3. Daniel,
    The procedure is working for me. The file, telcoloadproc.plb, create the procedure for data generation, but then the procedure needs to be invoked to actually load data. It can be invoked like:
    exec telco_load(10000);
    to load 10000 customers and associated information in all the tables.

    ReplyDelete