Predictive Analytics in Oracle

Tuesday, March 1, 2011

Telco Churn Case Study Reloaded: Oracle Data Miner

The previous three-part series on mining star schemas (using a telco churn case study) demonstrated how to leverage SQL and the Oracle Data Mining API to capture complex methodology in a simpler, more performant approach. I have been informed that there is this concept called a "user interface" which some people find enticing :-). This post demonstrates the telco churn methodology using the shiny new Oracle Data Miner interface. Oracle Data Miner is bundled with SQL Developer 3.0 (SQL Developer 3.0 is currently available as an early adopter release) and runs against Oracle database version 11gR2 with the Oracle Data Mining option enabled.

In addition to the four tables created in the previous post, this example includes a fifth table, CHURN_RISK, to demonstrate a deployment scenario. The ODMtelcosetup.sql script can be invoked from within a clean data mining user schema to create and populate the five tables. Alternatively, just run the CHURN_RISK table create statement from within the script if the rest of the schema is setup as described in the earlier post.

With the schema prepared, use Oracle Data Miner to import the pre-defined ODMTelcoChurn.xml workflow that captures the telco churn methodology. The Oracle Data Miner navigator can be accessed from the Tools->Data Miner menu in SQL Developer 3.0. From within the data miner navigator, right-click on the mining user and select New Project. The first connection attempt via the Data Miner navigator will kick-off some setup work, including installation of a repository to support Oracle Data Miner (SYS account password is required for this setup). Once an empty project is created, right-click the project, select Import Workflow, and browse to the ODMTelcoChurn.xml file. Once imported, the workflow should look like:

Next, right-click the CHURN_RISK node (far-right) and select Run. This will start a job in the database that will:

Prepare the customer data
Train and test a classification model named CHURN
Score current, high-valued customers with the CHURN model
Identify those current, high-valued customers who are at risk for churn
Store the at-risk customers in the CHURN_RISK table

The transformation node that deserves the most attention is the COLLECT CDR node at the top. This node is responsible for capturing usage minutes for each customer, broken out to different levels of granularity. While it is very simple to specify using Oracle Data Miner due to the nature in which ODM handles transactional data in a star schema, the same operation requires dozens of nodes and expensive data transformations when attempted in other data mining tools.

After running the workflow, the CHURN node contains a Generalized Linear Model (GLM) built on customer data. The GLM model can be inspected, and the results of a held-aside test can be viewed, via right-click menu entries from that node. A right-click on the CHURN_RISK update table node offers the ability to view data. This data - the list of high-valued, at-risk customers - can be integrated within any business process.

More details on the individual workflow nodes are available in the full white paper.

Wednesday, December 8, 2010

Mining a Star Schema: Telco Churn Case Study (3 of 3)

Now that the "difficult" work is complete - preparing the data - we can move to building a predictive model to help identify and understand churn.

The case study suggests that separate models be built for different customer segments (high, medium, low, and very low value customer groups). To reduce the data to a single segment, a filter can be applied:

create or replace view churn_data_high as
select * from churn_prep where value_band = 'HIGH';

It is simple to take a quick look at the predictive aspects of the data on a univariate basis. While this does not capture the more complex multi-variate effects as would occur with the full-blown data mining algorithms, it can give a quick feel as to the predictive aspects of the data as well as validate the data preparation steps. Oracle Data Mining includes a predictive analytics package which enables quick analysis.

begin
dbms_predictive_analytics.explain(
'churn_data_high','churn_m6','expl_churn_tab');

end;

/
select * from expl_churn_tab where rank <= 5 order by rank;

ATTRIBUTE_NAME       ATTRIBUTE_SUBNAME EXPLANATORY_VALUE RANK
-------------------- ----------------- ----------------- ----------
LOS_BAND                                      .069167052          1
MINS_PER_TARIFF_MON PEAK-5                   .034881648          2
REV_PER_MON        REV-5                    .034527798          3
DROPPED_CALLS                   .028110322          4
MINS_PER_TARIFF_MON PEAK-4           .024698149    5

From the above results, it is clear that some predictors do contain information to help identify churn (explanatory value > 0). The strongest uni-variate predictor of churn appears to be the customer's (binned) length of service. The second strongest churn indicator appears to be the number of peak minutes used in the most recent month. The subname column contains the interior piece of the DM_NESTED_NUMERICALS column described in the previous post. By using the object relational approach, many related predictors are included within a single top-level column.

When building a predictive model, we need to ensure that the model is not over-trained. One way to do this is to retain a held-aside dataset and test the model on unseen data. Splitting the data into training and testing datasets is straightforward by using SQL predicates and a hash function:

create or replace view churn_test_high as
select * from churn_data_high where ora_hash(customer_id,99,0) >= 60;
create or replace view churn_train_high as
select * from churn_data_high where ora_hash(customer_id,99,0) < 60;

The above statements will separate the data into a 40% random sample for testing and the remaining 60% for training the model. We can now pass the training data into an Oracle Data Mining routine to create a mining model. In this example, we will use the GLM algorithm with automatic data preparation.

create table churn_set (setting_name varchar2(30), setting_value varchar2(4000));
insert into churn_set values ('ALGO_NAME','ALGO_GENERALIZED_LINEAR_MODEL');
insert into churn_set values ('PREP_AUTO','ON');
begin
dbms_data_mining.create_model(
'CHURN_MOD_HIGH','CLASSIFICATION','CHURN_TRAIN_HIGH',
'CUSTOMER_ID','CHURN_M6','CHURN_SET');
end;
/

Now that we have built a model - CHURN_MOD_HIGH - we can test that model against the held-aside data to see how it performs.

select actual, predicted, count(*) cnt from (
select churn_m6 actual, prediction(churn_mod_high using *) predicted
from churn_test_high)
group by actual, predicted;

The above query will show the number of correct and incorrect predictions for all combinations, often referred to as a confusion matrix.

Thus, without having to extract data or jump through hoops to massage star schema data into a flattened form, Oracle Data Mining is able to extract insight directly from the rich database data.

Mining a Star Schema: Telco Churn Case Study (2 of 3)

This post will follow the transformation steps as described in the case study, but will use Oracle SQL as the means for preparing data. Please see the previous post for background material, including links to the case study and to scripts that can be used to replicate the stages in these posts.

1) Handling missing values for call data records
The CDR_T table records the number of phone minutes used by a customer per month and per call type (tariff). For example, the table may contain one record corresponding to the number of peak (call type) minutes in January for a specific customer, and another record associated with international calls in March for the same customer. This table is likely to be fairly dense (most type-month combinations for a given customer will be present) due to the coarse level of aggregation, but there may be some missing values. Missing entries may occur for a number of reasons: the customer made no calls of a particular type in a particular month, the customer switched providers during the timeframe, or perhaps there is a data entry problem. In the first situation, the correct interpretation of a missing entry would be to assume that the number of minutes for the type-month combination is zero. In the other situations, it is not appropriate to assume zero, but rather derive some representative value to replace the missing entries. The referenced case study takes the latter approach. The data is segmented by customer and call type, and within a given customer-call type combination, an average number of minutes is computed and used as a replacement value.
In SQL, we need to generate additional rows for the missing entries and populate those rows with appropriate values. To generate the missing rows, Oracle's partition outer join feature is a perfect fit.

select cust_id, cdre.tariff, cdre.month, mins
from cdr_t cdr partition by (cust_id) right outer join
(select distinct tariff, month from cdr_t) cdre
on (cdr.month = cdre.month and cdr.tariff = cdre.tariff);

I have chosen to use a distinct on the CDR_T table to generate the set of values, but a more rigorous and performant (but less compact) approach would be to explicitly list the tariff-month combinations in the cdre inlined subquery rather than go directly against the CDR_T table itself.

Now that the missing rows are generated, we need to replace the missing value entries with representative values as computed on a per-customer-call type basis. Oracle's analytic functions are a great match for this step.

select cust_id, tariff, month,
nvl(mins, round(avg(mins) over (partition by cust_id, tariff))) mins
from (<prev query>);

We can use the avg function, and specify the partition by feature of the over clause to generate an average within each customer-call type group. The nvl function will replace the missing values with the tailored, computed averages.

2) Transposing call data records
The next transformation step in the case study involves transposing the data in CDR_T from a multiple row per customer format to a single row per customer by generating new columns for all of the tariff-month combinations. While this is feasible with a small set of combinations, it will be problematic when addressing items with higher cardinality. Oracle Data Mining does not need to transpose the data. Instead, the data is combined using Oracle's object-relational technology so that it can remain in its natural, multi-row format. Oracle Data Mining has introduced two datatypes to capture such data - DM_NESTED_NUMERICALS and DM_NESTED_CATEGORICALS.
In addition, the case study suggests adding an attribute which contains the total number of minutes per call type for a customer (summed across all months). Oracle's rollup syntax is useful for generating aggregates at different levels of granularity.

select cust_id,
cast(collect(dm_nested_numerical(tariff||'-'||nvl(month,'ALL'),mins))
as dm_nested_numericals) mins_per_tariff_mon from
(select cust_id, tariff, month, sum(mins) mins
from (<prev query>)
group by cust_id, tariff, rollup(month))
group by cust_id;

The above query will first aggregate the minutes by cust_id-tariff-month combination, but it will also rollup the month column to produce a total for each cust_id-tariff combination. While the data in the case study was already aggregated at the month level, the above query would also work on data that is at a finer granularity.

Once the data is generated by the inner query, there is an outer group by on cust_id with the COLLECT operation. The purpose of this step is to generate an output of one row per customer, but each row contains an entry of type DM_NESTED_NUMERICALS. This entry is a collection of pairs that capture the number of minutes per tariff-month combination.
While we performed missing value replacement in the previous transformation step, thereby densifying the data, Oracle Data Mining has a natural treatment for missing rows. When data is presented as a DM_NESTED_NUMERICALS column, it is assumed that any missing entries correspond to a zero in the value - matching the first option for missing value treatment described earlier. If this is the correct interpretation for missing values, then no missing value treatment step is necessary. The data can remain in sparse form, yet the algorithms will correctly interpret the missing entries as having an implicit value of zero.

3) Transposing revenue records
Again, no need to transpose when using Oracle Data Mining. We add an aggregate to produce the total revenue per customer in addition to the per-month breakout coming from the COLLECT.

select cust_id, sum(revenue) rev_tot_sum,
cast(collect(dm_nested_numerical('REV-'||month,revenue))
as dm_nested_numericals) rev_per_mon
from revenues
group by cust_id;

4) Creating derived attributes
The final transformation step in the case study is to generate some additional derived attributes, and connect everything together so that a each customer is composed of a single entity that includes all of the attributes that have been identified to this point.
A view which comprises all of the above data preparation steps as well as the final pieces is as follows:

create or replace view churn_prep as
with
q322 as
(select cust_id, tariff, month,
nvl(mins, round(avg(mins) over (partition by cust_id, tariff))) mins
from (
select cust_id, cdre.tariff, cdre.month, mins
from
   cdr_t cdr partition by (cust_id)
    right outer join
   (select distinct tariff, month from cdr_t) cdre
   on (cdr.month = cdre.month and cdr.tariff = cdre.tariff))),
q323 as
(select cust_id,
cast(collect(dm_nested_numerical(tariff||'-'||nvl(month,'ALL'),mins))
       as dm_nested_numericals) mins_per_tariff_mon from
(select cust_id, tariff, month, sum(mins) mins
from q322
group by cust_id, tariff, rollup(month))
group by cust_id),
q324 as
(select cust_id, sum(revenue) rev_tot_sum,
cast(collect(dm_nested_numerical('REV-'||month,revenue))
       as dm_nested_numericals) rev_per_mon
from revenues
group by cust_id)
select
customer_id, age, gender, handset, tariff_type, tariff_plan, dropped_calls,
churn_m6, all_dv51, all_dv52, all_dv53, all_dv54, all_ang51,
rev_per_mon, mins_per_tariff_mon,
case when l_o_s < 24 then 'SHORT' when l_o_s > 84 then 'LONG' else 'MED' end los_band,
case when rev_tot_sum <= 100 then 'VERY LOW' when rev_tot_sum < 130 then 'LOW'
      when rev_tot_sum < 220 then 'MEDIUM' else 'HIGH' end value_band
from
customers c,
services s,
q324,
q323,
(select cust_id, "5_MINS"-"1_MINS" ALL_DV51, "5_MINS"-"2_MINS" ALL_DV52,
         "5_MINS"-"3_MINS" ALL_DV53, "4_MINS"-"2_MINS" ALL_DV54,
         ("5_MINS"-"1_MINS")/4 ALL_ANG51 from
(select *
   from (select cust_id, month, mins from q322)
   pivot (sum(mins) as mins
   for month in (1,2,3,4,5)))) vm
where customer_id = vm.cust_id(+)
and customer_id = s.cust_id
and customer_id = q324.cust_id
and customer_id = q323.cust_id(+)
and s.tariff_plan in ('CAT','PLAY');

The PIVOT operation is used to generate named columns that can be easily combined with arithmetic operations. Binning and filtering steps, as identified in the case study, are included in the above SQL.

The query can execute in parallel on SMPs, as well as MPPs using Oracle's RAC technology. The data can be directly fed to Oracle Data Mining without having to extract it from the database, materialize copies of any parts of the underlying tables, or pivot data that is in a naturally multi-row format.

The final post in this series will show how to mine the prepared data using Oracle Data Mining.

Mining a Star Schema: Telco Churn Case Study (1 of 3)

One of the strengths of Oracle Data Mining is the ability to mine star schemas with minimal effort. Star schemas are commonly used in relational databases, and they often contain rich data with interesting patterns. While dimension tables may contain interesting demographics, fact tables will often contain user behavior, such as phone usage or purchase patterns. Both of these aspects - demographics and usage patterns - can provide insight into behavior.

Churn is a critical problem in the telecommunications industry, and companies go to great lengths to reduce the churn of their customer base. One case study¹ describes a telecommunications scenario involving understanding, and identification of, churn, where the underlying data is present in a star schema. That case study is a good example for demonstrating just how natural it is for Oracle Data Mining to analyze a star schema, so it will be used as the basis for this series of posts.

The case study schema includes four tables: CUSTOMERS, SERVICES, REVENUES, and CDR_T. The CUSTOMERS table contains one row per customer, as does the SERVICES table, and both contain a customer id that can be used to join the tables together. Most data mining tools are capable of handling this type of data, where one row of input corresponds to one case for mining. The other two tables have multiple rows for each customer. The CDR_T (call data records) table contains multiple records for each customer which captures calling behavior. In the case study, this information is already pre-aggregated by type of call (peak, international, etc.) per month, but the information may also be available at a finer level of granularity. The REVENUES table contains the revenue per customer on a monthly basis for a five month history, so there are up to five rows per customer. Capturing the information in the CDR_T and REVENUES table to help predict churn for a single customer requires collapsing all of this fact table information into a single "case" per customer. Most tools will require pivoting the data into columns, which has the drawbacks of densifying data as well as pivoting data beyond column count limitations. The data in a fact table is often stored in sparse form (this case study aggregates it to a denser form, but it need not be this way for other mining activities), and keeping it in sparse form is highly desirable.

For fact table data that has a much larger number of interesting groups (such as per-product sales information of a large retailer), retaining the sparse format becomes critical to avoid densification of such high cardinality information. Oracle Data Mining algorithms are designed to interpret missing entries in a sparse fact table appropriately, enabling increased performance and simpler transformation processing.

Some steps in the referenced case study are not completely defined (in my opinion), and in those situations I will take my best guess as to the intended objective. This approximation is sufficient since the intent of this series of posts is to show the power and flexibility of Oracle Data Mining on a real-world scenario rather than to match the case study letter-for-letter.

The following files support reproduction of the results in this series of posts:
telcoddl.sql - SQL which creates the four tables
telcoloadproc.plb - Obfuscated SQL which creates the procedure that can generate data and populate the tables - all data is generated, and patterns are injected to make it interesting and "real-world" like
telcoprep.sql - A SQL create view statement corresponding to the data preparation steps from part 2 of this series
telcomodel.sql - A SQL script corresponding to the steps from part 3 of this series

In order to prepare a schema that can run the above SQL, a user must be created with the following privileges: create table, create view, create mining model, and create procedure (for telcoloadproc), as well as any other privs as needed for the database user (e.g., create session).Once the schema is prepared, telcoddl.sql and telcoloadproc.plb can be run to create the empty tables and the procedure for loading data. The procedure that is created is named telco_load, and it takes one optional argument - the number of customers (default 10000). The results from parts 2 and 3 of this series correspond to loading 10,000 customers.

The sample code in these posts has been tested against an 11gR2 database. Many new features have been added in each release, so some of the referenced routines and syntax are not available in older releases; however, similar functionality can be achieved with 10g. The following modified scripts can be used with 10g (tested with 10gR2):
telcoprep_10g.sql - A SQL create view statement corresponding to the data preparation steps from part 2 of this series, including substitution for the 11g PIVOT syntax and inclusion of manual data preparation for nested columns.
telcomodel_10g.sql - A SQL script corresponding to the steps from part 3 of this series, including substitution of the Generalized Linear Model algorithm for 10g Support Vector Machine, manual data preparation leveraging the transformation package, use of dbms_data_mining.apply instead of 10gR2 built-in data mining scoring functions, explicit commit of settings prior to build, and removal of the EXPLAIN routine from the script flow.
In addition, the create mining model privilege is not available in 10g.

The next part in this series will demonstrate how data in a star schema can be prepared for Oracle Data Mining. The final part in this series will mine the data by building and testing a model.

¹ Richeldi, Marco and Perrucci, Alessandro, Churn Analysis Case Study, December 17, 2002.

Thursday, September 16, 2010

Oracle Data Mining Primer (3 of 3): Scoring Models

Part 3 of this Oracle Data Mining Primer series will demonstrate how mining models are scored.

The Oracle SQL language has been extended to include data mining functions that apply persisted models to new data. Classification, regression, and anomaly detection models can be used to predict likely outcomes. Clustering models can be used to assign rows to clusters. Feature extraction models can be used to transform the input data to a new space.

The data mining SQL functions are single row functions and are evaluated as any other built-in SQL function. Parallel operations are fully supported without restriction as each input row can be evaluated independently. The model information necessary to produce the scores (predictions, probabilities, etc.) is often loaded into shared memory as part of the shared cursor, enabling caching of the model content across query executions.

ODM scores the classification model that was built in part 1 of this primer by leveraging the PREDICTION* family of functions. Let's take the following query as a starting point for further understanding:
select sum(decode(INCOME,
                  PREDICTION(PREDICT_INCOME using *),
                  1, 0))
           *100/count(*) as accuracy
from adult_test;
The query produces a measure of overall accuracy for the classification model when applied to the ADULT_TEST dataset. The query reads in all rows from the input table, and for each row, uses a decode expression to compare the actual income level value, INCOME, with the result of applying the PREDICTION data mining function to the input row attributes. As with a * on the select list, the * represents a wildcard which expands to all columns in the underlying data flow. The input attributes are matched based on name to the attributes in the model. The sum will count up how many correct predictions were made, and the *100/count(*) piece will translate this to a percentage of correct predictions.

Overall accuracy provides a very coarse measure. It is often mode interesting to investigate how the model predicts the individual income level values. The following query will produce the confusion matrix associated with the PREDICT_MODEL for the ADULT_TEST dataset:

select INCOME, predicted_income, count(*) from

(select INCOME, 

        PREDICTION(PREDICT_INCOME using *) predicted_income

 from adult_test)

group by INCOME, predicted_income;

In addition to producing the most likely income level value for a given row, Oracle Data Mining can also produce a probability associated with a prediction. Such a probability can differentiate very likely predictions from marginal ones. The following query retrieves the ten adults that are most likely to earn >50K:

select * from

(select a.*, 

rank()
   over (order by
         PREDICTION_PROBABILITY(PREDICT_INCOME, '>50K' using *)
         desc) rnk
from adult_test a)
where rnk <= 10;
This query uses the rank analytic function to rank the rows, where the input for ranking is the probability that an individual is likely to have an income level of >50K. The PREDICTION_PROBABILITY function takes as input the mining model, the input data, and, optionally, the class value of interest.

The above examples just scratch the surface of how data mining results can be embedded within the context of SQL queries.

Friday, September 10, 2010

Oracle Data Mining Primer (2 of 3): Viewing Models

Now that we have a mining model in the database, we can interrogate the database to understand what has been discovered during the training process.

Three catalog views contain the high-level information for mining models: ALL_MINING_MODELS, ALL_MINING_MODEL_ATTRIBUTES, and ALL_MINING_MODEL_SETTINGS.

By querying ALL_MINING_MODELS, we retrieve the list of all mining models available to the current user.

select model_name, mining_function, algorithm

from all_mining_models;

PREDICT_INCOME    CLASSIFICATION    SUPPORT_VECTOR_MACHINES

By querying ALL_MINING_MODEL_ATTRIBUTES, we retrieve the list of column names that were relevant when training the model, as well as other per-attribute information.

select attribute_name, attribute_type, target

 from all_mining_model_attributes

where model_name = 'PREDICT_INCOME';

AGE               NUMERICAL      NO

WORKCLASS         CATEGORICAL    NO

FNLWGT            NUMERICAL      NO

EDUCATION         CATEGORICAL    NO

EDUCATION_NUM     NUMERICAL      NO

MARITAL_STATUS    CATEGORICAL    NO

OCCUPATION        CATEGORICAL    NO

RELATIONSHIP      CATEGORICAL    NO

RACE              CATEGORICAL    NO

SEX               CATEGORICAL    NO

CAPITAL_GAIN      NUMERICAL      NO

CAPITAL_LOSS      NUMERICAL      NO

HOURS_PER_WEEK    NUMERICAL      NO

NATIVE_COUNTRY    CATEGORICAL    NO

INCOME            CATEGORICAL    YES

By querying ALL_MINING_MODEL_SETTINGS, we retrieve the list of model settings used during training. Some of these settings may have been specified by the user, others automatically computed while training. For the support vector machine algorithm, Oracle supports two kernel functions: gaussian and linear. In this instance, based on the shape and content of data, Oracle chose to use the linear kernel as evidenced by the setting value SVMS_LINEAR.

select setting_name, setting_value

from all_mining_model_settings

where model_name = 'PREDICT_INCOME';

ALGO_NAME               ALGO_SUPPORT_VECTOR_MACHINES

SVMS_ACTIVE_LEARNING    SVMS_AL_ENABLE

PREP_AUTO               ON 

SVMS_COMPLEXITY_FACTOR  0.495891

SVMS_KERNEL_FUNCTION    SVMS_LINEAR 

SVMS_CONV_TOLERANCE     .001

The above three catalog views provide high-level information that is relevant to most Oracle Data Mining algorithms. Additional, deeper insight is often available by querying the details of the model. Each algorithm tends to use a different structure to represent what was learned while training, and therefore the structure of this returned information will vary from one algorithm to the next (possibly from one flavor of a given algorithm to the next). For SVM with linear kernel, the model will contain a set of coefficients, as shown below for the PREDICT_INCOME model.

select class, attribute_name, attribute_value, coefficient 

from table(dbms_data_mining.get_model_details_svm('PREDICT_INCOME')) a, table(a.attribute_set) b

order by abs(coefficient) desc;

 >50K    CAPITAL_GAIN      (null)          8.1179930161383904

 >50K    (null)            (null)         -4.1469740381933802

 >50K    EDUCATION_NUM     (null)          1.85498650687918

 >50K    HOURS_PER_WEEK    (null)          1.80588516494733

 >50K    CAPITAL_LOSS      (null)          1.28361583304225

 >50K    AGE               (null)          1.20889883984869

 >50K    EDUCATION         Doctorate       1.1139153328993401

 >50K    NATIVE_COUNTRY    Nicaragua      -1.0957665557355201

 >50K    WORKCLASS         Without-pay    -0.99178110036931799

 >50K    NATIVE_COUNTRY    Columbia       -0.99178110036931699

 >50K    RELATIONSHIP      Wife            0.99046458006739702

 >50K    NATIVE_COUNTRY    Hungary        -0.973898034330827

...

In order to understand the information from the above query, it is necessary to scratch the surface of the Support Vector Machine algorithm. As a result of training the model, we generate an equation which includes the following snippet:

a₁x₁ + a₂x₂ + a₃x₃ + ... + a_nx_n

where a_i is a coefficient from the above query result and x_i is the corresponding value for the attribute in the dataset. For example, for a record that has HOURS_PER_WEEK of 10, the resulting contribution of HOURS_PER_WEEK in determining income level is (10*1.80588516494733). For a categorical attribute, we assign 1 for the value of x for the attribute_value that is present in the incoming record, and 0 for all the others. In this case, if the incoming record represents a person from Nicaragua, then we will include -1.0957665557355201 in the equation (1*-1.0957665557355201), but will not include anything related to people from Columbia, Hungary, etc. as their respective contributions are 0.

So, what does all this mean?

This means that, according to the trained model, larger values of HOURS_PER_WEEK will tend to make it more likely that the resulting income level is >50K (the class value identified in the above query result). This also means that residents of Nicaragua are slightly less likely to have an income level >50K than residents of Columbia (the contribution away from the high income class is larger for Nicaragua since the magnitude of the coefficient is larger and both are negative).

By taking information from all of the attributes together, the model is able to yield a prediction as to whether or not a particular individual is likely to earn more than 50K. The final part of this ODM primer will demonstrate how the database provides these resulting predictions.

Thursday, September 9, 2010

Oracle Data Mining Primer (1 of 3): Training Models

This post is the first of three intended to provide a quick demonstration of the Oracle Data Mining API. Deeper concepts will be glossed over in the interest of keeping this primer material short and to the point.

Training a data mining model involves the process of learning the patterns and relationships in the underlying data. Identifying a business problem, gathering data, and cleaning data are critical steps that must be performed prior to training a model, and these steps tend to consume large amounts of time and effort, but they are outside the scope of this API-focused post.

This example will use the Census Income dataset (Ref: Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science). The census income data has two tables: ADULT_DATA (32,561 rows) for training a model and ADULT_TEST (16,281 rows) for testing the accuracy of the trained model. The data contains a number of attributes, including a field that separates the records into individuals earning >50K per year from those earning <=50K per year.

When training the model, it is important for the algorithm to correctly interpret the data. For example, an attribute called "AGE" may contain numerical values (e.g., 18, 45, etc.), categorical values (e.g., 'TEEN', 'ELDERLY', etc.), and it may even contain numbers that simply represent categorical values (e.g., 1 represents 'ELDERLY', 2 represents 'TEEN', etc.). Machine learning algorithms often process numerical data differently from categorical data, so it is important that the algorithms correctly interpret the input. The Oracle Data Mining API uses the following heuristic: treat columns with a database datatype of NUMBER or FLOAT as numeric in nature and treat columns with a database datatype of CHAR or VARCHAR2 as categorical in nature. If the database datatype does not match the resulting interpretation, then the type should be coerced before training a model.

For this example, the following census attributes are stored as NUMBER: AGE, FNLWGT, EDUCATION_NUM, CAPITAL_GAIN, CAPITAL_LOSS, and HOURS_PER_WEEK. The remaining attributes, including INCOME level, are all stored as VARCHAR2.

To guide the machine learning process, an optional settings table may be provided as input to the training stage. The settings table can identify a specific algorithm to use as well as various algorithm and data preparation choices. For this example, we will use the Support Vector Machine algorithm with automatic data preparation.

create  table odm_settings

(setting_name varchar2(30), setting_value  varchar2(4000));

insert into odm_settings values  ('ALGO_NAME','ALGO_SUPPORT_VECTOR_MACHINES');

insert into odm_settings values ('PREP_AUTO','ON');

commit;  -- commit needed for releases prior to 11gR2

To train the model, we need to provide a few pieces of information (including the input data and the function that is to be performed). The following API invocation trains a model to learn the patterns in the data that differentiate individuals based on their income level.

begin

 dbms_data_mining.create_model(

   model_name       => 'PREDICT_INCOME',     -- name of resulting object

   mining_function  => 'CLASSIFICATION',     -- mining function of interest

   data_table_name  => 'ADULT_DATA',         -- input data for training

   case_id_column_name   => NULL,            -- unique case/record identifier

   target_column_name    => 'INCOME',        -- attribute of interest

   settings_table_name   => 'ODM_SETTINGS'); -- optional settings

end;

/

Great, now what?
The create_model procedure persists a new schema object in the database. The next two posts will demonstrate the value of this mining model object.