Thursday, September 16, 2010

Oracle Data Mining Primer (3 of 3): Scoring Models

Part 3 of this Oracle Data Mining Primer series will demonstrate how mining models are scored.

The Oracle SQL language has been extended to include data mining functions that apply persisted models to new data.  Classification, regression, and anomaly detection models can be used to predict likely outcomes.  Clustering models can be used to assign rows to clusters.  Feature extraction models can be used to transform the input data to a new space.

The data mining SQL functions are single row functions and are evaluated as any other built-in SQL function.  Parallel operations are fully supported without restriction as each input row can be evaluated independently.  The model information necessary to produce the scores (predictions, probabilities, etc.) is often loaded into shared memory as part of the shared cursor, enabling caching of the model content across query executions.

ODM scores the classification model that was built in part 1 of this primer by leveraging the PREDICTION* family of functions.  Let's take the following query as a starting point for further understanding:
select sum(decode(INCOME, 
                  PREDICTION(PREDICT_INCOME using *), 
                  1, 0))
           *100/count(*) as accuracy
from adult_test;

The query produces a measure of overall accuracy for the classification model when applied to the ADULT_TEST dataset.  The query reads in all rows from the input table, and for each row, uses a decode expression to compare the actual income level value, INCOME, with the result of applying the PREDICTION data mining function to the input row attributes.  As with a * on the select list, the * represents a wildcard which expands to all columns in the underlying data flow.  The input attributes are matched based on name to the attributes in the model.  The sum will count up how many correct predictions were made, and the *100/count(*) piece will translate this to a percentage of correct predictions.

Overall accuracy provides a very coarse measure.  It is often mode interesting to investigate how the model predicts the individual income level values.  The following query will produce the confusion matrix associated with the PREDICT_MODEL for the ADULT_TEST dataset:
select INCOME, predicted_income, count(*) from
(select INCOME, 
        PREDICTION(PREDICT_INCOME using *) predicted_income
 from adult_test)
group by INCOME, predicted_income;

In addition to producing the most likely income level value for a given row, Oracle Data Mining can also produce a probability associated with a prediction.  Such a probability can differentiate very likely predictions from marginal ones.  The following query retrieves the ten adults that are most likely to earn >50K:
select * from
(select a.*, 
   over (order by 
         desc) rnk
 from adult_test a)
where rnk <= 10;

This query uses the rank analytic function to rank the rows, where the input for ranking is the probability that an individual is likely to have an income level of >50K.  The PREDICTION_PROBABILITY function takes as input the mining model, the input data, and, optionally, the class value of interest.

The above examples just scratch the surface of how data mining results can be embedded within the context of SQL queries.

Friday, September 10, 2010

Oracle Data Mining Primer (2 of 3): Viewing Models

Now that we have a mining model in the database, we can interrogate the database to understand what has been discovered during the training process.

Three catalog views contain the high-level information for mining models: ALL_MINING_MODELS, ALL_MINING_MODEL_ATTRIBUTES, and ALL_MINING_MODEL_SETTINGS.

By querying ALL_MINING_MODELS, we retrieve the list of all mining models available to the current user.

select model_name, mining_function, algorithm
from all_mining_models;

By querying ALL_MINING_MODEL_ATTRIBUTES, we retrieve the list of column names that were relevant when training the model, as well as other per-attribute information.

select attribute_name, attribute_type, target
from all_mining_model_attributes
where model_name = 'PREDICT_INCOME';
AGE               NUMERICAL      NO
FNLWGT            NUMERICAL      NO
RACE              CATEGORICAL    NO
SEX               CATEGORICAL    NO

By querying ALL_MINING_MODEL_SETTINGS, we retrieve the list of model settings used during training.  Some of these settings may have been specified by the user, others automatically computed while training.  For the support vector machine algorithm, Oracle supports two kernel functions: gaussian and linear.  In this instance, based on the shape and content of data, Oracle chose to use the linear kernel as evidenced by the setting value SVMS_LINEAR.

select setting_name, setting_value
from all_mining_model_settings
where model_name = 'PREDICT_INCOME';
PREP_AUTO               ON

The above three catalog views provide high-level information that is relevant to most Oracle Data Mining algorithms.  Additional, deeper insight is often available by querying the details of the model.  Each algorithm tends to use a different structure to represent what was learned while training, and therefore the structure of this returned information will vary from one algorithm to the next (possibly from one flavor of a given algorithm to the next).  For SVM with linear kernel, the model will contain a set of coefficients, as shown below for the PREDICT_INCOME model.

select class, attribute_name, attribute_value, coefficient 
from table(dbms_data_mining.get_model_details_svm('PREDICT_INCOME')) a, table(a.attribute_set) b
order by abs(coefficient) desc;
 >50K    CAPITAL_GAIN      (null)          8.1179930161383904
 >50K    (null)            (null)         -4.1469740381933802
 >50K    EDUCATION_NUM     (null)          1.85498650687918
 >50K    HOURS_PER_WEEK    (null)          1.80588516494733
 >50K    CAPITAL_LOSS      (null)          1.28361583304225
 >50K    AGE               (null)          1.20889883984869
 >50K    EDUCATION         Doctorate       1.1139153328993401
 >50K    NATIVE_COUNTRY    Nicaragua      -1.0957665557355201
 >50K    WORKCLASS         Without-pay    -0.99178110036931799
 >50K    NATIVE_COUNTRY    Columbia       -0.99178110036931699
 >50K    RELATIONSHIP      Wife            0.99046458006739702
 >50K    NATIVE_COUNTRY    Hungary        -0.973898034330827

In order to understand the information from the above query, it is necessary to scratch the surface of the Support Vector Machine algorithm.  As a result of training the model, we generate an equation which includes the following snippet:
a1x1 + a2x2 + a3x3 + ... + anxn
where ai is a coefficient from the above query result and xi is the corresponding value for the attribute in the dataset.  For example, for a record that has HOURS_PER_WEEK of 10, the resulting contribution of HOURS_PER_WEEK in determining income level is (10*1.80588516494733).   For a categorical attribute, we assign 1 for the value of x for the attribute_value that is present in the incoming record, and 0 for all the others.  In this case, if the incoming record represents a person from Nicaragua, then we will include -1.0957665557355201 in the equation (1*-1.0957665557355201), but will not include anything related to people from Columbia, Hungary, etc. as their respective contributions are 0.

So, what does all this mean?
This means that, according to the trained model, larger values of HOURS_PER_WEEK will tend to make it more likely that the resulting income level is >50K (the class value identified in the above query result).  This also means that residents of Nicaragua are slightly less likely to have an income level >50K than residents of Columbia (the contribution away from the high income class is larger for Nicaragua since the magnitude of the coefficient is larger and both are negative).

By taking information from all of the attributes together, the model is able to yield a prediction as to whether or not a particular individual is likely to earn more than 50K.  The final part of this ODM primer will demonstrate how the database provides these resulting predictions.

Thursday, September 9, 2010

Oracle Data Mining Primer (1 of 3): Training Models

This post is the first of three intended to provide a quick demonstration of the Oracle Data Mining API.  Deeper concepts will be glossed over in the interest of keeping this primer material short and to the point.

Training a data mining model involves the process of learning the patterns and relationships in the underlying data.  Identifying a business problem, gathering data, and cleaning data are critical steps that must be performed prior to training a model, and these steps tend to consume large amounts of time and effort, but they are outside the scope of this API-focused post.

This example will use the Census Income dataset (Ref: Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository []. Irvine, CA: University of California, School of Information and Computer Science).  The census income data has two tables: ADULT_DATA (32,561 rows) for training a model and ADULT_TEST (16,281 rows) for testing the accuracy of the trained model.  The data contains a number of attributes, including a field that separates the records into individuals earning >50K per year from those earning <=50K per year.

When training the model, it is important for the algorithm to correctly interpret the data.  For example, an attribute called "AGE" may contain numerical values (e.g., 18, 45, etc.), categorical values (e.g., 'TEEN', 'ELDERLY', etc.), and it may even contain numbers that simply represent categorical values (e.g., 1 represents 'ELDERLY', 2 represents 'TEEN', etc.).  Machine learning algorithms often process numerical data differently from categorical data, so it is important that the algorithms correctly interpret the input.  The Oracle Data Mining API uses the following heuristic: treat columns with a database datatype of NUMBER or FLOAT as numeric in nature and treat columns with a database datatype of CHAR or VARCHAR2 as categorical in nature.  If the database datatype does not match the resulting interpretation, then the type should be coerced before training a model.

For this example, the following census attributes are stored as NUMBER: AGE, FNLWGT, EDUCATION_NUM, CAPITAL_GAIN, CAPITAL_LOSS, and HOURS_PER_WEEK.  The remaining attributes, including INCOME level, are all stored as VARCHAR2.

To guide the machine learning process, an optional settings table may be provided as input to the training stage.  The settings table can identify a specific algorithm to use as well as various algorithm and data preparation choices.  For this example, we will use the Support Vector Machine algorithm with automatic data preparation.

create table odm_settings
(setting_name varchar2(30), setting_value varchar2(4000));
insert into odm_settings values ('ALGO_NAME','ALGO_SUPPORT_VECTOR_MACHINES');
insert into odm_settings values ('PREP_AUTO','ON');
commit;  -- commit needed for releases prior to 11gR2

To train the model, we need to provide a few pieces of information (including the input data and the function that is to be performed).  The following API invocation trains a model to learn the patterns in the data that differentiate individuals based on their income level.

   model_name       => 'PREDICT_INCOME',     -- name of resulting object
   mining_function  => 'CLASSIFICATION',     -- mining function of interest
   data_table_name  => 'ADULT_DATA',         -- input data for training
   case_id_column_name   => NULL,            -- unique case/record identifier
   target_column_name    => 'INCOME',        -- attribute of interest
   settings_table_name   => 'ODM_SETTINGS'); -- optional settings

Great, now what?
The create_model procedure persists a new schema object in the database.  The next two posts will demonstrate the value of this mining model object.