Thursday, September 16, 2010

Oracle Data Mining Primer (3 of 3): Scoring Models

Part 3 of this Oracle Data Mining Primer series will demonstrate how mining models are scored.

The Oracle SQL language has been extended to include data mining functions that apply persisted models to new data.  Classification, regression, and anomaly detection models can be used to predict likely outcomes.  Clustering models can be used to assign rows to clusters.  Feature extraction models can be used to transform the input data to a new space.

The data mining SQL functions are single row functions and are evaluated as any other built-in SQL function.  Parallel operations are fully supported without restriction as each input row can be evaluated independently.  The model information necessary to produce the scores (predictions, probabilities, etc.) is often loaded into shared memory as part of the shared cursor, enabling caching of the model content across query executions.

ODM scores the classification model that was built in part 1 of this primer by leveraging the PREDICTION* family of functions.  Let's take the following query as a starting point for further understanding:
select sum(decode(INCOME, 
                  PREDICTION(PREDICT_INCOME using *), 
                  1, 0))
           *100/count(*) as accuracy
from adult_test;

The query produces a measure of overall accuracy for the classification model when applied to the ADULT_TEST dataset.  The query reads in all rows from the input table, and for each row, uses a decode expression to compare the actual income level value, INCOME, with the result of applying the PREDICTION data mining function to the input row attributes.  As with a * on the select list, the * represents a wildcard which expands to all columns in the underlying data flow.  The input attributes are matched based on name to the attributes in the model.  The sum will count up how many correct predictions were made, and the *100/count(*) piece will translate this to a percentage of correct predictions.

Overall accuracy provides a very coarse measure.  It is often mode interesting to investigate how the model predicts the individual income level values.  The following query will produce the confusion matrix associated with the PREDICT_MODEL for the ADULT_TEST dataset:
select INCOME, predicted_income, count(*) from
(select INCOME, 
        PREDICTION(PREDICT_INCOME using *) predicted_income
 from adult_test)
group by INCOME, predicted_income;

In addition to producing the most likely income level value for a given row, Oracle Data Mining can also produce a probability associated with a prediction.  Such a probability can differentiate very likely predictions from marginal ones.  The following query retrieves the ten adults that are most likely to earn >50K:
select * from
(select a.*, 
  rank() 
   over (order by 
         PREDICTION_PROBABILITY(PREDICT_INCOME, '>50K' using *)
         desc) rnk
 from adult_test a)
where rnk <= 10;

This query uses the rank analytic function to rank the rows, where the input for ranking is the probability that an individual is likely to have an income level of >50K.  The PREDICTION_PROBABILITY function takes as input the mining model, the input data, and, optionally, the class value of interest.

The above examples just scratch the surface of how data mining results can be embedded within the context of SQL queries.

2 comments:

  1. Hi I try to learn ODM by following the doc but I cannot install the example successfully. Is there a zip file somewhere I can get all the examples.

    ReplyDelete
  2. Are you referring to the sample code that we publish? The main ODM page has a lot of information. Near the bottom is a link for Sample Code. It looks like the 11gR1 sample code link is broken at the moment, but the 10gR2 sample code will work.

    ReplyDelete