Thursday, September 9, 2010

Oracle Data Mining Primer (1 of 3): Training Models

This post is the first of three intended to provide a quick demonstration of the Oracle Data Mining API.  Deeper concepts will be glossed over in the interest of keeping this primer material short and to the point.

Training a data mining model involves the process of learning the patterns and relationships in the underlying data.  Identifying a business problem, gathering data, and cleaning data are critical steps that must be performed prior to training a model, and these steps tend to consume large amounts of time and effort, but they are outside the scope of this API-focused post.

This example will use the Census Income dataset (Ref: Frank, A. & Asuncion, A. (2010). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science).  The census income data has two tables: ADULT_DATA (32,561 rows) for training a model and ADULT_TEST (16,281 rows) for testing the accuracy of the trained model.  The data contains a number of attributes, including a field that separates the records into individuals earning >50K per year from those earning <=50K per year.

When training the model, it is important for the algorithm to correctly interpret the data.  For example, an attribute called "AGE" may contain numerical values (e.g., 18, 45, etc.), categorical values (e.g., 'TEEN', 'ELDERLY', etc.), and it may even contain numbers that simply represent categorical values (e.g., 1 represents 'ELDERLY', 2 represents 'TEEN', etc.).  Machine learning algorithms often process numerical data differently from categorical data, so it is important that the algorithms correctly interpret the input.  The Oracle Data Mining API uses the following heuristic: treat columns with a database datatype of NUMBER or FLOAT as numeric in nature and treat columns with a database datatype of CHAR or VARCHAR2 as categorical in nature.  If the database datatype does not match the resulting interpretation, then the type should be coerced before training a model.

For this example, the following census attributes are stored as NUMBER: AGE, FNLWGT, EDUCATION_NUM, CAPITAL_GAIN, CAPITAL_LOSS, and HOURS_PER_WEEK.  The remaining attributes, including INCOME level, are all stored as VARCHAR2.


To guide the machine learning process, an optional settings table may be provided as input to the training stage.  The settings table can identify a specific algorithm to use as well as various algorithm and data preparation choices.  For this example, we will use the Support Vector Machine algorithm with automatic data preparation.


create table odm_settings
(setting_name varchar2(30), setting_value varchar2(4000));
insert into odm_settings values ('ALGO_NAME','ALGO_SUPPORT_VECTOR_MACHINES');
insert into odm_settings values ('PREP_AUTO','ON');
commit;  -- commit needed for releases prior to 11gR2

To train the model, we need to provide a few pieces of information (including the input data and the function that is to be performed).  The following API invocation trains a model to learn the patterns in the data that differentiate individuals based on their income level.

begin
 dbms_data_mining.create_model(
   model_name       => 'PREDICT_INCOME',     -- name of resulting object
   mining_function  => 'CLASSIFICATION',     -- mining function of interest
   data_table_name  => 'ADULT_DATA',         -- input data for training
   case_id_column_name   => NULL,            -- unique case/record identifier
   target_column_name    => 'INCOME',        -- attribute of interest
   settings_table_name   => 'ODM_SETTINGS'); -- optional settings
end;
/

Great, now what?
The create_model procedure persists a new schema object in the database.  The next two posts will demonstrate the value of this mining model object.

No comments:

Post a Comment