Now that we have a mining model in the database, we can interrogate the database to understand what has been discovered during the training process.
Three catalog views contain the high-level information for mining models: ALL_MINING_MODELS, ALL_MINING_MODEL_ATTRIBUTES, and ALL_MINING_MODEL_SETTINGS.
By querying ALL_MINING_MODELS, we retrieve the list of all mining models available to the current user.
select model_name, mining_function, algorithm
from all_mining_models;
from all_mining_models;
PREDICT_INCOME CLASSIFICATION SUPPORT_VECTOR_MACHINES
By querying ALL_MINING_MODEL_ATTRIBUTES, we retrieve the list of column names that were relevant when training the model, as well as other per-attribute information.
select attribute_name, attribute_type, target
from all_mining_model_attributes
where model_name = 'PREDICT_INCOME';
AGE NUMERICAL NO
WORKCLASS CATEGORICAL NO
FNLWGT NUMERICAL NO
EDUCATION CATEGORICAL NO
EDUCATION_NUM NUMERICAL NO
MARITAL_STATUS CATEGORICAL NO
OCCUPATION CATEGORICAL NO
RELATIONSHIP CATEGORICAL NO
RACE CATEGORICAL NO
SEX CATEGORICAL NO
CAPITAL_GAIN NUMERICAL NO
CAPITAL_LOSS NUMERICAL NO
HOURS_PER_WEEK NUMERICAL NO
NATIVE_COUNTRY CATEGORICAL NO
INCOME CATEGORICAL YES
where model_name = 'PREDICT_INCOME';
AGE NUMERICAL NO
WORKCLASS CATEGORICAL NO
FNLWGT NUMERICAL NO
EDUCATION CATEGORICAL NO
EDUCATION_NUM NUMERICAL NO
MARITAL_STATUS CATEGORICAL NO
OCCUPATION CATEGORICAL NO
RELATIONSHIP CATEGORICAL NO
RACE CATEGORICAL NO
SEX CATEGORICAL NO
CAPITAL_GAIN NUMERICAL NO
CAPITAL_LOSS NUMERICAL NO
HOURS_PER_WEEK NUMERICAL NO
NATIVE_COUNTRY CATEGORICAL NO
INCOME CATEGORICAL YES
By querying ALL_MINING_MODEL_SETTINGS, we retrieve the list of model settings used during training. Some of these settings may have been specified by the user, others automatically computed while training. For the support vector machine algorithm, Oracle supports two kernel functions: gaussian and linear. In this instance, based on the shape and content of data, Oracle chose to use the linear kernel as evidenced by the setting value SVMS_LINEAR.
select setting_name, setting_value
from all_mining_model_settings
where model_name = 'PREDICT_INCOME';
ALGO_NAME ALGO_SUPPORT_VECTOR_MACHINES
SVMS_ACTIVE_LEARNING SVMS_AL_ENABLE
PREP_AUTO ON
SVMS_COMPLEXITY_FACTOR 0.495891
SVMS_KERNEL_FUNCTION SVMS_LINEAR
SVMS_CONV_TOLERANCE .001
from all_mining_model_settings
where model_name = 'PREDICT_INCOME';
ALGO_NAME ALGO_SUPPORT_VECTOR_MACHINES
SVMS_ACTIVE_LEARNING SVMS_AL_ENABLE
PREP_AUTO ON
SVMS_COMPLEXITY_FACTOR 0.495891
SVMS_KERNEL_FUNCTION SVMS_LINEAR
SVMS_CONV_TOLERANCE .001
The above three catalog views provide high-level information that is relevant to most Oracle Data Mining algorithms. Additional, deeper insight is often available by querying the details of the model. Each algorithm tends to use a different structure to represent what was learned while training, and therefore the structure of this returned information will vary from one algorithm to the next (possibly from one flavor of a given algorithm to the next). For SVM with linear kernel, the model will contain a set of coefficients, as shown below for the PREDICT_INCOME model.
select class, attribute_name, attribute_value, coefficient
from table(dbms_data_mining.get_model_details_svm('PREDICT_INCOME')) a, table(a.attribute_set) b
order by abs(coefficient) desc;
>50K CAPITAL_GAIN (null) 8.1179930161383904
>50K (null) (null) -4.1469740381933802
>50K EDUCATION_NUM (null) 1.85498650687918
>50K HOURS_PER_WEEK (null) 1.80588516494733
>50K CAPITAL_LOSS (null) 1.28361583304225
>50K AGE (null) 1.20889883984869
>50K EDUCATION Doctorate 1.1139153328993401
>50K NATIVE_COUNTRY Nicaragua -1.0957665557355201
>50K WORKCLASS Without-pay -0.99178110036931799
>50K NATIVE_COUNTRY Columbia -0.99178110036931699
>50K RELATIONSHIP Wife 0.99046458006739702
>50K NATIVE_COUNTRY Hungary -0.973898034330827
from table(dbms_data_mining.get_model_details_svm('PREDICT_INCOME')) a, table(a.attribute_set) b
order by abs(coefficient) desc;
>50K CAPITAL_GAIN (null) 8.1179930161383904
>50K (null) (null) -4.1469740381933802
>50K EDUCATION_NUM (null) 1.85498650687918
>50K HOURS_PER_WEEK (null) 1.80588516494733
>50K CAPITAL_LOSS (null) 1.28361583304225
>50K AGE (null) 1.20889883984869
>50K EDUCATION Doctorate 1.1139153328993401
>50K NATIVE_COUNTRY Nicaragua -1.0957665557355201
>50K WORKCLASS Without-pay -0.99178110036931799
>50K NATIVE_COUNTRY Columbia -0.99178110036931699
>50K RELATIONSHIP Wife 0.99046458006739702
>50K NATIVE_COUNTRY Hungary -0.973898034330827
...
In order to understand the information from the above query, it is necessary to scratch the surface of the Support Vector Machine algorithm. As a result of training the model, we generate an equation which includes the following snippet:
a1x1 + a2x2 + a3x3 + ... + anxn
where ai is a coefficient from the above query result and xi is the corresponding value for the attribute in the dataset. For example, for a record that has HOURS_PER_WEEK of 10, the resulting contribution of HOURS_PER_WEEK in determining income level is (10*1.80588516494733). For a categorical attribute, we assign 1 for the value of x for the attribute_value that is present in the incoming record, and 0 for all the others. In this case, if the incoming record represents a person from Nicaragua, then we will include -1.0957665557355201 in the equation (1*-1.0957665557355201), but will not include anything related to people from Columbia, Hungary, etc. as their respective contributions are 0.
So, what does all this mean?
This means that, according to the trained model, larger values of HOURS_PER_WEEK will tend to make it more likely that the resulting income level is >50K (the class value identified in the above query result). This also means that residents of Nicaragua are slightly less likely to have an income level >50K than residents of Columbia (the contribution away from the high income class is larger for Nicaragua since the magnitude of the coefficient is larger and both are negative).
By taking information from all of the attributes together, the model is able to yield a prediction as to whether or not a particular individual is likely to earn more than 50K. The final part of this ODM primer will demonstrate how the database provides these resulting predictions.
No comments:
Post a Comment