Title: How do we mine data
1How do we mine data?
- The process of data mining is described as a
process of model building. - Five main steps to data mining
- 1. Data Preparation
- 2. Defining a study
- Reading the data and building a model
- Understanding the model
- 3. Data Mining
- 4. Analysis of Results.
- 5. Assimilation of Knowledge.
2Step(1) Data Preparation
- It is considered as the heart of data mining.
- Example if you want to find out who will respond
to a direct marketing process, you need data
about customers who have previously responded to
mailer. - Example If you have their names and addresses,
you should know that this type of information is
unique to a customer and therefore not the best
data to be mined!! - Information like city and state are descriptive
information. Demographic information is more
valuable such as age, income, interests,
household type...
3Data Preparation Issues
- (1) Data Cleaning
- Consistency Problem a column containing a list
of soft drinks may have the values Pepsi,
Coca Cola and Cola. These values refers to
the same drink (soft drink) but they are not
known to the computer as the same. - Stale Data Problem a database has to be
continually updated, because people may move and
their addresses change. An old address that is no
longer correct is often referred to as stale. - Typographical errors words are frequently
misspelled or typed incorrectly.
4- (2) Missing Values
- Some data mining techniques require rows of data
to be complete in order to mine the data. If too
many values are missing in a data set, it becomes
hard to extract any useful information or to make
prediction. - (3) Data Derivation
- Most interesting data may require derivation from
existing columns, such as MAX, SUM functions.
5- (4) Merging Data
- Data are stored in the form of tables. Merging
data can be achieved in a number of ways such as
SQL statements or export of the data into a file.
6Effort Required for Each Data Mining Process Step
70 60 50 40 30 20 10
Effort
Defining a study Data Preparation Data
Mining Analysis of Results
and Knowledge Assimilation
7The Data Mining Process Begins and Ends with the
Business Objectives
Selected Data Preprocessed Data
Transformed Data Extracted
Information Assimilated Knowledge
Database
Select Preprocess
Transform
Mine Analyze
and Assimilate
Data Mining Process is and Iterative Process
8Example of Data on patient recovery form severe
back pain
9Data Preparation
- Getting at your data
- It is not straightforward task if the data is
stored in many places. - Example data about patients, doctors, hospital,
insurance, ... May be stored in different
databases. - Even if the data is in one relational database,
the data is likely to be stored in multiple
tables.
10Ways to Access Data for Data Mining
- Accessing Data Warehouses
- Accessing Data through Relational (by creating a
view on the database side, which is a way to
make multiple tables appear as one). - Accessing Data through Conversion Utilities (if
the data is stored in a different format than
what the tool supports) - Accessing Data Using Query Tools (to join tables
and create files). - Accessing Data from Flat Files (very fast to
read, have to be created from somewhere,
difficult to manipulate).
11Data Preparation - Stage 1 - Data Selection
- Goal Identify the available data sources and
extract the data that is needed for preliminary
analysis in preparation for further mining. - Data selection will vary with the business
objectives. - With each of the selected variable, associated
semantic information (metadata) is needed to
understand what each of the variables means. - Metadata must include business definitions of
the data, clear descriptions of data types,
potential values, original sources system, data
formats and other characteristics.
12Types of Variables
- (1) Categorical The possible values are finite
and differ in kind. - (a) Nominal variables name the kind of object to
which they refer. There is no order among the
possible values. Examples martial status
Married, single, divorced, unknown. Gender male,
female. Educational level university , college,
high school. - (b) Ordinal Variables have an order among the
possible values. Example customer credit rating
Good, regular, poor. - (2) Qualitative Measurable difference between
the possible values. - (a) Continuous (real numbers). Income, average
number of purchases. - (b) Discrete (Integers). Number of employees,
time of year month, season, quarter).
13- Active Variables The variables selected for data
mining are called active variables in the sense
that they are actively used to distinguish
segments, make predictions or perform some other
data mining operations. - Supplementary variables these variables are not
used in data mining analysis but are useful in
helping to visualize and explain the results.
14Example
- From a database of 15,000 customers whose
supermarket purchases had been tracked for three
years. - From this database, only those who had purchased
orange juice more than 25 times in the last three
years were selected. The list of items purchased
in each supermarket visit was called a basket. - And few variables were used to describe each
basket householdID, date of purchase, basket
contents, basket value, product quantity, and
promotion ID...
15Data Preparation - Stage 2 - Data Preprocessing
- Goal to ensure the quality of the selected data.
- Clean and well-understood data is a clear
prerequisite for successful data mining. - The most problematic phase? Because most
operational data has never been captured or
modeled for data mining purposes. - It includes
- a general review of the structure of the data and
- some measuring of its quality using some
statistical and visualization methods. - Representative sampling of the selected data is a
useful technique as large data volumes would
otherwise make the review process very time
consuming.
16- Scatterplots is a graphical tool that represent
the relationship between two or more continuous
variables.
150k 120k 90k 60k 30k 0k
Income
20 40 60 Age
17Boxplot diagrams is very useful for comparing the
center (average) or spread (deviation) of two or
more variables
150k 120k 90k 60k 30k 0k
Extreme Upper Value Upper Quartile (75) Median
Value Lower Quartile (25) Extreme Lower Value
Income
Women Men
18Noisy Data
- Outlier One or more variables have values that
are significantly out of line with what is
expected for those variables. - It gives us the maximum/minimum limits but at the
same time it may be no more than invalid data. - One kind of outlier may be the result of a human
error (Example Age 654 or negative income).
Either to be corrected (if possible) or drooped
from the analysis. - Another kind of outlier is created when changes
in operational systems have not been reflected in
the data mining environment. For example, new
product codes introduced in operational systems
show up initially as outliers. In this case you
have to update the metadata.
19Missing Values
- Include values that are not present in the
selected data and those invalid values that we
may delete during noise detection. - Values may be missing because of human error,
because the information was not available at the
time of input or because the data was selected
across heterogeneous sources. - One way is to eliminate the observations that
have missing values. (Easy, but it has drawback
of losing valuable data, especially if the data
to be mined is small or if the fraud or quality
control is the objective). - Another solution is to drop the whole variable
from the the analysis. - Another solution is to replace the missing value
with its most likely value. For quantitative
variables, this most likely value could be the
mean or mode. - For categorical variables it could be the mode or
a newly created value for the variable called
unknown.
20 - A more sophisticated approach for both
quantitative and categorical variables is to use
a predictive model (will be discussed later) to
predict the most likely value for a variable on
the basis of the values of the other variables in
the observation.
21Data Qualification Issues
- You would not mine a field like CustomerID,
FirstName, LastName or Address because they are
unique field and there are no patterns to find in
unique fields.
22Data Quality Issues - Examples
- This study is supposed to show only one patient
record for each patient.
The fact that 33 records for one patient means
that we have redundant data in this study and it
must be cleaned.
23- There are inconsistencies and misspelling in the
value that should read 0-2 Weeks.
24- Simple graphical tools (histograms and pie
charts) can quickly plot the contribution made by
each value for the categorical variable and
therefore help to identify distribution skews and
invalid or missing values. - When dealing with quantitative variables the data
analysts may interested in such measures as
maxima and minima, mean (average), mode (most
frequently occurring value), median (midpoint
value), and several statistical measures of
central tendency (tendency for values to cluster
around the mean).
25- Variance in Defining Terms
- You may ask what makes a person an occasional
smoker versus a frequent smoker? - If two hospitals vary in their definition, your
data is skewed. (Example five days a week or
more than 12 times a week).
26- Skewed distributions often indicate outliers.
- Example a histogram may show that most of the
people in a target group have low incomes and
only a few are high earners. It may refer to that
they result from a poor data collection i.e. the
group may consists mainly of retired people.
27- Data Preparation involves finding the answers to
several questions, including - How do you create the table?
- How do you mine data that is not in the right
form? - How do you handle data that is not entirely clean?
28Binning - Examples
- The field was already binned before you mined it.
- When you have fields that are a range of numbers,
it is often best to bin them or define them in
categories. - Most data mining tools will offer ways to bin
data. - How many bins you should have? Depends on the
data distribution.
29Data Derivation - Examples
- Two fields Weight and WeightLastYear. But
another field that might be interesting for use
in data mining, is to have a field that shows the
difference in a patients weight, which can be
derived by taking the difference between the two
columns. - It can be derived using built in functions,
SQL... - Deriving the name of a state from the area code.
30Data Preparation - Stage 3 - Data Transformation
- During data transformation, the preprocessed data
is transformed to produce the analytical data
model. - The techniques used can range from simple data
format conversions to complex statistical data
reduction tools. (from/to US, European formats,
date of birth to age...). - Data reduction is another transformation
techniques in which we combine several existing
variable into one new variable. For example
income, ZIP code and level of education together
to find the attractiveness of the prospect. - Data reduction --gt smaller and more manageable
set for further analysis but 1 it is not easy
to determine which variable can be combined and
2 combining variables will cause some loss of
information and 3 the final result will be all
the more difficult to interpret.
31- Many techniques (like Neural Network) can accept
only numeric input in the 0.0 to 1.0 or -1.0 to
1.0 range. In these cases, continuous parameter
values must be scaled so that all have the same
order of magnitude. - Discretization technique to convert quantitative
variable into categorical variables by dividing
the values of the input variable into buckets.
(Income of 0-9999 --gt 1, 10000-19999 --gt
2,...). - One-of-N transformation to convert a categoric
variable to a numeric representation. (4 values
for the variable TypeOfCar could be by 1000,
0100, 0010 and 0001).
32Step(2) Defining a Study (Business Objective
Determination)
- To ensure that there is a real, critical business
issues that is worth solving. - The only way to find out whether a data mining
solution is really needed is to properly define
the business objectives. - Ill-defined projects are not likely to succeed or
result in added value. - It requires the collaboration of the business
analyst with domain knowledge and the data
analyst who can begin to translate the objectives
into a data mining application.
33Step (3) Data Mining
- The objective is to apply the selected data
mining algorithm or algorithms to the
preprocessed data. - What happens during the data mining step will
vary with the kind of application that is under
development. - In the case of data segmentation, one or two runs
of the algorithm may be sufficient to clear this
step and move into analysis of the results. - In the case of developing a predictive model,
there will be a cyclical process where the models
are repeatedly trained and retrained on sample
data before being tested against the real
database.
34- One difficulty in predictive modeling is that of
overtraining, where the model predicts well on
the training data but performs poorly on the real
test data. (i.e. the model learns the detailed
patterns of that data but cannot generalize well
when confronted with new observations from the
test data set.
35- All results were extensively cross-validated
using a technique that is sometimes called
10-fold cross-validation. The entire database was
divided into 10 equal parts. The models were then
trained on only nine-tenths of the database and
tested on the remaining one-tenth, which had been
held out. This process was repeated until each of
the other tenths had also been used for testing.
36Step (4) Analysis of Results
- After mining the data --gt Have we found something
that is interesting, valid and actionable? - Data mining is different from traditional
statistical analysis - With statistics, the answer is generally, yes or
no (i.e. the hypothesis is correct or incorrect). - With data mining, if it is done well, the results
either suggest the answer or at least point the
team in the direction of another avenue of
research.
37Examples of rules output
- If purchases OJ in large (gt12 oz) cans gt 58 of
the time Then Predict Loyal - If primary brand is Brand X Then Predict
Vulnerable. - If buys at warehouse stores gt 11 of the
time Then Predict Vulnerable. - If buys gt 24.26 ounces per shopping trip on
average AND average price per ounce gt
0.10 Then Predict Loyal.