Title: Introduction to Data Mining 2
1Introduction to Data Mining 2
2Data Warehouse
- For organizational learning to take place, data
from many sources must be gathered together and
organized in a consistent and useful way hence,
Data Warehousing (DW) - DW allows an organization (enterprise) to
remember what it has noticed about its data - Data Mining techniques make use of the data in a
DW
3Data Warehouse
Enterprise Database
Customers
Orders
Transactions
Vendors
Etc
Etc
- Data Miners
- Farmers they know
- Explorers - unpredictable
Copied, organized summarized
Data Warehouse
Data Mining
4Data Warehouse
- A data warehouse is a copy of transaction data
specifically structured for querying, analysis
and reporting. - Note that the data warehouse contains a copy of
the transactions which are not updated or changed
later by the transaction system. - Also note that this data is specially structured,
and may have been transformed when it was copied
into the data warehouse.
5Data Mart
- A Data Mart is a smaller, more focused Data
Warehouse a mini-warehouse. - A Data Mart typically reflects the business rules
of a specific business unit within an enterprise.
6Data Mining Flavors
- Supervised or Directed Attempts to explain or
categorize some particular target field such as
income or response. (regression) - Unsupervised or Undirected Attempts to find
patterns or similarities among groups of records
without the use of a particular target field or
collection of predefined classes. (clustering)
7Terminologies
- Independent variable
- Input, predictor, attribute, x
- Dependent variable
- Output, target, outcome, response, y
8Terminologies
- Supervised learning
- Given D, a set of (x,y) pairs, find f such that y
f(x). X can be a vector. - Classification when y is categorical, eg yes/no
- Regression or Prediction when y is continuous
9Terminologies
- Danger of over-fitting
- T shirts for all SNU students. fit to sample
students. (sample vs population) - Fig 2.1 and 2.2 of textbook
- Complex model (eg higher order polynomial)
overfits - Training / Validation / Test split of D for
supervised learning - Training used to find fs
- Val used to find the final f (avoid overfitting)
- Test used to evaluate the final f
- Fig 2.3
10Terminologies
- Unsupervised learning
- Given D, a set of x,
- Clustering is to find a partition where each
subset contains similar x while different subsets
contain different xs. - Association Rule mining or Affinity Grouping
is to find an association rule/pattern among xs.
- Overfitting? Train/Vali/Test split?
11Terminologies
- Variable selection or dimensionality reduction
- Parsimony or compactness
- Data/Pattern selection/reduction, sampling
- Outliers
- Data that lie outside the usual range
- Error? Or important pattern?
- Missing value
- Remove
- Impute or replace with a value
- Normalizing or standardization or scaling
- Age (0100) vs salary (010M)
12Data Minings Biggest Challenge
- The largest challenge a data miner may face is
the sheer volume of data in the data warehouse. - It is quite important, then, that summary data
also be available to get the analysis started. - A major problem is that this sheer volume may
mask the important relationships the data miner
is interested in. - The ability to overcome the volume and be able to
interpret the data is quite important.
13But
- Finding patterns is not enough
- Business must
- Respond to the patterns by taking action
- Turning
- Data into Information
- Information into Action
- Action into Value
- Hence, the Virtuous Cycle of DM
14Data Minings Virtuous Cycle
- Identify the business opportunity
- Mining data to transform it into actionable
information - Acting on the information
- Measuring the results
Textbook interchanges problem with
opportunity
151. Identify the Business Opportunity
- Many business processes are good candidates
- New product introduction
- Direct marketing campaign
- Understanding customer attrition/churn
- Evaluating the results of a test market
- Measurements from past DM efforts
- What types of customers responded to our last
campaign? - Where do the best customers live?
- Are long waits in check-out lines a cause of
customer attrition? - What products should be promoted with our XYZ
product? - TIP When talking with business users about data
mining opportunities, make sure you focus on the
business problems/opportunities and not on
technology and algorithms.
162. Mining data to transform it into actionable
information
- Success is making business sense of the data
- Numerous data issues
- Bad data formats (alpha vs numeric, missing,
null, bogus data) - Confusing data fields (synonyms and differences)
- Lack of functionality (I wish I could)
- Legal ramifications (privacy, etc.)
- Organizational factors (unwilling to change our
ways) - Lack of timeliness
173. Acting on the Information
- This is the purpose of Data Mining with the
hope of adding value - What type of action?
- Interactions with customers, prospects, suppliers
- Modifying service procedures
- Adjusting inventory levels
- Consolidating
- Expanding
- Etc
184. Measuring the Results
- Assesses the impact of the action taken
- Often overlooked, ignored, skipped
- Planning for the measurement should begin when
analyzing the business opportunity, not after it
is all over - Assessment questions (examples)
- Did this ____ campaign do what we hoped?
- Did some offers work better than others?
- Did these customers purchase additional products?
- Tons of others
19Data Minings Virtuous Cycle
- Identify the business opportunity
- Mining data to transform it into actionable
information - Acting on the information
- Measuring the results
Textbook interchanges problem with
opportunity
20Learning Things that are not True
- Patterns may not represent any underlying rule
- Sample may not reflect its parent population,
hence bias - Data may be at the wrong level of detail
(granularity aggregation) - Examples?
21Example
22Things that are True, but not Useful
- Learning things that cannot be used
- Examples?
- result of marketing campaign
23Data Mining Steps
- Translate biz opportunity (problem) into DM
opportunity (problem) - Select appropriate data
- Get to know the data
- Create a model set
- Fix problems with the data
- Transform data to bring information to the
surface - Build models
- Assess models
- Deploy models
- Assess results
- Begin again
24Data mining is not a linear process
25Data Mining in Press
- the 2008 technologies by Technology Review
- Read two articles
- Reality Mining
- Surprise Modeling