Title: Data Mining and Knowledge Discovery
1Data Mining and Knowledge Discovery
By Matt Goliber and Jim Hougas
2What is Data Mining?
Not like gold or diamond mining Mining of
knowledge from data Important to many
different fields A Part of Knowledge Discovery
in Databases (KDD)
3The Process of Knowledge Discovery
Data cleaning and integration
Raw data
Data Warehouse
Data transformation, selection, and mining
Data transformation, selection, and mining
Pattern evaluation and knowledge presentation
Patterns
KNOWLEDGE!
4Why is Data Mining useful?
We are data rich but information
poor -Internet -Intelligence Humans often
lack the ability to comprehend and manage the
immense amount of available and sometime
seemingly unrelated data
5How long has this idea been around?
Late 60s and Early 70s Stanfords
Meta-DENDRAL (1970-76) -Extension of DENDRAL
Doug Lenat with AM (1976)
6Meta-DENDRAL
- Extension of the DENDRAL (1965) program
- One of the first expert systems
- Interpreted mass spectra
- Meta-DENDRAL took the mass spectra of compound
of known 3-D structure and formulated rules about
the interpretation of the spectra - Came up with known rules and some new ones!
7Sample Mass Spec
ethyl 3-oxy-3-phenylpropanoate (ethyl
benzoylacetate)
8AM
- Doug Lenat, 1976
- Name means nothing, stand alone
- AM was given sets, bags, ordered sets, and
lists - AM was also given operations to perform on
these data sets - Union, Intersection, ect
- Came up with ideas about counting, addition,
multiplication, prime numbers, and Goldbachs
conjecture - AM thought that these were all uninteresting
- Liked maximally divisible numbers though
9What next?
Not a whole lot Databases were not prevalent
enough, no great demand Did benefit from
machine learning research Beginning of the
1990s, The next area -Ranked as one of
the most promising research areas (NSF)
-Information explosion Early commercial
systems -Farm Journal -GM
10Next Generation Techniques
- Decision Trees
- Each branch is a classification question
- Allows businesses to segment customers, products,
and sales regions - Questions organize the data
- Rule Induction
- All patterns are pulled from the data
- Accuracy and Significance are then added to them
- Help the user know how strong pattern is and
likelihood of it occurring again - Ex If bagels are purchased then cream cheese is
purchased 90 of the time and this pattern occurs
in 3 of all shopping baskets
11Decision Trees vs. Rule Induction
- Decision Trees
- Many rules to cover same instance or
- no rule to cover an instance
- Rule Induction
- Always and only one rule
- Example
- Decision Trees use height and shoe size to
determine size of person - Rule Induction uses one or the other
12Examples of Significant Developments
- Stock Market Advances (1991)
- Astrophysicists Doyne Farmer and Norman Packard
- Prediction company could predict stock market
trends - Bell Atlantic (1996)
- Consumer phone buying trends
- Rule Induction
- Advanced Scout (1997)
- Inderpal Bhandari assists NBA coaches
- Rule Induction
- Persuade 400,000 undecided voters (2004)
- MoveOn attemps to influence the election
- Decision Tree
13Challenges
- Large Data Sets with High Complexity
- - One or the other is currently possible, but
not both - Expensive
- - Costs of Bell Atlantic (Experts are needed)
- - Cost for a two-day course in Las Vegas
(1,300) - - Software (100,000)
14Research
- DARPA
- Defense Advance Research Projects Agency
- ACLU claims this is an invasion of privacy
- Decision Tree
- Uncovering Terrorists in public chat rooms
- Tracks the times that messages are sent
- Advanced Scout
- Bhandari is working on Advanced Scout for the NHL
- Rule Induction
15Current State
- Out of the Lab
- Into Fortune 500 companies
- Automate Model Scoring
- Fingers are currently crossed in hopes that
scoring by IT personnel is done correctly
16Future States
- Utilizing Company Warehouses
- Data miners must take advantage of a million
dollar warehouse that a company builds - Effort Knob
- Low for quick model, high for quality model
- Computed Target Columns
- User could create a new target variable
- Ex finance information that a business has
17Sources
http//web.media.mit.edu/haase/thesis/node54.html
SECTION00711000000000000000 http//smi-web.stanfo
rd.edu/projects/history.htmlMETADENDRAL http//ww
w.cs.cf.ac.uk/Dave/AI2/node151.html http//64.233.
161.104/search?qcacheQ6eMD9tEKwIJwww.cosc.brock
u.ca/Offerings/4P79/Week12.pptmeta-dendralhlen
http//laurel.actlab.utexas.edu/cynbe/muq/muf3_21
.html http//64.233.161.104/search?qcacheyft0cQ5
tZJQJwww.cs.uwaterloo.ca/shallit/Talks/cct.ps2
2fundamentaltheoremofarithmetic22computerdat
aminingprovehlen http//mathworld.wolfram.com/
GoldbachConjecture.html http//www.quantlet.com/md
stat/scripts/csa/html/node202.html http//www.thea
rling.com http//www.wired.com http//www.dmreview
.com http//www.ebscohost.com http//www.thearling
.com/text/dmtechniques/dmtechniques.htm http//www
.aaai.org/Library/Magazine/Vol13/13-03/vol13-03.ht
ml Data Mining Concepts and Techniques. Han J.
and Kamber M.