Data Mining and Knowledge Discovery - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining and Knowledge Discovery

Description:

Inderpal Bhandari assists NBA coaches. Rule Induction. Persuade 400,000 undecided voters (2004) ... crossed in hopes that scoring by IT personnel is done ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 18
Provided by: jimho5
Category:

less

Transcript and Presenter's Notes

Title: Data Mining and Knowledge Discovery


1
Data Mining and Knowledge Discovery
By Matt Goliber and Jim Hougas
2
What is Data Mining?
Not like gold or diamond mining Mining of
knowledge from data Important to many
different fields A Part of Knowledge Discovery
in Databases (KDD)
3
The Process of Knowledge Discovery
Data cleaning and integration
Raw data
Data Warehouse
Data transformation, selection, and mining
Data transformation, selection, and mining
Pattern evaluation and knowledge presentation
Patterns
KNOWLEDGE!
4
Why is Data Mining useful?
We are data rich but information
poor -Internet -Intelligence Humans often
lack the ability to comprehend and manage the
immense amount of available and sometime
seemingly unrelated data
5
How long has this idea been around?
Late 60s and Early 70s Stanfords
Meta-DENDRAL (1970-76) -Extension of DENDRAL
Doug Lenat with AM (1976)
6
Meta-DENDRAL
  • Extension of the DENDRAL (1965) program
  • One of the first expert systems
  • Interpreted mass spectra
  • Meta-DENDRAL took the mass spectra of compound
    of known 3-D structure and formulated rules about
    the interpretation of the spectra
  • Came up with known rules and some new ones!

7
Sample Mass Spec
ethyl 3-oxy-3-phenylpropanoate (ethyl
benzoylacetate)
8
AM
  • Doug Lenat, 1976
  • Name means nothing, stand alone
  • AM was given sets, bags, ordered sets, and
    lists
  • AM was also given operations to perform on
    these data sets
  • Union, Intersection, ect
  • Came up with ideas about counting, addition,
    multiplication, prime numbers, and Goldbachs
    conjecture
  • AM thought that these were all uninteresting
  • Liked maximally divisible numbers though

9
What next?
Not a whole lot Databases were not prevalent
enough, no great demand Did benefit from
machine learning research Beginning of the
1990s, The next area -Ranked as one of
the most promising research areas (NSF)
-Information explosion Early commercial
systems -Farm Journal -GM
10
Next Generation Techniques
  • Decision Trees
  • Each branch is a classification question
  • Allows businesses to segment customers, products,
    and sales regions
  • Questions organize the data
  • Rule Induction
  • All patterns are pulled from the data
  • Accuracy and Significance are then added to them
  • Help the user know how strong pattern is and
    likelihood of it occurring again
  • Ex If bagels are purchased then cream cheese is
    purchased 90 of the time and this pattern occurs
    in 3 of all shopping baskets

11
Decision Trees vs. Rule Induction
  • Decision Trees
  • Many rules to cover same instance or
  • no rule to cover an instance
  • Rule Induction
  • Always and only one rule
  • Example
  • Decision Trees use height and shoe size to
    determine size of person
  • Rule Induction uses one or the other

12
Examples of Significant Developments
  • Stock Market Advances (1991)
  • Astrophysicists Doyne Farmer and Norman Packard
  • Prediction company could predict stock market
    trends
  • Bell Atlantic (1996)
  • Consumer phone buying trends
  • Rule Induction
  • Advanced Scout (1997)
  • Inderpal Bhandari assists NBA coaches
  • Rule Induction
  • Persuade 400,000 undecided voters (2004)
  • MoveOn attemps to influence the election
  • Decision Tree

13
Challenges
  • Large Data Sets with High Complexity
  • - One or the other is currently possible, but
    not both
  • Expensive
  • - Costs of Bell Atlantic (Experts are needed)
  • - Cost for a two-day course in Las Vegas
    (1,300)
  • - Software (100,000)

14
Research
  • DARPA
  • Defense Advance Research Projects Agency
  • ACLU claims this is an invasion of privacy
  • Decision Tree
  • Uncovering Terrorists in public chat rooms
  • Tracks the times that messages are sent
  • Advanced Scout
  • Bhandari is working on Advanced Scout for the NHL
  • Rule Induction

15
Current State
  • Out of the Lab
  • Into Fortune 500 companies
  • Automate Model Scoring
  • Fingers are currently crossed in hopes that
    scoring by IT personnel is done correctly

16
Future States
  • Utilizing Company Warehouses
  • Data miners must take advantage of a million
    dollar warehouse that a company builds
  • Effort Knob
  • Low for quick model, high for quality model
  • Computed Target Columns
  • User could create a new target variable
  • Ex finance information that a business has

17
Sources
http//web.media.mit.edu/haase/thesis/node54.html
SECTION00711000000000000000 http//smi-web.stanfo
rd.edu/projects/history.htmlMETADENDRAL http//ww
w.cs.cf.ac.uk/Dave/AI2/node151.html http//64.233.
161.104/search?qcacheQ6eMD9tEKwIJwww.cosc.brock
u.ca/Offerings/4P79/Week12.pptmeta-dendralhlen
http//laurel.actlab.utexas.edu/cynbe/muq/muf3_21
.html http//64.233.161.104/search?qcacheyft0cQ5
tZJQJwww.cs.uwaterloo.ca/shallit/Talks/cct.ps2
2fundamentaltheoremofarithmetic22computerdat
aminingprovehlen http//mathworld.wolfram.com/
GoldbachConjecture.html http//www.quantlet.com/md
stat/scripts/csa/html/node202.html http//www.thea
rling.com http//www.wired.com http//www.dmreview
.com http//www.ebscohost.com http//www.thearling
.com/text/dmtechniques/dmtechniques.htm http//www
.aaai.org/Library/Magazine/Vol13/13-03/vol13-03.ht
ml Data Mining Concepts and Techniques. Han J.
and Kamber M.
Write a Comment
User Comments (0)
About PowerShow.com