Title: Automated Learning and Discovery
1Automated Learning and Discovery
- Professor Tom M. Mitchell, Director
- Center for Automated Learning and Discovery
- Carnegie Mellon University
- www.cs.cmu.edu/cald
- March, 1999
2Talk Organization
- Overview of datamining issues
- An example research project
- Mining the world wide web
3The Opportunity
- Explosion in online data
- Inexpensive computational power
- Advances in automated learning algorithms
How can we best use historical data to improve
future decisions?
4Typical Datamining Task
- Data
- Patient103 (timet0) Patient103 (timet1)
Patient103 (timetn) - Age 23 Age 23 Age 23
- FirstPregnancy no FirstPregnancy no
FirstPregnancy no - Anemia no Anemia no Anemia no
- Diabetes no Diabetes YES Diabetes no
- PreviousPrematureBirth no PreviousPrematureBir
th no PreviousPrematureBirth no - Ultrasound ? Ultrasound abnormal
Ultrasound ? - Elective C-Section? Elective C-Sectionno
Elective C-Sectionno - Emergency C-Section? Emergency C-Section?
Emergency C-Section? Yes
- Given
- 9714 patient records, each describing a
pregnancy and birth - Each patient record contains 215 features
- Learn to predict
- Classes of future patients at high risk for
Emergency - Cesarean Section
5Datamining Result
One of 18 learned rules If No previous vaginal
delivery, and Abnormal 2nd Trimester Ultrasound,
and Malpresentation at admission Then
Probability of Emergency C-Section is 0.6 Over
training data 26/41 .63, Over test data 12/20
.60
6Credit Risk Analysis
- Customer103 (timet0) Customer103 (timet1)
Customer103 (timetn) - Years of credit 9 Years of credit 9 Years
of credit 9 - Loan balance 2,400 Loan balance 3,250 Loan
balance 4,500 - Income 52k Income ? Income ?
- Own House Yes Own House Yes Own House Yes
- Other delinquent accts 2 Other delinquent
accts 2 Other delinquent accts 3 - Max billing cycles late 3 Max billing cycles
late 4 Max billing cycles late 6 - Profitable customer?? Profitable
customer?? Profitable customer? NO
7Credit Risk Analysis
Rules learned from synthesized data If
Other-Delinquent-Accounts gt 2, and
Number-Deliquent-Billing-Cycles gt1 Then
Profitable-Customer? No Deny Credit Card
Application If Other-Deliquent-Accounts
0, and (Income gt 30k) OR
(Years-of-Credit gt 3) Then Profitable-Customer?
Yes Accept Credit Card application
8 - Customer purchase behavior
- Customer103 (timet0) Customer103
(timet1) Customer103 (timetn) - Sex M Sex M Sex M
- Age 53 Age 53 Age 53
- Income 50k Income 50k Income 50k
- Own House Yes Own House Yes Own House Yes
- MS Products Word MS Products Word MS Products
Word - Computer 386 PC Computer Pentium Computer
Pentium - Purchase Excel?? Purchase Excel?? Purchase
Excel? YES
9 - Customer retention
- Customer103 (timet0) Customer103
(timet1) Customer103 (timetn) - Sex M Sex M Sex M
- Age 53 Age 53 Age 53
- Income 50k Income 50k Income 50k
- Own House Yes Own House Yes Own House Yes
- Checking 5k Checking 20k Checking 0
- Savings 15k Savings 0 Savings 0
- Current-customer?yes Current-customer?yes
Current-customer?NO
10 - Process optimization
- Product72 (timet0) Product72
(timet1) Product72 (timetn) - Stage mix Stage cook Stage cool
- Mixing-speed 60rpm Temperature 325 Fan-speed
medium - Viscosity 1.3 Viscosity 3.2 Viscosity 1.3
- Fat content 15 Fat content 12 Fat content
12 - Density 2.8 Density 1.1 Density 1.2
- Spectral peak 2800 Spectral peak 3200 Spectral
peak 3100 - Product underweight??? Product
underweight??? Product underweight?YES
11Where Is this Headed?
- Today tip of the iceberg
- First generation algorithms regression, neural
nets, d-trees, ... - Applied to single databases
- Budding industry
- Tomorrow
- Learn across multiple media data
- Learn across multiple databases, including web,
newsfeeds - Learn through active experimentation
- Learn decisions rather than predictions
12A Datamining Research Agenda
- Scientific Issues,
- Basic Technologies Applications
Learn from mixed media data, e.g., Numeric, text,
image, voice, sensor, Active experimentation,
exploration Optimize decisions rather than
predictions Invent new features to
improve accuracy Learn from multiple databases
and the world wide web
Medicine Manufacturing Financial Intelligence
analysis Public Policy Marketing
13Part II Mining the World Wide Web
14Project Goals
- Observation
- The web is rapidly becoming the worlds largest
information resource - Retrievable by computer, but readable only to
humans - Our goal
- Automatic construction of computer readable
knowledge base from the web
15Automatically Extracted WebKB Description
16(No Transcript)
17Automatically Extracted WebKB Description
18Hypertext Information Extraction
Analyze hypertext at 3 levels of resolution
- Individual web pages
- faculty, student, company, product
- Groups of interconnected web pages
- teaches-course(p, c)
- Individual sentences and fragments
- we produce oranges in Florida
19Bag of Words Classification
aardvark 0 about 2 all 2 Africa 1 apple 0 anxious
0 ... gas 1 ... oil 1 Zaire 0
20Total Oil Corporation Web Site
21Learned First Order Rules
IF Person(B) Research_Project(A) Hyperlink(C,A,
D) Neighborhood_word(C,people) Hyperlink(E,D,B
) Then Member_Of_Research_Project(A,B)
Accuracy 135/138
D
A
B
Research Project
Person
people
22Learning to Extract Information
We are headquartered in sunny Tehran.
- Focus
- linguistic structure
- train with minimal effort
If Verb phrase ?V has direct object ?O and
head of ?V is headquartered Then Location is ?O
23Web Information Extraction
Economic Sector
Corporate Locations
Student
Research Project
Course
Default Accuracy
1
lt49
lt5
lt5
lt5
Learned Accuracy
76 (cov 100)
83 (cov 60)
72
89
73
24Automatically Extracted KB