Solving Inverse Problems via Machine Learning and Knowledge Discovery - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Solving Inverse Problems via Machine Learning and Knowledge Discovery

Description:

Solving Inverse Problems via Machine Learning and Knowledge Discovery -- Na Tang ... ECG. Cholesterol. Angina. Fig. 3: A TAN network model for Heart Disease ... – PowerPoint PPT presentation

Number of Views:193

Avg rating:3.0/5.0

Slides: 21

Provided by: stude614

Category:

more less

Transcript and Presenter's Notes

Title: Solving Inverse Problems via Machine Learning and Knowledge Discovery

1
Solving Inverse Problems via Machine Learning and
Knowledge Discovery

-- Na Tang
Computer Science, UC Davis
natang_at_ucdavis.edu

2
Outline

Inverse Problem
Background
Bayesian Network
Related work
Our Knowledge Discovery Approach
Architecture
C1 Documents Collection and Classification
C2 Text Analysis and Information Extraction
C3 Missing Information Restoration
Experiments
Conclusion Future Work

3
Inverse Problem
Outputs
Inputs
Internal Structure (Model)

Goal To obtain the internal structure from
indirect noisy observations (input/output
data). A specific internal structure (model)
Bayesian Network Observable data Incomplete Our
Inverse problem To learn Bayesian net models
from incomplete data
4
(No Transcript)
5
Bayesian Network

Bayesian Net -- directed graph where the nodes
are variables (inputs/outputs) and the edges
indicate the dependence among variables.
Calculate P(Outoi In1in1, Inninn) for each
possible oi
Predict Outoi which has the largest probability
value.
Ink attributes, Out class variable.
Different learning methods are available to find
a good Bayesian Net from the data.
N nodes, 2n possible networks

Inputs
Complete Data
Bayesian Net Construction
Predict
Outputs
6
Bayesian Network Cont.

Example
Heart disease data sets (UCI).
Age, Sex Thal -- attributes
Outcome -- class variable.
Once the Bayesian net is built, calculate
P(Outcome Age, Sex, Thal) to predict heart
disease

Fig. 1 Cleveland data (one of the four heart
disease data sets) from UCI repository.
7
Bayesian Network Learning Methods

Naïve Bayes (NB)
Simple, Strong assumption of independency between
any attribute Ai and Aj
Assumption not hold in reality but good accuracy.
P(Outcome Age, Sex, Thal)
P(Outcome)P(AgeOutcome)P(ThalOutcome) / P(Age,
Sex, Thal)
P(Outcome)P(AgeOutcome)P(ThalOutcom
e)

Outcome
Sex
Age
ChestPain
Thal

Fig. 2 A Naïve Bayes network model for Heart
Disease
8
Bayesian Network Learning Methods

Tree Augmented Naïve Bayes (TAN)
More complex than NB but outperforms NB
Search restricted space -- all possible trees in
which the class variable C is the parent of all
attribute and each Ai has at most one more parent
Aj . Find the one that fits the data most.

Outcome
Vessel
OldPeak
Age
MaxHeartRate
STSlope
Thal
ChestPain
RestBP
Sex
BloodSugar
ECG
Angina
Cholesterol
Fig. 3 A TAN network model for Heart Disease
9
Bayesian Network Learning Methods

Other learning methods
Expensive, Outperform NB and TAN
Important assumption for all methods -- Complete
Data
How to deal with incomplete data?
Incomplete attribute -- attribute containing
missing values

Fig. 4 Switzerland data (one of the four heart
disease data sets) from UCI repository. All
-999s stand for missing values.
10
Related Work

Statistic Methods to deal with incomplete data
Fill in missing values using available data as a
guide.
EM (expectation-maximization) Algorithm
Evolutionary Algorithm
Most of these methods cannot work well when a
large
percentage of data is missing

11
Our Knowledge Discovery Approach
WWW
(Negative)
Other Documents
Incomplete Data
Model Construction
Documents Collection Classification
(Positive) Documents containing probability
information
Keywords
Search Engines
Missing Data for incomplete attributes
Available data
C1
Search Results -- Documents containing the
keywords
Text Analysis Information Extraction
Missing Information Restoration
Probability Information
C2
C3
Fig. 5 Implementation architecture for
generating missing information
12
C1 Documents Collection and Classification

Document Collection Phase
Collect N documents from Google using keywords
E.g. Cholesterol is an incomplete attribute.
Keywords input to search engine would be about
Cholesterol and other attributes cholesterol
age, cholesterol gender, cholesterol heart
disease
Document Classification Phase
Naïve Bayes text classifier (Rainbow)
Bag-of-words representation.
Outperforms KNN (k-nearest neighbor) in our case.
Divide collected documents into two classes
positive class documents containing information
on heart disease causes and probability
information.
negative class all other documents.

13
C1 Cont.

Training data, Parameter setting for the
classifier
Training data
292 documents from professional web sites that
specialize in heart disease, hand-labeled as
positive and negative (78 positive class, 214
negative class).
Parameter setting
Token option skip-html, stop-list, no stemming
Event model word-based (multinomial)
Smoothing method no

14
C2 Text Analysis and Information Extraction

Rules are generated to extract the probability
information.
Two categories of probability information
Point probabilities
P(Aiai Ajaj) c , where c stands for some
constant
Qualitative influence
Positive influence from Ai to Aj choosing a
higher value for Ai makes the higher value for Aj
more likely.
P(Aigtai Aj aj1) gt P(Aigtai Aj aj2) given
aj1gtaj2
Other forms of probability information (e.g.
comparison, qualitative synergies)

15
An example of Point Probability information
In general, total cholesterol is considered high
when 240 or more, borderline when 200-239, and
desirable when 200 or less.
a)
b) Point Probabilities expressed in regular
sentences
a) Point Probabilities expressed in table
Output P(Outcome1Cholesterol lt 160) optimal
degree1, P(Outcome1Cholesterol lt200)
desirable degree2, P(Outcome1200ltCholesterol
lt239) borderline degree3, P(Outcome1Cholest
erol lt200) high degree4.
16
An example of Qualitative Influence information

As people get older, their cholesterol levels
rise.
Cholesterol levels naturally rise as men and
women age.
Old people have higher cholesterol levels than
the youth.

Output Positive Influence from Age to
Cholesterol P(Cholesterol gt v Agea1) gt
P(Cholesterol gt v Agea2), given a1gta2
17
C3 Missing Information Restoration

Inputs Probability outputs from C2,
Available incomplete data
Outputs Missing Data
Approach
1) Calculate the probabilities from the available
data
e.g. P(Age, Sex, Outcome) v
2) Convert all the probability information
(outputs from C2 and the probability constraints
from 1)) into a linear system of equalities and
inequalities, calculate bounds on the
probabilities of interest and elicit the required
probabilities -- P(Cholesterol Age, Sex,
Outcome) DG '95
3) Fill the missing values in the data set based
on these probabilities.

18
C3 cont.

An example to process 3) -- Suppose after 1)2) we
got
P(Cholesterollt200Agelt50, Sexfemale, Heart
Disease1) v1,
P(200ltCholesterollt240Agelt50, Sexfemale, Heart
Disease1) v2,
P(Cholesterolgt240Agelt50, Sexfemale, Heart
Disease1) v3
If
Age(patient)lt50 Sex(patient)female
HeartDisease(patient)1
Then
Cholesterol(patient) one of the values from
the set lt200, 200-240, gt240 respectively with
probabilities v1/v, v2/v and v3/v, where v v1
v2 v3.

19
Experiment Results

Cleveland data (Complete, no missing values)
2/3 data training, 1/3 data testing
Construct NB and TAN on the complete training
data
Assume all the cholesterol values in training
data are missing, fill the values using our
approach and then construct NB and TAN on the
filled-in training data.

Table 1 A Comparison of prediction accuracies
with complete and filled-in data
20
Conclusion Future Work

Conclusion
WWW is used as a source for filling missing
information in relational tables.
Bayesian Models constructed on the restored data
retrieve good prediction accuracy.
Future Work
Consideration of reliability of the web
information
Different automatic extraction methods
The case when the available data is little.
Extend our method to directly estimate the
Bayesian parameters from the web knowledge.
Feasible only if we can retrieve enough
probability information from the web.