Title: Solving Inverse Problems via Machine Learning and Knowledge Discovery
1Solving Inverse Problems via Machine Learning and
Knowledge Discovery
- -- Na Tang
- Computer Science, UC Davis
- natang_at_ucdavis.edu
2Outline
- Inverse Problem
- Background
- Bayesian Network
- Related work
- Our Knowledge Discovery Approach
- Architecture
- C1 Documents Collection and Classification
- C2 Text Analysis and Information Extraction
- C3 Missing Information Restoration
- Experiments
- Conclusion Future Work
3Inverse Problem
Outputs
Inputs
Internal Structure (Model)
Goal To obtain the internal structure from
indirect noisy observations (input/output
data). A specific internal structure (model)
Bayesian Network Observable data Incomplete Our
Inverse problem To learn Bayesian net models
from incomplete data
4(No Transcript)
5Bayesian Network
- Bayesian Net -- directed graph where the nodes
are variables (inputs/outputs) and the edges
indicate the dependence among variables. - Calculate P(Outoi In1in1, Inninn) for each
possible oi - Predict Outoi which has the largest probability
value. - Ink attributes, Out class variable.
- Different learning methods are available to find
a good Bayesian Net from the data. - N nodes, 2n possible networks
Inputs
Complete Data
Bayesian Net Construction
Predict
Outputs
6Bayesian Network Cont.
- Example
- Heart disease data sets (UCI).
- Age, Sex Thal -- attributes
- Outcome -- class variable.
- Once the Bayesian net is built, calculate
P(Outcome Age, Sex, Thal) to predict heart
disease
Fig. 1 Cleveland data (one of the four heart
disease data sets) from UCI repository.
7Bayesian Network Learning Methods
- Naïve Bayes (NB)
- Simple, Strong assumption of independency between
- any attribute Ai and Aj
- Assumption not hold in reality but good accuracy.
- P(Outcome Age, Sex, Thal)
- P(Outcome)P(AgeOutcome)P(ThalOutcome) / P(Age,
Sex, Thal) - P(Outcome)P(AgeOutcome)P(ThalOutcom
e)
Outcome
Sex
Age
ChestPain
Thal
Fig. 2 A Naïve Bayes network model for Heart
Disease
8Bayesian Network Learning Methods
- Tree Augmented Naïve Bayes (TAN)
- More complex than NB but outperforms NB
- Search restricted space -- all possible trees in
which the class variable C is the parent of all
attribute and each Ai has at most one more parent
Aj . Find the one that fits the data most.
Outcome
Vessel
OldPeak
Age
MaxHeartRate
STSlope
Thal
ChestPain
RestBP
Sex
BloodSugar
ECG
Angina
Cholesterol
Fig. 3 A TAN network model for Heart Disease
9Bayesian Network Learning Methods
- Other learning methods
- Expensive, Outperform NB and TAN
- Important assumption for all methods -- Complete
Data - How to deal with incomplete data?
- Incomplete attribute -- attribute containing
missing values
Fig. 4 Switzerland data (one of the four heart
disease data sets) from UCI repository. All
-999s stand for missing values.
10Related Work
- Statistic Methods to deal with incomplete data
- Fill in missing values using available data as a
guide. - EM (expectation-maximization) Algorithm
- Evolutionary Algorithm
- Most of these methods cannot work well when a
large - percentage of data is missing
11Our Knowledge Discovery Approach
WWW
(Negative)
Other Documents
Incomplete Data
Model Construction
Documents Collection Classification
(Positive) Documents containing probability
information
Keywords
Search Engines
Missing Data for incomplete attributes
Available data
C1
Search Results -- Documents containing the
keywords
Text Analysis Information Extraction
Missing Information Restoration
Probability Information
C2
C3
Fig. 5 Implementation architecture for
generating missing information
12C1 Documents Collection and Classification
- Document Collection Phase
- Collect N documents from Google using keywords
- E.g. Cholesterol is an incomplete attribute.
Keywords input to search engine would be about
Cholesterol and other attributes cholesterol
age, cholesterol gender, cholesterol heart
disease - Document Classification Phase
- Naïve Bayes text classifier (Rainbow)
- Bag-of-words representation.
- Outperforms KNN (k-nearest neighbor) in our case.
- Divide collected documents into two classes
- positive class documents containing information
on heart disease causes and probability
information. - negative class all other documents.
13C1 Cont.
- Training data, Parameter setting for the
classifier - Training data
- 292 documents from professional web sites that
specialize in heart disease, hand-labeled as
positive and negative (78 positive class, 214
negative class). - Parameter setting
- Token option skip-html, stop-list, no stemming
- Event model word-based (multinomial)
- Smoothing method no
14C2 Text Analysis and Information Extraction
- Rules are generated to extract the probability
information. - Two categories of probability information
- Point probabilities
- P(Aiai Ajaj) c , where c stands for some
constant - Qualitative influence
- Positive influence from Ai to Aj choosing a
higher value for Ai makes the higher value for Aj
more likely. - P(Aigtai Aj aj1) gt P(Aigtai Aj aj2) given
aj1gtaj2 - Other forms of probability information (e.g.
comparison, qualitative synergies)
15An example of Point Probability information
In general, total cholesterol is considered high
when 240 or more, borderline when 200-239, and
desirable when 200 or less.
a)
b) Point Probabilities expressed in regular
sentences
a) Point Probabilities expressed in table
Output P(Outcome1Cholesterol lt 160) optimal
degree1, P(Outcome1Cholesterol lt200)
desirable degree2, P(Outcome1200ltCholesterol
lt239) borderline degree3, P(Outcome1Cholest
erol lt200) high degree4.
16An example of Qualitative Influence information
- As people get older, their cholesterol levels
rise. - Cholesterol levels naturally rise as men and
women age. - Old people have higher cholesterol levels than
the youth.
Output Positive Influence from Age to
Cholesterol P(Cholesterol gt v Agea1) gt
P(Cholesterol gt v Agea2), given a1gta2
17C3 Missing Information Restoration
- Inputs Probability outputs from C2,
Available incomplete data - Outputs Missing Data
- Approach
- 1) Calculate the probabilities from the available
data - e.g. P(Age, Sex, Outcome) v
- 2) Convert all the probability information
(outputs from C2 and the probability constraints
from 1)) into a linear system of equalities and
inequalities, calculate bounds on the
probabilities of interest and elicit the required
probabilities -- P(Cholesterol Age, Sex,
Outcome) DG '95 - 3) Fill the missing values in the data set based
on these probabilities.
18C3 cont.
- An example to process 3) -- Suppose after 1)2) we
got -
- P(Cholesterollt200Agelt50, Sexfemale, Heart
Disease1) v1, - P(200ltCholesterollt240Agelt50, Sexfemale, Heart
Disease1) v2, - P(Cholesterolgt240Agelt50, Sexfemale, Heart
Disease1) v3 - If
- Age(patient)lt50 Sex(patient)female
HeartDisease(patient)1 - Then
- Cholesterol(patient) one of the values from
the set lt200, 200-240, gt240 respectively with
probabilities v1/v, v2/v and v3/v, where v v1
v2 v3.
19Experiment Results
- Cleveland data (Complete, no missing values)
- 2/3 data training, 1/3 data testing
- Construct NB and TAN on the complete training
data - Assume all the cholesterol values in training
data are missing, fill the values using our
approach and then construct NB and TAN on the
filled-in training data.
Table 1 A Comparison of prediction accuracies
with complete and filled-in data
20Conclusion Future Work
- Conclusion
- WWW is used as a source for filling missing
information in relational tables. - Bayesian Models constructed on the restored data
retrieve good prediction accuracy. - Future Work
- Consideration of reliability of the web
information - Different automatic extraction methods
- The case when the available data is little.
Extend our method to directly estimate the
Bayesian parameters from the web knowledge.
Feasible only if we can retrieve enough
probability information from the web.