Solving Inverse Problems via Machine Learning and Knowledge Discovery - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Solving Inverse Problems via Machine Learning and Knowledge Discovery

Description:

Solving Inverse Problems via Machine Learning and Knowledge Discovery -- Na Tang ... ECG. Cholesterol. Angina. Fig. 3: A TAN network model for Heart Disease ... – PowerPoint PPT presentation

Number of Views:193
Avg rating:3.0/5.0
Slides: 21
Provided by: stude614
Category:

less

Transcript and Presenter's Notes

Title: Solving Inverse Problems via Machine Learning and Knowledge Discovery


1
Solving Inverse Problems via Machine Learning and
Knowledge Discovery
  • -- Na Tang
  • Computer Science, UC Davis
  • natang_at_ucdavis.edu

2
Outline
  • Inverse Problem
  • Background
  • Bayesian Network
  • Related work
  • Our Knowledge Discovery Approach
  • Architecture
  • C1 Documents Collection and Classification
  • C2 Text Analysis and Information Extraction
  • C3 Missing Information Restoration
  • Experiments
  • Conclusion Future Work

3
Inverse Problem
Outputs
Inputs
Internal Structure (Model)


Goal To obtain the internal structure from
indirect noisy observations (input/output
data). A specific internal structure (model)
Bayesian Network Observable data Incomplete Our
Inverse problem To learn Bayesian net models
from incomplete data
4
(No Transcript)
5
Bayesian Network
  • Bayesian Net -- directed graph where the nodes
    are variables (inputs/outputs) and the edges
    indicate the dependence among variables.
  • Calculate P(Outoi In1in1, Inninn) for each
    possible oi
  • Predict Outoi which has the largest probability
    value.
  • Ink attributes, Out class variable.
  • Different learning methods are available to find
    a good Bayesian Net from the data.
  • N nodes, 2n possible networks

Inputs
Complete Data
Bayesian Net Construction
Predict
Outputs
6
Bayesian Network Cont.
  • Example
  • Heart disease data sets (UCI).
  • Age, Sex Thal -- attributes
  • Outcome -- class variable.
  • Once the Bayesian net is built, calculate
    P(Outcome Age, Sex, Thal) to predict heart
    disease

Fig. 1 Cleveland data (one of the four heart
disease data sets) from UCI repository.
7
Bayesian Network Learning Methods
  • Naïve Bayes (NB)
  • Simple, Strong assumption of independency between
  • any attribute Ai and Aj
  • Assumption not hold in reality but good accuracy.
  • P(Outcome Age, Sex, Thal)
  • P(Outcome)P(AgeOutcome)P(ThalOutcome) / P(Age,
    Sex, Thal)
  • P(Outcome)P(AgeOutcome)P(ThalOutcom
    e)

Outcome
Sex
Age
ChestPain
Thal

Fig. 2 A Naïve Bayes network model for Heart
Disease
8
Bayesian Network Learning Methods
  • Tree Augmented Naïve Bayes (TAN)
  • More complex than NB but outperforms NB
  • Search restricted space -- all possible trees in
    which the class variable C is the parent of all
    attribute and each Ai has at most one more parent
    Aj . Find the one that fits the data most.

Outcome
Vessel
OldPeak
Age
MaxHeartRate
STSlope
Thal
ChestPain
RestBP
Sex
BloodSugar
ECG
Angina
Cholesterol
Fig. 3 A TAN network model for Heart Disease
9
Bayesian Network Learning Methods
  • Other learning methods
  • Expensive, Outperform NB and TAN
  • Important assumption for all methods -- Complete
    Data
  • How to deal with incomplete data?
  • Incomplete attribute -- attribute containing
    missing values

Fig. 4 Switzerland data (one of the four heart
disease data sets) from UCI repository. All
-999s stand for missing values.
10
Related Work
  • Statistic Methods to deal with incomplete data
  • Fill in missing values using available data as a
    guide.
  • EM (expectation-maximization) Algorithm
  • Evolutionary Algorithm
  • Most of these methods cannot work well when a
    large
  • percentage of data is missing

11
Our Knowledge Discovery Approach
WWW
(Negative)
Other Documents
Incomplete Data
Model Construction
Documents Collection Classification
(Positive) Documents containing probability
information
Keywords
Search Engines
Missing Data for incomplete attributes
Available data
C1
Search Results -- Documents containing the
keywords
Text Analysis Information Extraction
Missing Information Restoration
Probability Information
C2
C3
Fig. 5 Implementation architecture for
generating missing information
12
C1 Documents Collection and Classification
  • Document Collection Phase
  • Collect N documents from Google using keywords
  • E.g. Cholesterol is an incomplete attribute.
    Keywords input to search engine would be about
    Cholesterol and other attributes cholesterol
    age, cholesterol gender, cholesterol heart
    disease
  • Document Classification Phase
  • Naïve Bayes text classifier (Rainbow)
  • Bag-of-words representation.
  • Outperforms KNN (k-nearest neighbor) in our case.
  • Divide collected documents into two classes
  • positive class documents containing information
    on heart disease causes and probability
    information.
  • negative class all other documents.

13
C1 Cont.
  • Training data, Parameter setting for the
    classifier
  • Training data
  • 292 documents from professional web sites that
    specialize in heart disease, hand-labeled as
    positive and negative (78 positive class, 214
    negative class).
  • Parameter setting
  • Token option skip-html, stop-list, no stemming
  • Event model word-based (multinomial)
  • Smoothing method no

14
C2 Text Analysis and Information Extraction
  • Rules are generated to extract the probability
    information.
  • Two categories of probability information
  • Point probabilities
  • P(Aiai Ajaj) c , where c stands for some
    constant
  • Qualitative influence
  • Positive influence from Ai to Aj choosing a
    higher value for Ai makes the higher value for Aj
    more likely.
  • P(Aigtai Aj aj1) gt P(Aigtai Aj aj2) given
    aj1gtaj2
  • Other forms of probability information (e.g.
    comparison, qualitative synergies)

15
An example of Point Probability information
In general, total cholesterol is considered high
when 240 or more, borderline when 200-239, and
desirable when 200 or less.
a)
b) Point Probabilities expressed in regular
sentences
a) Point Probabilities expressed in table
Output P(Outcome1Cholesterol lt 160) optimal
degree1, P(Outcome1Cholesterol lt200)
desirable degree2, P(Outcome1200ltCholesterol
lt239) borderline degree3, P(Outcome1Cholest
erol lt200) high degree4.
16
An example of Qualitative Influence information
  • As people get older, their cholesterol levels
    rise.
  • Cholesterol levels naturally rise as men and
    women age.
  • Old people have higher cholesterol levels than
    the youth.

Output Positive Influence from Age to
Cholesterol P(Cholesterol gt v Agea1) gt
P(Cholesterol gt v Agea2), given a1gta2
17
C3 Missing Information Restoration
  • Inputs Probability outputs from C2,
    Available incomplete data
  • Outputs Missing Data
  • Approach
  • 1) Calculate the probabilities from the available
    data
  • e.g. P(Age, Sex, Outcome) v
  • 2) Convert all the probability information
    (outputs from C2 and the probability constraints
    from 1)) into a linear system of equalities and
    inequalities, calculate bounds on the
    probabilities of interest and elicit the required
    probabilities -- P(Cholesterol Age, Sex,
    Outcome) DG '95
  • 3) Fill the missing values in the data set based
    on these probabilities.

18
C3 cont.
  • An example to process 3) -- Suppose after 1)2) we
    got
  • P(Cholesterollt200Agelt50, Sexfemale, Heart
    Disease1) v1,
  • P(200ltCholesterollt240Agelt50, Sexfemale, Heart
    Disease1) v2,
  • P(Cholesterolgt240Agelt50, Sexfemale, Heart
    Disease1) v3
  • If
  • Age(patient)lt50 Sex(patient)female
    HeartDisease(patient)1
  • Then
  • Cholesterol(patient) one of the values from
    the set lt200, 200-240, gt240 respectively with
    probabilities v1/v, v2/v and v3/v, where v v1
    v2 v3.

19
Experiment Results
  • Cleveland data (Complete, no missing values)
  • 2/3 data training, 1/3 data testing
  • Construct NB and TAN on the complete training
    data
  • Assume all the cholesterol values in training
    data are missing, fill the values using our
    approach and then construct NB and TAN on the
    filled-in training data.

Table 1 A Comparison of prediction accuracies
with complete and filled-in data
20
Conclusion Future Work
  • Conclusion
  • WWW is used as a source for filling missing
    information in relational tables.
  • Bayesian Models constructed on the restored data
    retrieve good prediction accuracy.
  • Future Work
  • Consideration of reliability of the web
    information
  • Different automatic extraction methods
  • The case when the available data is little.
    Extend our method to directly estimate the
    Bayesian parameters from the web knowledge.
    Feasible only if we can retrieve enough
    probability information from the web.
Write a Comment
User Comments (0)
About PowerShow.com