Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases

Description:

... nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnan nGnpnSnsn ntnFnTnOndnnfnennn nnnnn ... s FtGUeuV7 p)* S s(GWf8v gw tHXhx F 9IYiy T*:JZjz O ... – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 73
Provided by: compN
Category:

less

Transcript and Presenter's Notes

Title: Data Mining of Gene Expression Profiles for the Diagnosis and Understanding of Diseases


1
Data Mining of Gene Expression Profiles for the
Diagnosis and Understanding of Diseases
  • Limsoon Wong
  • Institute for Infocomm Research

2
Plan
  • Some accomplishments and challenges in knowledge
    discovery from biological and clinical data
  • Data mining in microarray analysis
  • diagnosis of disease state and subtype
  • derivation of treatment plan
  • understanding of gene interaction network

3
Knowledge Discovery from Biological and Clinical
Data MOTIVATION
4
Driving Forces Genes, Proteins, Interactions,
Diagnosis, Cures
5
If we figure out how these work, we get these
Benefits
To the patient Better drug, better treatment To
the pharma Save time, save cost, make more To
the scientist Better science
6
To figure these out,we bet on...
solution Data Mgmt Knowledge
Discovery Data Mgmt Integration
Transformation Cleansing Knowledge Discovery
Statistics Algorithms Databases
7
Knowledge Discovery from Biological and Clinical
Data ACCOMPLISHMENT
8
8 years of bioinformatics RD in Singapore
9
Predict Epitopes,Find Vaccine Targets
  • Vaccines are often the only solution for viral
    diseases
  • Finding developing effective vaccine targets is
    slow and expensive process

10
Recognize Functional Sites,Help Scientists
  • Effective recognition of initiation, control, and
    termination of biological processes is crucial to
    speeding up and focusing scientific experiments
  • Data mining of bio seqs to find rules for
    recognizing understanding functional sites

Dragons 10x reduction of TSS recognition false
positives
11
Diagnose Leukaemia, Benefit Children
  • Childhood leukaemia is a heterogeneous disease
  • Treatment is based on subtype
  • 3 different tests and 4 different experts are
    needed for accurate diagnosis
  • Curable in USA,
  • fatal in Indonesia
  • A single platform diagnosis
  • based on gene expression
  • Data mining to discover
  • rules that are easy for
  • doctors to understand

12
Understand Proteins,Fight Diseases
  • Understanding function and role of protein needs
    organised info on interaction pathways
  • Such info are often reported in scientific paper
    but are seldom found in structured databases
  • Knowledge extraction
  • system to process free text
  • extract protein names
  • extract interactions

Jak1
13
Data Mining in Microarray AnalysisMICROARRAY
BACKGROUND
14
Whats a Microarray?
  • Contain large number of DNA molecules spotted on
    glass slides, nylon membranes, or silicon wafers
  • Measure expression of thousands of genes
    simultaneously

15
Affymetrix GeneChip Array
16
Making Affymetrix GeneChip
17
Gene Expression Measurement by GeneChip
18
A Sample Affymetrix GeneChip File (U95A)
19
Data Mining in Microarray Analysis DISEASE
SUBSTYPE DIAGNOSIS
20
Pediatric Acute Lymphoblastic Leukemia
  • A heterogeneous disease with more than 12
    subtypes, e.g., T-ALL, E2A-PBX1, TEL-AML1,
    BCR-ABL, MLL, and Hyperdipgt50.
  • Treatment response is subtype dependent
  • 80 continuous remission if subtype is correctly
    diagnosed and the corresponding treatment plan is
    applied

21
Subtype Diagnosis
  • Require different tests
  • immunophenotyping
  • cytogenetics
  • molecular diagnostics
  • Require different experts
  • hematologist
  • oncologist
  • pathologist
  • cytogeneticist

22
Difficulties and Implications
  • The different tests and experts are not commonly
    available within a single hospital, especially in
    less advanced countries
  • An 80-curable disease in USA can be a fatal
    disease in Indonesia!
  • Is there a single diagnostic platform that does
    not need multiple human specialists?

23
A Potential Solution by MicroarraysYeoh et al.,
Cancer Cell 1133--143, 2002
24
Some Caveats
  • Study was performed on Americans
  • May not be applicable to Singaporeans,
    Malaysians, Indonesians, etc.
  • Large-scale study on local populations currently
    in the works

25
Typical Procedure in Analysing Gene Expression
for Diagnosis
  • Gene expression data collection
  • Gene selection
  • Classifier training
  • Classifier tuning (optional for some machine
    learning methods)
  • Apply classifier for diagnosis of future cases

26
Feature Selection Methods
A refresher of feature selection methods
27
Signal Selection (Basic Idea)
  • Choose a signal w/ low intra-class distance
  • Choose a signal w/ high inter-class distance

28
Signal Selection (eg., t-statistics)
29
Signal Selection (eg., ?2)
30
Signal Selection (eg., CFS)
  • Instead of scoring individual signals, how about
    scoring a group of signals as a whole?
  • CFS
  • Correlation-based Feature Selection
  • A good group contains signals that are highly
    correlated with the class, and yet uncorrelated
    with each other

31
Gene Expression Profile Classification
An introduction to gene expression profile
classification by the example on ALL subtype
diagnosis
32
Subtype Classification of ALL
A tree-structured diagnostic workflow was
recommended by the doctors, as per Yeoh et al.,
Cancer Cell 1133--143, 2002
33
Training and Testing Sets
34
Our procedure for ALL subtype diagnosis
  • Gene expression data collection
  • Gene selection by entropy
  • Classifier training by emerging pattern
  • Classifier tuning (optional for some machine
    learning methods)
  • Apply classifier for diagnosis of future cases by
    PCL

35
Signal Selection (eg., entropy)
36
Emerging Patterns (EPs)
  • An EP is a set of conditions
  • usually involving several features
  • that most members of a class satisfy
  • but none or few of the other class satisfy
  • A jumping EP is an EP that
  • some members of a class satisfy
  • but no members of the other class satisfy
  • We use only most general jumping EPs

37
PCL Prediction by Collective Likelihood
38
Accuracy (using 20 genes of lowest entropy)
39
Comprehensibility
40
Gene Expression Profile ClassificationHow about
other feature selection and classification
methods?
41
Some gene selection heuristics
  • all-CFS all features from CFS
  • top20-?2 20 features w/ highest ?2 stats
  • top20-t 20 features w/ highest t-stats
  • top20-mit 20 features w/ highest MIT stats
  • entropy 20 features w/ lowest entropy
  • all-?2 all features meeting 5 significance
    level of ?2 stats

42
Some other classification methods
  • k-NN (k1)
  • majority votes of the k nearest neighbours
    determined by Euclidean distance
  • C4.5
  • widely used decision tree method.
  • Naïve Bayes (NB)
  • probabilistic prediction using Bayes rule
  • SVM
  • (linear) discriminant function that maximizes
    separation of boundary samples

43
Accuracy
  • Feature selection improves performance
  • EntropyPCL has consistent high performance

44
When 20 genes are selected randomly
Average over 100 experiments
Cf. 7-15 mistakes total with good feature
selection
45
Data Mining in Microarray Analysis TREATMENT
PLAN DERIVATION
A pure speculation!
46
Can we do more with EPs?
  • Detect gene groups that are significantly related
    to a disease
  • Derive coordinated gene expression patterns from
    these groups
  • Derive treatment plan based on these patterns

47
Colon Tumour DatasetAlon et al., PNAS
966745--6750, 1999
  • We use the colon tumour dataset above to
    illustrate our ideas
  • 22 normal samples
  • 40 colon tumour samples

48
Detect Gene Groups
  • Feature Selection
  • Use entropy method
  • 35 genes have cut points
  • Generate EPs
  • 19501 EPs in normals
  • 2165 EPs in tumours
  • EPs with largest support are gene groups
    significantly co-related to disease

49
Top 20 EPs
50
Observation 1
  • Some EPs contain large number of genes and still
    have high freq
  • E.g., 2, 3, 6, 7, 13, 17, 33 has freq 90.91 in
    normal and 0 in cancer samples
  • Nearly all normal samples gene expr. values
    satisfy all conds. implied by these 7 items

51
Observation 2
  • Freq of singleton EP is not necessarily larger
    than EP having multiple genes
  • E.g., 5 is EP in cancer samples and has freq
    32.5
  • E.g., 16, 58, 62 is EP in cancer samples and
    has freq 75.5
  • Groups of genes and their correlation's could be
    more impt than single genes

52
Observation 3
  • M33680 has lowest entropy of the 35 genes if
    cutpoint is set at 352
  • 18/40 of cancer samples shift expr level of
    M33680 from its normal range to its abnormal range

53
Treatment Plan Idea
  • Increase/decrease expression level of particular
    genes in a cancer cell so that
  • it has the common EPs of normal cells
  • it has no common EPs of cancer cells

54
Treatment Plan Example
  • From the EP 2,3,6,7,13,17,33
  • 91 of normal cells express the 7 genes (T51560,
    T49941, M62994, R34701, L02426, U20428, R10707)
    in the corr. Intervals
  • a cancer cell never express all 7 genes in the
    same way
  • if expression level of improperly expressed genes
    can be adjusted, the cancer cell can have one
    common EP of normal cells
  • a cancer cell can then be iteratively converted
    into a normal one

55
Choosing Genes to Adjust
56
Doing more adjustments...
  • Down regulating T49941 leads to 2 more top 10 EPs
    of normal cells to show up in the adjusted T1
  • Down regulating X62153 to below 396 and T72403 to
    below 296 leads to T1 having 9 top 10 EPs of
    normal cells
  • Ave. no. of EPs in normal cells is 9
  • So the adjusted T1 now has impt features of
    normal cells

57
Next, eliminate common EPs of cancer cells in T1
  • 6 more genes (K03001, T49732, U29171, R76254,
    D31767, L40992) are adjusted
  • All top 10 EPs of cancer cells now disappear from
    T1
  • Ave. no. of top 10 EPs contained in cancer cells
    is 6
  • The adjusted T1 now holds enough common features
    of normal cells and no features of cancer cells
  • T1 is converted to normal cells

58
Treatment Plan Validation
  • Adjustments were made to the 40 colon tumour
    samples based on EPs as described
  • Classifiers trained on original samples were
    applied to the adjusted samples

It works!
59
A Big But...
  • Effective means for identifying mechanisms and
    pathways through which to modulate gene
    expression of selected genes need to be developed

60
Data Mining in Microarray AnalysisGENE
INTERACTION PREDICTION
61
Beyond Classification of Gene Expression Profiles
  • After identifying the candidate genes by feature
    selection, do we know which ones are causal genes
    and which ones are surrogates?

62
Gene Regulatory Circuits
  • Genes are connected in circuit or network
  • Expression of a gene in a network depends on
    expression of some other genes in the network
  • Can we reconstruct the gene network from gene
    expression data?

63
Key Questions
  • For each gene in the network
  • which genes affect it?
  • How they affect it?
  • Positively?
  • Negatively?
  • More complicated ways?

64
Some Techniques
  • Bayesian Networks
  • Friedman et al., JCB 7601--620, 2000
  • Boolean Networks
  • Akutsu et al., PSB 2000, pages 293--304
  • Differential equations
  • Chen et al., PSB 1999, pages 29--40
  • Classification-based method
  • Soinov et al., Towards reconstruction of gene
    network from expression data by supervised
    learning, Genome Biology 4R6.1--9, 2003

65
A Classification-based TechniqueSoinov et al.,
Genome Biology 4R6.1-9, Jan 2003
  • Given a gene expression matrix X
  • each row is a gene
  • each column is a sample
  • each element xij is expression of gene i in
    sample j
  • Find the average value ai of each gene i
  • Denote sij as state of gene i in sample j,
  • sij up if xij gt ai
  • sij down if xij ? ai

66
A Classification-based TechniqueSoinov et al.,
Genome Biology 4R6.1-9, Jan 2003
  • To see whether the state of gene g is determined
    by the state of other genes
  • we see whether ?sij i ? g? can predict sgj
  • if can predict with high accuracy, then yes
  • Any classifier can be used, such as C4.5, PCL,
    SVM, etc.
  • To see how the state of gene g is determined by
    the state of other genes
  • apply C4.5 (or PCL or other rule-based
    classifiers) to predict sgj from ?sij i ? g?
  • and extract the decision tree or rules used

67
Advantages of this method
  • Can identify genes affecting a target gene
  • Dont need discretization thresholds
  • Each data sample is treated as an example
  • Explicit rules can be extracted from the
    classifier (assuming C4.5 or PCL)
  • Generalizable to time series

68
Acknowledgements
69
Data Mining in Microarray Analysis NOTES
70
References
  • J.Li, L. Wong, Geography of differences between
    two classes of data, Proc. 6th European Conf. on
    Principles of Data Mining and Knowledge
    Discovery, pp. 325--337, 2002
  • J.Li, L. Wong, Identifying good diagnostic genes
    or gene groups from gene expression data by using
    the concept of emerging patterns,
    Bioinformatics, 18725--734, 2002
  • J.Li et al., A comparative study on feature
    selection and classification methods using a
    large set of gene expression profiles, GIW,
    1351--60, 2002

71
References
  • E.-J. Yeoh et al., Classification, subtype
    discovery, and prediction of outcome in pediatric
    acute lymphoblastic leukemia by gene expression
    profiling, Cancer Cell, 1133--143, 2002
  • U.Alon et al., Broad patterns of gene expression
    revealed by clustering analysis of tumor colon
    tissues probed by oligonucleotide arrays, PNAS
    966745--6750, 1999
  • L.A.Soinov et al., Towards reconstruction of
    gene networks from expression data by supervised
    learning, Genome Biology 4R6.1--9, 2003.

72
gt Data Mining of Gene Expression
Profiles for gt the Diagnosis and
Understanding of Diseases gt gt This talk is
divided into two parts. In Part I, I will provide
a gt brief overview of some accomplishments and
challenges gt in Bioinformatics. In Part II, I
will discuss the data mining gt in the analysis
of microarray gene expression profiles for gt
(a) diagnosis of disease state or subtype, (b)
derivation of gt disease treatment plan, and (c)
understanding of gene gt interaction networks. gt
Write a Comment
User Comments (0)
About PowerShow.com