Fr - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Fr

Description:

... that best separates (discriminates) two classes of objects. ... discriminates well between the two groups. Blue points tend to have smaller L values ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 40
Provided by: dioge
Category:

less

Transcript and Presenter's Notes

Title: Fr


1
Statistics and bioinformaticsapplied to omics
technologiesPart II Integrating biological
knowledge
  • Frédéric Schütz
  • Frederic.Schutz_at_isb-sib.ch

Center for Integrative Genomics University of
Lausanne, Switzerland
Bioinformatics Core Facility Swiss Institute of
Bioinformatics
2
Contents
Slides
  • Class prediction 1-19
  • Gene Ontology analysis 20-29
  • Geneset analysis (GSEA, etc) 30-39

3
Class discovery and class prediction
  • Example patients from which we obtained
    measurements (e.g. gene expression)

Class discovery
Class prediction
?
Gene 2
Gene 2
Gene 1
Gene 1
Given previous measurements for whichthe
grouping is known (red and blue), can we predict
the group to which a newobservation belongs ?
Find natural groups in the data (e.g. setsof
patients with similar gene expression)
4
Why do we want to do class prediction ?
  • Many questions in biology and medicine are class
    prediction questions
  • Does a patient have a predisposition for a given
    disease ?
  • What is the prognosis for this patient ?
  • What will be the response of this patient to a
    given drug ?
  • Is this tumour benign or malign ?
  • What type is this tumour ?
  • Which treatment should be used ?

5
Class prediction easy case
Gene 2
Gene 1
Classify everythingon this side as blue
Classify everythingon this side as red
Threshold
6
Example
Blue points represent oestrogen receptor (ER)
status positive determinedby immunohistochemistr
y.
Pierre Farmer et al. Identification of molecular
apocrine breasttumours by microarray analysis.
Oncogene (2005) 24, 46604671
7
Class prediction in practice
Gene 2
Gene 1
  • The two groups are not perfectly separated (and
    this is still a pretty good case)
  • One variable (gene) is not sufficient to assign
    patients to groups
  • Remember that with microarrays, we are not
    talking about just 2 measurements, but several
    10,000s.

8
Discrimination in general
  • Goal assign objects (e.g. patients) to classes
    based on some measurements (e.g. gene expression)
  • Typically, in a microarray setting
  • 10s or (at best) 100s of patients
  • 10,000s genes
  • Unsupervised learning nothing is known about the
    grouping of the data, and we try to find natural
    groups in the data
  • Supervised learning the classes are predefined
    we use previously labelled objects to create a
    procedure for classification of future
    observations.

9
Some supervised analysis methods
  • K-nearest neighbours
  • Linear Discriminant Analysis
  • Classification trees
  • Support Vector Machines (SVM)
  • etc.

10
Example 3-nearest neighbours
Gene 2
Gene 1
Red or blue ?
11
Example 3-nearest neighbours
Gene 2
Gene 1
2 red vs 1 blue the point is assigned to red
12
K-nearest neighbours
  • Choose a value for k (typical values 3 or 5) in
    practice it can be chosen using the learning data
    (value that produces the best result)
  • Find the k observations in the learning set that
    are closest to the new, unknown, observation
  • Predict the class by a majority vote, that is,
    choose the class that is most common among the
    neighbours.
  • Very simple method, with surprisingly good
    performance

13
Linear Discriminant Analysis
  • Suggested by R.A. Fisher in 1935
  • Procedure to find a linear combination of the
    observed variables that best separates
    (discriminates) two classes of objects.
  • Using the new variable, objects from the same
    class are close together, and objects from
    different class are further away.
  • Straightforward to calculate
  • Can easily be extended to more than two classes
  • Similar idea to Principal Component Analysis
    (PCA)
  • Often forgotten in favour of PCA

14
Back to the easy case
Gene 2
Gene 1
Classify everythingon this side as blue Low
value ofthe discriminant
Classify everythingon this side as red High
value ofthe discriminant
Threshold
Discriminant Gene 1
15
Linear Discriminant Analysis Example
Gene 2
Gene 1
  • The two groups are well separated
  • Neither Gene1 nor Gene2 is able to discriminate
    between the two categories

16
Linear Discriminant Analysis Example
High values
Gene 2
Low values
Gene 1
  • However, the linear combination
  • L Gene1 Gene2
  • discriminates well between the two groups
  • Blue points tend to have smaller L values
  • Red points tend to have bigger L values

17
Linear Discriminant Analysis Example
High values
Gene 2
Low values
Gene 1
Threshold
  • A threshold is set in between the mean of the two
    groups
  • Points with a value L above the threshold are
    classified as red
  • Points with a value L below the threshold are
    classified as blue

18
Caveats Overfitting
  • It is easy to create classifiers which fit the
    training data perfectly
  • It is harder to find classifiers which still work
    as well when validated on new data
  • A classifier must ALWAYS be tested on data
    independent from the one used to actually train
    the classifier.
  • This is particularly important in microarray
    analysis
  • Few samples
  • Many different measurements
  • If not careful, it is always possible to find a
    classifier that works well for your training data
    !

19
Caveats Overfitting
Gene 2
Classify everything in this region as red
Gene 1
  • Perfect classifier for this data
  • Probably not so good with any new data

20
Gene Ontology analysis
  • Many microarray experiments produce lists of
    genes that are significantly differently
    expressed between two conditions (gene
    comparison).
  • In some (rare) cases, only a few genes are of
    interest, and they can easily be examined and
    validated.
  • In most cases, however, a long list of
    differentially expressed genes is returned, and
    these genes can not be considered individually.
  • It is harder to obtain biological understanding
    from this data.
  • One strategy consider the functional annotation
    of the differentially expressed genes.
  • Question what do these genes have in common that
    could be of interest ?

21
Reminder Gene Ontology (GO) project
  • Collaborative effort to address the need for
    consistent descriptions of gene products in
    different databases.
  • Three structured, controlled vocabularies
    (ontologies) that describe gene products in terms
    of their associated
  • biological processes
  • cellular components
  • molecular functions
  • in a species-independent manner.

(From http//www.geneontology.org/)
22
Example
PPARA, NR1C1, PPAR Peroxisome proliferator-activa
ted receptor alpha
(TAS Traceable Author Statement, IPI Inferred
from Physical Interaction)
(From http//www.geneontology.org/)
23
Example of GO analysis
  • Simple microarray experience WT vs KO
  • The microarray has 10,000 genes, 100 of which
    have GO annotation fatty acid transport
  • I obtain 1000 differentially expressed genes (10
    of all genes)
  • If my experiment has nothing to do with fatty
    acid transport, I expect in average about 10 of
    genes (or 10) to be differentially expressed.
  • If this proportion is higher, it means the list
    of differentially-expressed genes is enriched in
    fatty acid transport genes
  • If the difference is significant, it suggests a
    link between differential expression and this GO
    annotation genes with this annotation are more
    likely to be differentially expressed than others
  • This indicates that this biological process may
    be related to my KO experiment.

1000 genesdifferentiallyexpressed
10
10,000genes in total
90
24
Number of genes fatty acid transport
1000 genesdifferentiallyexpressed
10
10 (10)
100 (100)
10,000genes in total
. . .
0 (0)
90 (90)
90
Looks like a random distribution No apparent
association
?
Strong association
25
Statistical analysis
  • Assume that I found 20 differential expression
    with the GO annotation of interest.
  • Count the numbers of genes with the GO annotation
    or not, and compare with differential expression
  • A statistical test such as Fishers exact test
    can tell us what is the probability of observing
    this result (or more extreme) if there is no
    association between the rows and columns
  • In this case, this probability (p-value) is 0.002
  • This indicates that this biological process may
    be important in the difference between WT and KO.

Differentially expressed Not D.E. Total
Fatty acid transport 20 80 100
Others 980 8980 9900
Total 1000 9000 10000
26
In practice
  • One can either suggest a GO annotation and see if
    it is enriched in the list of differentially
    expressed genes
  • Or we may want to go fishing and try all
    potentially interesting GO annotations to see if
    any of them is enriched.
  • Easy to do
  • Multiple services available on the web
  • User indicates the list of genes differentially
    expressed
  • Returns the most significant GO annotations

27
Gene Ontology analysis example. I
  • Microarray with about 22,000 genes
  • We look at the 1 of the genes that are most
    different between different subtypes of cancer.
  • Which processes are likely to be different
    between these subtypes ?
  • Those for which more than 1 of the genes are
    differentially expressed are good candidates

Pierre Farmer et al. Identification of molecular
apocrine breasttumours by microarray analysis.
Oncogene (2005) 24, 46604671
28
Gene Ontology analysis example. II
  • To apply this GO analysis, we need first to
    define a list of differentially expressed genes.
  • This usually means calculating a score (e.g.
    p-value), and selecting a cut-off point.
  • While there are some traditional cut-off points
    (0.001, 0.01 or the magical 0.05), they remain
    fairly arbitrary
  • Is there really a difference between a gene
    associated with a p-value of 0.049 and another
    one with a p-value of 0.051 ?

29
Gene Ontology analysis example. III
  • Some genes may be differentially expressed, but
    the change may be so small (lost in the noise)
    that it will not appear in the list.
  • However, the difference in expression may appear
    at the level of a set of genes rather than
    individual genes
  • Set of genes may correspond e.g. to co-regulated
    genes, or genes belonging to the same pathway
  • If the change of expression is consistent across
    genes in the set, it may indicate that the set is
    of interest, even if no individual gene shows a
    significant difference.

30
Gene set enrichment analysis (GSEA)
31
Gene set enrichment analysis (GSEA)
  • Series of papers describing a method for
    analyzing the expression of sets of genes
  • Software available, along with a database of
    biologically relevant gene sets
  • Relatively hot topic in bioinformatics/statistics
    many differerent papers and methods published on
    the topic, with small or large differences
  • GSEA usually refers to this particular program,
    but sometimes indicates any such method which
    examines sets of genes.

32
Principle of GSEA
  • We have a list of genes sorted according to a
    given measure (score for differential expression,
    correlation to a phenotype, etc)
  • Among this list, we have a smaller set of genes
    of interest (e.g. all belonging to a given
    pathway)
  • Is the smaller set distributed randomly in the
    sorted list of genes ?
  • If yes, the set is less likely to be of interest
  • If no, it may indicate that the function
    represented by the set is linked with the measure.

33
Principle of GSEA (most methods)
All genes, sorted
Low values (e.g. down-regulated)
High values (e.g. upregulated)
Position in the list of genes of our set of
interest
The location of the genes of our set of interest
within the list seem random (uniform) the set
does not appearto be linked with differential
expression.
34
Principle of GSEA (most methods)
All genes, sorted
Low values (e.g. down-regulated)
High values (e.g. upregulated)
Link withup-regulation
Position in the list of genes of our set of
interest
Link withdown-regulation
Position in the list of genes of our set of
interest
35
Statistical analysis
  • Random walk
  • The list of genes is walked down from left to
    right
  • Everytime a gene belong to our list S, the score
    goes up
  • Everytime a gene does not belong to the list, it
    goes down
  • If the genes of the set are uniformly
    distributed, the score will never go very high
    (up soon followed by a down)
  • If the genes are distributed together, the score
    will go higher before getting back to 0.
  • Using a permutation test, a p-value can be
    associated to the geneset.

From fig. 1 of Subramanian et al. PNAS 2005 102
15545-15550
36
Statistical analysis
  • How can we summarise and assess an apparent link
    between a set and differential expression ?
  • Each method uses different statistics
  • Original GSEA method based on the
    Kolmogorov-Smirnov test (compare the distribution
    of genes with a uniform distribution)
  • Later replaced by an Enrichment Score (similar
    but weighted)

37
Example
  • mRNA expression profiles from lymphoblastoid cell
    lines derived from 15 males and 17 females
  • Identify gene sets correlated with the difference
    between males and females

(False Discovery Rate)
From table 2 of Subramanian et al. PNAS 2005
102 15545-15550
38
Example
  • Gene expression patterns from a collection of 50
    cancer cell lines
  • p53 regulates gene expression in response to
    various signals of cellular stress
  • 33 cell lines carry a mutation on the p53 gene,
    and 17 are normal.

From table 2 of Subramanian et al. PNAS 2005
102 15545-15550
39
Conclusions
  • GeneSet Enrichment Analysis methods have quickly
    become widespread in the microarray community.
  • Intuitive method
  • Can be used to confirm an association known or
    suspected (use a given geneset)
  • or to go fishing for unknown association (use
    a database of genesets)
  • More generally, microarray analysis uses more and
    more this external biological knowledge.
Write a Comment
User Comments (0)
About PowerShow.com