Lecture Outline - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture Outline

Description:

Microarray experiments are done to answer a biological question ... TOUCAN: http://homes.esat.kuleuven.be/~saerts/software/tutorial1/TOUCAN_Tutorial_Ov erview.html ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 53
Provided by: muld3
Category:

less

Transcript and Presenter's Notes

Title: Lecture Outline


1
Lecture Outline
  • Introduction
  • Data mining sources
  • GO, InterPro, KEGG, UniProt
  • Tools to do the data mining
  • FatiGO
  • FatiWISE

2
Data mining Microarray results
  • Microarray experiments are done to answer a
    biological question
  • Results generate sets of numbers (intensities)
    which are then clustered to find data points of
    interest
  • These themselves dont necessarily answer the
    research question, these need to be converted to
    biological information first

3
Purpose of data mining
  • Validation of results understanding why these
    genes are grouped together
  • Using biological information to find significant
    associations of biological terms to sets of genes
  • Understanding of the roles of the genes at the
    molecular level

4
Data mining (1)
Add gene identifiers
-AB02387 -SB07593 -AA00498 -AC008742
-AB083121
5
Data mining (2)
Add gene descriptions
-AB02387 -SB07593 -AA00498 -AC008742
-AB083121
-RNA polymerase -Glycosyl
hydrolase -Phosphofructokinase
-Transcripiton factor -Glucose transporter
6
Data mining (3)
Add GO terms
-AB02387 -SB07593 -AA00498 -AC008742
-AB083121
-RNA polymerase -Glycosyl
hydrolase -Phosphofructokinase
-Transcripiton factor -Glucose transporter
-GO0003456 -GO0006783
-GO0142291 -GO0054198 -GO0000234
7
Data mining (4)
-AB02387 -SB07593 -AA00498 -AC008742
-AB083121
-RNA polymerase -Glycosyl
hydrolase -Phosphofructokinase
-Transcripiton factor -Glucose transporter
-GO0003456 -GO0006783
-GO0142291 -GO0054198 -GO0000234
Add functional annotation
8
Data mining (5)
-AB02387 -SB07593 -AA00498 -AC008742
-AB083121
-RNA polymerase -Glycosyl
hydrolase -Phosphofructokinase
-Transcripiton factor -Glucose transporter
-GO0003456 -GO0006783
-GO0142291 -GO0054198 -GO0000234
Store results in database
Map onto pathways
9
Sources of biological information
  • Free text e.g. Medline
  • Using text processing tools
  • Curated repositories e.g. GO, KEGG, UniProt,
    InterPro etc.
  • Using data mining
  • Using tools e.g. FatiGO and FatiWISE

10
Free text mining
  • Advantages
  • Vast amounts of data
  • Many associated terms for each gene
  • Disadvantages
  • Synonyms and acronyms
  • Context information
  • Irrelevant terms
  • Need to divide into entities and relationships to
    structure text

11
Example of problems
  • The Sch9 protein kinase regulates
    Hsp90-dependent signal transduction activity in
    the budding yeast Saccharomyces cerevisiae. This
    interaction was suppressed by decreased signaling
    through the protein kinase A (PKA) signal
    transduction pathway.

Text is unstructured needs to be divided into
entities and relationships
12
Example of problems
Protein
Verb
Pathway
  • The Sch9 protein kinase regulates
    Hsp90-dependent signal transduction activity in
    the budding yeast Saccharomyces cerevisiae. This
    interaction was suppressed by decreased signaling
    through the protein kinase A (PKA) signal
    transduction pathway.

Organism
Acronym could be used elsewhere for different
gene
Negative term used
Some problems overcome using stats better
detection of entities and relationships
13
Curated repositories
  • These have reliable annotation
  • Annotation is standardised
  • They are usually well structured
  • However, they usually have less annotation
  • Examples GenBank, GO (FatiGO), UniProt,
    InterPro, KEGG (FatiWISE)

14
Gene Ontology (GO)
  • http//www.geneontology.org
  • Many annotation systems are organism-specific or
    different levels of granularity
  • GO introduced standard vocabulary first used for
    mouse, fly and yeast, but now generic
  • An ontology is a formal specification of terms
    and relationships between them

15
GO Ontologies
  • Molecular function tasks performed by gene
    product e.g. G-protein coupled receptor
  • Biological process broad biological goals
    accomplished by one or more gene products e.g.
    G-protein signaling pathway
  • Cellular component part(s) of a cell of which a
    gene product is a component includes
    extracellular environment of cells e.g nucleus,
    membrane etc.

16
GO relationships
  • is-a e.g. mitochondrial membrane is a membrane
  • part of e.g. nuclear membrane is part of nucleus

DAG structure
17
Current Mappings to GO
  • Consortium mappings -MGD, SGD, RGD,
    FlyBase, TAIR
  • GOA (Gene Ontology Anotation)
  • Swiss-Prot keywords
  • EC numbers
  • InterPro entries
  • Manual mappings
  • Unigene
  • Medline ID mappings, etc.

FatiGO
Evidence codes NB
18
GO Slim
  • Slimmed down version of GO ontologies
  • Selection of high level terms covering all or
    most biological functions processes and cell
    locations
  • Many different GO Slims available with different
    depths and detail
  • Used to make comparisons between annotated
    gene/protein sets easier (each gene may be mapped
    to different granularity)

19
Applications of GO slim
20
GO consortium page
21
UniProt annotation
  • Protein sequence database from EMBL translations
    and direct sequencing
  • Structured into specific fields e.g. description,
    comments, feature table, keywords
  • Each field may have controlled vocabulary or
    specific syntax
  • Swiss-Prot is well annotated, TrEMBL is not, and
    may have less structured text

22
Example Swiss-Prot entry
Annotation
23
KEGG
  • Kyoto Encyclopedia of Genes and Genomes
  • Molecular interaction networks in biological
    processes -PATHWAY database
  • Genes and proteins -GENES/SSDB/KO databases
  • Chemical compounds and reactions
    -COMPOUND/GLYCAN/REACTION databases
  • Includes most organisms and info on orthologues

24
Example KEGG entry
25
InterPro
  • Integrates protein signature databases e.g. Pfam,
    PROSITE, Prints etc.
  • Classifies proteins into families and domains and
    lists all UniProt proteins belonging to each
  • Provides annotation on the family/domain and
    links to 3D structure, GO, Enzyme Classification
  • Used to functionally characterise a protein

26
Example InterPro entry
27
FatiGO
  • Connecting microarray results with these
    biological data sources answers questions e.g do
    my differentially expressed genes have different
    functions?
  • FatiGO is used to extract relevant GO terms for a
    group of genes with respect to a set of reference
    genes (the rest)
  • Can be used to list proportions of GO terms in a
    set of genes

http//fatigo.bioinfo.cnio.es
28
FatiGO data sources
  • Uses tables of correspondences between genes and
    their GO terms (human, mouse, Drosophila, yeast,
    worm and UniProt proteins curated if possible)
  • Uses genes from GenBank, UniProt
    (Swiss-Prot/TrEMBL), Ensembl etc.
  • Problem in lack of standardisation of names use
    EBI xrefs to link them, and for other databases
    they use their own gene IDs
  • For GO associations they include GO evidence
    codes, e.g. IEA

29
Using the GO hierarchy
  • Different levels in the GO hierarchy can be
    chosen, depending on specificity required
  • FatiGO suggest using level 3 questionable?
  • Deeper you go (more specific) fewer genes
    annotated to the terms
  • Once level is set, for each gene FatiGO moves up
    hierarchy until set level is reached increases
    no. of terms mapped to this level easier to find
    relevance in different distributions of GO terms
  • Repeated genes are counted once

30
How FatiGO works
  • Given two sets of genes, and selected GO level
  • Retrieves GO terms for each gene on correct level
  • Applies Fishers exact test for 2x2 contingency
    tables for comparing 2 sets of genes (to get
    p-values)
  • Extracts GO terms with significantly different
    distributions
  • After correcting for multiple testing, provides
    adjusted p-values for 3 tests
  • Step-down minP method (Westfall and Young)
  • FDR independent (Benjamini Hochberg)
  • FDR arbitrary dependent (Benjamini Yekutieli )

31
Testing sets of GO terms
Gene set 2
Gene set 1
Set 1
Set 2
Significantly higher distribution in 1 than 2
Transport 20
Transport 60
Observed difference and possible stronger
differences
Same distribution
Regulation 20
Regulation 20
32
Multiple testing
  • P-value is the probability, under the null
    hypothesis of obtaining the observed result or a
    more extreme result than one observed
  • Testing multiple null hypotheses (one per GO
    term) that there is no difference in the
    frequency of terms in each set
  • For 1 test, type I error rate (probability of
    rejecting a true null hypothesis) is 0.05, but
    for multiple tests this increases -Family wise
    error rate (probability that one or more of
    rejected nulls are true )
  • Multiple testing allows controlling of Family
    Wise Error Rate (FWER) and False discovery rate
    (FDR)

33
Step down min-P method
  • Controls FWER
  • Procedure with a test statistic equivalent to
    Fisher's exact test for 2x2 contingency tables
  • No. of random permutations set at 10000
  • Examines how many of the permuted p-values are
    smaller than the one under consideration
  • Adjusted p-value for hypothesis H is level of
    entire test set procedure at which H would be
    rejected, given values of all test statistics
    involved

34
Controlling False Discovery Rate
  • Tends to be more liberal than controlling FWER
  • Controlling expected no. of false rejections
    (Type 1 errors) among rejected hypotheses
  • Consider the proportions of erroneous rejections
    to the total number of rejections. Average value
    of proportion FDR
  • FDR can be dependent on or independent of test
    statistics, FatiGO gives
  • adjusted p-value using the FDR method of
    Benjamini Hochberg control of FDR under
    independence
  • adjusted p-value using the FDR method of
    Benjamini Yekutieli control of FDR under
    arbitrary dependent structures

35
Using FatiGO -Input
  • Search for Unigene cluster ID, or specific gene
    IDs
  • Input results from SotaTree or Pomelo
  • Or input Excel or text file with list of gene or
    protein IDs, each on a new line
  • Input reference set of genes
  • Select GO ontology and level (inclusive)
  • Select whether multiple test should include
    adjusted p-values for minP test

36
FatiGO interface (1)
37
FatiGO interface (2)
38
FatiGO output
  • FatiGO returns four columns the unadjusted
    p-value (p-value from Fishers exact test without
    adjusting for multiple comparisons) and adjusted
    p-values based on the three methods
  • Results are ordered by increasing value of the
    adjusted p-value, facilitating the selection of
    GO terms with the most significant differences.
  • P-value of 0.01-0.05 some evidence, 0.01-0.001
    strong evidence and lt 0.001 very strong
    evidence against null

39
FatiGO example output
Query set
Reference set
Unadjusted p-value
FRD (indep) adjusted
FDR (depend) adjusted
40
(No Transcript)
41
Link to AmiGO
42
Other features of FatiGO
  • You can input a list of genes and extract the GO
    terms sorted by percentages
  • You can use GO results as a way to find
    differentially expressed genes see if after
    correcting for multiple testing, some GO terms
    are overrepresented (provides more resolution
    where p-value has no meaning)

43
Percentages of GO terms within a set of genes
44
FatiWISE
  • Data mining to retrieve additional biological
    info on InterPro motifs, KEGG pathways and
    Swiss-Prot keywords
  • Uses Fishers exact test for 2x2 contingency
    tables for comparing two sets of genes and
    finding significantly different distributions
  • Corrects for multiple testing to get adjusted
    p-value
  • Can get stats for one set of genes or compare 2
    sets

45
FatiWISE input and output
  • Data sources KEGG, InterPro, UniProt
  • Input
  • one or two sets of genes
  • Selection of organism (for pathway)
  • Output
  • Unadjusted p-value
  • Step-down min P adjusted p-value
  • FDR (arbitrary dependent) adjusted p-value

46
FatiWISE interface
47
FatiWISE InterPro output
48
FatiWISE KEGG output
49
FatiWISE keyword output
50
Summary
  • Data mining is used to bring the biology into
    results
  • Curated data sources are the best for this, due
    to structure and controlled vocabulary
  • FatiGO and FatiWISE are simple web tools enabling
    data mining on 1 or 2 sets of genes
  • Exercises http//cbio.uct.ac.za/courses/MicroDM/

51
Websites for Annotation
  • Webgestalt http//genereg.ornl.gov/webgestalt/log
    in.php
  • Fatigo http//babelomics.bioinfo.cipf.es/

52
Websites for Sequence Analysis and Motif Finding
  • Martview http//www.ensembl.org/Multi/martview
  • TOUCAN http//homes.esat.kuleuven.be/saerts/soft
    ware/tutorial1/TOUCAN_Tutorial_Overview.html
  • SeqVista http//zlab.bu.edu/SeqVISTA/tutorials/mo
    tif.htm
  • Mitra http//fluff.cs.columbia.edu8080/domain/mi
    tra.html
  • Spex http//ep.ebi.ac.uk/EP/SPEXS/
  • Gene Expression Analysis http//geneontology.org/
    GO.tools.microarray.shtml
Write a Comment
User Comments (0)
About PowerShow.com