Lecture Outline - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Lecture Outline

Description:

... or biological labels (human, mouse, Drosophila, yeast, worm and UniProt proteins) ... http://www.reactome.org/cgi-bin/skypainter2. SkyPainter ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 45
Provided by: mul47
Category:
Tags: lecture | outline

less

Transcript and Presenter's Notes

Title: Lecture Outline


1
Lecture Outline
  • Introduction
  • Analysing biological information for gene sets
  • Predictors and signatures
  • Data mining sources
  • GO, UniProt, InterPro, KEGG
  • Tools to do the data mining
  • FatiGO (Babelomics part of GEPAS)
  • Pathway tools

2
Data mining Microarray results
  • Microarray experiments are done to answer a
    biological question
  • Results generate sets of numbers (intensities)
    which are then clustered to find data points of
    interest
  • These themselves dont necessarily answer the
    research question, these need to be converted to
    biological information first

3
Purpose of data mining
  • Validation of results understanding why these
    genes are grouped together
  • Using biological information to find significant
    associations between biological terms and sets of
    genes
  • Understanding of the roles of the genes at the
    molecular level

4
Data mining
Add gene identifiers
-AB02387 -SB07593 -AA00498 -AC008742
-AB083121
5
Data mining
Add gene descriptions
-AB02387 -SB07593 -AA00498 -AC008742
-AB083121
-RNA polymerase -Glycosyl
hydrolase -Phosphofructokinase
-Transcripiton factor -Glucose transporter
6
Data mining
Add GO terms
-AB02387 -SB07593 -AA00498 -AC008742
-AB083121
-RNA polymerase -Glycosyl
hydrolase -Phosphofructokinase
-Transcripiton factor -Glucose transporter
-GO0003456 -GO0006783
-GO0142291 -GO0054198 -GO0000234
7
Data mining
-AB02387 -SB07593 -AA00498 -AC008742
-AB083121
-RNA polymerase -Glycosyl
hydrolase -Phosphofructokinase
-Transcripiton factor -Glucose transporter
-GO0003456 -GO0006783
-GO0142291 -GO0054198 -GO0000234
Add functional annotation
8
Data mining
-AB02387 -SB07593 -AA00498 -AC008742
-AB083121
-RNA polymerase -Glycosyl
hydrolase -Phosphofructokinase
-Transcripiton factor -Glucose transporter
-GO0003456 -GO0006783
-GO0142291 -GO0054198 -GO0000234
Map onto pathways
9
FatiGO/Babelomics tools
  • Aim to identify a set of differentially expressed
    genes
  • Then see if there is an enrichment of a type of
    biological label in this set compared to the
    background (rest)
  • Biological label could be e.g. GO terms,
    functional assignment, pathways etc.

10
Sources of biological information
  • Free text e.g. Medline
  • Using text processing tools
  • Curated repositories e.g. GO, KEGG, UniProt,
    InterPro etc.
  • Using data mining
  • Using tools e.g. FatiGO

11
Free text mining
  • Advantages
  • Vast amounts of data
  • Many associated terms for each gene
  • Disadvantages
  • Synonyms and acronyms
  • Context information
  • Irrelevant terms
  • Need to divide into entities and relationships to
    structure text

12
Example of problems
  • The Sch9 protein kinase regulates
    Hsp90-dependent signal transduction activity in
    the budding yeast Saccharomyces cerevisiae. This
    interaction was suppressed by decreased signaling
    through the protein kinase A (PKA) signal
    transduction pathway.

Text is unstructured needs to be divided into
entities and relationships
13
Example of problems
Protein
Verb
Pathway
  • The Sch9 protein kinase regulates
    Hsp90-dependent signal transduction activity in
    the budding yeast Saccharomyces cerevisiae. This
    interaction was suppressed by decreased signaling
    through the protein kinase A (PKA) signal
    transduction pathway.

Organism
Acronym could be used elsewhere for different
gene
Negative term used
Some problems overcome using stats better
detection of entities and relationships
14
Curated repositories
  • These have reliable annotation
  • Annotation is standardised
  • They are usually well structured
  • However, they usually have less annotation
  • Examples GenBank, GO, UniProt, InterPro, KEGG

15
Gene Ontology (GO)
  • http//www.geneontology.org
  • Many annotation systems are organism-specific or
    different levels of granularity
  • GO introduced standard vocabulary first used for
    mouse, fly and yeast, but now generic
  • An ontology is a formal specification of terms
    and relationships between them

16
GO Ontologies
  • Molecular function tasks performed by gene
    product e.g. G-protein coupled receptor
  • Biological process broad biological goals
    accomplished by one or more gene products e.g.
    G-protein signaling pathway
  • Cellular component part(s) of a cell of which a
    gene product is a component includes
    extracellular environment of cells e.g nucleus,
    membrane etc.

17
GO relationships
  • is-a e.g. mitochondrial membrane is a membrane
  • part of e.g. nuclear membrane is part of nucleus

DAG structure
18
Current Mappings to GO
  • Consortium mappings -MGD, SGD, RGD,
    FlyBase, TAIR
  • GOA (Gene Ontology Anotation)
  • Swiss-Prot keywords
  • EC numbers
  • InterPro entries
  • Manual mappings
  • Medline ID mappings, etc.

FatiGO
Evidence codes NB
19
GO Slim
  • Slimmed down version of GO ontologies
  • Selection of high level terms covering all or
    most biological functions processes and cell
    locations
  • Many different GO Slims available with different
    depths and detail
  • Used to make comparisons between annotated
    gene/protein sets easier (each gene may be mapped
    to different granularity)

20
UniProt annotation
  • Protein sequence database from EMBL translations
    and direct sequencing
  • Structured into specific fields e.g. description,
    comments, feature table, keywords
  • Each field may have controlled vocabulary or
    specific syntax
  • Swiss-Prot is well annotated, TrEMBL is not, and
    may have less structured text

21
Example Swiss-Prot entry
Annotation
22
KEGG
  • Kyoto Encyclopedia of Genes and Genomes
  • Molecular interaction networks in biological
    processes -PATHWAY database
  • Genes and proteins -GENES/SSDB/KO databases
  • Chemical compounds and reactions
    -COMPOUND/GLYCAN/REACTION databases
  • Includes most organisms and info on orthologues

23
Example KEGG entry
24
InterPro
  • Integrates protein signature databases e.g. Pfam,
    PROSITE, Prints etc.
  • Classifies proteins into families and domains and
    lists all UniProt proteins belonging to each
  • Provides annotation on the family/domain and
    links to 3D structure, GO, Enzyme Classification
  • Used to functionally characterise a protein

25
Example InterPro entry
26
Babelomics -FatiGO
  • Connecting microarray results with these
    biological data sources answers questions e.g do
    my differentially expressed genes have similar
    functions?
  • FatiGO() is used to extract relevant GO terms,
    InterPro results, KEGG pathways etc. for a group
    of genes with respect to a set of reference genes
    (the rest)
  • Can also be used to list proportions of GO terms
    in a set of genes

http//babelomics.bioinfo.cipf.es/fatigoplus/cgi-b
in/fatigoplus.cgi
27
FatiGO data sources
  • Uses tables of correspondences between genes and
    their GO terms or biological labels (human,
    mouse, Drosophila, yeast, worm and UniProt
    proteins)
  • Uses genes from GenBank, UniProt
    (Swiss-Prot/TrEMBL), Ensembl etc.
  • Problem in lack of standardisation of names use
    EBI xrefs to link them, and for other databases
    they use their own gene IDs
  • For GO associations they include GO evidence
    codes, e.g. IEA

28
Using the GO hierarchy
  • GO terms are tested from level 3 to depth 9 and
    only the deepest significant term is reported for
    each branch of the GO hierarchy
  • Deeper you go (more specific) fewer genes
    annotated to the terms
  • For each level, FatiGO moves up hierarchy until
    set level is reached increases no. of terms
    mapped to this level easier to find relevance in
    different distributions of GO terms
  • Repeated genes are counted once

29
How FatiGO works (1)
  • Given two sets of genes, and selected biological
    label(s)
  • Retrieves label (e.g. GO terms) for each gene
  • Applies Fishers exact test for 2x2 contingency
    tables for comparing 2 sets of genes (to get
    p-values)
  • Extracts labels with significantly different
    distributions

30
Testing sets of GO terms
Gene set 2
Gene set 1
Set 1
Set 2
Significantly higher distribution in 1 than 2
Transport 20
Transport 60
Observed difference and possible stronger
differences
Same distribution
Regulation 20
Regulation 20
31
Multiple testing
  • P-value is the probability, under the null
    hypothesis of obtaining the observed result or a
    more extreme result than one observed
  • Testing multiple null hypotheses (one per GO
    term) that there is no difference in the
    frequency of terms in each set
  • For 1 test, type I error rate (probability of
    rejecting a true null hypothesis) is 0.05, but
    for multiple tests this increases -Family wise
    error rate (probability that one or more of
    rejected nulls are true )
  • Multiple testing allows controlling of Family
    Wise Error Rate (FWER) and False discovery rate
    (FDR)

32
How FatiGO works (2)
  • After correcting for multiple testing, used to
    provide adjusted p-values for 3 tests
  • Step-down minP method (Westfall and Young)
    controls FWER
  • FDR -controls expected no. of false rejections
    (Type 1 errors) among rejected hypotheses
  • independent (Benjamini Hochberg)
  • arbitrary dependent (Benjamini Yekutieli )

33
Controlling False Discovery Rate
  • Tends to be more liberal than controlling FWER
  • Controlling expected no. of false rejections
    (Type 1 errors) among rejected hypotheses
  • Consider the proportions of erroneous rejections
    to the total number of rejections. Average value
    of proportion FDR
  • FatiGO calculates FDR

34
Using FatiGO -Input
  • Input results from SotaTree
  • Or input Excel or text file with list of gene or
    protein IDs, each on a new line
  • Input reference set of genes
  • Select biological label to analyse
  • Select organism

35
FatiGO interface for a single gene set
36
FatiGO interface for comparing gene sets
Query set
Ref set
Different biological labels to compare
37
Example output summary
38
For Biological process, list of GO terms at
different levels that are significant
Significant genes at lowest part of hierarchy
39
Query set
Reference set
Unadjusted p-value
FRD (indep) adjusted
40
P-values
  • P-values
  • lt 0.05 significant
  • 0.01-0.05 some evidence
  • 0.01-0.001 strong evidence
  • lt 0.001 very strong evidence against null
  • If you do not have any a priori hypothesis on
    biological process in cluster of genes -look at
    the second column -FDR-adjusted p-values

41
Additional pathway tools
  • Cytoscape
  • http//www.cytoscape.org/
  • Install locally, can display expression data
  • MapMan
  • http//gabi.rzpd.de/projects/MapMan/
  • Displays large datasets onto diagrams of
    metabolic pathways
  • Reactomes SkyPainter
  • http//www.reactome.org/cgi-bin/skypainter2

42
SkyPainter
  • Reactome is a curated repository of pathways in
    eukaryotes
  • Skypainter allows you to enter a gene list and
    retrieve significantly overrepresented pathways
    or reactions
  • M genes in a reaction out of X genes in organism,
    submitted list of genes, N involved in this
    event, calculates probability of picking N or
    more genes involved in event by chance.
  • Not corrected for multiple testing

43
Example SkyPainter output
Has movie option for time course experiments!
44
Summary
  • Data mining is used to bring the biology into and
    interpret results
  • Curated data sources are the best for this, due
    to structure and controlled vocabulary
  • FatiGO is a simple web tool enabling data mining
    on 1 or 2 sets of genes
  • Additional tools are available for pathway
    analysis, e.g. Reactomes SkyPainter
  • Exercises http//cbio.uct.ac.za/training/courses/
    microarray-data-analysis-course/MicroDM/Microarray
    DM/
Write a Comment
User Comments (0)
About PowerShow.com