Title: Agenda
 1Agenda
- Biological databases related to microarray 
- Gene Ontology 
- KEGG 
- Pathway enrichment analysis 
- Motif finding
21. Databases
Biological pathways and knowledge are very 
complex
- Is it possible to establish a database? 
- To systematically structuring and managing the 
 knowledge?
- To validate analysis result or be incorporated 
 into analysis?
31.1 Gene Ontology
- Ontologies Controlled vocabularies to describe 
 fuctions of genes.
- The database is structured as directed acyclic 
 graphs (DAGs), which differ from hierarchical
 trees in that a 'child' (more specialized term)
 can have many 'parents' (less specialized terms).
41.1 Gene Ontology
Three major categories in Gene Ontology
Current term counts as of April 2, 2005 at 1800 
Pacific time17708 terms, 93.8 with 
definitions. 9263 biological_process1496 
cellular_component6949 molecular_function 
 51.1 Gene Ontology
Evidence code How is the information collected?
- IC inferred by curator 
- IDA inferred from direct assay 
- IEA inferred from electronic annotation 
- IEP inferred from expression pattern 
- IGI inferred from genetic interaction 
- IMP inferred from mutant phenotype 
- IPI inferred from physical interaction 
- ISS inferred from sequence or structural 
 similarity
- NAS non-traceable author statement 
- ND no biological data available 
- RCA inferred from reviewed computational analysis 
 
- TAS traceable author statement 
- NR not recorded 
- There may be (a lot of) errors in the database!!
61.1 Gene Ontology
- Demo 
- Go to GO http//www.geneontology.org 
- Go to Tools" and click on "AmiGO". 
- Click Browse. Click on the boxes with "" to 
 expand any category to look at its subcategories.
 Click on "-" to collapse again.
- Type the term cell cycle" in the "Search 
 GO"field. Press "Submit". You will then see all
 GO categories containig this word.
- Click on a GO term, say cell cycle arrest. 
 Genes belonging to this GO term can be shown.
 Further filter genes by Data source or
 Species.
- Type the name cyclin" in Amigo. Change to the 
 genes or proteins" selection button and press
 "Submit". You will then see a number of genes
 containing this name. Press some of the "Tree
 view" links.
- Note that in some cases, the same term category 
 can exist in different places in the tree. This
 ontology is thus not strictly hierarchical, but
 shows complex "many-to-many" relationships
 between gene products, ontology terms and
 branches in the ontology tree.
71.2 KEGG
http//www.genome.jp/kegg/pathway.html 
 81.2 KEGG Kyoto Encyclopedia of Genes and Genomes
KEGG is a suite of databases and associated 
software, integrating our current knowledge on 
molecular interaction networks in biological 
processes (PATHWAY database), the information 
about the universe of genes and proteins 
(GENES/SSDB/KO databases), and the information 
about the universe of chemical compounds and 
reactions (COMPOUND/GLYCAN/REACTION databases). 
 The current statistics of KEGG databases is as 
follows Number of pathways 23,574(PATHWAY 
database) Number of reference pathways 265(PATHWAY
 database) Number of ortholog tables 87(PATHWAY 
database) Number of organisms 272(GENOME 
database) Number of genes 911,584(GENES 
database) Number of ortholog clusters 35,456(SSDB 
database) Number of KO assignments 6,221(KO 
database) Number of chemical compounds 12,737(COMP
OUND database) Number of glycans 11,017(GLYCAN 
database) Number of chemical reactions 6,399(REACT
ION database) Number of reactant 
pairs 5,953(RPAIR database) 
 91.2 KEGG
RNA polymerase 
 101.2 KEGG
Cell cycle 
 111.2 KEGG
Parkinsons disease
Alzheimers disease, Huntingtons disease, Prion 
disease. 
 122. Enrichment analysis
- After 
- Selecting DE genes, or 
- Classification, or 
- Clustering 
- We are usually given a gene list for further 
 investigation.
How do we validate information contained in the 
gene list by available biological knowledge? 
 132. Enrichment analysis
Cell cycle data Cells are synchronized and 
samples taken at various time points (covering 2 
cell cycles). 6162 genes are included.
From Fourier analysis, 800 genes with cyclic gene 
expression pattern are selected for further 
investigation. Are these 800 genes really 
involved in cell cycle? 
 142. Enrichment analysis
http//db.yeastgenome.org/cgi-bin/GO/goTermMapper 
 152. Enrichment analysis
Is the selected set of genes enriched in the GO 
term of cell cycle? 
 162. Enrichment analysis 
 172. Enrichment analysis 
 182. Enrichment analysis 
 192. Enrichment analysis
R code for chi-square test without continuity 
correction gt chisq.test(matrix(c(285, 5012, 100, 
691), 2, 2), correctF) Pearson's 
Chi-squared test data matrix(c(285, 5012, 100, 
691), 2, 2) X-squared  61.2644, df  1, p-value 
 4.99e-15 
 202. Enrichment analysis
Chi-squared test is an approximate test and may 
not perform well when sample size small. Fishers 
exact test is a better alternative.
Fishers exact test G genes in the genome 
(G1663) are analyzed Functional category F 
(Six functional categories). In a cluster of size 
C, h genes are found to be in a functional 
category F with m genes, then p-value (i.e. the 
probability of observing h or more annotated 
genes in the cluster is calculated as (Tavazoie 
et al. 1999) 
 212. Enrichment analysis
- In practice, we need to search through thousands 
 of GO terms to determine which GO term is
 enriched in the selected gene set .
-  Multiple comparison problem!! 
- Difficulties Tests are highly dependent. 
- Hierarchical structure of the GO 
- e.g. Cell Proliferation is a parent GO term of 
 Cell Cycle.
- Each gene can belong to multiple GO terms. 
- e.g. human HoxA7 gene belongs to four GO terms 
 Development, Nucleus, DNA dependent
 regulation and transcription, Transcription
 factor activity.
222. Enrichment analysis
- Simple Fishers exact test 
- Ingenuity Pathway 
- A commercial package with good interface and 
 human curated annotation. Can generate network
 figures.
-  NIH DAVID 
- Free and web-based. Perform enrichment analysis 
 (Fishers exact test), adjust for multiple
 comparison and generate a table of results. Use
 multiple databases.
-  Gostats package in Bioconductor 
- Free and web-based. Perform enrichment analysis 
 (Fishers exact test) and generate a table of
 results. Use only GO database.
- More sophisticated and systematic methods 
-  Gene set enrichment analysis (GSEA MIT 
 Mesirovs group)
-  http//www.broad.mit.edu/gsea/ 
-  Gene set analysis (GSA Stanford Tibshiranis 
 group)
-  http//www-stat.stanford.edu/tibs/GSA/ 
232. Enrichment analysis
- Things to note when using biological database 
- Biological pathways and gene functions are 
 complex and difficult to quantify.
- Data may not be accurate. The analysis should 
 take into account of strength of evidence.
- May need to go to specific database for 
 particular organism. (e.g. SGD for yeast FlyBase
 and BDGP for fly)
- To systematically collect and manage massive 
 biological knowledge from publications and
 experiments is an important and active research
 topic in bioinformatics.
243. Motif Finding 
 253. Motif Finding
http//web.indstate.edu/thcme/mwking/gene-regulati
on.html 
 263. Motif Finding
http//web.indstate.edu/thcme/mwking/gene-regulati
on.html 
 273. Motif Finding
- Genes in a cluster have similar expression 
 patterns.
- They might share common regulatory motifs so they 
 are expressed simultaneously.
- It is of interest to find motifs from the gene 
 clusters.
283. Motif Finding
The following materials are obtained from Shirley 
Liu at Harvard. 
 293. Motif Finding 
 303. Motif Finding 
 313. Motif Finding 
 323. Motif Finding 
 333. Motif Finding 
 343. Motif Finding 
 353. Motif Finding 
 363. Motif Finding 
 373. Motif Finding 
 383. Motif Finding 
 393. Motif Finding 
 403. Motif Finding 
 413. Motif Finding 
 423. Motif Finding 
 433. Motif Finding 
 443. Motif Finding 
 453. Motif Finding 
 463. Motif Finding 
 473. Motif Finding 
 483. Motif Finding 
 493. Motif Finding 
 503. Motif Finding 
 513. Motif Finding 
 523. Motif Finding 
 533. Motif Finding 
 543. Motif Finding 
 553. Motif Finding 
 563. Motif Finding 
 573. Motif Finding 
 583. Motif Finding 
 593. Motif Finding 
 603. Motif Finding 
 613. Motif Finding 
 623. Motif Finding 
 633. Motif Finding 
 643. Motif Finding 
 653. Motif Finding 
 663. Motif Finding