Title: Automated cisRegulatory Annotation of genomes
1Automated cis-Regulatory Annotation of genomes
- Saurabh Sinha
- Dept. of Computer Science,
- UIUC.
2Automated genome annotation
- Routine steps today
- Gene prediction
- Orthology maps
- Gene functions
3Genes are not the whole story
Genetic regulatory network controlling the
development of the body plan of the sea urchin
embryo Davidson et al., Science,
295(5560)1669-1678.
4Gene Regulation
- Biological processes, including development, are
coordinated by spatio-temporal interactions among
genes - Some genes (transcription factors) regulate the
expression of other genes - GENE REGULATORY NETWORK (GRN)
- GRN is a key substrate of evolution
- morphological diversity arises out of evolution
tinkering with GRN
5Annotating cis-Regulatory elements
- Goal is to unravel the GRN for some biological
process, in some species - Each edge of GRN is of the form transcription
factor X regulates gene Y - Many other possible forms, this is the most
well-studied - Molecular implementation of this edge binding
site for the transcription factor, located near
the gene - Sub-goal Find these binding sites in the genome
6Preview
- Different bioinformatics methods to annotate
genomic footprints of cis-regulatory interactions - Different methods differ in terms of
- Input Data e.g., single species or multiple
species known motifs or unknown motifs etc. - Precise goal Find actual binding sites Find
clusters of binding sites Find target genes
etc.
7(1) Annotation of transcription factor binding
sites (TFBS)
- Output Predicted binding sites (10 bp long) of
a given TF - Input Prior characterization of binding site
affinity of a transcription factor motifs - From high throughput techniques (e.g., Chromatin
Immunoprecipation, Bacterial 1 Hybrid, etc.) - Input multiple, closely related genomes
8Examples
- Harbison et al. (Nature 2004) on yeast
- gt 3000 binding sites for over 100 TFs
- Stark et al. (Genome Res. 2007) on Drosophila
- gt 46000 regulatory interactions
Binding site prediction
9(2) Genome-wide prediction of motifs
- If the binding specificity of a TF is not known
from experiments, it can be computationally
predicted - Output A set of motifs that probably represent
TF binding affinities - Input multiple, closely related genomes. Some
methods attempt to find motifs from single
species alone.
10Examples
- Kellis et al. (Nature 2003) on yeast
- 72 motifs identified.
- Xie et al. (Nature 2005) on human
- 174 motifs identified
- Out of how many ? Perhaps 2500, but not all have
distinct motifs - Down et al. (PLoS CB, 2006) on Drosophila
- 120 motifs identified (out of 700)
11(3) Prediction of clusters of TFBS
- In many cases, binding sites do not work alone
- Clusters of binding sites, of the same or
different TFs, together mediate a particular
expression pattern - Such clusters are called cis-regulatory modules
(CRMs). Typically, CRM prediction is more
accurate than individual TFBS prediction - Output Annotation of CRMs (1000 bp long)
involved in a certain biological process. Expect
1-3 per gene. - Input Set of motifs involved in a biological
process - Input (Optional) multiple, closely related
genomes
12Examples
- Blanchette et al. (Genome Res. 2006)
- human vs. mouse comparison. gt 100000 modules
predicted. - motifs taken from large database (TRANSFAC)
- Noyes et al. (Unpublished, 2007)
- multiple Drosophila species comparison
- motifs taken from high throughput assay (B1H)
- Based on our earlier work (ISMB 2003, ISMB 2006)
- Smith et al. (Mol. Sys. Biol. 2007)
- multiple mammalian species
- motifs taken from large databases
- tissue-specific gene expression data
13(4) Prediction of TF-gene interactions
- Individual TFBS prediction may suffer from high
false positive rate - May be more practical to only predict TF X
targets Gene Y - Output Pairs of (TF, Gene) regulatory
interactions - Input Motifs of known TFs
- Input (Optional) multiple, closely related
genomes
14Examples
- Sinha et al. (PNAS 2006)
- Honeybee genome
- Motifs from another insect (fruitfly)
- Sinha et al. (Genome Res. 2007)
- Human genome
- Motifs from TRANSFAC
- Penacchhio et al. (Genome Res. 2007)
- Human genome
- Multiple mammalian species analyzed
15(5) Annotation of miRNA targets
- Transcription factor binding to DNA is not the
only mode of gene regulation - MicroRNAs binding to 3 UTRs of mRNA is another
major mode of gene regulation - Output (miRNA, Gene) interactions
- Input miRNA sequence
- Input multiple, closely related genomes
16Examples
- Krek et al. (Nature Genetics 2005.)
- human genome
- Lewis et al. (Cell 2005)
- human genome
- Grun et al. (PLoS CB, 2005)
- Drosophila genome
17(6) CRM Prediction without known motifs
- More on the research side
- Output CRMs in a genome
- Input Set of functionally related CRMs in the
same genome - Current estimates (unpublished data)
- 50 specificity in Drosophila
- Joint work with Gene Robinson, preliminary work
in ISMB 2007.
18(7) CRM Prediction across evolutionary gaps
- Output CRMs in a new genome
- Input Orthologous CRMs in a different genome
- Assumption Alignment not available as a guide
- E.g., Knowledge of CRMs in fruitfly, export
this knowledge to the honeybee or wasp genome - Ongoing work joint work with Gene Robinson.
19Summary
- Already achievable Genome-wide prediction of
- binding sites
- binding site affinities (motifs)
- Cis-regulatory modules (CRMs)
- TF-gene interactions
- miRNA targets
- Issue Accuracy not very high but still very
useful ! - One direction for the future of automatic
annotation - Exporting annotation from one species to
another, without the aid of alignments