Title: Genome annotation techniques: new approaches and challenges,Drug Discovery Today, Volume 7, Issue 11, 6 May 2002, Pages 570-576 Alistair G. Rust, Emmanuel Mongin and Ewan Birney Loraine AE, Helt GA.
1Genome annotation techniquesnew approaches and
challenges Presented by Haili Ping
- Genome annotation techniques new approaches and
challenges,Drug Discovery Today, Volume 7, Issue
11, 6 May 2002, Pages 570-576 Alistair G. Rust,
Emmanuel Mongin and Ewan Birney Loraine AE, Helt
GA.
2Exponential increase of the amount of human
genomic sequence and genomes from other species
needs to be matched by increases in the accurate
annotation of this huge variety of
genomes Accurate annotation of the human genome
and other species is an essential element in
supporting current drug discovery
efforts Bioinformatics solutions are
increasingly required to develop automatic
annotation techniques to support and complement
the manual curation process
3- Automatic genome annotation pipelines
- Primary goal is to deliver highly accurate and
reliable genome annotations, using the widest
range of evidence from available databases. - Enssence pipelines are the integration of
suites of bioinformatics software tools with
multiple databases, to manage automatically the
analysis and storage of genomic sequence - Trend
- single algorithm methods ?consensus-based
approaches - combined results of gene predictors and
similarity search methods are used
4 The generic structure of an automatic genome
annotation pipeline and delivery system
5Box 1. Useful human genome annotation and browser
URLs Automated annotation pipelines
EBI/Sanger Institute Ensembl
Project http//www.ensembl.org/Homo_sapiens/
NCBI Human Genome Browser
http//proxy.library.uiuc.edu3367/genome/guide/h
uman/ The Oak Ridge National
Laboratories Genome Channel http//compbio.ornl.
gov/channel/ Celera Discovery
System http//cds.celera.com/
Incyte Genomics Genomics Knowledge Platform
http//www.incyte.com/incyte_science/technology/g
kp/ Paracel GeneMatcher2
System http//www.paracel.com/products/gm2.html H
uman genome browsers UCSC
Human Genome Browser http//genome.cse.ucsc.edu/c
gi-bin/hgGateway/ Softberry
Genome Explorer http//www.softberry.com/berry.ph
tml?topicgenomexp Viaken
Enterprise Ensembl Solution http//www.viaken.co
m/ns/solutions/ensembl.html
LabBook Inc. Genomic Explorer Suite
http//www.labbook.com/products/ExplorerSuite.asp
University of Tokyo Gene
Resource Locator Browser http//grl.gi.k.u-tokyo.
ac.jp/ Other useful sites The
Institute for Genomic Research (TIGR)
http//www.tigr.org/ Human
Genome Central http//www.ensembl.org/genome/cent
ral/ and http//proxy.library.uiuc.edu3528/genom
e/central/
6- From raw sequence to gene predictions
-
- Raw sequence pre-processing
- masking known repeats and low comlexity
sequences using - RepeatMasker
- identifying homology matches using BLAST
- Scans for other features, such as sequence
tagged site (STS) - markers and CpG islands
- Gene prediction
- Predictions based on protein matches
- Predictions based on DNA sequence
- Ab initio gene prediction programs
7A simplified schematic of algorithmic gene
prediction
8- Gene function characterization
- Mapping to known genes
- RefSeq and SWISS-PROT
- HUGO (NCBI,UCSC and Ensemble)
- Protein domain annotation
- Pam, PRINTS, PROSITE, ProDom, BLOCKS and SMART.
- Interpro project creating a unique
characterization for a given protein family,
domain or functional site. Domains of the protein
sequences can then be identified using this
signature method. The use of Interpro provides
the least-redundant and extensive annotation
currently available - Gene ontology
- Gene Ontology (GO) project aims at defining such
common terms to specify molecular function,
biological process and cellular location
9- Sharing genome annotations
- Website display and ftp sites
-
Chromosome 20 Overview
10(No Transcript)
11- Pros does not require expert bioinformatics
skills and they are thus more accessible to a
wide range of researchers wishing to gain access
to genomic annotation - Cons it makes it difficult to perform
large-scale data mining - Solution enabling more experienced users to
retrieve the data they require and to run
analyses locally - Open annotation
- The need for researchers to have access to
annotations available in the community and to
share their own contributions with the community - The need for a common protocol between systems
that enables genome data to be freely exchanged - the AGAVE (Architecture for Genomic Annotation,
Visualization and Exchange) and the Distributed
Annotation System (DAS) projects
12- Challenges facing automatic annotation systems
- Data warehousing a solution for large-scale data
mining
- First, the desired query statement might be too
complex to implement - Second, the computing power needed might be too
expensive in most cases for queries performed on
large, monolithic databases - Solution
- the business sector using data warehousing,
which segregates information into denormalized
databases, enabling fast querying and data
retrieval. - a large variety of data-mining tools to extract
datasets of interest efficiently can result in
subsequent stages of statistical analyses or data
mining
13- The requirement to remain flexible
The development of automated annotation pipelines
is an evolving process.
- the quality of sequences and assemblies continue
to improve, - redundant sequences are replaced with new,
superior sequences - demands
- a flexible system in which new, individual
sequences can be added and analysed without
disrupting the whole system - new, improved algorithms and methodologies
- demands
- the architecture of a pipeline flexible to
incorporate them into the analysis process
without redesign of the system.
14- Future opportunities
- Comparative genomics
- As more genomes are sequenced and become publicly
available in the next few years, comparative
genomics will become one of the greatest areas of
development - Cross-species Analysis human-mouse
- Protein coding genes are likely to be highly
conserved between closely related species (e.g.
mouse and human), and other regions, such as RNA
genes and regulatory regions, could also be
elucidated - need for the development of bioinformatics tools
- Vista, Synplot and FamilyJewels
- the integration of such tools with the current
automated approaches - the design of genome browsers and websites that
can intelligently display and annotate
comparative results
15- Integrating and delivering new data
- Horizontal integration
- genomic systems should be able to cross-match
species that can be sensibly compared - Vertical integration
- New flows of data coming from proteomics and
microarray sources will soon have to be
incorporated
16- Concluding remarks
- Automatic genome annotation systems
- increased and is increasing.
- Grounded upon central cores of bioinformatics
software tools and associated relational
databases
sequenced genomes ? integration of new genomes
into the current systems ?the demand for an
openess towards the distribution of annotation
data ?the delivery of genomic data in forms
suitable for large- scale data mining
17References 1.Genome annotation techniques new
approaches and challenges,Drug Discovery Today,
Volume 7, Issue 11, 6 May 2002, Pages 570-576
Alistair G. Rust, Emmanuel Mongin and Ewan Birney
Loraine AE, Helt GA. 2.Discovering new genes
with advanced homology detection, Trends in
Biotechnology, Volume 20, Issue 8, 1 August 2002,
Pages 315-316 Weizhong Li and Adam Godzik
3.Biswas M, O'Rourke JF, Camon E, Fraser G,
Kanapin A, Karavidopoulou Y, Kersey P,
Kriventseva E, Mittard V, Mulder N, Phan I,
Servant F, Apweiler R. Applications of InterPro
in protein annotation and genome analysis. Brief
Bioinform. 2002 Sep3(3)285-95. PMID 12230037
PubMed - in process http//www.ebi.ac.uk/interpr
o/ 4.Visualizing the genome techniques for
presenting human genome data and annotations. BMC
Bioinformatics. 2002 Jul 303(1)19.
http//www.pubmedcentral.gov/articlerender.fcgi?to
olpubmedpubmedid12149135 5.Oshiro G, Wodicka
LM, Washburn MP, Yates JR 3rd, Lockhart DJ,
Winzeler EA. Parallel identification of new genes
in Saccharomyces cerevisiae. Genome Res. 2002
Aug12(8)1210-20. PMID 12176929 PubMed -
indexed for MEDLINE http//www.genome.org/cgi/con
tent/full/12/8/1210