Title: Part I: Identifying sequences with
1Part IIdentifying sequences with
Speaker S. Gaj
Date 11-01-2005
2Annotation
- Annotation
- Best possible description available for a given
sequence at the current time. - How to annotate?
- Combining
- Alignment Tools
- Databases
- Datamining (scripts)
Background
3Microarrays
4(No Transcript)
5(No Transcript)
6Introduction
- Global alignment
- Optimal alignment between two sequences
containing as much characters of the query as
possible. - Ex predicting evolutionary relationship
between genes, - Local alignment
- Optimal alignment between two sequences
identifying identical area(s) - Ex Identifying key molecular structures
(S-bonds, a- helices, )
Background
7Introduction
- Basic Local Alignment Search Tool
- Aligning an unknown sequence (query) against all
sequences present in a chosen database based on a
score-value. - Aim
- Obtaining structural or functional information
on the unknown sequence.
BLAST
8Programs
- Different BLAST programs available
- Usable criteria
- E-Value, Gap Opening Penalty (GOP), Gap Extension
Penalty (GEP), - Terms
- Query Sequence which will be aligned
- Subject Sequence present in database
- Hit Alignment result.
BLAST
9Common BLAST problems
Clone seq
mRNA
Sequencing Error
BLAST
- Solution
- Low penalty for GOP and GEP 1
10Translation Problems
gtemblJ03801HSLSZ Human lysozyme mRNA, complete
cds with an Alu repeat in the 3' flank.
BLAST
L A L P S S Q H
E G S H C S G A
1
ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggc
t...
11Translation Problems
gtemblJ03801HSLSZ Human lysozyme mRNA, complete
cds with an Alu repeat in the 3' flank.
3
2
H S D L A V N M
K A L I V L G
BLAST
L A L P S S Q H
E G S H C S G A
1
ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggc
t...
-1
-2
-3
12Common BLAST problems
intron
exon
Gene X
Translation
BLAST
full mRNA
Splicing
mRNA
13Common BLAST problems
mRNA
Clones derived from mRNA
BLAST
BlastX against protein sequence
3 possible hit-situations
14Common BLAST problems
? Aligns with protein in 1 of the 6 frames.
BLAST
? Part perfect alignment
or
15Part II Databases and annotation
16Introduction
- Primary database
- DNA Sequence (EMBL, GenBank, )
- AminoAcid Sequence (SwissProt, PIR, )
- Protein Structure (PDB, )
- Secondary database
- Derived from primary DB
- DNA Sequence (UniGene, RefSeq, )
- Combination of all (LocusLink, ENSEMBL, )
- Structure
- Flat file databases
Databases
17Primary Databases
- EMBL
- DNA Sequence
- Human 4.126.190.851 nucleotides in 292.205
entries - Clones, mRNA, (Riken) cDNA,
- New sequences can be admitted by everyone.
- No curative check before admittance.
Databases
18Primary Databases
- SwissProt
- Amino Acid sequence
- Human
- Contains protein information
- SwissProt ?(EU) ? PIR (USA)
- Crosslinks to most informative DB (PDB, OMIM)
- Part of UniProt consortium.
- Each addition needs validation by appointed
curators. - Highly curated
Databases
19Secondary Databases
- TrEMBL
- Translated EMBL
- Hypothetical proteins
- After careful assessment ? SpTrEMBL ? SwissProt
Databases
20Secondary Databases
- UniGene
- Automated clustering of sequences with high
similarity - Derived from GenBank / EMBL
- 1 consensus-sequence
- Species-specific
Databases
21Secondary Databases
- LocusLink
- Curated sequences
- Descriptive information about genetic loci
- RefSeq
- Non-redundant set of sequences.
- Genomic DNA, mRNA, Protein
- Stable reference for gene identification and
characterization. - High curation
Databases
22Database Quality?
mRNA
Protein
DNA
EMBL
SwissProt
Databases
Submitter
Submitter
Curators
Database Manager
Database Manager
23How to Annotate?
- BlastN against random nucleotide DB
- ESTs
- BlastN against structured nucleotide DB
(UniGene,
RefSeq) - mRNA hits
- Sometimes not annotated at all
- Best information
Databases
24Microarrays
25(No Transcript)
26(No Transcript)
27Part III Annotation Techniques
28What do we have?
- Probe sequence
- Alignment Tools (e.g. BLAST)
- Databases
- !?! What to choose ?!?
Annotation
29Possibilities?
- 1. Do it like everyone else does.
- 2. Make use of curative properties of certain
databases - Goal
- Annotate as many genes with as much information
as possible (e.g. SwissProt ID)
Annotation
301st Approach - General
- Done by most array manufacturers
- Step-by-step approach
- BLAST sequences against nucleic database
(preferably UniGene) - Extract high quality (HQ) hits (gt95)
- For each HQ hit search crosslinks.
- Find a well-described (SwissProt) ID for each
sequence.
Annotation Techniques
311st Approach - Concept
Annotation Techniques
322nd Approach - General
- Make use of present database curation
- Other way around
- Use SwissProt to clean out EMBL
- Result
- Cleaned EMBL database with direct SP
crosslinks - BLAST against cEMBL
- Extract high quality alignment hits (gt95)
- Convert EMBL ID to SP ID.
-
Annotation Techniques
332nd Approach - Concept
Annotation Techniques
34Annotating Incyte Reporters
-
- Total 13.497
- cEMBL-approach 2.898 (21,47) SP-IDs
- DM approach 10.013 (74,18) UG-IDs in which
- M 4.723 (34,9) SP-IDs MR 5.147 (38,1)
SP-IDs MRH 6.641 (49,2) SP-IDs
Results
35Annotating Incyte Reporters
- All reporters present on Incyte Mouse UniGene 1
converted - Total 9.596 reporters
- Old annotation 9.370 (97,6) UG-IDs in which
- Non-existing UG-IDs 5.713 (59,5) M 1.939
(20,2) SP-IDs - MR 2.096 (21,8) SP-IDs MRH 2.582 (26,9)
SP-IDs - Datamining approach 8.532 (88,9) UG-IDs in
which - M 4.145 (43,2) SP-IDs MR 4.499 (38,1)
SP-IDs MRH 5.576 (60,1) SP-IDs - Custom EMBL-approach 2.898 (30,2) SP-IDs
Results
36Annotating Incyte Reporters
- Combined methods Incyte Mouse UniGene 1
reporters - Total 9.596 reporters
- No annotation 1.062 (11) reporters
- Annotated with SP-ID 5.895 (61,3) reporters of
which - 2.184 (22,7) identical SP-IDs 532 (5)
reporters with improved SP-IDs by EMBL-method - 174 (1,8) reporters with different mouse SP-IDs
5 reporters found only by EMBL-method
Results
37Conclusions
- Annotation is much needed
- Array sequences can point to different genes
- Direct translation into protein not best option
- Sequencing errors
- Addition or deletion of nucleotides
- 6-Frame window
- Public nucleotide databases are redundant.
- Sequencing errors
- Differences in sequence-length
- Attachment of vector-sequence
Conclusions
38Questions?
End