Part I: Identifying sequences with - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Part I: Identifying sequences with

Description:

'Done by most array manufacturers' Step-by-step approach: ... 6-Frame window. Public nucleotide databases are redundant. Sequencing errors ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 39
Provided by: BiGF
Category:

less

Transcript and Presenter's Notes

Title: Part I: Identifying sequences with


1
Part IIdentifying sequences with
Speaker S. Gaj
Date 11-01-2005
2
Annotation
  • Annotation
  • Best possible description available for a given
    sequence at the current time.
  • How to annotate?
  • Combining
  • Alignment Tools
  • Databases
  • Datamining (scripts)

Background
3
Microarrays
4
(No Transcript)
5
(No Transcript)
6
Introduction
  • Global alignment
  • Optimal alignment between two sequences
    containing as much characters of the query as
    possible.
  • Ex predicting evolutionary relationship
    between genes,
  • Local alignment
  • Optimal alignment between two sequences
    identifying identical area(s)
  • Ex Identifying key molecular structures
    (S-bonds, a- helices, )

Background
7
Introduction
  • Basic Local Alignment Search Tool
  • Aligning an unknown sequence (query) against all
    sequences present in a chosen database based on a
    score-value.
  • Aim
  • Obtaining structural or functional information
    on the unknown sequence.

BLAST
8
Programs
  • Different BLAST programs available
  • Usable criteria
  • E-Value, Gap Opening Penalty (GOP), Gap Extension
    Penalty (GEP),
  • Terms
  • Query Sequence which will be aligned
  • Subject Sequence present in database
  • Hit Alignment result.

BLAST
9
Common BLAST problems
  • BlastN

Clone seq
mRNA
Sequencing Error
BLAST
  • Solution
  • Low penalty for GOP and GEP 1

10
Translation Problems
  • 6-Frame translation

gtemblJ03801HSLSZ Human lysozyme mRNA, complete
cds with an Alu repeat in the 3' flank.
BLAST
L A L P S S Q H
E G S H C S G A
1
ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggc
t...
11
Translation Problems
  • 6-Frame translation

gtemblJ03801HSLSZ Human lysozyme mRNA, complete
cds with an Alu repeat in the 3' flank.
3
2
H S D L A V N M
K A L I V L G
BLAST
L A L P S S Q H
E G S H C S G A
1
ctagcactctgacctagcagtcaacatgaaggctctcattgttctggggc
t...
-1
-2
-3
12
Common BLAST problems
intron
exon
Gene X
Translation
BLAST
full mRNA
Splicing
mRNA
13
Common BLAST problems
mRNA
Clones derived from mRNA
BLAST
BlastX against protein sequence
3 possible hit-situations
14
Common BLAST problems
  • ? Yields no protein hit

? Aligns with protein in 1 of the 6 frames.
BLAST
? Part perfect alignment
or
15
Part II Databases and annotation
16
Introduction
  • Primary database
  • DNA Sequence (EMBL, GenBank, )
  • AminoAcid Sequence (SwissProt, PIR, )
  • Protein Structure (PDB, )
  • Secondary database
  • Derived from primary DB
  • DNA Sequence (UniGene, RefSeq, )
  • Combination of all (LocusLink, ENSEMBL, )
  • Structure
  • Flat file databases

Databases
17
Primary Databases
  • EMBL
  • DNA Sequence
  • Human 4.126.190.851 nucleotides in 292.205
    entries
  • Clones, mRNA, (Riken) cDNA,
  • New sequences can be admitted by everyone.
  • No curative check before admittance.

Databases
18
Primary Databases
  • SwissProt
  • Amino Acid sequence
  • Human
  • Contains protein information
  • SwissProt ?(EU) ? PIR (USA)
  • Crosslinks to most informative DB (PDB, OMIM)
  • Part of UniProt consortium.
  • Each addition needs validation by appointed
    curators.
  • Highly curated

Databases
19
Secondary Databases
  • TrEMBL
  • Translated EMBL
  • Hypothetical proteins
  • After careful assessment ? SpTrEMBL ? SwissProt

Databases
20
Secondary Databases
  • UniGene
  • Automated clustering of sequences with high
    similarity
  • Derived from GenBank / EMBL
  • 1 consensus-sequence
  • Species-specific

Databases
21
Secondary Databases
  • LocusLink
  • Curated sequences
  • Descriptive information about genetic loci
  • RefSeq
  • Non-redundant set of sequences.
  • Genomic DNA, mRNA, Protein
  • Stable reference for gene identification and
    characterization.
  • High curation

Databases
22
Database Quality?
mRNA
Protein
DNA
EMBL
SwissProt
Databases
Submitter
Submitter
Curators
Database Manager
Database Manager
23
How to Annotate?
  • BlastN against random nucleotide DB
  • ESTs
  • BlastN against structured nucleotide DB
    (UniGene,
    RefSeq)
  • mRNA hits
  • Sometimes not annotated at all
  • Best information

Databases
24
Microarrays
25
(No Transcript)
26
(No Transcript)
27
Part III Annotation Techniques
28
What do we have?
  • Probe sequence
  • Alignment Tools (e.g. BLAST)
  • Databases
  • !?! What to choose ?!?

Annotation
29
Possibilities?
  • 1. Do it like everyone else does.
  • 2. Make use of curative properties of certain
    databases
  • Goal
  • Annotate as many genes with as much information
    as possible (e.g. SwissProt ID)

Annotation
30
1st Approach - General
  • Done by most array manufacturers
  • Step-by-step approach
  • BLAST sequences against nucleic database

    (preferably UniGene)
  • Extract high quality (HQ) hits (gt95)
  • For each HQ hit search crosslinks.
  • Find a well-described (SwissProt) ID for each
    sequence.

Annotation Techniques
31
1st Approach - Concept
Annotation Techniques
32
2nd Approach - General
  • Make use of present database curation
  • Other way around
  • Use SwissProt to clean out EMBL
  • Result
  • Cleaned EMBL database with direct SP
    crosslinks
  • BLAST against cEMBL
  • Extract high quality alignment hits (gt95)
  • Convert EMBL ID to SP ID.

Annotation Techniques
33
2nd Approach - Concept
Annotation Techniques
34
Annotating Incyte Reporters
  • Total 13.497
  • cEMBL-approach 2.898 (21,47) SP-IDs
  • DM approach 10.013 (74,18) UG-IDs in which
  • M 4.723 (34,9) SP-IDs MR 5.147 (38,1)
    SP-IDs MRH 6.641 (49,2) SP-IDs

Results
35
Annotating Incyte Reporters
  • All reporters present on Incyte Mouse UniGene 1
    converted
  • Total 9.596 reporters
  • Old annotation 9.370 (97,6) UG-IDs in which
  • Non-existing UG-IDs 5.713 (59,5) M 1.939
    (20,2) SP-IDs
  • MR 2.096 (21,8) SP-IDs MRH 2.582 (26,9)
    SP-IDs
  • Datamining approach 8.532 (88,9) UG-IDs in
    which
  • M 4.145 (43,2) SP-IDs MR 4.499 (38,1)
    SP-IDs MRH 5.576 (60,1) SP-IDs
  • Custom EMBL-approach 2.898 (30,2) SP-IDs

Results
36
Annotating Incyte Reporters
  • Combined methods Incyte Mouse UniGene 1
    reporters
  • Total 9.596 reporters
  • No annotation 1.062 (11) reporters
  • Annotated with SP-ID 5.895 (61,3) reporters of
    which
  • 2.184 (22,7) identical SP-IDs 532 (5)
    reporters with improved SP-IDs by EMBL-method
  • 174 (1,8) reporters with different mouse SP-IDs
    5 reporters found only by EMBL-method

Results
37
Conclusions
  • Annotation is much needed
  • Array sequences can point to different genes
  • Direct translation into protein not best option
  • Sequencing errors
  • Addition or deletion of nucleotides
  • 6-Frame window
  • Public nucleotide databases are redundant.
  • Sequencing errors
  • Differences in sequence-length
  • Attachment of vector-sequence

Conclusions
38
Questions?
End
Write a Comment
User Comments (0)
About PowerShow.com