Title: INDIAN INITIATIVE FOR RICE GENOME SEQUENCING
1INDIAN INITIATIVE FOR RICE GENOME SEQUENCING
ANNUAL PROGRESS REPORT 17 Jan, 2002-2003
VIVEK DALAL
2PROCESS FLOWCHART IN IIRGS
START
IDENTIFY CLONE
(PHYSICAL MAPPING)
LIB. PREPARATION SEQUENCING
(SHOT GUN CLONING SEQUENCING)
QUALITY CHECK
(GENOINFORMATICS)
FAIL
PASSED
TEMPLATE PREP.
(SHOT GUN CLONING)
DNA SEQUENCING
(SEQUENCING)
TCF
(DATA STORE)
ASSEMBLY
(GENOINFORMATICS)
(GENOINFORMATICS)
3(SUBMISSION)
ORIENTATION (PHASE II)
GENE PREDICTION
ANNOTATION
FINISHING (PHASE III)
STOP
NCBI GENBANK
(RESUBMISSION)
TCF
(DATA STORE)
(GENOINFORMATICS)
4CRITERIA FOR QUALITY CHECK
- To treat all hits with E.coli as contamination.
- To treat all hits with pBeloBAC as
contamination. - To treat all significant hits with pUC19 or any
other cloning or sub-cloning - vector as contamination.
- Upto 10 of maximum total contamination is
allowed. - Templates are estimated based on Mean Read
Length Avg. Success Rate
RESULTS OF QUALITY CHECK
5ASSEMBLY
- Assembly uses a combination of 3 programs namely
- - PHRED Assigns quality values to each base.
- PHRAP Trims vector sequences assembles the
reads into Contigs. - CONSED Provides a graphical view of the
assembly.
Phase I
A
B
E
C
D
H
F
G
6Sequence Assembly - I
Avg. Seq reads 2000/day Avg. No. of bases
7,75,000/day i.e. 52,50,000/wk
OSJNBa0079N13
No. of plates 8 F/R Coverage
6.3X Total contigs 21 No. of Contigs gt2K
12 Largest contig 27.9K
7Sequence Assembly - II
OSJNBa0079N13
No. of plates 16 F/R Coverage
10X Total contigs 11 Contigs gt2K
6 Largest contig 66Kb
Submitted to GenBank 140Kb
8Verification of BAC ends from sequence reads
AAGCTT Hind III site
9Validation with BAC End Sequences
Ba70D14 End Contig
Ba70D14 Forward BAC end seq.
Ba70D14 End Contig
Ba70D14 Reverse BAC end seq.
10Orientation of BAC Clones
52B22
89M05
85G12
89M05 Forward end contig
85G12
89M 05
89M 05
85G12--FE--
1
2
3
4
1
2
3
4
11Orientation of BAC Clones
52B22
89M05
85G12
Reverse BAC end sequence of 52B22
89M 05 Reverse end contig
89M05
89M05
FE--
FE--
--RE
85G12--
85G12--
--52B22
1
2
3
4
1
2
3
4
12SUMMARY OF PHASE II SUBMISSION TO GENBANK
13STRATEGY FOR GENE PREDICTION
14ANNOTATION STANDARDS (IRGSP, FEB. 2002)
- Sequences with 100 identity at the amino acid
level to known proteins will receive the same,
original gene name. - Sequences with less than 100 identity but with
significant homology to known proteins will be
called "putative" proteins of the same name. - Protein matches with BLASTP bit scores of gt100,
e-values of lt e-20 , or equivalent
criteria, will be regarded as significant
homologies. - Sequences with homology to unknown ESTs will be
called "unknown." - Sequences predicted by multiple gene prediction
programs with no homology to any EST will be
called hypothetical protein.
15GENE PREDICTION ANNOTATION RESULTS -
- Total No. of Genes Predicted - 984
- Exact / 100 identical - 156
- Putative - 339
- Unknown 78
- Hypothetical 411
Known genes Pi ta, Pib, Xa2, Xa21, RGAs, Yr10
, NBS-LRR,
salinity tolerance,
Gag-Pol polyprotein etc.
16Problems in sequences
Single clone area
Gap
Gap
Single strand area
Multiple clone coverage on both strands
17Finishing DNA Sequences
Finishing It is the process of polishing raw
sequences, transforming the fragmented rough
draft into long, continuous final product
without breaks or errors.
Objectives..
- Resolve sequence ambiguities and discrepancies,
such that the error rate is less than one in
10,000 bases. - Provide double-stranded coverage for every
base - minimum of two different clones
- two different directions
- two different chemistries
- Achieve contiguity.
- Delineate vector/insert junctions.
18www.nrcpb.org
19www.nrcpb.org/rgp.html
20THANK YOU