Diapositive 1 - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Diapositive 1

Description:

From cDNA to integrative protein annotation and beyond: ... Alvinella pompejana, the ' pompeii worm ', is a Polychaete Annelid discovered in 1980. ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 2
Provided by: nicolasg
Category:

less

Transcript and Presenter's Notes

Title: Diapositive 1


1
From cDNA to integrative protein annotation and
beyond application to Alvinella pompejana cDNA
collection
Gagnière, N.1, Bigot, Y.2, Gaill, F.3, Higuet,
D.4, Jollivet, D.5, Leize, E.6, Perrodou, E.1,
Rees, J.F.7, Weissenbach, J.8, Zal, F.9, Poch,
O.1 , Lecompte, O.1
1 CNRS-INSERM-ULP, UMR7104/U596 LBGI Laboratoire de Biologie et Génomique Intégratives 4 CNRS-UPMC-MNHN-IRD, UMR 7138 Génétique et Evolution 7 ISV-UCL, Laboratoire de Biologie cellulaire (Belgium)
2 CNRS-UFR FRE 2535- Laboratoire dEtude des Parasites Génétiques 5 CNRS-UPMC, UMR 7144 - Evolution et Génétique des Populations Marines 8 GENOSCOPE
3 CNRS-UPMC-MNHN-IRD, UMR 7138 Systématique, Adaptation, Evolution 6 CNRS-ULP, UMR 7512 - Laboratoire de Spectrométrie de masse BioOrganique 9 CNRS-UPMC Equipe Ecophysiologie Adaptation et Evolution Moléculaires
Available cDNA libraries
Abstract
  • Full-length enriched cDNA libraries were
    generated at the Genoscope (http//www.genoscope.c
    ns.fr/) for
  • whole animal (Cloneminer method)
  • gills (Oligo-capping method)
  • ventral tissue (Oligo-capping method)
  • pygidium (Cloneminer method, sequencing in
    progress)
  • Whole animals as well as dissected tissues were
    been collected during the oceanographic Biospeedo
    cruise on the Pacific Ridge in 2004. The
    sequencing of the 5 ends is ongoing at Genoscope
    on a ABI 3730 sequencer using dye-terminator
    fluorescent DNA sequencing technology. A total of
    200,000 reads will be achieved. We will select
    about 10,000 full-length cDNA using the sequence
    data and the entire sequence of the selected
    clones will be determined.

Alvinella pompejana, the  pompeii worm , is a
Polychaete Annelid discovered in 1980. This
tubiculous worm colonizes hydrothermal Vents
where it is faced with extreme and variable
physico-chemical conditions including very high
temperatures (from 20 to over 80C), anoxic
conditions, low pH, high concentration of heavy
metals and sulfideThis environment makes A.
pompejana an ideal model for studies aimed at
deciphering adaptation in general as well as a
unique source of thermostable proteins of
eukaryotic origin for structural studies. For
these reasons, the Alvinella consortium initiated
a massive cDNA sequencing project. To exploit the
first 70,000 reads, we have designed a semi
automated protocol starting from Alvinella cDNA
collection up to annotated proteins. This
protocol includes chromatograms base calling, raw
sequences cleaning and assembling as well as
original strategies for protein creation and
annotation.
Semi automated cDNA sequence analysis protocol
Cleaning and assembling process
Protein creation and integrative annotation with
MACSIMS
Contigs and singlets are annotated by the
software platform, GScope, developed at the LBGI
(R. Ripp, manuscript in preparation). GScope
manages, integrates, validates, analyses and
visualizes high-throughput information (genome
proteic sequences, transcriptomics). Classical
tools for similarity search, gene prediction,
codon usage determination are implemented as well
as in-house programs for specialised analysis
(start codon validation, frameshift detection,
oligonucleotide design, target analysis,
phylogenetic distribution). Protein sequence
prediction We developed an original BlastX-based
approach to detect and translate Alvinella CDS
segments complementary to hidden Markov Model CDS
prediction program ESTscan2 (Lottaz et al.). Due
to the limited number of Alvinella cDNA coding
versus non-coding sequences, robust HMM model
could not be constructed leading to the use of
the bundled human model that proved to be
efficient. This result is linked to the close
relationships existing between A. pompejana and
vertebrates (Alvinella consortium, manuscript in
preparation). MACS creation All the annotation
process programs rely on high quality clustered
multiple alignments generated by the PipeAlign
(http//bips.u-strasbg.fr/PipeAlign/) protein
analysis toolkit. This allows the reliable
characterization of a target protein sequence in
its evolutionary context. Annotation We used
MACSIMS (http//bips.u-strasbg.fr/MACSIMS) to
propagate to Alvinella sequences structural and
functional information mined from the public
databases. In addition, the GoAnno program
(http//bips.u-strasbg.fr/GOAnno/) annotates
proteins according to the Gene Ontology and a
data mining programs generates a consensus
functional definition and a consensus EC number
from close homologs. Throughout the whole
analysis protocol, fine grained information about
cDNAs (tissular origin, cloning errors, sequence
quality, ) are maintained in a relational
database to facilitate tissue libraries
comparison, variant comparison and efficient
exploitation of A. pompejana cDNAs.
BlastX-based protein sequence prediction. The
significant assembled sequence BlastX HSPs are
reported on the corresponding cDNA segment to be
translated in correct reading frame. Unmatched
cDNA segments and covering HSPs segments are
padded with X characters. Finally protein is
extended in both directions until stop codon or
cDNA extremities.
Propagation of functional and structural
information using MACSIMS
For the 70,000 available reads, base-calling and
low-quality (Q13) region trimming were performed
using the Phred program. Vector sequences and
other contaminants were masked using Cross-match.
Poly(A/T) regions as well as repetitive sequences
were masked using ad hoc scripts. After sequence
trimming and masking, sequences with fewer than
100 unmasked bases were excluded from further
processing. Cleaned sequences of each library
were assembled separately using Cap3, leading to
a total of 13,000 contigs and singlets. Mean
contig length is gt 900 bp and the library
redundancy ranges from 53 to 79.
Multiple Alignment of Complete Sequences (MACS)
creation using PipeAlign.
PFAM-A annotation display using JalView
(www.jalview.org/). Propagated features appear in
a lighter color than database mined features.
Conclusion and perspectives
Ongoing developments
Annotation results summary
To facilitate and speed up oligo design for
future protein expression tests, we have
developed a new program called OliDA (Oligo
Design Automatization) to automatically determine
optimized cDNAs and protein boundaries through
MACSIMS results analysis. Boundary determination
combines PFAM-A domains or PDB structure
boundaries with phylogenetic distribution and
conservation patterns. This program is integrated
into the GScope platform upstream to oligo
ordering for PCR and will be available as a web
application.
Annotation protocol
Overview of the OliDA decision tree. Since
sequenced 3 cDNA extremities are often unusable
, when the C terminus extremity of the protein is
expected to be in the 1,200 mean base pairs of
the insert, the program will use vector specific
hand designed oligos called run-off oligos.
These oligos match the vector downstream to the
insert and then the endogenous protein stop codon
should be used.
Beta version of OliDA Web2.0 results page. The
red lines indicate the proposed boundaries. User
can correct cloning boundaries by clicking on the
alignment.
  • References
  • Chalmel F, Lardenois A, Thompson JD, Muller J,
    Sahel JA, Leveillard T, PochO. GOAnno GO
    annotation based on multiple alignment.
    Bioinformatics. 2005
  • Clamp, M., Cuff, J., Searle, SM, Barton, GJ. The
    Jalview Java Alignment Editor. Bioinformatics.
    2004
  • Ewing B, Hillier L, Wendl MC, Green P.
    Base-calling of automated sequencer traces using
    phred. Genome Res. 1998
  • Huang X, Madan A. CAP3 A DNA sequence assembly
    program. Genome Res. 1999
  • Lecompte O, Thompson JD, Plewniak F, Thierry J,
    Poch O. Multiple alignment of complete sequences
    (MACS) in the post-genomic era.Gene. 2001
  • Lottaz C, Iseli C, Jongeneel CV, Bucher P.
    Modeling sequencing errors by combining Hidden
    Markov models. Bioinformatics. 2003
  • Plewniak F, Bianchetti L, Brelivet Y, Carles A,
    Chalmel F, Lecompte O,Mochel T, Moulinier L,
    Muller A, Muller J, Prigent V, Ripp R, Thierry
    JC,Thompson JD, Wicker N, Poch O. PipeAlign A
    new toolkit for protein family analysis.Nucleic
    Acids Res. 2003
  • Thompson JD, Muller A, Waterhouse A, Procter J,
    Barton GJ, Plewniak F, Poch O. MACSIMS multiple
    alignment of complete sequences information
    management system. BMC Bioinformatics. 2006
  • About 30 of initial cDNA sequences have been
    discarded from the assembly by the cleaning
    process. Although some short sequences of good
    quality were removed, the vast majority of these
    sequences were empty vector sequences and
    chimeric inserts.
  • From the 13,000 assembled sequences, only half
    of them have significant BlastX homologs for
    protein creation and annotation. ESTscan2
    prediction using human model on the sequences
    without homologs showed many long open reading
    frames with biased composition.
  • Almost all the proteins have been annotated with
    either PFAM-A domains, Gene Ontology, functional
    definition or EC number. Annotation verification
    is in progress, nevertheless we will also
    implement a scoring function that will help to
    semi automatically check the consistency of the
    annotation for each sequence.
Write a Comment
User Comments (0)
About PowerShow.com