Title: FASGCGFRAMESSEQWEBTRANSLATE
1FAS(GCG)?FRAMES(SEQWEB)?TRANSLATE ? BLAST(NCBI)
?PROTEIN ANALYSIS(EXPASY)
2LECTURE 92-13
Petrus Tang, Ph.D. Graduate Institute of Basic
Medical Sciences and Bioinformatics Center, Chang
Gung University. petang_at_mail.cgu.edu.tw http//pet
ang.cgu.edu.tw
3(No Transcript)
4THE WORLD OF GENOMICS
Published Complete Genome Projects
95 (including 3 chromosomes)
Prokaryotic Ongoing Genome Projects 310
Eukaryotic Ongoing Genome Projects
211 (including 11 chromosomes)
Last update 18July2002
5GenBank Sequences
GenBank is the National Institute of Health
genetic sequence database, an annotated
collection of all publicly available DNA
sequences.
Genetic Sequence Data Bank Apr 15 2004, Release
141.0 33,676,218 loci, 38,989,342,565
bases, from 33,676,218 reported
sequences Homo sapiens 7,714,277 sequences
10,569,756,393 bases EST 5,491,557
sequences
6Gene Expression Post-Translational Modification
of Proteins
Cell Growth, External Stress
7Gene Expression
Unique set of genes are expressed at different
growth conditions and at different stages.
8Functional Genomics
9Global Analysis of Gene Expression
Analysis of 10,000-50,000 messages in a
transcriptome will generate a relevant profile of
gene expression within a cell, providing a
quantitative measurement of transcripts for gene
discovery.
10Microarray
11The Art of Microarray
Marilyn Monroe Marilyn was printed using a
MicroGrid Compact with MicroSpot 10K Quill pins
onto Corning GAPS2 slides. Cy3 and Cy5 labelled
M13FWD oligonucleotides were made to 1µM in 1 x
EZRays spotting solution (Apogent Discoveries),
0.01 N-lauroyl sarcosine. Eight 2-fold dilutions
of the oligonucleotides were made into the same
spotting buffer. A 20µl aliquot of each
oligonucleotide dilution were placed in the
appropriate wells of Greiner 384 well v bottom
microplates.
The image of Marilyn was downloaded from the web,
pixelised and grayscaled using image manipulation
software. Pixel information was extracted to
Excel from which grid patterns were generated. Ms
Monroe was scanned at 10µm resolution using the
ArrayWorX scanner from Applied Precision.
12The Art of Microarray
Mona Lisa Mona was printed using a MicroGrid
Compact with MicroSpot 10K Quill pins onto
Corning GAPS2 slides. Cy3 labelled M13FWD
oligonucleotides were made to 1µM in 1 x EZRays
spotting solution (Apogent Discoveries), 0.01
N-lauroyl sarcosine. Eight 2-fold dilutions of
the oligonucleotides were made into the same
spotting buffer. A 20µl aliquot of each
oligonucleotide dilution were placed in the
appropriate wells of Greiner 384 well v bottom
microplates.
The image of Mona was downloaded from the web,
pixelised and grayscaled using image manipulation
software. Pixel information was extracted to
Excel from which grid patterns were
generated. The Mona Lisa was scanned at 10µm
resolution using the ArrayWorX scanner from
Applied Precision.
13Serial Analysis of Gene Expression (SAGE)
14What are ESTs?
Expressed Sequence Tags are small pieces of DNA
sequence (usually 200 to 500 nucleotides long)
that are generated by sequencing either one or
both ends of an expressed gene. The idea is to
sequence bits of DNA that represent genes
expressed in certain cells, tissues, or organs
from different organisms and use these "tags" to
fish a gene out of a portion of chromosomal DNA
by matching base pairs. The challenge associated
with identifying genes from genomic sequences
varies among organisms and is dependent upon
genome size as well as the presence or absence of
introns--the intervening DNA sequences
interrupting the protein coding sequence of a
gene.
15Expressed Sequence Tags (EST)
5-EST
3-EST
16(No Transcript)
17Basic Features and Tools of an Automated EST
Analysis Pipeline
? Relational database (Oracle 8i) ? Automatic
data validation ? Quality score generation ?
Automatic trimming of low-quality, vector,
adaptor, poly-A tails, low-complexity and
contaminant sequences ? Automatic running of
selected blast algorithms, with user-defined
parameters, user selected reference databases,
and storage of top results (by user- defined
cutoffs) in the database ? Includes a web
interface for viewing the data in the database,
according to the permissions allowed to the
viewer (by individual, project, lab or
institution) ? Includes a Java tool for dbEST
submission of newly generated ESTs at intervals
define by the users ? System can be readily and
simply deployed at any of the partner's
institutions ? Includes methods for defining a
Unigene set for a library. Additional
functionalities are needed by the members of the
current co-development group, including ? Tissue
or organism, integration of gene expression data.
? Annotations Gene ontology annotations,
functional motif annotation, metabolic pathways
annotations, signal transduction pathways.
18Data Processing Raw Nucleotide Sequence
EST or SAGE clones sequenced
PC
UNIX
Ewing B et al. (1988) Base-calling of automated
sequencer traces using phred. I. Accuracy
assessment. Genome Res. 8(3)175-85 Base-calling
of automated sequencer traces using phred. II.
Error probabilities. Genome Res. 8(3)186-94
19FREEWARES
DL
DL
DL
20EST Analysis Clustering
21Similarity Search Blastx
BLAST uses a heuristic algorithm which seeks
local as opposed to global alignments and is
therefore able to detect relationships among
sequences which share only isolated regions of
similarity (Altschul et al., 1990)
TV007D02
22InterPro provides an integrated view of the
commonly used signature databases, and has an
intuitive interface for text- and sequence-based
searches.
Bioinformatics infrastructural activities are
crucial to modern biological research. Complete
and up-to-date databases of biological knowledge
are vital for the increasingly information-depende
nt biological and biotechnological research.
Secondary protein databases on functional sites
and domains like PROSITE, PRINTS, SMART, Pfam,
ProDom, etc. are vital resources for identifying
distant relationships in novel sequences, and
hence for predicting protein function and
structure. Unfortunately, these signature
databases do not share the same formats and
nomenclature, and each database has is own
strengths and weaknesses. To capitalise on
these, the following partners EBI, SIB,
University of Manchester, Sanger Institute,
GENE-IT, CNRS/INRA, LION bioscience AG and
University of Bergen unified PROSITE, PRINTS,
ProDom and Pfam into InterPro (Integrated
resource of Protein Families, Domains and Sites).
The latest databases to join the project were
SMART, and more recently, TIGRFAMs.
23Annotation - GO
The goal of the Gene OntologyTM Consortium is to
produce a dynamic controlled vocabulary that can
be applied to all organisms even as knowledge of
gene and protein roles in cells is accumulating
and changing.
Molecular Function the tasks performed by
individual gene products
examples are transcription factor and
DNA helicase. Biological Process broad
biological goals, such as mitosis or purine
metabolism, that
are accomplished by ordered
assemblies of molecular
functions. Cellular Component subcellular
structures, locations, and macromolecular
complexes
examples include nucleus, telomere, and
origin recognition
complex .
p53
24Classification According to Metabolic
Signalling Pathways
Biocarta ( http//biocarta.com)
Kyto Encyclopedia of Genes Genomes http//www.gen
ome.ad.jp/kegg/
The Cancer Genome Anatomy Project (CGAP)
http//cgap.nci.nih.gov/
25Annotation
ESTs are categorized into the following classes
ESTs shows homology to known protein
motifs/domains
Unique ESTs with no matces
ESTs matches exactly to known protein sequences
26Top 50 Highly Expressed ESTs
0.17
Percentage of total ESTs
2.31
Number of ESTs
27GO Classification of ESTs
BIOLOGICAL PROCESS
MOLECULAR FUNCTION
CELLULAR COMPONENT
28COG Classification of ESTs
29Automated EST Analysis Pipeline
Project Management Sequence Management Clusterin
g Sequence Analysis Annotation dBEST
GenBank is the National Institute of Health
genetic sequence database, an annotated
collection of all publicly available DNA
sequences. There are approximately
20,648,748,345 bases in 17,471,130 sequence
records as of June 2002 R130 (12,055,326
sequences in dBEST, 4.500,000 from Homo sapiens).
12,261,869 (Aug,2002)
30EST Databases dBEST UNIGENE
dbEST (http//www.ncbi.nlm.nih.gov/dbEST/index.ht
ml) is a division of GenBank that contains
sequence data and other information on
"single-pass" cDNA sequences, or Expressed
Sequence Tags, from a number of
organisms. UniGene (http//www.ncbi.nlm.nih.gov/
entrez/query.fcgi?dbunigene) is an experimental
system for automatically partitioning GenBank
sequences into a non-redundant set of
gene-oriented clusters. Each UniGene cluster
contains sequences that represent a unique gene,
as well as related information such as the tissue
types in which the gene has been expressed and
map location.
31dBEST Record
NCBI dBEST Accession numbers for Trichomonas
vaginalis ESTs BQ621379BQ621732
BQ625216BQ625229 BQ640771BQ640943
1 BQ640943. TVEST017.H09 Tv30...gi21765401
Taxonomy IDENTIFIERS dbEST Id
12791004 EST name TVEST017.H09 GenBank
Acc BQ640943 GenBank gi 21765401 CLONE
INFO Clone Id (5') DNA type
cDNA PRIMERS PCR forward T7 PCR backward
T3 Sequencing T3 PolyA Tail
Unknown SEQUENCE
ATTACAGCAATTGCCGATGATTGGCTTGGCATCACTGGCTGGCGTATCGA
AAACTTTAAG CTCGTTAAAGTTGCAGAGATGGG
CGCCTTCCACACAGGAGATTCTTATTTGTATCTTCAC
GCTTACCTTGNTTGGCACAAGCAAGCTCGTCCATCGTGATATTTAC
TTCTGGCAGGGCTC CACATCCACAACAGATGAG
CGCGGTGCTGTTGCTATCAAGGCTGTTGAACTTGATGACAG
ATTTGGAGGCTCTCCAAAGCAACACAGAGAAGTCCAGAACCA
CGAGTCAGACCAGTTCAT
TGGACTCTTCGATCAGTTTGGCGGTGTTCGCTACCTCGATGGCGGTGTTG
AATCAGGATT CCACAAAGTCACAACATCTGCAA
AGGTTGAGATGTACAGAATCAAGGGAAGAAAGCGCCC
AATTCTCCAGATCGTTCCAGCTCAGCGCTCCTCCCTCAACCATGGA
GATGTTTTCATTAT CCATGC
Entry Created Jul 8 2002 Last Updated Jul 15
2002 PUTATIVE ID Assigned by submitter
ACTIN-BINDING PROTEIN FRAGMIN
P. LIBRARY Lib Name Tv30236_PT cDNA
Library Organism Trichomonas
vaginalis Cell line ATCC30236 Develop.
stage Trophozoites at mid-log phase Lab host
XL1 Blue-MRF' Vector Lambda
ZAP-Express (Stratagene) R. Site 1 EcoRI R.
Site 2 XhoI SUBMITTER Name
Tang, P. Lab Molecular Regulation and
Bioinformatics Laboratory, College
of Medicine Institution Chang Gung
University Address 259 Wenhwa 1st. Road,
Kweishan, Taoyuan 333, Taiwan Tel
886 3 3283016 EXT5136 Fax 886 3
3283031 E-mail petang_at_mail.cgu.edu.tw CI
TATIONS Title Analysis of Gene
Expression Profile in Trichomonas vaginalis
by EST Sequencing Authors
Zhou,Y., Shu,W.M., Huang,S.C.C., Huang,K.Y.,
Tang,P. Year 2003 Status
Unpublished
http//www.ncbi.nlm.nih.gov/dbEST/index.html
trichomonas vaginalis AND gbdiv_estPROP
32http//www.ncbi.nlm.nih.gov/SAGE/
33(No Transcript)