Title: Outline
1Outline
- Biology 101-Cells, DNA, DNA replication
- Genes, EST
- Genetic markers, SSR
- Bioinformatics
- Database, Blast
- Predict SSR from public EST sequences
- Goal, Method and Results
2Cell-Smallest unit of Life
- Basic unit of life
- Organisms
- Single Cell
- Colonies/clusters
- Multicellular (cell differentiation)
- Multicellular organisms
- cell ? tissue ? organ ? system ? organism
3Chemical Foundation for Cells
- Organic compound molecules of life
- consist of atoms of C, H, O and sometimes other
elements - Carbohydrates starches, sugars, cellulose -
used for energy and building material - Lipids store fat for energy
- Protein building material, structural, enzyme
- Nucleic Acids DNA, RNA, basis of inheritance and
cell reproduction
4Cell Structure and Function
- Nucleus DNA, RNA, replication
- Ribosomes protein synthesis
- Endoplasmic reticulumprotein, lipid, others
synthesis - Golgi apparatusprocessing, packaging, secretion
of proteins - Mitochondrion cellular respiration
- Chloroplasts (plant only) Photosynthesis
- Vacuoles Vesicles storage of substances
- Lysosomes intracellular digestion
-
5DNA, Chromosome and Gene
- DNA An enormous double-stranded, twisted
molecule densely coiled around molecular beads of
histone protein to form a chromosome. - Genes A strand of DNA from a chromosome with
very long series of coded messages. - Each gene is composed of many thousands of
letters called basesA, T, G, C, (A-T and G-C are
pairs). - In order for the cell to decipher the DNA strand,
the bases must be read in groups of 3 letters
called codons. - It is also important to start reading on the
correct letter of the base sequence where the
first codon of the gene begins. - Since there are four different possibilities for
each base of a codon, the total number of DNA
codons is 43 or 64. - Some codons represent "punctuation marks" marking
the beginning or the end of a gene message.
6- LSHTHEMANANDBOYARESAD THE MAN AND BOY ARE SAD
- AUGAATCGCATACTAAGACAG AAT CGC ATA CTA AGA CAG
- Point mutation
- AATCGCATACTAAGATAG
- THEMANANDBOYAREBAD
- DNA
Start codon
7(No Transcript)
8DNA-gtRNA-gtProtein
9DNA Replication
- DNA replicates before cell division
- 2 complementary strands serve as template for the
new construction - Unwind DNA, separate 2 strands by DNA Helicase
- replication origins (characterized by a weak bond
between the two DNA strands) formed - A small RNA attached to the single stranded DNA
by DNA primase - DNA polymerase copy the DNA template with high
fidelity - Only work 5-3 direction, different approach for
two strands
Replication, leading strand Replication,
lagging strand
10(No Transcript)
11EST-Expressed Sequence Tags
- Genome is big, 600,000-3 billion base pair
- Small fraction of genome, usually expressed gene
12Genetic Markers
- Phenotype (characteristics of individuals or
different species) - Genotype (genetic information, DNA sequence)
- Genetic diversity is characterized by genetic
markers - DNA sequence difference
- Phenotype difference
- No phenotype difference
- Different type of genetic markers
13Type of genetic markers
- AFLP Amplified Fragment Length Polymorphism
- RFLP Restriction Fragment Length Polymorphism
- Allozymes proteins variants
- cDNA markers coding region of genes
- Microsatellites (SSR-simple repeat sequence)
14SSR
- Molecular marker loci consisting of tandem repeat
units of very short (1-5 basepairs) nucleotide
motif - TTTTTTTTTT, ATATATATATAT
- The number of repeats are different in different
species and individuals - In case the nucleotide sequences in the flanking
regions of the microsatellite are known, specific
primers (generally 20-25 bp) can be designed to
amplify the microsatellite by the Polymerase
Chain Reaction (PCR) - The size of the amplified fragment is different
and can be detected by gel-electrophoresis. - How to identify the flanking region of the SSR?
15(No Transcript)
16- Constructing a small-insert genomic library,
screening of the library with a synthetic labeled
oligonucleotide repeat and sequencing of the
positive clones - Primers designed for closely related species may
be used - Bioinformatics approach-searching the public
available EST sequence database - Large amount of EST available
- Fast, large scale analysis
- Lower cost
-
17What is Bioinformatics
- Study of biological information
- Interface of biology and computers
- Computational molecular biology
- Includes genomics
- Subfields DNA informatics, protein informatics,
proteomics - Person is bioinformaticist
- Sometimes bioinformatician, bioinformatist
18What is Bioinformatics
- A narrower definition
- The use of computer tools to store, access,and
analyze nucleic acid and amino acid sequence data
and protein structural data. - A broader definition
- The use of computer tools to store, access, and
analyze all types of biological data, including
text data, phylogenetic trees, metabolic maps. - Many applications in biological research, drug
design, medicine.
19Bioinformatics Related Fields
- Biotechnology
- Computer simulations in biology
- Genetics, biochemistry, cell biology
- Structural biology
- Phylogenetics and taxonomy
- Quantitative biology, mathematical biology,
theoretical biology - Epidemiology
- Medical informatics
- Drug design
20Early History
- First protein amino acid sequence database in mid
1960s. - Researchers developing algorithms to analyze
these data for individual research projects in
1960s and 1970s. - GenBank and other public databases with freely
available analysis tools in the 1980s - Enormous growth of GenBank and PDB in 1990s
21Recent Trend
- A great surge in genomics
- The Human Genome Project
- Genome projects for 400 organisms
- gt100 completed published genomes
- Recent advances in molecular genetics
technologies, especially microarrays - Push to analyze genes and gene products, and to
determine protein structure/function relationship - High through-put biology, large scale data
analysis
22Data Explosion
- These research trends, especially the sequencing
projects, have resulted in huge amounts of raw
data that must be stored, manipulated, and
analyzed. - Computer information technology has also advanced
very rapidly and has provided the means for
handling this data explosion.
23GenBank increase in basepairs and DVA sequence
files
24A very Broad field
- mapping
- evolutionary studies
- sequencing
- statistical analysis of biopolymers
- sequence comparisons
- gene modeling
- DNA sequence -gt DNA structure
- molecular modeling
- structure comparisons
- sequences/structures -gt function
- gene expression and genetic/metabolic networks
- database
- visualization and interaction
- comparing tools
- new technologies
- ethics
25Problems in Bioinformatcs
- Genomics
- Gene finding
- Annotation
- Sequence alignment and database search
- Functional genomics
- Microarray expression, gene chips
- Proteomics
- Structure prediction
- Comparative modeling
- Function prediction
- Structural bioinformatics
- Molecular docking, screening, etc.
26Bioinformatcs Tools and Services
- Databases text, sequence, structure
- Database annotation text searches
- Sequence similarity search tools
- Sequence and structure analysis tools
- 3D structure visualization tools
- Structure prediction tools
- Phylogenetic analysis tools
- Metabolic analysis tools
27Major Bioinformatics Site
- NCBI, The National Center for Biotechnology
Information http//www.ncbi.nlm.nih.gov - EMBL, The European Molecular Biology Laboratory
http//www.ebi.ac.uk - PDB, Protein Data Bank www.rcsb.org/pdb
- This is just a sample list of the most used sites
28NCBI
- Entrez DNA, protein, genome database
- BLAST Offer blast search against many ncbi
databases - PubMed Literature search service that provides
access to gt10 million citations. - COGs (Clusters of Orthologous Groups) ancient
conserved protein domains from the complete
genomes - ORF Finder Finds all open reading frames in a
user or database sequence. - FTP server for databases and software
- Many other features, including examples and
tutorials
29What is BLAST
- BLASTBasic Local Alignment Search Tool
- Most widely used search program
- Finds and aligns matching sequences.
- Fast
- Several versions, eg. DNA or protein query
- Current since 1997 BLAST 2.0 gapped BLAST
- Does alignments of local regions, not entire
sequences.
30Sequence Questions
- Does the DNA sequence contain a gene?
- Is the gene a member of a known gene family? What
is the encoded protein - What is the function of the protein?
- Do other organisms have the protein or the gene?
31Sequence Implications
- DNA sequence of gene determine amino acid
sequence of a protein. Primary sequence
determines structure and function of a gene. - Proteins with similar sequences and structures
have similar functions - Similar sequences should have long regions of
identical or similar residues. Why? - Evolution decent from a common ancestral
sequence. Homology is implied for each similar
region - Very rarely have functional convergence without
sequence or structure similarity.
32Sequence Similarity Search
- Look for DNAs or proteins with similar sequence
to query, by searching a sequence database. - Search requires
- Search software
- Databases of annotated sequences
- Useful output
- Results must be interpreted and evaluated.
33Similarity Search Terms
- A similarity search of a database is performed by
aligning a single query sequence to search
sequence in the database in turn. Good matches
may be found to subject sequences - Local alignment align local regions of
sequences. Local alignment returns a list of
HSPs High-scoring Segment Pairs.
34Alignment Quality
- Features of good alignments
- Many exact matches of residues between sequences.
- Some mismatches which nevertheless are AAs with
similar physicochemical properties. - Few gaps.
- Long stretches of very good match.
- Query 162 ---LKFGNMKVETFYPGKGH 181
- FG VE FYPG H
- Sbjct 162 GDAVRFG--PVELFYPGAAH 182
35Different BLASTS
- BLASTN
- Compares a DNA query to DNA database
- Searches both strands automatically
- Is optimised for speed, not sensitivity
- BLASTP
- Compares a protein query to protein database
- BLASTX
- Compares a DNA query to protein database
- Does 6-frame translations of the query
- TBLASTN
- Compares a protein query to a DNA database
- Does 6-frame translations of the database
36Different BLASTS
- TBLASTX
- Compares a DNA query to a DNA database
- Does 6-frame translations of the query and of the
database - Each search is comparable to 36 BLASTP searches!
- BLAST2
- Also called 'advanced' BLAST
- Can perform gapped alignments
37The pipeline to generate potential SSR markers
from the EST sequence
EST sequences
Output of results 1. Position, type, of repeats
of identified SSR 2. Flanking primers
sequences, melting T, product size, etc 3.
Annotation by searching TGI
Pre-processing (remove polyA/T, etc)
SSR search Identification of SSR, Definition of
minimum number of repeats N(10), NN(6), NNN(5),
NNNN(5), NNNNN(5), NNNNNN(5)
Primer3-Primer prediction
CAP3-assemble the ESTs Remove redundancy
Lab work to confirm the predicted SSR markers
38Download EST from genbank
39Clean Up the EST sequences
- Trim the poly A, or poly T sequences
- Remove the low quality sequences (if there is a
lot of Ns) - Remove sequence gt 700 bp in a EST
- Remove the very short sequences lt100 bp
- EST 94,423 to 94,340 sequences
40Perl scripts to find the SSR sequence
- Criteria for SSR
- Definement of SSR (unit size / minimum number of
repeats) (1/10) (2/6) (3/5) (4/5) (5/5) (6/5) - Maximal number of bases interrupting 2 SSRs in a
compound microsatellite 100 - For each individual sequences, read base one by
one, count the number of repeats, if meet
criteria, output the sequence and the position,
type of the SSR to one file. - Results
- Total number of identified SSRs
6725 - Number of SSR containing sequences
6051 - Number of sequences containing more than 1 SSR
600 - Number of SSRs present in compound formation
39
41Primer3 to design the primer
- Primer3, a free available software to design the
PCR primer based on DNA sequence - Can define the primer pickup criteria
- Results
- 5045 SSR sequences were successfully predicted
with primer pairs - 1282 SSR sequences failed
42Use CAP3 to assemble the SSR containing EST
sequence to avoid redundancy
- CAP3, a free available software to assemble DNA
sequences based on sequence similarity - CAP3 assemble the EST sequences using BLAST
- BLAST most heavily used searching tool
- Blastn, blastp, blast2
- Use public available database nr, nraa, est,
human, mouse, rat - Use customized database
43All the ESTs represent same part of DNA sequence
in this figure
44Summary of Statistics
- Total number of sequences examined
94340 - Total size of examined sequences (bp)
50197414 - Total number of identified SSRs
6725 - Number of SSR containing sequences
6051 - Number of sequences containing gt1 SSR 600
- Number of SSRs present in compound formation
398 - Unit size Number of SSRs
- 1 3141
- 2 805
- 3 2652
- 4 53
- 5 10
- 6 64