Outline - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Outline

Description:

Person is 'bioinformaticist' Sometimes 'bioinformatician,' 'bioinformatist' ... ORF Finder: Finds all open reading frames in a user or database sequence. ... – PowerPoint PPT presentation

Number of Views:705
Avg rating:3.0/5.0
Slides: 45
Provided by: robin45
Category:
Tags: outline

less

Transcript and Presenter's Notes

Title: Outline


1
Outline
  • Biology 101-Cells, DNA, DNA replication
  • Genes, EST
  • Genetic markers, SSR
  • Bioinformatics
  • Database, Blast
  • Predict SSR from public EST sequences
  • Goal, Method and Results

2
Cell-Smallest unit of Life
  • Basic unit of life
  • Organisms
  • Single Cell
  • Colonies/clusters
  • Multicellular (cell differentiation)
  • Multicellular organisms
  • cell ? tissue ? organ ? system ? organism

3
Chemical Foundation for Cells
  • Organic compound molecules of life
  • consist of atoms of C, H, O and sometimes other
    elements
  • Carbohydrates starches, sugars, cellulose -
    used for energy and building material
  • Lipids store fat for energy
  • Protein building material, structural, enzyme
  • Nucleic Acids DNA, RNA, basis of inheritance and
    cell reproduction

4
Cell Structure and Function
  • Nucleus DNA, RNA, replication
  • Ribosomes protein synthesis
  • Endoplasmic reticulumprotein, lipid, others
    synthesis
  • Golgi apparatusprocessing, packaging, secretion
    of proteins
  • Mitochondrion cellular respiration
  • Chloroplasts (plant only) Photosynthesis
  • Vacuoles Vesicles storage of substances
  • Lysosomes intracellular digestion

5
DNA, Chromosome and Gene
  • DNA An enormous double-stranded, twisted
    molecule densely coiled around molecular beads of
    histone protein to form a chromosome.
  • Genes A strand of DNA from a chromosome with
    very long series of coded messages.
  • Each gene is composed of many thousands of
    letters called basesA, T, G, C, (A-T and G-C are
    pairs).
  • In order for the cell to decipher the DNA strand,
    the bases must be read in groups of 3 letters
    called codons.
  • It is also important to start reading on the
    correct letter of the base sequence where the
    first codon of the gene begins.
  • Since there are four different possibilities for
    each base of a codon, the total number of DNA
    codons is 43 or 64.
  • Some codons represent "punctuation marks" marking
    the beginning or the end of a gene message.

6
  • LSHTHEMANANDBOYARESAD THE MAN AND BOY ARE SAD
  • AUGAATCGCATACTAAGACAG AAT CGC ATA CTA AGA CAG
  • Point mutation
  • AATCGCATACTAAGATAG
  • THEMANANDBOYAREBAD
  • DNA

Start codon
7
(No Transcript)
8
DNA-gtRNA-gtProtein
9
DNA Replication
  • DNA replicates before cell division
  • 2 complementary strands serve as template for the
    new construction
  • Unwind DNA, separate 2 strands by DNA Helicase
  • replication origins (characterized by a weak bond
    between the two DNA strands) formed
  • A small RNA attached to the single stranded DNA
    by DNA primase
  • DNA polymerase copy the DNA template with high
    fidelity
  • Only work 5-3 direction, different approach for
    two strands

Replication, leading strand Replication,
lagging strand
10
(No Transcript)
11
EST-Expressed Sequence Tags
  • Genome is big, 600,000-3 billion base pair
  • Small fraction of genome, usually expressed gene

12
Genetic Markers
  • Phenotype (characteristics of individuals or
    different species)
  • Genotype (genetic information, DNA sequence)
  • Genetic diversity is characterized by genetic
    markers
  • DNA sequence difference
  • Phenotype difference
  • No phenotype difference
  • Different type of genetic markers

13
Type of genetic markers
  • AFLP Amplified Fragment Length Polymorphism
  • RFLP Restriction Fragment Length Polymorphism
  • Allozymes proteins variants
  • cDNA markers coding region of genes
  • Microsatellites (SSR-simple repeat sequence)

14
SSR
  • Molecular marker loci consisting of tandem repeat
    units of very short (1-5 basepairs) nucleotide
    motif
  • TTTTTTTTTT, ATATATATATAT
  • The number of repeats are different in different
    species and individuals
  • In case the nucleotide sequences in the flanking
    regions of the microsatellite are known, specific
    primers (generally 20-25 bp) can be designed to
    amplify the microsatellite by the Polymerase
    Chain Reaction (PCR)
  • The size of the amplified fragment is different
    and can be detected by gel-electrophoresis.
  • How to identify the flanking region of the SSR?

15
(No Transcript)
16
  • Constructing a small-insert genomic library,
    screening of the library with a synthetic labeled
    oligonucleotide repeat and sequencing of the
    positive clones
  • Primers designed for closely related species may
    be used
  • Bioinformatics approach-searching the public
    available EST sequence database
  • Large amount of EST available
  • Fast, large scale analysis
  • Lower cost

17
What is Bioinformatics
  • Study of biological information
  • Interface of biology and computers
  • Computational molecular biology
  • Includes genomics
  • Subfields DNA informatics, protein informatics,
    proteomics
  • Person is bioinformaticist
  • Sometimes bioinformatician, bioinformatist

18
What is Bioinformatics
  • A narrower definition
  • The use of computer tools to store, access,and
    analyze nucleic acid and amino acid sequence data
    and protein structural data.
  • A broader definition
  • The use of computer tools to store, access, and
    analyze all types of biological data, including
    text data, phylogenetic trees, metabolic maps.
  • Many applications in biological research, drug
    design, medicine.

19
Bioinformatics Related Fields
  • Biotechnology
  • Computer simulations in biology
  • Genetics, biochemistry, cell biology
  • Structural biology
  • Phylogenetics and taxonomy
  • Quantitative biology, mathematical biology,
    theoretical biology
  • Epidemiology
  • Medical informatics
  • Drug design

20
Early History
  • First protein amino acid sequence database in mid
    1960s.
  • Researchers developing algorithms to analyze
    these data for individual research projects in
    1960s and 1970s.
  • GenBank and other public databases with freely
    available analysis tools in the 1980s
  • Enormous growth of GenBank and PDB in 1990s

21
Recent Trend
  • A great surge in genomics
  • The Human Genome Project
  • Genome projects for 400 organisms
  • gt100 completed published genomes
  • Recent advances in molecular genetics
    technologies, especially microarrays
  • Push to analyze genes and gene products, and to
    determine protein structure/function relationship
  • High through-put biology, large scale data
    analysis

22
Data Explosion
  • These research trends, especially the sequencing
    projects, have resulted in huge amounts of raw
    data that must be stored, manipulated, and
    analyzed.
  • Computer information technology has also advanced
    very rapidly and has provided the means for
    handling this data explosion.

23
GenBank increase in basepairs and DVA sequence
files
24
A very Broad field
  • mapping
  • evolutionary studies
  • sequencing
  • statistical analysis of biopolymers
  • sequence comparisons
  • gene modeling
  • DNA sequence -gt DNA structure
  • molecular modeling
  • structure comparisons
  • sequences/structures -gt function
  • gene expression and genetic/metabolic networks
  • database
  • visualization and interaction
  • comparing tools
  • new technologies
  • ethics

25
Problems in Bioinformatcs
  • Genomics
  • Gene finding
  • Annotation
  • Sequence alignment and database search
  • Functional genomics
  • Microarray expression, gene chips
  • Proteomics
  • Structure prediction
  • Comparative modeling
  • Function prediction
  • Structural bioinformatics
  • Molecular docking, screening, etc.

26
Bioinformatcs Tools and Services
  • Databases text, sequence, structure
  • Database annotation text searches
  • Sequence similarity search tools
  • Sequence and structure analysis tools
  • 3D structure visualization tools
  • Structure prediction tools
  • Phylogenetic analysis tools
  • Metabolic analysis tools

27
Major Bioinformatics Site
  • NCBI, The National Center for Biotechnology
    Information http//www.ncbi.nlm.nih.gov
  • EMBL, The European Molecular Biology Laboratory
    http//www.ebi.ac.uk
  • PDB, Protein Data Bank www.rcsb.org/pdb
  • This is just a sample list of the most used sites

28
NCBI
  • Entrez DNA, protein, genome database
  • BLAST Offer blast search against many ncbi
    databases
  • PubMed Literature search service that provides
    access to gt10 million citations.
  • COGs (Clusters of Orthologous Groups) ancient
    conserved protein domains from the complete
    genomes
  • ORF Finder Finds all open reading frames in a
    user or database sequence.
  • FTP server for databases and software
  • Many other features, including examples and
    tutorials

29
What is BLAST
  • BLASTBasic Local Alignment Search Tool
  • Most widely used search program
  • Finds and aligns matching sequences.
  • Fast
  • Several versions, eg. DNA or protein query
  • Current since 1997 BLAST 2.0 gapped BLAST
  • Does alignments of local regions, not entire
    sequences.

30
Sequence Questions
  • Does the DNA sequence contain a gene?
  • Is the gene a member of a known gene family? What
    is the encoded protein
  • What is the function of the protein?
  • Do other organisms have the protein or the gene?

31
Sequence Implications
  • DNA sequence of gene determine amino acid
    sequence of a protein. Primary sequence
    determines structure and function of a gene.
  • Proteins with similar sequences and structures
    have similar functions
  • Similar sequences should have long regions of
    identical or similar residues. Why?
  • Evolution decent from a common ancestral
    sequence. Homology is implied for each similar
    region
  • Very rarely have functional convergence without
    sequence or structure similarity.

32
Sequence Similarity Search
  • Look for DNAs or proteins with similar sequence
    to query, by searching a sequence database.
  • Search requires
  • Search software
  • Databases of annotated sequences
  • Useful output
  • Results must be interpreted and evaluated.

33
Similarity Search Terms
  • A similarity search of a database is performed by
    aligning a single query sequence to search
    sequence in the database in turn. Good matches
    may be found to subject sequences
  • Local alignment align local regions of
    sequences. Local alignment returns a list of
    HSPs High-scoring Segment Pairs.

34
Alignment Quality
  • Features of good alignments
  • Many exact matches of residues between sequences.
  • Some mismatches which nevertheless are AAs with
    similar physicochemical properties.
  • Few gaps.
  • Long stretches of very good match.
  • Query 162 ---LKFGNMKVETFYPGKGH 181
  • FG VE FYPG H
  • Sbjct 162 GDAVRFG--PVELFYPGAAH 182

35
Different BLASTS
  • BLASTN
  • Compares a DNA query to DNA database
  • Searches both strands automatically
  • Is optimised for speed, not sensitivity
  • BLASTP
  • Compares a protein query to protein database
  • BLASTX
  • Compares a DNA query to protein database
  • Does 6-frame translations of the query
  • TBLASTN
  • Compares a protein query to a DNA database
  • Does 6-frame translations of the database

36
Different BLASTS
  • TBLASTX
  • Compares a DNA query to a DNA database
  • Does 6-frame translations of the query and of the
    database
  • Each search is comparable to 36 BLASTP searches!
  • BLAST2
  • Also called 'advanced' BLAST
  • Can perform gapped alignments

37
The pipeline to generate potential SSR markers
from the EST sequence
EST sequences
Output of results 1. Position, type, of repeats
of identified SSR 2. Flanking primers
sequences, melting T, product size, etc 3.
Annotation by searching TGI
Pre-processing (remove polyA/T, etc)
SSR search Identification of SSR, Definition of
minimum number of repeats N(10), NN(6), NNN(5),
NNNN(5), NNNNN(5), NNNNNN(5)
Primer3-Primer prediction
CAP3-assemble the ESTs Remove redundancy
Lab work to confirm the predicted SSR markers
38
Download EST from genbank
39
Clean Up the EST sequences
  • Trim the poly A, or poly T sequences
  • Remove the low quality sequences (if there is a
    lot of Ns)
  • Remove sequence gt 700 bp in a EST
  • Remove the very short sequences lt100 bp
  • EST 94,423 to 94,340 sequences

40
Perl scripts to find the SSR sequence
  • Criteria for SSR
  • Definement of SSR (unit size / minimum number of
    repeats) (1/10) (2/6) (3/5) (4/5) (5/5) (6/5)
  • Maximal number of bases interrupting 2 SSRs in a
    compound microsatellite 100
  • For each individual sequences, read base one by
    one, count the number of repeats, if meet
    criteria, output the sequence and the position,
    type of the SSR to one file.
  • Results
  • Total number of identified SSRs
    6725
  • Number of SSR containing sequences
    6051
  • Number of sequences containing more than 1 SSR
    600
  • Number of SSRs present in compound formation
    39

41
Primer3 to design the primer
  • Primer3, a free available software to design the
    PCR primer based on DNA sequence
  • Can define the primer pickup criteria
  • Results
  • 5045 SSR sequences were successfully predicted
    with primer pairs
  • 1282 SSR sequences failed

42
Use CAP3 to assemble the SSR containing EST
sequence to avoid redundancy
  • CAP3, a free available software to assemble DNA
    sequences based on sequence similarity
  • CAP3 assemble the EST sequences using BLAST
  • BLAST most heavily used searching tool
  • Blastn, blastp, blast2
  • Use public available database nr, nraa, est,
    human, mouse, rat
  • Use customized database

43
All the ESTs represent same part of DNA sequence
in this figure
44
Summary of Statistics
  • Total number of sequences examined
    94340
  • Total size of examined sequences (bp)
    50197414
  • Total number of identified SSRs
    6725
  • Number of SSR containing sequences
    6051
  • Number of sequences containing gt1 SSR 600
  • Number of SSRs present in compound formation
    398
  • Unit size Number of SSRs
  • 1 3141
  • 2 805
  • 3 2652
  • 4 53
  • 5 10
  • 6 64
Write a Comment
User Comments (0)
About PowerShow.com