Bioinformatics Tools - PowerPoint PPT Presentation

About This Presentation
Title:

Bioinformatics Tools

Description:

Bioinformatics Tools Overview This lecture will summarize a huge amount of bioinformatics material that is usually presented ... The most popular is known as ... – PowerPoint PPT presentation

Number of Views:604
Avg rating:3.0/5.0
Slides: 51
Provided by: Stuart217
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Tools


1
Bioinformatics Tools
  • Stuart M. Brown, Ph.D
  • Dept of Cell Biology
  • NYU School of Medicine

2
  • Bioinformatics Tools

Stuart M. Brown, Ph.D Dept of Cell Biology NYU
School of Medicine
3
Overview
  • This lecture will summarize a huge amount of
    bioinformatics material that is usually presented
    as a full 12 week course.
  • Data management and analysis of sequences from
    the HGP
  • A quick look at GenBank and ENTREZ.
  • Gene finding and translation
  • Similarity searching and alignment (BLAST)
  • Protein structure and function

4
Data Management and Analysis
  • The Human Genome Project has generated huge
    quantities of DNA sequence data.
  • This data will lead to many medial advances.
  • But a great deal of analysis and research will be
    needed.

5
  • Access to the Data
  • Organize the genome data provide access for
    scientists
  • Use the Internet
  • The data is public, so anyone can access it.

6
GenBank
  • All Genome Project data is stored in a database
    called GenBank managed by the National Center for
    Biotechnology Information (NCBI)
  • The NCBI is a branch of the National Library of
    Medicine, which is part of the NIH (National
    Institutes of Health).
  • http//ncbi.nlm.nih.gov

7
(No Transcript)
8
GenBank Sections
  • In addition to DNA sequences of genes GenBank
    has a number of other sections including
  • Protein sequences (translated from DNA)
  • Short RNA fragments (ESTs)
  • Cancer Genome Anatomy Project (CGAP) gene
    expression profiles of normal, pre-cancer, and
    cancer cells from a wide variety of tissue types
  • Single Nucleotide Polymorphisms (SNPs) which
    represent genetic variations in the human
    population
  • Online Mendelian Inheritance in Man (OMIM) a
    database of human genetic disorders

9
Finding Genes
  • GenBank contains approximately 13 billion bases
    in 12 million sequence records (as of August
    2001).
  • These billions of G, A, T, and C letters would be
    almost useless without descriptions of what genes
    they contain, the organisms they come from, etc.
  • All of this information is contained in the
    "annotation" part of each sequence record.

10
(No Transcript)
11
Entrez is a Tool for Finding Sequences
  • NCBI has created a Web-based tool called Entrez
    for finding sequences in GenBank.
  • Each sequence in GenBank has a unique accession
    number.
  • Entrez can also search for keywords such as gene
    names, protein names, and the names of orgainisms
    or biological functions

12
(No Transcript)
13
Entrez has links to Medline
  • Entrez is much more than just a tool for finding
    sequences by keywords.
  • It contains links to PubMed/Medline
  • Entrez also contains all known protein sequences
    and 3-D protein structures.

14

15
Entrez is Internally Cross-linked
  • DNA and protein sequences are linked to other
    similar sequences
  • Medline citations are linked to other citations
    that contain similar keywords
  • 3-D structures are linked to similar structures

16
(No Transcript)
17
  • These relationships might include genes in a
    multi-gene family, related journal articles, or
    other proteins in the same biochemical pathway
  • This potential for horizontal movement through
    the linked databases makes Entrez a dynamic tool.
  • You can start with only a vague set of keywords
    or a sequence from the laboratory and rapidly
    access a set of relevant literature and related
    database sequences.

18
Similarity Searching
  • There are a variety of computer programs that are
    used for making comparisons between DNA
    sequences.
  • The most popular is known as BLAST (Basic Local
    Alignment Search Tool)
  • BLAST is free at the NCBI website

19
(No Transcript)
20
BLAST Searches GenBank
  • The NCBI BLAST web server lets you compare your
    query sequence to various sections of GenBank
  • nr non-redundant (main sections)
  • month new sequences from the past few weeks
  • ESTs
  • human, drososphila, yeast, or E.coli genomes
  • proteins (by automatic translation)
  • This is a VERY fast and powerful computer.

21
BLAST is Complex
  • Similarity searching relies on the concepts of
    alignment and distance between pairs of
    sequences.
  • Distances can only be measured between aligned
    sequences (match vs. mismatch at each position).
  • A similarity search is a process of testing the
    best alignment of a query sequence with every
    sequence in a database.

22
Search with Protein not DNA
  • 1) 4 DNA bases vs. 20 amino acids - less random
    similarity
  • 2) Can have varying degrees of similarity between
    different AAs
  • - of mutations, chemical similarity, PAM matrix
  • 3) Protein databanks are much smaller than DNA
    databanks.

23
BLAST has Automatic Translation
  • BLASTX makes automatic translation (in all 6
    reading frames) of your DNA query sequence to
    compare with protein databanks
  • TBLASTN makes automatic translation of an entire
    DNA database to compare with your protein query
    sequence
  • Only make a DNA-DNA search if you are working
    with a sequence that does not code for protein.

24
  • gtgbBE588357.1BE588357 194087 BARC 5BOV Bos
    taurus cDNA 5'.
  • Length 369
  • Score 272 bits (137), Expect 4e-71
  • Identities 258/297 (86), Gaps 1/297 (0)
  • Strand Plus / Plus

  • Query 17 aggatccaacgtcgctccagctgctcttgacgactccac
    agataccccgaagccatggca 76

  • Sbjct 1 aggatccaacgtcgctgcggctacccttaaccact-cgc
    agaccccccgcagccatggcc 59

  • Query 77 agcaagggcttgcaggacctgaagcaacaggtggagggg
    accgcccaggaagccgtgtca 136

  • Sbjct 60 agcaagggcttgcaggacctgaagaagcaagtggagggg
    gcggcccaggaagcggtgaca 119

  • Query 137 gcggccggagcggcagctcagcaagtggtggaccaggcc
    acagaggcggggcagaaagcc 196

  • Sbjct 120 tcggccggaacagcggttcagcaagtggtggatcaggcc
    acagaagcagggcagaaagcc 179

  • Query 197 atggaccagctggccaagaccacccaggaaaccatcgac
    aagactgctaaccaggcctct 256

25
Understand the Statistics!
  • BLAST produces an E-value for every match
  • This is the same as the P value in a statistical
    test
  • A match is generally considered significant if
    the E-value lt 0.05 (smaller numbers are more
    significant)
  • Very low E-values (e-100) are homologs or
    identical genes
  • Moderate E-values are related genes
  • Long regions of moderate similarity are more
    important than short regions of high identity.

26
BLAST is Approximate
  • BLAST makes similarity searches very quickly
    because it takes shortcuts.
  • looks for short, nearly identical words (11
    bases)
  • It also makes errors
  • misses some important similarities
  • makes many incorrect matches
  • easily fooled by repeats or skewed composition

27
Bad Genome Annotation
  • Gene finding is at best only 90 accurate.
  • New sequences are automatically annotated with
    BLAST scores.
  • Bad annotations propagate
  • Its going to take us 10-20 years or more to sort
    this mess out!

28
Protein Function
  • The ultimate goal of the HGP is to identify all
    of the genes and determine their functions
  • Genes function by being translated into proteins
  • structural
  • enzymes
  • regulatory
  • signalling

29
Translation
  • Once we have found the DNA sequence of a gene, we
    can decode the amino acid sequence of the
    corresponding protein .
  • The Genetic Code is actually quite simple.

30
Chemical Properties
  • Some chemical properties of a protein can be
    calculated from its amino acid sequence
  • molecular weight
  • charge/pH
  • hydrophobicity

31
Patterns in Proteins
32
Conserved Domains
  • Proteins are built out of functional units know
    as domains (or motifs)
  • These domains have conserved sequences
  • Often much more similar than their respective
    proteins
  • Exon splicing theory (W. Gilbert)
  • Exons correspond to folding domains which in
    turn serve as functional units
  • Unrelated proteins may share a single similar
    exon (i.e.. ATPase or DNA binding function)

33
Simple Structures
  • Some motifs form structures that can be
    recognized as simple sequence patterns
  • transmembrane domains
  • coiled coils
  • helix-turn-helix
  • signal peptides

34
Functional Motifs
  • Other functional portions of proteins can be
    recognized by their sequence, even if their 3-D
    structure is not known.
  • There are many databases of protein
    motifs/domains ProSite, Pfam, ProDom, etc.

35
Tools for Finding Motifs
  • Define a motif from a set of known proteins that
    share a similar sequence and function.
  • A pattern is a list of amino acids that can occur
    at each position in the motif.
  • A profile is a matrix that assigns a value to
    every amino acid at every position in the motif.
  • A HMM is a more complex profile based on pairs of
    amino acids.

36
(No Transcript)
37
Protein 3-D Structure
38
Structure Function
  • Proteins function by 3-D interactions with other
    molecules (i.e. physical chemistry).
  • So for a protein, 3-D structure is function.
  • But we cant accurately determine 3-D structure
    from gene sequence.

39
Structure Prediction
  • Predicting a proteins 3-D structure from its
    amino acid sequence is incredibly complex.
  • proteins are polypeptides (long chains of amino
    acids)
  • can fold and rotate around bonds within each
    amino acid as well as the bonds between them
  • it is not possible to evaluate every possible
    folding pattern for an amino acid sequence

40
Secondary Structure
  • The local structure of the amino acids in a
    protein can also be predicted to some extent.
  • Each amino acid has a tendency to form either an
    alpha helix or a beta sheet

41
Threading
  • Rather than computing a 3-D structure from
    scratch, it may be possible to find a similar
    structure.
  • Must have 25 aa sequence identity.
  • Uses a process called threading to create a new
    structure based on a known structure.
  • This still requires HUGE amounts of computer
    power.

42
(No Transcript)
43
Protein Data Base
  • There is a database of all known protein
    structures called the PDB.
  • These have been determined by X-ray
    crystalography and/or NMR.
  • Anyone download and view these structures with a
    PDB viewer program.

44
RasMol
  • RasMol is the simplest PDB viewer.
  • http//www.umass.edu/microbio/rasmol/
  • It can work together with a web browser to let
    you view the structure of any sequence found with
    Entrez that has a known 3-D structure.

45
Gene Finding Translation
  • How can we find genes on chromosomes?
  • Genome project data is just huge chunks of DNA.
  • Does automatic annotation work?

46
Raw Genome Data
47
Finding Genes is Not Easy
  • Perhaps 1 of human DNA encodes functional genes.
  • Genes are interspersed among long stretches of
    non-coding DNA.
  • Repeats, pseudo-genes, and introns confound
    matters

48
Pattern Finding Tools
  • It is possible to use DNA sequence patterns to
    predict genes
  • Promoters
  • translational start and stop codes (ORFs)
  • intron splice sites
  • codon usage

49
(No Transcript)
50
Similarity to Known Genes
  • It is also possible to scan new DNA sequence for
    known genes
  • Can look for annotated genes/proteins
  • Or just for RNAs (ESTs)
Write a Comment
User Comments (0)
About PowerShow.com