Bioinformatics Resources for Computer Science Educators - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Bioinformatics Resources for Computer Science Educators

Description:

Bioinformatics Resources for Computer Science Educators ... http://ocelot.bio.brandeis.edu/pages/classes/InterpGenes/Project/bit8.htm. Protein Sequence ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 52
Provided by: debbu
Category:

less

Transcript and Presenter's Notes

Title: Bioinformatics Resources for Computer Science Educators


1
Bioinformatics Resources for Computer Science
Educators
  • Debra T. Burhans, Ph.D.
  • Director, Bioinformatics Program
  • Canisius College
  • burhansd_at_canisius.edu
  • Gary R. Skuse, Ph.D.
  • Director, Bioinformatics Program
  • Rochester Institute of Technology
  • gary_at_bioinformatics.rit.edu
  • SIGCSE Workshop, March 2004

2
Outline
  • General resources
  • NCBI
  • Bioinformatics data and software
  • Other useful resources
  • Summary

3
General Resources
4
Comprehensive Web Sites
  • NCBI National Center for Biotechnology
    Information
  • http//www.ncbi.nih.gov/
  • Bioinformatics.ca Central bioinformatics site
    in Canada
  • http//www.bioinformatics.ca/
  • EBI European Bioinformatics Institute
  • http//www.ebi.ac.uk/
  • TIGR The Institute for Genomic Research
  • http//www.tigr.org/
  • DDBJ DNA Databank of Japan
  • http//www.ddbj.nig.ac.jp/
  • Bioinformatics.org resources in bioinformatics
  • http//www.bioinformatics.org/

5
Comprehensive Web Sites
  • Canadian Bioinformatics Resources
  • http//cbrmain.cbr.nrc.ca8080/cbr/servlet/ListCLA
    ppsServlet?type8langeng
  • CCR at SUNY Buffalo (some public, some member
    resources) http//bioinformatics.ccr.buffalo.edu
  • National Library of Medicine
  • http//www.nlm.nih.gov/nlmhome.html
  • Bio-IT World
  • http//www.bio-itworld.com
  • National Science Digital Library
  • http//www.nsdl.org
  • CITIDEL
  • http//www.citidel.org/
  • Pevsner, Bioinformatics and Functional Genomics,
    Wiley 2003 (book) ch 1 URLs overview
  • http//www.bioinfbook.org/chapt1.htm

6
NCBI
7
Entrez
8
Just how much information?
  • GenBank, the primary gene sequence database at
    the NCBI (National Center for Biotechnology
    Infrastructure at the National Institutes of
    Health), Release 135, April 2003, comprises
  • 24,027,936 records
  • 31,099,264,455 nucleotides
  • 120,000 species
  • 114 GB

9
Genbank Growth
10
Usage Statistics
July 2001
11
Bioinformatics Data and Software
12
What sort of data?
  • Sequence
  • DNA/gene (genome)
  • Amino acid/protein (proteome)
  • Structure
  • Protein structure
  • Expression
  • Gene and protein expression
  • Interaction
  • Biological pathways (metabolic pathways
    metabolome)
  • Evolutionary biology
  • Molecular phylogenetics
  • Disease patterns and inheritance
  • OMIM (on-line Mendelian inheritance in man)
  • Biomedical literature (biobibliome)

13
Sequence
14
Sequence Data - DNA
  • DNA is double stranded
  • Think of DNA sequence information as a complex
    language
  • We know the alphabet (A, C, G, T)
  • We know that subsequences correspond to
    functional components of the genome
  • Genes (average size is 3000 bases)
  • Regulatory sequences
  • We dont know how to identify all of these
    components we dont know all of the words of
    the language
  • Complex code with a 4 letter alphabet

15
DNA Sequence
  • 1 aaaaaggaag cgttcgccga gatcgcagcg
    gctgcgccgg ggtatgcgga acgggctcgt
  • 61 gtggctgctg caccccgcgc tgcccggcac
    cttgcgctcc atcctcggcg cccgcccgcc
  • 121 gcccgcgaag cgactgtgtg gattcccaaa
    acagacttac agcacaatga gtaatccggc
  • 181 catccagaga atagaagacc aaattgtcaa
    gtctcctgaa gacaaacggg aataccgtgg
  • 241 actagagctg gccaatggca tcaaagtgct
    tctcatcagc gatcccacca cagacaagtc
  • 301 ctcagcggcc ctcgatgtgc acataggttc
    actgtcagac cctccaaata ttcctggctt
  • 361 aagtcatttt tgtgaacata tgctgttttt
    gggaaccaag aaatatccta aagaaaatga
  • 421 atatagccag tttctcagtg aacatgctgg
    aagttcaaat gcattcacca gtggagaaca
  • 481 caccaattat tatttcgatg tttcccatga
    acacttggaa ggagccctgg acaggtttgc
  • 541 gcagtttttc ctgtgcccct tgtttgatgc
    aagttgtaaa gacagagagg tgaacgctgt
  • 601 cgattcagaa catgagaaga atgtgatgaa
    cgatgcctgg agactcttcc agctggaaaa
  • 661 ggctacgggg aaccccaaac accccttcag
    caaatttggg acaggaaaca aatatactct
  • 721 agagactcgg cccaaccaag aaggcatcga
    cgtaagggaa gagctcttga aatttcactc
  • 781 tacgtattat tcgtccaatc tgatggcgat
    ttgtgtttta ggtcgagaat ccttagacga

16
Reading Frames
From http//www.ebi.ac.uk/help/frames_frame.html
17
ORFs
  • Open reading frames
  • A random piece of DNA has 6 different reading
    frames associated with it 3 in the forward
    direction and 3 in the reverse
  • Different reading frames produce different amino
    acid sequences
  • ORF finder (NCBI)
  • http//www.ncbi.nih.gov/gorf/gorf.html
  • Good student exercise! Write an ORF finding
    program

18
Sequence Data - Protein
  • A protein is a sequence of amino acids
  • There are 20 amino acids
  • Table to the right lists one and three letter
    abbreviations with names (http//bioinformatics.or
    g/tutorial/1-3.html)
  • Protein sequences are represented using the one
    letter code
  • Web site about biology and alphabets at Brandeis
  • http//ocelot.bio.brandeis.edu/pages/classes/Inter
    pGenes/Project/bit8.htm

19
Protein Sequence
  • gtgi42718017refNP_976037.1 retinoblastoma
    binding protein 8 isoform b CTBP-interacting
    protein retinoblastoma-interacting myosin-like
    Homo sapiens MNISGSSCGSPNSADTSSDFKDLWTKLKECHDREV
    QGLQVKVTKLKQERILDAQRLEEFFTKNQQLREQQ
    KVLHETIKVLEDRLRAGLCDRCAVTEEHMRKKQQEFENIRQQNLKLITEL
    MNERNTLQEENKKLSEQLQQ KIENDQQHQAAELECEEDVIPDSPITAFS
    FSGVNRLRRKENPHVRYIEQTHTKLEHSVCANEMRKVSKSS
    THPQHNPNENEILVADTYDQSQSPMAKAHGTSSYTPDKSSFNLATVVAET
    LGLGVQEESETQGPMSPLGD ELYHCLEGNHKKQPFEESTRNTEDSLRFS
    DSTSKTPPQEELPTRVSSPVFGATSSIKSGLDLNTSLSPSL
    LQPGKKKHLKTLPFSNTCISRLEKTRSKSEDSALFTHHSLGSEVNKIIIQ
    SSNKQILINKNISESLGEQN RTEYGKDSNTDKHLEPLKSLGGRTSKRKK
    TEEESEHEVSCPQASFDKENAFPFPMDNQFSMNGDCVMDKP
    LDLSDRFSAIQRQEKSQGSETSKNKFRQVTLYEALKTIPKGFSSSRKASD
    GNCTLPKDSPGEPCSQECII LQPLNKCSPDNKPSLQIKEENAVFKIPLR
    PRESLETENVLDDIKSAGSHEPIKIQTRSDHGGCELASVLQ
    LNPCRTGKIKSLQNNQDVSFENIQWSIDPGADLSQYKMDVTVIDTKDGSQ
    SKLGGETVDMDCTLVSETVL LKMKKQEQKGEKSSNEERKMNDSLEDMFD
    RTTHEEYESCLADSFSQAADEEEELSTATKKLHTHGDKQDK
    VKQKAFVEPYFKGDESIMQICQQKKEKRNWLPAQDTDSATFHPTHQRIFG
    KLVFLPLRLVWKEVILRKIL ILVLVQKDVSLTTQYFLQKARSRRHRR

20
Software/Computing Tools
  • Sequence alignment (Pauls talk)
  • BLAST (NCBI)
  • http//www.ncbi.nlm.nih.gov/BLAST/
  • FASTA
  • http//fasta.bioch.virginia.edu/
  • ClustalW (EBI) multiple alignment
  • http//www.ebi.ac.uk/clustalw/
  • EBI sequence analysis tools
  • http//www.ebi.ac.uk/Tools/sequence.html
  • Gene Boy
  • http//www.dnai.org/geneboy/index.html

21
Software/Computing Tools
  • Gene Finding/Gene structure
  • GLIMMER (bacterial and archea primarily)
  • http//www.tigr.org/salzberg/glimmer.html
  • Database searching, profile building for protein
    sequence analysis
  • HMMR (Hidden Markov Models)
  • http//hmmer.wustl.edu/

22
Data and Formats
  • Data storage and formatting
  • FASTA format (raw sequence)
  • GenBank records (e.g. fly database)
  • Can display in many different formats
  • XML (e.g. retinoblastoma binding protein)
  • There are many good problems in parsing and
    database design that can be illustrated using
    this data
  • Pevsner Chapter 2 sequence data URLs overview
  • http//www.bioinfbook.org/chapt2.htm
  • NCBI FTP site includes data repository and tools
  • http//www.ncbi.nlm.nih.gov/Ftp/index.html

23
Structure
24
Protein Representation
  • Proteins have structure at different levels
  • Primary (sequence)
  • Secondary (local folding)
  • Tertiary (global folding)
  • Quarternary (interactions)
  • Protein Structure viewing tools
  • CN3D
  • ftp//ftp.ncbi.nih.gov/cn3d/
  • Rasmol
  • http//www.chemistry.wustl.edu/edudev/rasdir.html
  • Protein Explorer
  • http//molvis.sdsc.edu/protexpl/frntdoor.htm

25
Protein Structure Prediction
  • This is a critical problem that attracts the
    efforts of many laboratories around the world
  • Protein structures can be studied directly using
    x-ray crystallography
  • There are many more protein sequences than known
    structures for them
  • See Paul Craigs slides from the RIT workshop on
    Predicting and Visualizing Protein Structure for
    more information
  • Protein data is available in a variety of formats
    including flat files whose data can be input to a
    3-d modeling program

26
Expression
27
Expression Data
  • The context (e.g. tissue type, stage of growth of
    an organism, etc) of a cell determined its
    pattern of gene and protein expression
  • Expression patterns may be measured using
    microarrays
  • Each spot on a microarray attracts and binds
    particular sequences
  • The amount of sequence bound to a spot can be
    quantified (though there are problems with this)
  • Data is available in a variety of formats, for
    example
  • Spreadsheet
  • image

28
Microarray Data
29
Microarray Data
30
Microarray Data in Spreadsheet
  • Spreadsheet file

31
Resources for Expression Data - 1
  • NCBI Gene Expression Omnibus
  • http//www.ncbi.nlm.nih.gov/geo/
  • Microarray data and tools at EBI
  • http//www.ebi.ac.uk/microarray/
  • Stanford Microarray database
  • http//genome-www5.stanford.edu/
  • Affymetrix
  • http//www.affymetrix.com/index.affx
  • Wake Forest Gene Expression Technology Group
    Links
  • http//www.wfubmc.edu/physpharm/genetech/genetechl
    inks.html

32
Resources for Expression Data - 2
  • Gene Expression Page at EBI
  • http//industry.ebi.ac.uk/alan/MicroArray/
  • Microarray links from U Berlin
  • http//www.bioinf.mdc-berlin.de/schober/ArrayLink
    s.htm
  • Rockefeller University Gene Array Resources
  • http//www.rockefeller.edu/genearray/software

33
Interaction
34
Biological Pathways
  • Determine genes that are expressed together
  • Determine how different proteins interact in
    complex metabolic pathways

35
Pathways Resources
  • BIND Database/BluePrint
  • http//www.blueprint.org/bind/bind.php
  • Comprehensive Web site with links to pathways
    resources
  • http//www.hgmp.mrc.ac.uk/GenomeWeb/prot-interacti
    on.html

36
Evolutionary Biology
37
Comparative Genomics
  • Comparative Genomics is the analysis of
    molecular data from multiple species.
  • Biological applications the field of
    systematics and the tools of molecular biology
    have combined to form molecular phylogenetics.
  • Biologists use molecular phylogenetics to
    reconstruct evolutionary trees based on DNA or
    protein sequences.
  • Comprehensive page at Penn State with links
  • http//posnania.biotec.psu.edu/tools/resources.htm
    lphylogeny

38
(No Transcript)
39
Disease Patterns and Inheritance
40
The link to medicine
  • By understanding the genetic code we gain an
    understanding of disease
  • By understanding how we are related to other
    organisms we can understand better how model
    organisms relate to us and which model organisms
    might be appropriate to use when studying disease
  • If we know exactly what has caused disease
    (mutations) we might be able to fix it (gene
    therapy)
  • Subtyping of diseases based on expression
    patterns has already improved disease treatment
    for leukemia
  • This is a thriving and important area of
    bioinformatics research

41
Your genome and health
  • Richard A. Young from the Whitehead Institute
    imagines a health-care system in which, shortly
    after a baby is born, doctors take a tiny piece
    of tissue and test its genes to predict the
    baby's medical future. (Boston Globe/February 17,
    2004, Carlene Hempel)
  • OMIM Online Mendelian Inheritance in Man,
    database maintained at Johns Hopkins University
    and available at NCBI
  • http//www.ncbi.nih.gov/entrez/query.fcgi?dbOMIM
  • NCBI Genes and Disease link off main page
    (right hand side)
  • http//www.ncbi.nih.gov

42
Biomedical Literature
43
Literature Resources
  • Electronic repository of biomedical journals,
    including abstracts and (for many articles) full
    text
  • Available through PubMed at NCBI
  • http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?dbP
    ubMed
  • The MedLine database currently comprises nearly
    500 files of 30,000 lines each (baseline data
    approx. 40 GB)
  • Can be downloaded from the National Library of
    Medicine (NLM)
  • Database resource page at the NLM
  • http//www.nlm.nih.gov/databases/databases.html
  • NCBI Bookshelf on-line access and downloadable
    full text
  • http//www.ncbi.nih.gov/entrez/query.fcgi?dbBooks

44
Other Useful Resources
45
Ethical, Legal and Social Implications (ELSI)
  • ELSI page Human Genome Project (DOE)
  • http//www.ornl.gov/sci/techresources/Human_Genome
    /elsi/elsi.shtml
  • ELSI Institute, Dartmouth college
  • http//www.dartmouth.edu/ethics/programs.html

46
Ontology Resources
  • Ontologies terminologies arranged hierarchically
  • Allow for standardization of terms
  • Gene Ontology Consortium
  • http//www.geneontology.org/
  • Open Biological Ontologies
  • http//obo.sourceforge.net/
  • UMLS at the National Library of Medicine
  • http//www.nlm.nih.gov/research/umls/umlsmain.html
  • Robert Stevens U Manchester Ontology Page
  • http//www.cs.man.ac.uk/stevensr/ontology.html

47
Programming Language Resources
  • Perl is very popular
  • Perl, Python and Java have special modules for
    bioinformatics
  • Active State - Perl, Python and other languages
  • http//www.activestate.com/
  • CPAN Perl Archive
  • http//www.cpan.org/
  • BioPerl
  • http//www.bioperl.org/
  • BioPython
  • http//www.biopython.org/
  • BioJava
  • http//www.biojava.org/

48
Educational Resources
  • Human Genome Project at DOE
  • http//www.doegenomes.org/
  • NCBI Education Site
  • http//www.ncbi.nih.gov/Education/index.html
  • Geospiza
  • http//www.geospiza.com/outreach/
  • EBI 2can
  • http//www.ebi.ac.uk/2can/home.html
  • Dolan DNA Learning Center
  • http//www.dnalc.org

49
Academic Programs
  • Bio-It World overview
  • http//www.bio-itworld.com/careers/biotrain/index.
    html
  • Check this site out, if you find omissions or
    errors email them to Bio-It World help to
    create a comprehensive list of bioinformatics
    programs that is complete and correct

50
Summary
  • Enormous amount of data
  • Multitude of formats
  • An important problem is translation among formats
  • Proprietary vs. open source
  • Communities of biologists are banding together to
    create important web-based resources
  • There are a large number of resources on the WWW
  • There are research issues involved for computer
    scientists and biologists

51
Lincoln Stein on Bioinformatics
  • Lincoln Stein's keynote at the O'Reilly
    Bioinformatics Technology Conference was
    provocatively titled "Bioinformatics Gone in
    2012." Despite the title, Stein is optimistic
    about the future for people doing bioinformatics.
    But he explained that "the field of
    bioinformatics will be gone by 2012. The field
    will be doing the same thing but it won't be
    considered a field.
Write a Comment
User Comments (0)
About PowerShow.com