Introduction to Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Bioinformatics

Description:

Introduction to Bioinformatics – PowerPoint PPT presentation

Number of Views:286
Avg rating:3.0/5.0
Slides: 37
Provided by: Nat158
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics


1
Introduction to Bioinformatics
2
SIB and EMBnet Bioinformatics resources for
biomedical scientists
3
The Swiss Institute of Bioinformatics
  • Founded in March 1998
  • Collaborative structure Lausanne - Geneva - Basel
  • Groups at ISREC, Ludwig Institute, Unil, HUG,
    UniGe, recently UniBas and soon EPFL.
  • Several roles teaching, services, research
  • Currently 160 employees

4
Projects at SIB
  • Databases
  • SWISS-PROT, PROSITE, EPD, World-2DPAGE,
    SWISS-MODEL
  • TrEST, TrGEN (predicted proteins), tromer
    (transcriptome)
  • Softwares
  • Melanie, Deep View, proteomic tools, ESTScan,
    pftools, Java applets
  • Services
  • Web servers ExPASy, EMBnet, MyHits
  • Teaching and helpdesk
  • Research
  • Mostly sequence and expression analysis, 3D
    structure, and proteomic

5
(No Transcript)
6
Teaching
  • Master degrees in Bioinformatics (Bologna type)
    90 ECTS credits in Unige, Unil and Unibas.
  • EMBnet courses 4x 1 week per year in Lausanne,
    Basel and Zürich
  • Pregrade courses in Geneva, Fribourg and Lausanne
    Universities
  • Other courses at CHUV and EPFL
  • Courses in other countries Colombia, Cambodia,
    Peru,

7
Research
  • New algorithms (faster alignments)
  • New technology (GRID or cluster computing)
  • New tools (protein analysis, microarrays,
    confocal microscopy)
  • New databases (microarrays, transcriptome,
    proteome)
  • Collaborations with lab researchers!

8
Three levels of services
  • Simple web access to softwares and databases
  • Easy to use for basic occasional research with
    few sequences
  • Potentially insecure
  • Command-line access with a local Unix account
  • More powerful (automation) and secure
  • Requires to understand Unix system and frequent
    practice
  • Collaboration with SIB
  • Access to experts in the field (help desk)
  • For projects requiring huge programming or
    special hardware resources
  • Help desk
  • helpdesk_at_mail.ch.embnet.org or http//www.expasy.o
    rg/contact.html

9
SIBs important sites
  • Home
  • www.isb-sib.ch
  • ExPASy - Expert Protein Analysis System
  • www.expasy.org
  • MyHits database and tools
  • myhits.isb-sib.ch
  • EMBnet Switzerland
  • www.ch.embnet.org
  • Geneva Bioinformatics
  • www.genebio.ch

10
SIB home
11
Expert Protein Analysis System
12
MyHits http//myhits.isb-sib.ch
13
Swiss node http//www.ch.embnet.org
14
EMBnet organisation
  • European in 1988, now world-wide spread
  • 32 country nodes, 8 special nodes.
  • Role
  • Training, education (EMBER)
  • Software development (EMBOSS, SRS)
  • Computing resources (databases, websites,
    services)
  • Helpdesk and technical support
  • Publications (EMBnet.news, Briefings in
    Bioinformatics)
  • Access www.embnet.org
  • Each node with www.xx.embnet.org where xx is
    the country code (e.g., ch for Switzerland)

15
EMBnet home
16
European Molecular Biology Open Software Suite
  • Free Open Source (for most Unix plateforms)
  • GCG successor (compatible with GCG file format)
  • More than 150 programs (ver. 2.9.0)
  • Easy to install locally
  • but no interface, requires local databases
  • Unix command-line only
  • Interfaces
  • Jemboss, wEMBOSS, www2gcg, w2h (with account)
  • Pise, EMBOSS-GUI, SRSWWW (no account)
  • Staden, Kaptain, CoLiMate, Jemboss (local)
  • Access www.emboss.org or emboss.sourceforge.net

17
Other important sites
  • ExPASy - Expert Protein Analysis System
  • www.expasy.org
  • EBI - European Bioinformatics Institute
  • www.ebi.ac.uk
  • NCBI - National Center for Biotechnology
    Information
  • www.ncbi.nlm.nih.gov
  • Sanger - The Sanger Institute
  • www.sanger.ac.uk

18
Bioinformatics definition
  • Every application of computer science to biology
  • Sequence analysis, images analysis, sample
    management, population modelling,
  • Analysis of data coming from large-scale
    biological projects
  • Genomes, transcriptomes, proteomes, metabolomes,
    etc

19
The new biology
  • Traditional biology
  • Small team working on a specialized topic
  • Well defined experiment to answer precise
    questions
  • New  high-throughput  biology
  • Large international teams using cutting edge
    technology defining the project
  • Results are given raw to the scientific community
    without any underlying hypothesis

20
Example of  high-throughput 
  • Complete genome sequencing
  • Large-scale sampling of the transcriptome (EST)
  • Simultaneous expression analysis of thousands of
    genes (DNA microarrays, SAGE)
  • Large-scale sampling of the proteome
  • Protein-protein analysis large-scale 2-hybrid
    (yeast, worm)
  • Large-scale 3D structure production (yeast)
  • Metabolism modelling
  • Simulations
  • Biodiversity

21
Role of bioinformatics
  • Control and management of the data
  • Analysis of primary data e.g.
  • Base calling from chromatograms
  • Mass spectra analysis
  • DNA microarrays images analysis
  • Statistics
  • Database storage and access
  • Results analysis in a biological context

22
First information a sequence ?
  • Nucleotide
  • RNA (or cDNA)
  • Genomic (intron-exon)
  • Complete or incomplete?
  • mRNA with 5 and 3 UTR regions
  • Entire chromosome
  • Protein
  • Pre/Pro or functional protein?
  • Function prediction
  • Post-translational modifications?
  • Holy Grail 3D structure?

23
Genomes in numbers
  • Sizes
  • virus 103 to 105 nt
  • bacteria 105 to 107 nt
  • yeast 1.35 x 107 nt
  • mammals 108 to 1010 nt
  • plants 1010 to 1011 nt
  • Gene number
  • virus 3 to 100
  • bacteria 1000
  • yeast 7000
  • mammals 30000
  • Plants 30000-50000?

24
Sequencing projects
  •  small  genomes (lt107) bacteria, virus
  • Many already sequenced (industry excluded)
  • More than 150 microbial genomes already in the
    public domain
  • More to come! (one new every two weeks)
  •  large  genomes (107-1010) eucaryotes
  • gt30 finished (S.cerevisiae, S. Pombe, E.
    cuniculi, G. theta, C.elegans, D.melanogaster, A.
    gambiae, P. falciparum, P. yoelii, D. rerio, F.
    rubripes, A.thaliana, O. sativa (2x), M.
    musculus, Homo sapiens, P. troglodytes, R.
    norvegicus, C. familiaris, G. gallus)
  • Many more to come cat, elephant, pig, cow, maize
    (and other plants), insects, fishes, many
    pathogenic parasites (Leishmania)
  • EST sequencing
  • Partial mRNA sequences 20x106 sequences in the
    public domain

25
Human genome
  • Size 3 x 109 nt for a haploid genome
  • Highly repetitive sequences 25, moderately
    repetitive sequences 25-30
  • Size of a gene from 900 to gt2000000 bases
    (introns included)
  • Proportion of the genome coding for proteins
    5-7
  • Number of chromosomes 22 autosomal, 1 sexual
    chromosome
  • Size of a chromosome 5 x 107 to 5 x 108 bases

26
How to sequence the human genome?
  • Consortium  international  approach
  • Generate genetic maps (meiotic recombination) and
    pseudogenetic maps (chromosome hybrids) for
    indicator sequences
  • Generate a physical map based on large clones
    (BAC or PAC)
  • Sequence enough large clones to cover the genome
  •  commercial  approach (Celera)
  • Generate random libraries of fixed length genomic
    clones (2kb and 10kb)
  • Sequence both ends of enough clones to obtain a
    10x coverage
  • Use computer techniques to reconstitute the
    chromosomal sequences, check with the public
    project physical map

27
Interpretation of the human draft
  • All chromosomes considered as finished
  • Even a genomic sequence does not tell you where
    the genes are encoded. The genome is far from
    being  decoded 
  • One must combine genome and transcriptome to have
    a better idea

Last freeze Ncbi34 July, 2003
28
The transcriptome
  • The set of all functional RNAs (tRNA, rRNA, mRNA
    etc) that can potentially be transcribed from
    the genome
  • The documentation of the localization (cell type)
    and conditions under which these RNAs are
    expressed
  • The documentation of the biological function(s)
    of each RNA species

29
Public draft transcriptome
  • Information about the expression specificity and
    the function of mRNAs
  •  full  cDNA sequences of know function
  •  full  cDNA sequences (HTC), but  anonymous 
    (e.g. KIAA or DKFZ collections)
  • EST sequences
  • cDNA libraries derived from many different
    tissues
  • Rapid random sequencing of the ends of all clones
  • ORESTES sequences
  • Growing set of expression data (microarrays, SAGE
    etc)
  • Increasing evidences for multiple alternative
    splicing and polyadenylation

30
Example mapping of ESTs and mRNAs
mRNAs
ESTs
Computer prediction
31
The proteome
  • Set of proteins present in a particular cell type
    under particular conditions
  • Set of proteins potentially expressed from the
    genome
  • Information about the specific expression and
    function of the proteins

32
Information on the proteome
  • Separation of a complex mixture of proteins
  • 2D PAGE (IEF SDS PAGE)
  • Capillary chromatography
  • Individual characterisation of proteins
  • Tryptic peptides signature (MS)
  • Sequencing by chemistry or MS/MS
  • All post-translational modifications (PTMs) !

33
Tridimentional structures
  • Methods to determine structures
  • X-ray cristallography
  • NMR
  • Data format
  • Atoms coordinates (except H) in a cartesian space
  • Databases
  • For proteins and nucleic acids (RSCB, was PDB)
  • Independent databases for sugars and small
    organic molecules

34
Visualisation of the structures
  • Secondary structure elements
  • Alpha helices, beta sheets, other
  • Softwares
  • Various representations (atoms, bonds,
    secondary)
  • Big choice of commercial and free software (e.g.,
    DeepView)

35
Sequence information, and so what ?
  • How to store and organise ?
  • Databases (next lecture)
  • How to access, search, compare ?
  • Pairwise alignments, dot plots (Tuesday)
  • BLAST searches in db (Tuesday)
  • Patterns, PSI-BLAST, Profiles and HMMs
    (Wednesday)
  • Gene prediction (Wednesday)
  • EST clustering (Thursday)
  • Multiple Alignments (Thursday)
  • Protein function prediction (Friday)
  • Users problems (Friday)

36
Thank you
Write a Comment
User Comments (0)
About PowerShow.com