Introduction to Bioinformatics - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Introduction to Bioinformatics

Description:

Holy Grail: 3D structure? Swiss Institute of Bioinformatics. Institut Suisse de ... Patterns, PSI-BLAST, Profiles and HMMs (Thursday) Gene prediction (Thursday) ... – PowerPoint PPT presentation

Number of Views:151
Avg rating:3.0/5.0
Slides: 37
Provided by: nat1153
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics


1
Introduction to Bioinformatics
2
SIB and EMBnet Bioinformatics resources for
biomedical scientists
3
The Swiss Institute of Bioinformatics
  • Founded in March 1998
  • Collaborative structure Lausanne - Geneva - Basel
  • Groups at ISREC, Ludwig Institute, Unil, HUG,
    UniGe, recently UniBas and soon EPFL.
  • Several roles teaching, services, research
  • Currently 130 employees

4
Projects at SIB
  • Databases
  • SWISS-PROT, PROSITE, EPD, World-2DPAGE,
    SWISS-MODEL
  • TrEST, TrGEN (predicted proteins), tromer
    (transcriptome)
  • Softwares
  • Melanie, Deep View, proteomic tools, ESTScan,
    pftools, Java applets
  • Services
  • Web servers ExPASy, EMBnet
  • Teaching and helpdesk
  • Research
  • Mostly sequence and expression analysis, 3D
    structure, and proteomic

5
(No Transcript)
6
Teaching
  • DEA (master degree) in Bioinformatics 1 year
    full time, first diploma common to Unige and
    Unil.
  • EMBnet courses 2x 1 week per year in Lausanne,
    to be extended in Basel
  • Pregrade courses in Geneva, Fribourg and Lausanne
    Universities
  • Other courses at CHUV and EPFL
  • Courses in other countries Colombia, Cambodia,
    Peru,

7
Research
  • New algorithms (faster alignments)
  • New technology (GRID or cluster computing)
  • New tools (protein analysis, microarrays,
    confocal microscopy)
  • New databases (microarrays, transcriptome,
    proteome)
  • Collaborations with lab researchers!

8
Three levels of services
  • Simple web access to softwares and databases
  • Easy to use for basic occasional research with
    few sequences
  • Potentially insecure
  • Command-line access with a local Unix account
  • More powerful (automation) and secure
  • Requires to understand Unix system and frequent
    practice
  • Collaboration with SIB
  • Access to experts in the field (help desk)
  • For projects requiring huge programming or
    special hardware resources

9
SIBs important sites
  • Home
  • www.isb-sib.ch
  • ExPASy - Expert Protein Analysis System
  • www.expasy.org
  • Hits database and tools
  • hits.isb-sib.ch
  • EMBnet Switzerland
  • www.ch.embnet.org
  • Geneva Bioinformatics
  • www.genebio.ch

10
SIB home
11
Expert Protein Analysis System
12
Swiss node http//www.ch.embnet.org
13
EMBnet organisation
  • European in 1988, now world-wide spread
  • 32 country nodes, 8 special nodes.
  • Role
  • Training, education (EMBER)
  • Software development (EMBOSS, SRS)
  • Computing resources (databases, websites,
    services)
  • Helpdesk and technical support
  • Publications (EMBnet.news, Briefings in
    Bioinformatics)
  • Access www.embnet.org
  • Each node with www.xx.embnet.org where xx is
    the country code (e.g., ch for Switzerland)

14
EMBnet home
15
European Molecular Biology Open Software Suite
  • Free Open Source (for most Unix plateforms)
  • GCG successor (compatible with GCG file format)
  • More than 200 programs
  • Easy to install locally
  • but no interface, requires local databases
  • Unix command-line only
  • Interfaces
  • Jemboss, www2gcg, w2h, wemboss (with account)
  • Pise, EMBOSS-GUI (no account)
  • Access www.emboss.org

16
Other important sites
  • ExPASy - Expert Protein Analysis System
  • www.expasy.org
  • EBI - European Bioinformatics Institute
  • www.ebi.ac.uk
  • NCBI - National Center for Biotechnology
    Information
  • www.ncbi.nlm.nih.gov
  • Sanger - The Sanger Institute
  • www.sanger.ac.uk

17
Bioinformatics definition
  • Every application of computer science to biology
  • Sequence analysis, images analysis, sample
    management, population modelling,
  • Analysis of data coming from large-scale
    biological projects
  • Genomes, transcriptomes, proteomes, metabolomes,
    etc

18
The new biology
  • Traditional biology
  • Small team working on a specialized topic
  • Well defined experiment to answer precise
    questions
  • New  high-throughput  biology
  • Large international teams using cutting edge
    technology defining the project
  • Results are given raw to the scientific community
    without any underlying hypothesis

19
Example of  high-throughput 
  • Complete genome sequencing
  • Large-scale sampling of the transcriptome (EST)
  • Simultaneous expression analysis of thousands of
    genes (DNA microarrays, SAGE)
  • Large-scale sampling of the proteome
  • Protein-protein analysis large-scale 2-hybrid
    (yeast, worm)
  • Large-scale 3D structure production (yeast)
  • Metabolism modelling
  • Simulations
  • Biodiversity

20
Role of bioinformatics
  • Control and management of the data
  • Analysis of primary data e.g.
  • Base calling from chromatograms
  • Mass spectra analysis
  • DNA microarrays images analysis
  • Statistics
  • Database storage and access
  • Results analysis in a biological context

21
First information a sequence ?
  • Nucleotide
  • RNA (or cDNA)
  • Genomic (intron-exon)
  • Complete or incomplete?
  • mRNA with 5 and 3 UTR regions
  • Entire chromosome
  • Protein
  • Pre/Pro or functional protein?
  • Function prediction
  • Post-translational modifications?
  • Holy Grail 3D structure?

22
Genomes in numbers
  • Sizes
  • virus 103 to 105 nt
  • bacteria 105 to 107 nt
  • yeast 1.35 x 107 nt
  • mammals 108 to 1010 nt
  • plants 1010 to 1011 nt
  • Gene number
  • virus 3 to 100
  • bacteria 1000
  • yeast 7000
  • mammals 30000
  • Plants 30000-50000?

23
Sequencing projects
  •  small  genomes (lt107) bacteria, virus
  • Many already sequenced (industry excluded)
  • More than 100 microbial genomes already in the
    public domain
  • More to come! (one new every two weeks)
  •  large  genomes (107-1010) eucaryotes
  • 15 finished (S.cerevisiae, S. Pombe, E. cuniculi,
    G. theta, C.elegans, D.melanogaster, A. gambiae,
    P. falciparum, P. yoelii, D. rerio, F. rubripes,
    A.thaliana, O. sativa (2x), M. musculus, Homo
    sapiens)
  • Many more to come rat, pig, cow, maize (and
    other plants), insects, fishes, many pathogenic
    parasites (Leishmania)
  • EST sequencing
  • Partial mRNA sequences 15x106 sequences in the
    public domain

24
Human genome
  • Size 3 x 109 nt for a haploid genome
  • Highly repetitive sequences 25, moderately
    repetitive sequences 25-30
  • Size of a gene from 900 to gt2000000 bases
    (introns included)
  • Proportion of the genome coding for proteins
    5-7
  • Number of chromosomes 22 autosomal, 1 sexual
    chromosome
  • Size of a chromosome 5 x 107 to 5 x 108 bases

25
How to sequence the human genome?
  • Consortium  international  approach
  • Generate genetic maps (meiotic recombination) and
    pseudogenetic maps (chromosome hybrids) for
    indicator sequences
  • Generate a physical map based on large clones
    (BAC or PAC)
  • Sequence enough large clones to cover the genome
  •  commercial  approach (Celera)
  • Generate random libraries of fixed length genomic
    clones (2kb and 10kb)
  • Sequence both ends of enough clones to obtain a
    10x coverage
  • Use computer techniques to reconstitute the
    chromosomal sequences, check with the public
    project physical map

26
Sequencing progression
27
Interpretation of the human draft
  • Still many gaps and unordered small pieces
    (except for chr 6, 7, 13, 14 20, 21, 22, Y)
  • Even a genomic sequence does not tell you where
    the genes are encoded. The genome is far from
    being  decoded 
  • One must combine genome and transcriptome to have
    a better idea

Last freeze Ncbi30 June 24, 2002
28
The transcriptome
  • The set of all functional RNAs (tRNA, rRNA, mRNA
    etc) that can potentially be transcribed from
    the genome
  • The documentation of the localization (cell type)
    and conditions under which these RNAs are
    expressed
  • The documentation of the biological function(s)
    of each RNA species

29
Public draft transcriptome
  • Information about the expression specificity and
    the function of mRNAs
  •  full  cDNA sequences of know function
  •  full  cDNA sequences, but  anonymous  (e.g.
    KIAA or DKFZ collections)
  • EST sequences
  • cDNA libraries derived from many different
    tissues
  • Rapid random sequencing of the ends of all clones
  • ORESTES sequences
  • Growing set of expression data (microarrays, SAGE
    etc)
  • Increasing evidences for multiple alternative
    splicing and polyadenylation

30
Example mapping of ESTs and mRNAs
mRNAs
ESTs
Computer prediction
31
The proteome
  • Set of proteins present in a particular cell type
    under particular conditions
  • Set of proteins potentially expressed from the
    genome
  • Information about the specific expression and
    function of the proteins

32
Information on the proteome
  • Separation of a complex mixture of proteins
  • 2D PAGE (IEF SDS PAGE)
  • Capillary chromatography
  • Individual characterisation of proteins
  • Tryptic peptides signature (MS)
  • Sequencing by chemistry or MS/MS
  • All post-translational modifications (PTMs) !

33
Tridimentional structures
  • Methods to determine structures
  • X-ray cristallography
  • NMR
  • Data format
  • Atoms coordinates (except H) in a cartesian space
  • Databases
  • For proteins and nucleic acids (RSCB, was PDB)
  • Independent databases for sugars and small
    organic molecules

34
Visualisation of the structures
  • Secondary structure elements
  • Alpha helices, beta sheets, other
  • Softwares
  • Various representations (atoms, bonds,
    secondary)
  • Big choice of commercial and free software (e.g.,
    DeepView)

35
Sequence information, and so what ?
  • How to store and organise ?
  • Databases (next lecture)
  • How to access, search, compare ?
  • Pairwise alignments, dot plots (Tuesday)
  • BLAST searches in db (Tuesday)
  • EST clustering (Wednesday)
  • Multiple Alignments (Wednesday)
  • Patterns, PSI-BLAST, Profiles and HMMs (Thursday)
  • Gene prediction (Thursday)
  • Protein function prediction (Friday)
  • Users problems (Friday)

36
Thank you
Write a Comment
User Comments (0)
About PowerShow.com