Title: Introduction to Bioinformatics
1Introduction to Bioinformatics
2SIB and EMBnet Bioinformatics resources for
biomedical scientists
3The Swiss Institute of Bioinformatics
- Founded in March 1998
- Collaborative structure Lausanne - Geneva - Basel
- Groups at ISREC, Ludwig Institute, Unil, HUG,
UniGe, recently UniBas and soon EPFL. - Several roles teaching, services, research
- Currently 130 employees
4Projects at SIB
- Databases
- SWISS-PROT, PROSITE, EPD, World-2DPAGE,
SWISS-MODEL - TrEST, TrGEN (predicted proteins), tromer
(transcriptome) - Softwares
- Melanie, Deep View, proteomic tools, ESTScan,
pftools, Java applets - Services
- Web servers ExPASy, EMBnet
- Teaching and helpdesk
- Research
- Mostly sequence and expression analysis, 3D
structure, and proteomic
5(No Transcript)
6Teaching
- DEA (master degree) in Bioinformatics 1 year
full time, first diploma common to Unige and
Unil. - EMBnet courses 2x 1 week per year in Lausanne,
to be extended in Basel - Pregrade courses in Geneva, Fribourg and Lausanne
Universities - Other courses at CHUV and EPFL
- Courses in other countries Colombia, Cambodia,
Peru,
7Research
- New algorithms (faster alignments)
- New technology (GRID or cluster computing)
- New tools (protein analysis, microarrays,
confocal microscopy) - New databases (microarrays, transcriptome,
proteome) - Collaborations with lab researchers!
8Three levels of services
- Simple web access to softwares and databases
- Easy to use for basic occasional research with
few sequences - Potentially insecure
- Command-line access with a local Unix account
- More powerful (automation) and secure
- Requires to understand Unix system and frequent
practice - Collaboration with SIB
- Access to experts in the field (help desk)
- For projects requiring huge programming or
special hardware resources
9SIBs important sites
- Home
- www.isb-sib.ch
- ExPASy - Expert Protein Analysis System
- www.expasy.org
- Hits database and tools
- hits.isb-sib.ch
- EMBnet Switzerland
- www.ch.embnet.org
- Geneva Bioinformatics
- www.genebio.ch
10SIB home
11Expert Protein Analysis System
12Swiss node http//www.ch.embnet.org
13EMBnet organisation
- European in 1988, now world-wide spread
- 32 country nodes, 8 special nodes.
- Role
- Training, education (EMBER)
- Software development (EMBOSS, SRS)
- Computing resources (databases, websites,
services) - Helpdesk and technical support
- Publications (EMBnet.news, Briefings in
Bioinformatics) - Access www.embnet.org
- Each node with www.xx.embnet.org where xx is
the country code (e.g., ch for Switzerland)
14EMBnet home
15European Molecular Biology Open Software Suite
- Free Open Source (for most Unix plateforms)
- GCG successor (compatible with GCG file format)
- More than 200 programs
- Easy to install locally
- but no interface, requires local databases
- Unix command-line only
- Interfaces
- Jemboss, www2gcg, w2h, wemboss (with account)
- Pise, EMBOSS-GUI (no account)
- Access www.emboss.org
16Other important sites
- ExPASy - Expert Protein Analysis System
- www.expasy.org
- EBI - European Bioinformatics Institute
- www.ebi.ac.uk
- NCBI - National Center for Biotechnology
Information - www.ncbi.nlm.nih.gov
- Sanger - The Sanger Institute
- www.sanger.ac.uk
17Bioinformatics definition
- Every application of computer science to biology
- Sequence analysis, images analysis, sample
management, population modelling, - Analysis of data coming from large-scale
biological projects - Genomes, transcriptomes, proteomes, metabolomes,
etc
18The new biology
- Traditional biology
- Small team working on a specialized topic
- Well defined experiment to answer precise
questions - New  high-throughput biology
- Large international teams using cutting edge
technology defining the project - Results are given raw to the scientific community
without any underlying hypothesis
19Example of  high-throughputÂ
- Complete genome sequencing
- Large-scale sampling of the transcriptome (EST)
- Simultaneous expression analysis of thousands of
genes (DNA microarrays, SAGE) - Large-scale sampling of the proteome
- Protein-protein analysis large-scale 2-hybrid
(yeast, worm) - Large-scale 3D structure production (yeast)
- Metabolism modelling
- Simulations
- Biodiversity
20Role of bioinformatics
- Control and management of the data
- Analysis of primary data e.g.
- Base calling from chromatograms
- Mass spectra analysis
- DNA microarrays images analysis
- Statistics
- Database storage and access
- Results analysis in a biological context
21First information a sequence ?
- Nucleotide
- RNA (or cDNA)
- Genomic (intron-exon)
- Complete or incomplete?
- mRNA with 5 and 3 UTR regions
- Entire chromosome
- Protein
- Pre/Pro or functional protein?
- Function prediction
- Post-translational modifications?
- Holy Grail 3D structure?
22Genomes in numbers
- Sizes
- virus 103 to 105 nt
- bacteria 105 to 107 nt
- yeast 1.35 x 107 nt
- mammals 108 to 1010 nt
- plants 1010 to 1011 nt
- Gene number
- virus 3 to 100
- bacteria 1000
- yeast 7000
- mammals 30000
- Plants 30000-50000?
23Sequencing projects
-  small  genomes (lt107) bacteria, virus
- Many already sequenced (industry excluded)
- More than 100 microbial genomes already in the
public domain - More to come! (one new every two weeks)
-  large genomes (107-1010) eucaryotes
- 15 finished (S.cerevisiae, S. Pombe, E. cuniculi,
G. theta, C.elegans, D.melanogaster, A. gambiae,
P. falciparum, P. yoelii, D. rerio, F. rubripes,
A.thaliana, O. sativa (2x), M. musculus, Homo
sapiens) - Many more to come rat, pig, cow, maize (and
other plants), insects, fishes, many pathogenic
parasites (Leishmania) - EST sequencing
- Partial mRNA sequences 15x106 sequences in the
public domain
24Human genome
- Size 3 x 109 nt for a haploid genome
- Highly repetitive sequences 25, moderately
repetitive sequences 25-30 - Size of a gene from 900 to gt2000000 bases
(introns included) - Proportion of the genome coding for proteins
5-7 - Number of chromosomes 22 autosomal, 1 sexual
chromosome - Size of a chromosome 5 x 107 to 5 x 108 bases
25How to sequence the human genome?
- Consortium  international approach
- Generate genetic maps (meiotic recombination) and
pseudogenetic maps (chromosome hybrids) for
indicator sequences - Generate a physical map based on large clones
(BAC or PAC) - Sequence enough large clones to cover the genome
-  commercial approach (Celera)
- Generate random libraries of fixed length genomic
clones (2kb and 10kb) - Sequence both ends of enough clones to obtain a
10x coverage - Use computer techniques to reconstitute the
chromosomal sequences, check with the public
project physical map
26Sequencing progression
27Interpretation of the human draft
- Still many gaps and unordered small pieces
(except for chr 6, 7, 13, 14 20, 21, 22, Y) - Even a genomic sequence does not tell you where
the genes are encoded. The genome is far from
being  decoded - One must combine genome and transcriptome to have
a better idea
Last freeze Ncbi30 June 24, 2002
28The transcriptome
- The set of all functional RNAs (tRNA, rRNA, mRNA
etc) that can potentially be transcribed from
the genome - The documentation of the localization (cell type)
and conditions under which these RNAs are
expressed - The documentation of the biological function(s)
of each RNA species
29Public draft transcriptome
- Information about the expression specificity and
the function of mRNAs -  full cDNA sequences of know function
-  full cDNA sequences, but  anonymous (e.g.
KIAA or DKFZ collections) - EST sequences
- cDNA libraries derived from many different
tissues - Rapid random sequencing of the ends of all clones
- ORESTES sequences
- Growing set of expression data (microarrays, SAGE
etc) - Increasing evidences for multiple alternative
splicing and polyadenylation
30Example mapping of ESTs and mRNAs
mRNAs
ESTs
Computer prediction
31The proteome
- Set of proteins present in a particular cell type
under particular conditions - Set of proteins potentially expressed from the
genome - Information about the specific expression and
function of the proteins
32Information on the proteome
- Separation of a complex mixture of proteins
- 2D PAGE (IEF SDS PAGE)
- Capillary chromatography
- Individual characterisation of proteins
- Tryptic peptides signature (MS)
- Sequencing by chemistry or MS/MS
- All post-translational modifications (PTMs) !
33Tridimentional structures
- Methods to determine structures
- X-ray cristallography
- NMR
- Data format
- Atoms coordinates (except H) in a cartesian space
- Databases
- For proteins and nucleic acids (RSCB, was PDB)
- Independent databases for sugars and small
organic molecules
34Visualisation of the structures
- Secondary structure elements
- Alpha helices, beta sheets, other
- Softwares
- Various representations (atoms, bonds,
secondary) - Big choice of commercial and free software (e.g.,
DeepView)
35Sequence information, and so what ?
- How to store and organise ?
- Databases (next lecture)
- How to access, search, compare ?
- Pairwise alignments, dot plots (Tuesday)
- BLAST searches in db (Tuesday)
- EST clustering (Wednesday)
- Multiple Alignments (Wednesday)
- Patterns, PSI-BLAST, Profiles and HMMs (Thursday)
- Gene prediction (Thursday)
- Protein function prediction (Friday)
- Users problems (Friday)
36Thank you