Title: Introduction to Bioinformatics
1Introduction to Bioinformatics
2SIB and EMBnet Bioinformatics resources for
biomedical scientists
3The Swiss Institute of Bioinformatics
- Founded in March 1998
- Collaborative structure Lausanne - Geneva - Basel
- Groups at ISREC, Ludwig Institute, Unil, HUG,
UniGe, recently UniBas and soon EPFL. - Several roles teaching, services, research
- Currently 160 employees
4Projects at SIB
- Databases
- SWISS-PROT, PROSITE, EPD, World-2DPAGE,
SWISS-MODEL - TrEST, TrGEN (predicted proteins), tromer
(transcriptome) - Softwares
- Melanie, Deep View, proteomic tools, ESTScan,
pftools, Java applets - Services
- Web servers ExPASy, EMBnet, MyHits
- Teaching and helpdesk
- Research
- Mostly sequence and expression analysis, 3D
structure, and proteomic
5(No Transcript)
6Teaching
- Master degrees in Bioinformatics (Bologna type)
90 ECTS credits in Unige, Unil and Unibas. - EMBnet courses 4x 1 week per year in Lausanne,
Basel and Zürich - Pregrade courses in Geneva, Fribourg and Lausanne
Universities - Other courses at CHUV and EPFL
- Courses in other countries Colombia, Cambodia,
Peru,
7Research
- New algorithms (faster alignments)
- New technology (GRID or cluster computing)
- New tools (protein analysis, microarrays,
confocal microscopy) - New databases (microarrays, transcriptome,
proteome) - Collaborations with lab researchers!
8Three levels of services
- Simple web access to softwares and databases
- Easy to use for basic occasional research with
few sequences - Potentially insecure
- Command-line access with a local Unix account
- More powerful (automation) and secure
- Requires to understand Unix system and frequent
practice - Collaboration with SIB
- Access to experts in the field (help desk)
- For projects requiring huge programming or
special hardware resources - Help desk
- helpdesk_at_mail.ch.embnet.org or http//www.expasy.o
rg/contact.html
9SIBs important sites
- Home
- www.isb-sib.ch
- ExPASy - Expert Protein Analysis System
- www.expasy.org
- MyHits database and tools
- myhits.isb-sib.ch
- EMBnet Switzerland
- www.ch.embnet.org
- Geneva Bioinformatics
- www.genebio.ch
10SIB home
11Expert Protein Analysis System
12MyHits http//myhits.isb-sib.ch
13Swiss node http//www.ch.embnet.org
14EMBnet organisation
- European in 1988, now world-wide spread
- 32 country nodes, 8 special nodes.
- Role
- Training, education (EMBER)
- Software development (EMBOSS, SRS)
- Computing resources (databases, websites,
services) - Helpdesk and technical support
- Publications (EMBnet.news, Briefings in
Bioinformatics) - Access www.embnet.org
- Each node with www.xx.embnet.org where xx is
the country code (e.g., ch for Switzerland)
15EMBnet home
16European Molecular Biology Open Software Suite
- Free Open Source (for most Unix plateforms)
- GCG successor (compatible with GCG file format)
- More than 150 programs (ver. 2.9.0)
- Easy to install locally
- but no interface, requires local databases
- Unix command-line only
- Interfaces
- Jemboss, wEMBOSS, www2gcg, w2h (with account)
- Pise, EMBOSS-GUI, SRSWWW (no account)
- Staden, Kaptain, CoLiMate, Jemboss (local)
- Access www.emboss.org or emboss.sourceforge.net
17Other important sites
- ExPASy - Expert Protein Analysis System
- www.expasy.org
- EBI - European Bioinformatics Institute
- www.ebi.ac.uk
- NCBI - National Center for Biotechnology
Information - www.ncbi.nlm.nih.gov
- Sanger - The Sanger Institute
- www.sanger.ac.uk
18Bioinformatics definition
- Every application of computer science to biology
- Sequence analysis, images analysis, sample
management, population modelling, - Analysis of data coming from large-scale
biological projects - Genomes, transcriptomes, proteomes, metabolomes,
etc
19The new biology
- Traditional biology
- Small team working on a specialized topic
- Well defined experiment to answer precise
questions - New  high-throughput biology
- Large international teams using cutting edge
technology defining the project - Results are given raw to the scientific community
without any underlying hypothesis
20Example of  high-throughputÂ
- Complete genome sequencing
- Large-scale sampling of the transcriptome (EST)
- Simultaneous expression analysis of thousands of
genes (DNA microarrays, SAGE) - Large-scale sampling of the proteome
- Protein-protein analysis large-scale 2-hybrid
(yeast, worm) - Large-scale 3D structure production (yeast)
- Metabolism modelling
- Simulations
- Biodiversity
21Role of bioinformatics
- Control and management of the data
- Analysis of primary data e.g.
- Base calling from chromatograms
- Mass spectra analysis
- DNA microarrays images analysis
- Statistics
- Database storage and access
- Results analysis in a biological context
22First information a sequence ?
- Nucleotide
- RNA (or cDNA)
- Genomic (intron-exon)
- Complete or incomplete?
- mRNA with 5 and 3 UTR regions
- Entire chromosome
- Protein
- Pre/Pro or functional protein?
- Function prediction
- Post-translational modifications?
- Holy Grail 3D structure?
23Genomes in numbers
- Sizes
- virus 103 to 105 nt
- bacteria 105 to 107 nt
- yeast 1.35 x 107 nt
- mammals 108 to 1010 nt
- plants 1010 to 1011 nt
- Gene number
- virus 3 to 100
- bacteria 1000
- yeast 7000
- mammals 30000
- Plants 30000-50000?
24Sequencing projects
-  small  genomes (lt107) bacteria, virus
- Many already sequenced (industry excluded)
- More than 150 microbial genomes already in the
public domain - More to come! (one new every two weeks)
-  large genomes (107-1010) eucaryotes
- gt30 finished (S.cerevisiae, S. Pombe, E.
cuniculi, G. theta, C.elegans, D.melanogaster, A.
gambiae, P. falciparum, P. yoelii, D. rerio, F.
rubripes, A.thaliana, O. sativa (2x), M.
musculus, Homo sapiens, P. troglodytes, R.
norvegicus, C. familiaris, G. gallus) - Many more to come cat, elephant, pig, cow, maize
(and other plants), insects, fishes, many
pathogenic parasites (Leishmania) - EST sequencing
- Partial mRNA sequences 20x106 sequences in the
public domain
25Human genome
- Size 3 x 109 nt for a haploid genome
- Highly repetitive sequences 25, moderately
repetitive sequences 25-30 - Size of a gene from 900 to gt2000000 bases
(introns included) - Proportion of the genome coding for proteins
5-7 - Number of chromosomes 22 autosomal, 1 sexual
chromosome - Size of a chromosome 5 x 107 to 5 x 108 bases
26How to sequence the human genome?
- Consortium  international approach
- Generate genetic maps (meiotic recombination) and
pseudogenetic maps (chromosome hybrids) for
indicator sequences - Generate a physical map based on large clones
(BAC or PAC) - Sequence enough large clones to cover the genome
-  commercial approach (Celera)
- Generate random libraries of fixed length genomic
clones (2kb and 10kb) - Sequence both ends of enough clones to obtain a
10x coverage - Use computer techniques to reconstitute the
chromosomal sequences, check with the public
project physical map
27Interpretation of the human draft
- All chromosomes considered as finished
- Even a genomic sequence does not tell you where
the genes are encoded. The genome is far from
being  decoded - One must combine genome and transcriptome to have
a better idea
Last freeze Ncbi34 July, 2003
28The transcriptome
- The set of all functional RNAs (tRNA, rRNA, mRNA
etc) that can potentially be transcribed from
the genome - The documentation of the localization (cell type)
and conditions under which these RNAs are
expressed - The documentation of the biological function(s)
of each RNA species
29Public draft transcriptome
- Information about the expression specificity and
the function of mRNAs -  full cDNA sequences of know function
-  full cDNA sequences (HTC), but  anonymousÂ
(e.g. KIAA or DKFZ collections) - EST sequences
- cDNA libraries derived from many different
tissues - Rapid random sequencing of the ends of all clones
- ORESTES sequences
- Growing set of expression data (microarrays, SAGE
etc) - Increasing evidences for multiple alternative
splicing and polyadenylation
30Example mapping of ESTs and mRNAs
mRNAs
ESTs
Computer prediction
31The proteome
- Set of proteins present in a particular cell type
under particular conditions - Set of proteins potentially expressed from the
genome - Information about the specific expression and
function of the proteins
32Information on the proteome
- Separation of a complex mixture of proteins
- 2D PAGE (IEF SDS PAGE)
- Capillary chromatography
- Individual characterisation of proteins
- Tryptic peptides signature (MS)
- Sequencing by chemistry or MS/MS
- All post-translational modifications (PTMs) !
33Tridimentional structures
- Methods to determine structures
- X-ray cristallography
- NMR
- Data format
- Atoms coordinates (except H) in a cartesian space
- Databases
- For proteins and nucleic acids (RSCB, was PDB)
- Independent databases for sugars and small
organic molecules
34Visualisation of the structures
- Secondary structure elements
- Alpha helices, beta sheets, other
- Softwares
- Various representations (atoms, bonds,
secondary) - Big choice of commercial and free software (e.g.,
DeepView)
35Sequence information, and so what ?
- How to store and organise ?
- Databases (next lecture)
- How to access, search, compare ?
- Pairwise alignments, dot plots (Tuesday)
- BLAST searches in db (Tuesday)
- Patterns, PSI-BLAST, Profiles and HMMs
(Wednesday) - Gene prediction (Wednesday)
- EST clustering (Thursday)
- Multiple Alignments (Thursday)
- Protein function prediction (Friday)
- Users problems (Friday)
36Thank you