Title: Some research at the Bioinformatics Research Centre
1Some research at theBioinformatics Research
Centre
- brc_at_brc.dcs.gla.ac.uk
- www.brc.dcs.gla.ac.uk
- Department of Computing Science
- 416 Davidson Building (Biochemistry Molecular
Biology, Institute of Biomedical Life Sciences) - University of Glasgow
2Bioinformatics Research Centre
- Provides an environment for collaborative
interdisciplinary research in Bioinformatics. - Hosts researchers from
- Department of Computing Science (5 RAE)
- Institute of Biomedical and Life Sciences.
(5/5 RAE) - Physically located in the Institute of Biomedical
and Life Sciences (Davidson Building -
Biochemistry Molecular Biology) - Strong links with
- Sir Henry Welcome Functional Genomics Facility.
- Statistical Bioinformatics
- Mathematical Biology
- Protein Crystallography
- Outreach programme (visitors etc)
3What is Bioinformatics
Aims of research in Bioinformatics
- Bio - molecular biology
- Informatics - computer science.
- Development application of computing,
mathematical and statistical methods to analyse
biological, biochemical and biophysical data, and
to guide biological assays. - Computational Biology USA.
- Understand the functioning of living things - to
improve the quality of life. - drug design
- identification of genetic risk factors
- gene therapy
- genetic modification of food crops and animals,
etc.
4Bioinformatics in context
5Bioinformatics in context (applications)
6Computational techniques we use
- Knowledge discovery and machine learning
- Algorithm design
- Computations over graphs, constraint
computation - Reasoning under uncertainty
- Databases and database indexing for new data
types. - Software and data integration.
- Distributed computations and data use over the
GRID. - Innovative visualisation and query mechanisms
for large data sets
7Applications
- Biological areas
- sequence and structure analysis,
- biochemical networks,
- experimental design.
- Specifically
- disease gene finding,
- protein function prediction,
- protein structure modelling,
- microarray analysis.
8BRC Members
- Investigators
- David Gilbert (Director, Machine learning,
Biochemical networks, protein structure) c - David Leader (Visualisation tools) b
- Rod Page (Phylogenetic trees) b
- Pawel Herzyk (Protein structure) b
- Ela Hunt (Database indexing) c
- Gerhard May (Signalling pathways) b
- Olivier Sand (Transcriptional regulatory
regions) c - Richard Sinnott (Grid computing / eScience) c
- Juris Viksna (Graph algorithms) c
- Research Assistants Brian Ross, Neil Hanlon,
Nigel Harding, Gilleain Torrance - Research students Ali Al-Shahib, Iain Darroch,
Susan Fairley, Eilidh Grant, Andrew Jones, Aik
Choon Tan, Tim Troup, Mallika Veeramalai - Associated Malcolm Atkinson, Ernst Wit, John
McClure
9Some funded projects
- Malcolm Atkinson (National e-Science Centre),
David Gilbert, Ela Hunt, Anna Dominiczak and
David White (IBM UK Life Sciences), "BRIDGES
BioMedical Research Informatics Delivered by Grid
Enabled Services", EPSRC / E-Science Core
Programme - Mathis Riehle, David Gilbert, Chris Wilkinson and
Stephen Yarwood, " Engineering Cell Form and
Function with Nanometric Surfaces", a
collaboration between the Centre for Cell
Engineering and the Bioinformatics Research
Centre. Funded by the Royal Society Wolfson
Foundation Laboratory Refurbishment Scheme,
February 2003 for 1 year - Yves Deville and David Gilbert, Application of
constraint satisfaction for the analysis of
biochemical networks, September 2002 for 1 year,
funded by the EPSRC - A Software Tool for the Simulation and Analysis
of Biochemical Networks, DTI, David Gilbert,
Muffy Calder, Walter Kolch - Cardiovascular Functional Genomics, Wellcome
Trust, Anna Dominiczak, Malcolm Atkinson, David
Gilbert, and Ela Hunt
10Some funded projects (2)
- MRC Special Research Training Fellowship in
Bioinformatics Ela Hunt, MRC - SAILS Self-Adjusting Indexes for Large
Sequences, EPSRC, Malcolm Atkinson (Nigel Harding
RA) - "Structural patterns in the composite regulatory
regions of genomes", Olivier Sand, European Union - "Patterns, functions and structures on a protein
topology database", BBSRC/EPSRC Bioinformatics
Initiative, David Gilbert (Gilleain Torrance
RA) joint project with David Westhead, Leeds
University (Department of Biochemistry
Molecular Biology). - "Automatic extraction of rules for protein
classification", Juris Viksna, Wellcome Trust.
11Where we are
Functional Genomics (Joseph Black)
12BRC location
13BRC Phase 1
14Davidson level 4
BRC Phase 1
Gardiner lab
BRC Phase 2
15Bioinformatics Research Centre Davidson Building
20 workstations visitors facilities
Webserver
Fileserver
Unix Appserver
Microsoft App server
Cluster ( Scotgrid)
17 Lilybank Gardens
Boyd-Orr Building (backup)
16DNA to proteins
compute
compute
?how?
17Database Growth
PDB protein structures
18How can we analyse the flood of data ?
- Data don't just store it, analyze it ! By
comparing sequences, one can find out about
things like - how organisms are related evolution
- protein structures
- how proteins function
- population variability
- diseases
19Computational bottlenecks
- Caused by
- Data characteristics
- Lots of it
- heterogeneous
- distributed
- incomplete
- dirty
- (Traditional) complexity issues time, space
- Induction constructing discriminatory/descriptive
functions from large data sets
20Computational bottlenecks
- Data representation
- sequences (DNA, RNA, amino-acid)
- trees (phylogentic,)
- graphs (protein structure, biochemical networks)
- matrices (micro-arrays, metabolic pathways)
21What kind of computational approaches do we use?
- Operations over
- sequences (match)
- trees (e.g. suffix trees, supertree, joining,
...) - graphs (sub-graph isomorphism, maximal common
subgraph, path searching) - Data modelling, databases, data conversion
- Machine learning, knowledge discovery, pattern
discovery,... - Clustering
- Theorem proving, concurrency analysis,
- Integration data, knowledge
- Data visualisation
- Web services, Grid, Coarse Grain parallelism,
eScience,...
22Data, information, knowledge
- information where are the genes.
Found using classifier, pattern, rule which has
been mined/discovered
- knowledge facts and rules
- If a gene X has a weak psi-blast assignment to a
function F - and that gene is in an expression cluster
- and sufficient members of that cluster are known
to have function F, - ? then believe assignment of F to X.
23An abstract view
- Givenp9, p1, q8, p3, q2, q6, p5, q4,
p7, q0 - Cluster p9, p1, p3, p5, p7 q8, q2,
q6, q4, q0 - Background knowledge gt -
- Induce0 is qX is q if X-2 is q and X gt 0
- X is p if not(X is q)
24To Learn
to acquire knowledge of (a subject) or skill
in (an art, etc.) as a result of study,
experience, or teaching (OED)
What is Machine Learning?
a computer program that can learn from
experience with respect to some class of tasks
and performance measure (Mitchell, 1997)
25Learning Approaches
(Ramaswamy and Golub 2002)
26Machine learning tasks (in bioinformatics as
elsewhere)
- Classification predicting the class of an item
- - Clustering finding groups of items
- Characterisation describing a group
- Deviation Detection finding changes
- Linkage Analysis finding relationships
associations - Visualisation presenting data visually to
facilitate knowledge discovery by humans (human
in the loop)
27Types of Machine Learning
- Symbolic approaches
- Employ some kind of description language in which
the learned pattern is expressed. - Much more transparent and easier to interpret.
28Single Machine Learning Approach
Machine Learning
Classifier
C4.5 SVM k-NN ANN
29Decision Trees (Quinlan, 1993)
If gene_1671 lt 56.9 Then Normal If gene_1671 ?
56.9 and gene_682 lt 107.4 Then Normal If
gene_1671 ? 56.9 and gene_682 ? 107.4 and
gene_201 lt 3145.5 Then Tumour If gene_1671 ?
56.9 and gene_682 ? 107.4 and gene_201 ? 3145.5
Then Normal
30Ensemble Machine Learning
Combined Classifier
31Why Ensemble Learning?
- Advantages
- Compliment each other weakness
- Increase predictive power
- Approximate the true hypothesis
- Disadvantage
- Difficult to combine - lack of coherence
- Increase computational time
32 Cross-validation
33Confusion matrix / Contingency Table
Training set test set
True Positives(TP) x?X and h(x) TRUE True
Negatives(TN) x?X- and h(x) FALSE False
Positives(FP) x?X- and h(x) TRUE False
Negatives(FN) x?X and h(x) FALSE
34Classification conservation problems
Classification and - examples
Characterisation examples only
S-
clean training data
S
F-
S
F-
clean training data
F
F
?
?
S-
S
noisy training data
noisy training data
F-
S
F-
F
?
F
?
35The challenge of increasing data
Language of the pattern L(P)
36Protein family analysis
- Collect sequences (structures) in family
- Analyze
- local multiple alignment
- global multiple alignment
- pattern discovery
- Make family description
- Pick up more family members?
- Analyze extended set
37String or structure comparison motif discovery
Str Database
Eidhammer, Jonassen Taylor, Structure
Comparison and Structure Patterns, JCB, 75 pp
685-716, 2000.
38What is a pattern?
39Types of Pattern
- Deterministic
- is a boolean function which either matches a
given object (i.e. sequence, structure) or not - R-x-Y-ST
- (e.g. regular expression for sequence pattern)
1 2 3 4 5 6 7 8 9 10 S1 R V Q R
A Y S Y V N S2 P L M R A Y S I A
S S3 L V I R P Y T P V S S4 L C M R
A Y T P T S S5 E K L R L Y S I A
S R.2 V.4 Q.2 R1 A.6 Y1 S.6 Y.2
V.4 N.2 P.2 L.2 M.4 P.2
T.4 I.4 A.4 S.8 L.4 V.2 I.2
L.2 P.4 T.2 E.2
- Probabilistic
- Assigns each sequence with a
- probability that generated by the
- model. The higher the probability,
- the better is the match between a
- sequence and a pattern
- (e.g. Profile for sequence pattern)
40Approaches to pattern discovery
- Pattern driven
- enumerate all (or some) patterns up to certain
complexity (length), for each calculate the
score, and report the best - Sequence driven
- look for patterns by aligning the given sequences
Brazma et al, Approaches to the automatic
discovery of patterns in biosequences, Journal of
Computational Biology, 5(2)277-303, 1988
41Sequence driven algorithms
- Group similar sequences together (e.g., in
pairs) - For each group find a common pattern (e.g., by
dynamic programming) - Group similar patterns together and repeat the
previous step until there is only one group left
42Sequence driven approach
s1
p1
s2
p4
s3
p2
s4
p3
s5
43Pattern driven approach
- Given a set of examples E
- Set pattern P ø
- While (match_all(P,E)true) do
- P P c
- Return P
44Topological pattern discovery (pattern extension
and repeated matching)
Repeat
Works (in theory) on set of any size
- Find new sheet
- Extend current sheet
- Find circuits
45Rating patterns
- Size (e.g. number of characters)
- Compression
- measure of how much of each of the items in the
learning set is described - Sensitivity, Specificity etc
- requires evaluation against learning training
test sets
46Compression
- Send the pattern once, and
- Send the uncovered parts of each structure
Pattern
Domain 1
Domain 2
Special case When 2 examples, compression gives
comparison
47TOPSProteintopologyDavid Gilbert, Juris
Viksna, Gilleain Torrance (BRC, Glasgow),David
Westhead and Ioannis Michalopoulos
(Leeds)BBSRC/EPSRC funded
48TOPS website
http//www.tops.leeds.ac.uk/
49Topological structure comparison 1 against all
Structure comparison server http//tops.ebi.ac.uk/
tops
50Comparing structures - NADP binding domains
dihyrofoliate reductase
51Dendrogram of comparisons
Pairwise comparison all x all n (n-1) / 2
1413/291 hierarchical clustering
52Pairwise comparison
- Pairwise comparison all x all
- n (n-1) / 2
- 14,000 protein domains ? 108 comparisons
53Coverage vs Error
54Topological representation of transcriptional
regulatory regionsOlivier Sand,
osand_at_brc.dcs.gla.ac.uk
55Hierarchical Machine Learning of Patterns for
Characterising Protein Families
Aik Choon TAN actan_at_brc.dcs.gla.ac.uk
Research Aim To construct a novel approach to
induce invariant relationships between
distributed heterogeneous biological databases
using knowledge discovery and hierarchical
machine learning techniques.
56Biological Data Distributed and Heterogeneous!!
57Development of Twilight Friendly Software Using
a Rule-Based System that Provides Functional
Annotation with Varying Levels of Uncertainty
Ali Al-Shahib www.brc.dcs.gla.ac.uk/alshahib alsh
ahib_at_brc.dcs.gla.ac.uk
If a gene X has a weak psi-blast assignment to a
function F and that gene is in an expression
cluster and sufficient members of that cluster
are known to have function F, ? then believe
assignment of F to X. (for some suitable values
of weak and sufficient).
- WEIGHTED LOGICAL RULES
- Rules that will limit the measures of uncertainty
58Data storage, integration indexing Ela Hunt
ela_at_brc.dcs.gla.ac.uk
EXTERNAL DATABASES - NCBI - Ensembl - TIGR -
Mouse database - Drosophila database - Mouse
microarrays - chromosome 5 db
Experimental data -images -images converted to
numbers or strings
59IndexingEla Hunt ela_at_brc.dcs.gla.ac.uk
- String indexing structures can be used to index
DNA, proteins, XML and phylogenetic trees - All data is read once, index in created on disk
- Index reduces the search space of the query (we
read a of disk only)
60New database technologies for storing the output
from high-throughput biological experiments
Andrew Jones
- Proteomics study the set of proteins expressed
in a sample - Complex, variable output
- High-Resolution images
- Numerical data generated by lab. equipment and
software - Human Annotation
- The data is not suitable for storage in a
standard relational database - Storage, retrieval and exchange of data is
important - XML (Extensible Markup Language) is being
investigated for storing such data
61SAILS SELF ADJUSTING INDEXES FOR LARGE
SEQUENCESNigel Hardinghardingn_at_dcs.gla.ac.uk
- EPSRC, 30 months from August 2001Malcolm
Atkinson P.I. - Developing indexes for very large collections of
reference data (e.g. mammalian genomes) that will
tune themselves automatically in response to the
queries being submitted against them. - However, before introducing self adjustment need
to establish how query performance depends upon
index structure.
62Loading GenBank files Viewing Gene Information
BugView - A Genome Visualization ToolDavid P.
Leader d.leader_at_bio.gla.ac.uk
63Comparing Genes from different Bacterial Genomes
64Molecular Evolution A Phylogenetic Approach
Rod Pager.page_at_bio.gla.ac.uk
Locating genome duplications Q did one or more
genome-wide events affect all gene families?
65Supertrees combining small trees into one large
tree
Input k trees
Q can we do this in polynomial time?
supertree
graph
all minimum cuts
66Data complexityMethionine Biosynthesis in E.coli
67Biochemical networks
- Pathway navigation
- Pathway comparison
- Pathway motif discovery
- Pathway simulation
- High-level abstraction inferred from low-level
descriptions - Novel pathways from gene expression experiments
68Biochemical Pathway Simulator A Software Tool
for Simulation Analysis of Biochemical Networks
DTI Beacon project, 0.9M, 4 years
- Muffy Calder muffy_at_dcs.gla.ac.uk
- David Gilbert drg_at_brc.dcs.gla.ac.uk
- Walter Kolch wkolch_at_beatson.gla.ac.uk
- Keith van Rijsbergen keith_at_dcs.gla.ac.uk
- Brian Ross rossbs_at_dcs.gla.ac.uk
69Complexity real bioinformaticsClosing the loop
from wet lab to in-silico
Abstract model
Human feedback (in-the-loop)
Simulator
Database MAPK
Lab MAPK
Analysis
Web portal
DATA
User Interface
Pathway Editor
Rules
Database Apoptosis
Text miner
Simulator Calder Ross Concurrency theory
Bioinformatics Gilbert Tools, database, interface
Bio Kolch Lab/Literature
70Distributed databases and computation
Cardiovascular Functional Genomics
- -5.4 million project, 5 UK Universities.
- Combined studies
- scientific models of disease (Rat)
- parallel studies of patients
- large family and population DNA collections
- 3 pronged approach
- Targeted transcript sequencing
- Microarray gene expression profiling
- Comparative genome analysis.
- Data generated at each of the 5 sites made
available for analysis - Issues of distributed data and computation.
- Mapping gene sequences Rat ? Mouse ?Human
- an added layer of complexity in the computation.
71Cardiovascular Functional Genomics
-
- Glasgow Anna Dominiczak, John Connell, David
Gilbert, Malcolm Atkinson, Ela Hunt - Leicester Nilesh Samani, Richard Trembath, Paul
Burton - Edinburgh John Mullins, Ian Wilmut, Jonathon
Seckl - Imperial/MRC Clinical Sciences Centre Timothy
Aitman, James Scott, Helen Causton - Oxford Dominique Gauguier, Hugh Watkins
- Maastricht Henry Struijker Boudier
72Distributed data computation
73Wellcome Trust Cardiovascular Functional
Genomics
74BRIDGES BioMedical Research Informatics
Delivered by Grid Enabled Services
- National e-Science Centre, Bioinformatics
Research Centre, IBM UK Life Sciences - Incrementally develop and explore database
integration over 6 geographically distributed
research sites within the framework of the large
Wellcome Trust biomedical research project
Cardiovascular Functional Genomics. - Three classes of integration will be developed to
support a sophisticated bioinformatics
infrastructure supporting - data sources (both public and project generated),
- bioinformatics analysis and visualisation tools,
- research activities combining shared and private
data. - The inclusion of patient records and animal
experiment data means that privacy and access
control are particular concerns. - An exploration of index factories accelerating
sequence processing will test the hypothesis that
the Grid makes a new class of e-Science indexes
feasible. Both OGSA-DAI and IBM DiscoveryLink
technology will be employed and a report will
identify how each performed in this context.
75The Scottish Bioinformatics Forum (SBF)
- Network of Bioinformatics researchers and
industries in Scotland - A vehicle for developing Scotland as a Centre of
Bioinformatics Excellence - Nodes in Glasgow, Edinburgh, Dundee, Aberdeen,
... - Promoting collaborative research
- Development of a Bioinformatics educational
programme - www.sbforum.org, sbforum-general_at_sbforum.org
76The Future
GPCVIII, ECCB04 ISMB04 at Glasgow
Scottish Bioinformatics Forum (SBF)
- Closing the loop from wet lab to in silico !
Collaboration!
http//www.brc.dcs.gla.ac.uk