Some research at the Bioinformatics Research Centre

About This Presentation

Title:

Some research at the Bioinformatics Research Centre

Description:

gene therapy. genetic modification of food crops and animals, etc. ... Physical. Sciences. Bioinformatics Research Centre, University of Glasgow. 5 ... – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 77

Provided by: davidg77

Category:

more less

Transcript and Presenter's Notes

Title: Some research at the Bioinformatics Research Centre

1
Some research at theBioinformatics Research
Centre

brc_at_brc.dcs.gla.ac.uk
www.brc.dcs.gla.ac.uk
Department of Computing Science
416 Davidson Building (Biochemistry Molecular
Biology, Institute of Biomedical Life Sciences)
University of Glasgow

2
Bioinformatics Research Centre

Provides an environment for collaborative
interdisciplinary research in Bioinformatics.
Hosts researchers from
Department of Computing Science (5 RAE)
Institute of Biomedical and Life Sciences.
(5/5 RAE)
Physically located in the Institute of Biomedical
and Life Sciences (Davidson Building -
Biochemistry Molecular Biology)
Strong links with
Sir Henry Welcome Functional Genomics Facility.
Statistical Bioinformatics
Mathematical Biology
Protein Crystallography
Outreach programme (visitors etc)

3
What is Bioinformatics
Aims of research in Bioinformatics

Bio - molecular biology
Informatics - computer science.
Development application of computing,
mathematical and statistical methods to analyse
biological, biochemical and biophysical data, and
to guide biological assays.
Computational Biology USA.

Understand the functioning of living things - to
improve the quality of life.
drug design
identification of genetic risk factors
gene therapy
genetic modification of food crops and animals,
etc.

4
Bioinformatics in context
5
Bioinformatics in context (applications)
6
Computational techniques we use

Knowledge discovery and machine learning
Algorithm design
Computations over graphs, constraint
computation
Reasoning under uncertainty
Databases and database indexing for new data
types.
Software and data integration.
Distributed computations and data use over the
GRID.
Innovative visualisation and query mechanisms
for large data sets

7
Applications

Biological areas
sequence and structure analysis,
biochemical networks,
experimental design.
Specifically
disease gene finding,
protein function prediction,
protein structure modelling,
microarray analysis.

8
BRC Members

Investigators
David Gilbert (Director, Machine learning,
Biochemical networks, protein structure) c
David Leader (Visualisation tools) b
Rod Page (Phylogenetic trees) b
Pawel Herzyk (Protein structure) b
Ela Hunt (Database indexing) c
Gerhard May (Signalling pathways) b
Olivier Sand (Transcriptional regulatory
regions) c
Richard Sinnott (Grid computing / eScience) c
Juris Viksna (Graph algorithms) c
Research Assistants Brian Ross, Neil Hanlon,
Nigel Harding, Gilleain Torrance
Research students Ali Al-Shahib, Iain Darroch,
Susan Fairley, Eilidh Grant, Andrew Jones, Aik
Choon Tan, Tim Troup, Mallika Veeramalai
Associated Malcolm Atkinson, Ernst Wit, John
McClure

9
Some funded projects

Malcolm Atkinson (National e-Science Centre),
David Gilbert, Ela Hunt, Anna Dominiczak and
David White (IBM UK Life Sciences), "BRIDGES
BioMedical Research Informatics Delivered by Grid
Enabled Services", EPSRC / E-Science Core
Programme
Mathis Riehle, David Gilbert, Chris Wilkinson and
Stephen Yarwood, " Engineering Cell Form and
Function with Nanometric Surfaces", a
collaboration between the Centre for Cell
Engineering and the Bioinformatics Research
Centre. Funded by the Royal Society Wolfson
Foundation Laboratory Refurbishment Scheme,
February 2003 for 1 year
Yves Deville and David Gilbert, Application of
constraint satisfaction for the analysis of
biochemical networks, September 2002 for 1 year,
funded by the EPSRC
A Software Tool for the Simulation and Analysis
of Biochemical Networks, DTI, David Gilbert,
Muffy Calder, Walter Kolch
Cardiovascular Functional Genomics, Wellcome
Trust, Anna Dominiczak, Malcolm Atkinson, David
Gilbert, and Ela Hunt

10
Some funded projects (2)

MRC Special Research Training Fellowship in
Bioinformatics Ela Hunt, MRC
SAILS Self-Adjusting Indexes for Large
Sequences, EPSRC, Malcolm Atkinson (Nigel Harding
RA)
"Structural patterns in the composite regulatory
regions of genomes", Olivier Sand, European Union
"Patterns, functions and structures on a protein
topology database", BBSRC/EPSRC Bioinformatics
Initiative, David Gilbert (Gilleain Torrance
RA) joint project with David Westhead, Leeds
University (Department of Biochemistry
Molecular Biology).
"Automatic extraction of rules for protein
classification", Juris Viksna, Wellcome Trust.

11
Where we are
Functional Genomics (Joseph Black)
12
BRC location
13
BRC Phase 1
14
Davidson level 4
BRC Phase 1
Gardiner lab
BRC Phase 2
15
Bioinformatics Research Centre Davidson Building
20 workstations visitors facilities
Webserver
Fileserver
Unix Appserver
Microsoft App server
Cluster ( Scotgrid)
17 Lilybank Gardens
Boyd-Orr Building (backup)
16
DNA to proteins
compute
compute
?how?
17
Database Growth
PDB protein structures
18
How can we analyse the flood of data ?

Data don't just store it, analyze it ! By
comparing sequences, one can find out about
things like
how organisms are related evolution
protein structures
how proteins function
population variability
diseases

19
Computational bottlenecks

Caused by
Data characteristics
Lots of it
heterogeneous
distributed
incomplete
dirty
(Traditional) complexity issues time, space
Induction constructing discriminatory/descriptive
functions from large data sets

20
Computational bottlenecks

Data representation
sequences (DNA, RNA, amino-acid)
trees (phylogentic,)
graphs (protein structure, biochemical networks)
matrices (micro-arrays, metabolic pathways)

21
What kind of computational approaches do we use?

Operations over
sequences (match)
trees (e.g. suffix trees, supertree, joining,
...)
graphs (sub-graph isomorphism, maximal common
subgraph, path searching)
Data modelling, databases, data conversion
Machine learning, knowledge discovery, pattern
discovery,...
Clustering
Theorem proving, concurrency analysis,
Integration data, knowledge
Data visualisation
Web services, Grid, Coarse Grain parallelism,
eScience,...

22
Data, information, knowledge

data nucleotide sequence

information where are the genes.

Found using classifier, pattern, rule which has
been mined/discovered

knowledge facts and rules
If a gene X has a weak psi-blast assignment to a
function F
and that gene is in an expression cluster
and sufficient members of that cluster are known
to have function F,
? then believe assignment of F to X.

23
An abstract view

Givenp9, p1, q8, p3, q2, q6, p5, q4,
p7, q0
Cluster p9, p1, p3, p5, p7 q8, q2,
q6, q4, q0
Background knowledge gt -
Induce0 is qX is q if X-2 is q and X gt 0
X is p if not(X is q)

24
To Learn
to acquire knowledge of (a subject) or skill
in (an art, etc.) as a result of study,
experience, or teaching (OED)
What is Machine Learning?
a computer program that can learn from
experience with respect to some class of tasks
and performance measure (Mitchell, 1997)
25
Learning Approaches
(Ramaswamy and Golub 2002)
26
Machine learning tasks (in bioinformatics as
elsewhere)

Classification predicting the class of an item
-
Clustering finding groups of items
Characterisation describing a group
Deviation Detection finding changes
Linkage Analysis finding relationships
associations
Visualisation presenting data visually to
facilitate knowledge discovery by humans (human
in the loop)

27
Types of Machine Learning

Symbolic approaches
Employ some kind of description language in which
the learned pattern is expressed.
Much more transparent and easier to interpret.

28
Single Machine Learning Approach
Machine Learning
Classifier
C4.5 SVM k-NN ANN
29
Decision Trees (Quinlan, 1993)
If gene_1671 lt 56.9 Then Normal If gene_1671 ?
56.9 and gene_682 lt 107.4 Then Normal If
gene_1671 ? 56.9 and gene_682 ? 107.4 and
gene_201 lt 3145.5 Then Tumour If gene_1671 ?
56.9 and gene_682 ? 107.4 and gene_201 ? 3145.5
Then Normal
30
Ensemble Machine Learning
Combined Classifier
31
Why Ensemble Learning?

Advantages
Compliment each other weakness
Increase predictive power
Approximate the true hypothesis
Disadvantage
Difficult to combine - lack of coherence
Increase computational time

32
Cross-validation
33
Confusion matrix / Contingency Table
Training set test set
True Positives(TP) x?X and h(x) TRUE True
Negatives(TN) x?X- and h(x) FALSE False
Positives(FP) x?X- and h(x) TRUE False
Negatives(FN) x?X and h(x) FALSE
34
Classification conservation problems
Classification and - examples
Characterisation examples only
S-
clean training data
S
F-
S
F-
clean training data
F
F
?
?
S-
S
noisy training data
noisy training data
F-
S
F-
F
?
F
?
35
The challenge of increasing data
Language of the pattern L(P)
36
Protein family analysis

Collect sequences (structures) in family
Analyze
local multiple alignment
global multiple alignment
pattern discovery
Make family description
Pick up more family members?
Analyze extended set

37
String or structure comparison motif discovery
Str Database
Eidhammer, Jonassen Taylor, Structure
Comparison and Structure Patterns, JCB, 75 pp
685-716, 2000.
38
What is a pattern?
39
Types of Pattern

Deterministic
is a boolean function which either matches a
given object (i.e. sequence, structure) or not
R-x-Y-ST
(e.g. regular expression for sequence pattern)

1 2 3 4 5 6 7 8 9 10 S1 R V Q R
A Y S Y V N S2 P L M R A Y S I A
S S3 L V I R P Y T P V S S4 L C M R
A Y T P T S S5 E K L R L Y S I A
S R.2 V.4 Q.2 R1 A.6 Y1 S.6 Y.2
V.4 N.2 P.2 L.2 M.4 P.2
T.4 I.4 A.4 S.8 L.4 V.2 I.2
L.2 P.4 T.2 E.2

Probabilistic
Assigns each sequence with a
probability that generated by the
model. The higher the probability,
the better is the match between a
sequence and a pattern
(e.g. Profile for sequence pattern)

40
Approaches to pattern discovery

Pattern driven
enumerate all (or some) patterns up to certain
complexity (length), for each calculate the
score, and report the best
Sequence driven
look for patterns by aligning the given sequences

Brazma et al, Approaches to the automatic
discovery of patterns in biosequences, Journal of
Computational Biology, 5(2)277-303, 1988
41
Sequence driven algorithms

Group similar sequences together (e.g., in
pairs)
For each group find a common pattern (e.g., by
dynamic programming)
Group similar patterns together and repeat the
previous step until there is only one group left

42
Sequence driven approach
s1
p1
s2
p4
s3
p2
s4
p3
s5
43
Pattern driven approach

Given a set of examples E
Set pattern P ø
While (match_all(P,E)true) do
P P c
Return P

44
Topological pattern discovery (pattern extension
and repeated matching)
Repeat
Works (in theory) on set of any size
- Find new sheet
- Extend current sheet
- Find circuits
45
Rating patterns

Size (e.g. number of characters)
Compression
measure of how much of each of the items in the
learning set is described
Sensitivity, Specificity etc
requires evaluation against learning training
test sets

46
Compression

Send the pattern once, and
Send the uncovered parts of each structure

Pattern
Domain 1
Domain 2
Special case When 2 examples, compression gives
comparison
47
TOPSProteintopologyDavid Gilbert, Juris
Viksna, Gilleain Torrance (BRC, Glasgow),David
Westhead and Ioannis Michalopoulos
(Leeds)BBSRC/EPSRC funded
48
TOPS website
http//www.tops.leeds.ac.uk/
49
Topological structure comparison 1 against all
Structure comparison server http//tops.ebi.ac.uk/
tops
50
Comparing structures - NADP binding domains
dihyrofoliate reductase
51
Dendrogram of comparisons
Pairwise comparison all x all n (n-1) / 2
1413/291 hierarchical clustering
52
Pairwise comparison

Pairwise comparison all x all
n (n-1) / 2
14,000 protein domains ? 108 comparisons

53
Coverage vs Error
54
Topological representation of transcriptional
regulatory regionsOlivier Sand,
osand_at_brc.dcs.gla.ac.uk
55
Hierarchical Machine Learning of Patterns for
Characterising Protein Families
Aik Choon TAN actan_at_brc.dcs.gla.ac.uk
Research Aim To construct a novel approach to
induce invariant relationships between
distributed heterogeneous biological databases
using knowledge discovery and hierarchical
machine learning techniques.
56
Biological Data Distributed and Heterogeneous!!
57
Development of Twilight Friendly Software Using
a Rule-Based System that Provides Functional
Annotation with Varying Levels of Uncertainty
Ali Al-Shahib www.brc.dcs.gla.ac.uk/alshahib alsh
ahib_at_brc.dcs.gla.ac.uk
If a gene X has a weak psi-blast assignment to a
function F and that gene is in an expression
cluster and sufficient members of that cluster
are known to have function F, ? then believe
assignment of F to X. (for some suitable values
of weak and sufficient).

WEIGHTED LOGICAL RULES
Rules that will limit the measures of uncertainty

58
Data storage, integration indexing Ela Hunt
ela_at_brc.dcs.gla.ac.uk
EXTERNAL DATABASES - NCBI - Ensembl - TIGR -
Mouse database - Drosophila database - Mouse
microarrays - chromosome 5 db
Experimental data -images -images converted to
numbers or strings
59
IndexingEla Hunt ela_at_brc.dcs.gla.ac.uk

String indexing structures can be used to index
DNA, proteins, XML and phylogenetic trees
All data is read once, index in created on disk
Index reduces the search space of the query (we
read a of disk only)

60
New database technologies for storing the output
from high-throughput biological experiments
Andrew Jones

Proteomics study the set of proteins expressed
in a sample
Complex, variable output
High-Resolution images
Numerical data generated by lab. equipment and
software
Human Annotation
The data is not suitable for storage in a
standard relational database
Storage, retrieval and exchange of data is
important
XML (Extensible Markup Language) is being
investigated for storing such data

61
SAILS SELF ADJUSTING INDEXES FOR LARGE
SEQUENCESNigel Hardinghardingn_at_dcs.gla.ac.uk

EPSRC, 30 months from August 2001Malcolm
Atkinson P.I.
Developing indexes for very large collections of
reference data (e.g. mammalian genomes) that will
tune themselves automatically in response to the
queries being submitted against them.
However, before introducing self adjustment need
to establish how query performance depends upon
index structure.

62
Loading GenBank files Viewing Gene Information
BugView - A Genome Visualization ToolDavid P.
Leader d.leader_at_bio.gla.ac.uk
63
Comparing Genes from different Bacterial Genomes
64
Molecular Evolution A Phylogenetic Approach
Rod Pager.page_at_bio.gla.ac.uk
Locating genome duplications Q did one or more
genome-wide events affect all gene families?
65
Supertrees combining small trees into one large
tree
Input k trees
Q can we do this in polynomial time?
supertree
graph
all minimum cuts
66
Data complexityMethionine Biosynthesis in E.coli
67
Biochemical networks

Pathway navigation
Pathway comparison
Pathway motif discovery
Pathway simulation
High-level abstraction inferred from low-level
descriptions
Novel pathways from gene expression experiments

68
Biochemical Pathway Simulator A Software Tool
for Simulation Analysis of Biochemical Networks
DTI Beacon project, 0.9M, 4 years

Muffy Calder muffy_at_dcs.gla.ac.uk
David Gilbert drg_at_brc.dcs.gla.ac.uk
Walter Kolch wkolch_at_beatson.gla.ac.uk
Keith van Rijsbergen keith_at_dcs.gla.ac.uk
Brian Ross rossbs_at_dcs.gla.ac.uk

69
Complexity real bioinformaticsClosing the loop
from wet lab to in-silico
Abstract model
Human feedback (in-the-loop)
Simulator
Database MAPK
Lab MAPK
Analysis
Web portal
DATA
User Interface
Pathway Editor
Rules
Database Apoptosis
Text miner
Simulator Calder Ross Concurrency theory
Bioinformatics Gilbert Tools, database, interface
Bio Kolch Lab/Literature
70
Distributed databases and computation
Cardiovascular Functional Genomics

-5.4 million project, 5 UK Universities.
Combined studies
scientific models of disease (Rat)
parallel studies of patients
large family and population DNA collections
3 pronged approach
Targeted transcript sequencing
Microarray gene expression profiling
Comparative genome analysis.
Data generated at each of the 5 sites made
available for analysis
Issues of distributed data and computation.
Mapping gene sequences Rat ? Mouse ?Human
an added layer of complexity in the computation.

71
Cardiovascular Functional Genomics

Glasgow Anna Dominiczak, John Connell, David
Gilbert, Malcolm Atkinson, Ela Hunt
Leicester Nilesh Samani, Richard Trembath, Paul
Burton
Edinburgh John Mullins, Ian Wilmut, Jonathon
Seckl
Imperial/MRC Clinical Sciences Centre Timothy
Aitman, James Scott, Helen Causton
Oxford Dominique Gauguier, Hugh Watkins
Maastricht Henry Struijker Boudier

72
Distributed data computation
73
Wellcome Trust Cardiovascular Functional
Genomics
74
BRIDGES BioMedical Research Informatics
Delivered by Grid Enabled Services

National e-Science Centre, Bioinformatics
Research Centre, IBM UK Life Sciences
Incrementally develop and explore database
integration over 6 geographically distributed
research sites within the framework of the large
Wellcome Trust biomedical research project
Cardiovascular Functional Genomics.
Three classes of integration will be developed to
support a sophisticated bioinformatics
infrastructure supporting
data sources (both public and project generated),
bioinformatics analysis and visualisation tools,
research activities combining shared and private
data.
The inclusion of patient records and animal
experiment data means that privacy and access
control are particular concerns.
An exploration of index factories accelerating
sequence processing will test the hypothesis that
the Grid makes a new class of e-Science indexes
feasible. Both OGSA-DAI and IBM DiscoveryLink
technology will be employed and a report will
identify how each performed in this context.

75
The Scottish Bioinformatics Forum (SBF)

Network of Bioinformatics researchers and
industries in Scotland
A vehicle for developing Scotland as a Centre of
Bioinformatics Excellence
Nodes in Glasgow, Edinburgh, Dundee, Aberdeen,
...
Promoting collaborative research
Development of a Bioinformatics educational
programme
www.sbforum.org, sbforum-general_at_sbforum.org

76
The Future
GPCVIII, ECCB04 ISMB04 at Glasgow
Scottish Bioinformatics Forum (SBF)

Some research at the Bioinformatics Research Centre - PowerPoint PPT Presentation

Some research at the Bioinformatics Research Centre

gene therapy. genetic modification of food crops and animals, etc. ... Physical. Sciences. Bioinformatics Research Centre, University of Glasgow. 5 ... – PowerPoint PPT presentation