Title: The Aim
1- The Aim
- Integrate genomic mapping data across different
data sources on the Internet - To allow exploration of mapping information
- To discover or predict new information
- By exploiting Conservation of Synteny
acrossspecies boundaries.
2- The Problems
- Lots of different types of mapping data
- from different types of experiments
- in various species
- different locations/ computer systems
- different representations of data
- different terminology
- different storage formats
- data of variable quality
- et cetera
3- (Part Of) The Solution
- Define a language/terminology to describe and
represent mapping data unambiguously - Capturing both the meaning of the data and the
relationships represented in the data. - Use this language as a common representation for
exchanging and querying data and providing
results. - Perhaps use the semantics captured in this
language to automatically discover new
information.
4What is an Ontology?
Increasing Formality of Ontology
5Defining the ComparaGRID Domain Ontology.
- Ontology a (more or less) 'formal' specification
of a domain of knowledge - (Here Genomic Mapping data across all
species) - What types of concepts are there (defined terms,
things we need to talk about) - And how these concepts (might or necessarily)
relate to each other - Can be used to control the vocabulary used for
storing or describing data - Can represent Formal Logics allow 'reasoning'
about data (Software can check the validity of
data and deduce new information).
6Formal Logics.What?
- If we 'know' that
- 1. Concept A is related to Concept B a ? ß
- 2. Concept B is related to Concept C ß ? ?
- Can we deduce/reason anything about possible
relationships between Concepts A and C? a
??? ?
7Formal Logics.Why?
- For example
- If in species A
- Gene A is syntenic with Gene B a ? ß
- Gene B is syntenic with Gene C ß ? ?
- Gene C is syntenic with Gene D ? ? d
- If we have defined Synteny to be Transitive
- We can deduce that
- Gene A is syntenic with Gene D a ? d
8Formal Logics.Why?
- More complex example
- If in species A
- Gene A1 is syntenic with Gene A2 a1? a2
- Gene A2 is syntenic with Gene A3 a2? a3
-
- And in a somewhat related Species B
- Gene B2 is syntenic with Gene B3 ß2? ß3
- Gene B3 is syntenic with Gene B4 ß3? ß4
a4
And sequence comparisons establish that A2/B2 ,
A3/B3, A4/B4 are orthologues..
We might be able to postulate that an orthologue
of A1 might be found syntenic with B2, 3 and 4
and that A4 might be syntenic with A1,2,3
9Exploiting Conserved Synteny to predict
candidate genes
10The ComparaGRID Ontology
- The ComparaGRID ontology defines the terminology
used in the domain of Comparative Genomics, and
how this terminology can be used. - There are two components of the ontology
- Classes (Concepts, or terms with
definitions)and - 2. Properties (simple relationships, between a
Class and a Value the value can be another
Class, or a simple number etc.)
11Example Concepts and Properties
Map is a Concept It has a definitionThe
abstract (typically linear) representation of an
informational macromolecule or chromosome etc.,
allowing the positioning of identifiable markers
along the length of the map... hasScaleUnit is
a Property In our ontology we can define which
Concepts can have particular Properties, and
which Concepts may be the values of particular
Properties. Ontology Statement Map
hasScaleUnit ScaleUnit Real Data
ltRFxWL_UppsalaChromosome1gt hasScaleUnit
ltcentiMorgangt
12Building the Ontology
The process of Ontology Definition involves
collecting all the terms and relationships in the
knowledge domain Providing definitions for
terms Concepts Classifying Concepts into
related groups in a hierarchical tree Defining
the relationships found in the data
Properties Specifying the permitted domain and
range for these properties Specifying which
properties are allowed, which must always be
true, and which are disallowed
13CONCEPTS
SIMPLE RELATIONSHIPS
transcribedFrom
Microsatellite
identifier
PartOf
QuantitativeTrait
hasAbbreviation
TechniqueUsed
Chromosome
COMPLEX RELATIONSHIPS
DNADuplication
Orthology
Interval Position
Mapping
Reciprocal BestMatch
GeneticLinkageMap
14Example Modelling Maps
- WHAT IS A MAP? information about the presence
and ordering of Markers on an abstract
representation of a macromolecule (DNA Molecule,
Chromosome or even a Polypeptide). - Linkage Group the simplest Map
- a collection or set of markers that are
inherited together without implied order. - i.e the relationship between a Marker and a
Linkage Group is a Containment - the Linkage
Group contains Markers. - A true Map
- has some sort of ordering of Markers belonging
to a Linkage Group. - i.e. the relationship between a Marker and a Map
is a Mapping which has a Position. This
Position may be purely ordinal, or may be
co-ordinate and be associated with Scale Units.
The Map maps Markers with a Position. - A Map is a specialized type-of Linkage Group
15Modelling Maps
Workshop One distinguished two types of Maps
Physical Maps Probabilistic Maps 1.
Physical Map A map of the locations of
identifiable landmarks on DNA (e.g.,
restriction-enzyme cutting sites, genes),
regardless of inheritance. At highest resolution,
distance is measured in base pairs, other units
may be used. For a given genome, the
lowest-resolution physical map might be the
banding patterns on the different chromosomes
the highest-resolution physical map of a DNA
Molecule is its complete nucleotide
sequence. e.g. Contig Map Cytogenetic Map
Breakpoint Map Deletion Map FingerprintMap
Restriction Site Map Sequence Map (Amino Acid,
DNA, RNA)
16Modelling Maps
2. Probabilistic Map A map of the relative
locations of markers on a chromosome derived from
an experimental analysis tracking the propensity
markers to be inherited together following
natural or induced chromosomal disruption. i.e
based on some probabilistic measure of
closeness. e.g. Genetic Linkage Map Meiotic
Linkage Map Radiation Hybrid Map HAPPY Map
In addition we might represent 3. Integrated
Map A map combining mapping data from multiple
map sources and experiments
17The Importance of Relationships
Defining concepts is easy.-) In many
respects defining concepts such as maps, genes,
positions, chromosomes etc. to represent the
species specific maps in existing datasources is
straightforward. This language defines the nuts
and bolts used to represent and exchange the data
between individual datasources. However, some
concepts are problematic even within one
datasource e.g. what is meant by a
Marker? Even more complicated are the
Relationships that we want to express between
data, in different datasources and between
different organisms. And this represents the
primary scientific challenge for ComparaGRID.
18The Importance of Relationships
For example A pig database records the mapping
of some marker PIGA on a map at position SSC9
30.1, and associates that marker observation with
a technique PCR, and some reagents primers P1
and P2 with Sequence S1 and S2
30
31
SSC9
PIGA hasEvidence PCRDetection
hasReagent Primer1 (with sequence S1)
Primer2 (with sequence S2)
19The Importance of Relationships
A cattle database records mapping of a
marker COWX on a map BTA4 105.3 and associates
that marker observation with a technique PCR,
and some reagents primers P1 and P2 with
Sequence S1 and S2
106
105
BTA4
COWX hasEvidence PCRDetection
hasReagent Primer1 (with sequence S1)
Primer2 (with sequence S2)
20The Importance of Relationships
-
-
- Pig primers P1 and P2 with Sequence S1 and S2
(detecting Marker A). Are identical to Cow
primers P1 and P2 with Sequence S1 and S2
(detecting Marker X) - What can we say about the possible relationships
between Marker A and Marker X?
21The Importance of Relationships
- What can we discover about the relationships
between these mapping data?(And HOW can we
discover any relationships between these data?) - Can we draw any inference between the use of
identical primer sequences and a similar
detection technique? - Does this imply a relationship between the
cattle and pig markers? - Does it imply homology?
- Is it evidence that they are or could be
considered the same marker? - How good or reliable is any such inference?
- How can we represent different values/qualities
of such inferences to allow weighting of
evidence? - How can we accumulate different strands of
evidence to establish a real relationship between
these markers and these regions of the two
genomes?
22ComparaGRID Ontology Classification of
Relationships
Some of the relationships that we want to capture
in our data can be represented by simple binary
properties Concept A ?Property? Concept
B hasPosition hasScaleUnit hasProduct hasEvid
ence hasPart mappedOn containedOn hasMarker has
Value hasLatinName
23Simple relationships can be represented as
Properties
Homo sapiens
24ComparaGRID Ontology Classification of
Relationships
Others relationships are more complicated, and
might link mutiple concepts and have properties
attached to them. These are modelled as complex
concepts so that we can represent more details
about them Mapping Synteny Orthology Paralog
y Containment Similarity TaxonomicIdentification
IsMapOf
25More complex relationships are modelled as
Concepts
26Mapping is a type of Relationship
You can make a map of any DomainConcept made of
a biological informational macromolecule (DNA,
RNA, Protein...)
Any Concept that can experimentally be placed on
a Map/LinkageGroup. e.g a Gene, Gene Product,
Genetic Variation, QTL, Phenotype, STS, EST, SNP,
nucleotide etc.
27Whats the point of all this ontological
classification etc?
A structured classification makes it easier for
the human user to understand and navigate the
terminology. The meaning of terms is more
precisely captured and how the terms relate to
each other. We can see how terms used in
different datasets relate to each other. We can
integrate datasets that are described using this
common vocabulary. We can link data and make
inferences between species based on formalised
rules and conditions. Automatic classification
and reasoning about data is feasible.