Title: Semantic empowerment of Life Science Applications October 2006
1Semantic empowermentof Life Science
Applications October 2006
- Amit Sheth
- LSDIS Lab, Department of Computer Science,
- University of Georgia
Acknowledgement NCRR funded Bioinformatics of
Glycan Expression, collaborators, partners at
CCRC (Dr. William S. York) and Satya S. Sahoo,
Cartic Ramakrishnan, Christopher Thomas, Cory
Henson.
2Computation, data and semantics In life sciences
- The development of a predictive biology will
likely be one of the major creative enterprises
of the 21st century. Roger Brent, 1999 - The future will be the study of the genes and
proteins of organisms in the context of their
informational pathways or networks. L. Hood,
2000 - "Biological research is going to move from being
hypothesis-driven to being data-driven." Robert
Robbins - Well see over the next decade complete
transformation (of life science industry) to very
database-intensive as opposed to wet-lab
intensive. Debra Goldfarb - We will show how semantics is a key enabler for
achieving the above predictions and visions in
which information and process play critical role.
3Semantic Web and Life Science
- Data captured per year 1 exabyte (1018)(Eric
Neumann, Science, 2005) - How much is that?
- Compare it to the estimate of the total words
ever spoken by humans 12 exabyte - Death by data
- The need for
- Search
- Integration
- Analysis, decision support
- Discovery
Not data, but analysis and insight, leading to
decisions and discovery
4Semantic empowermentof Life Science Applications
- Life Science research today deals with highly
heterogeneous as well as massive amounts of data
distributed across the world. - We need more automated ways for integration and
analysis leading to insight and discovery - - to understand cellular components, molecular
functions and biological processes, and more
importantly complex interactions and
interdependencies between them.
5Benefits of Semantics
- Development of large domain-specific knowledge
- for reference, common nomenclature, tagging
- Integration of heterogeneous multi-source data
biomedical documents (text), scientific/experiment
al data and structured databases - Semantic search, browsing, integration analysis,
and discovery - Faster and more reliable discovery leading to
quality of life improvements
6What is semantics Semantic Web
- Meaning and use of data
- From syntax and structure to semantics (beyond
formatting, organization, query interfaces,.) - XML -gt RDF -gt OWL -gt Rules -gt Trust
- Ontologies at the heart of Semantic Web,
capturing agreement and domain knowledge - (Automatic) Semantic annotation, reasoning,
- Also, increasing use of Services oriented
Architecture -gt semantic Web services - W3C SW for Health Care and Life Sciences
7Semantic empowermentof Life Science Applications
- This talk will demonstrate some of the efforts
in - Building large (populated) life science
ontologies (GlycO, ProPreO) - Gathering/extracting knowledge and metadata
entity and relationship extraction from
unstructured data, automatic semantic annotation
of scientific/experimental data (e.g., mass
spectrometry) - Semantic web services and registries, leading to
better discovery/reuse of scientific tools and
their composition - Ontology-driven applications developed
8Semantic Applications
- Active Semantic Medical Records Demo an
operational health care application using
multiple ontologies, semantic annotations and
rule based decsion support - Semantic Browser Demo contextual browsing of
PubMed aided by ontology and schema (in future
instance) level relationships - N-glycosylation process an example of scientific
workflow - Integrated Semantic Information Knowledge
System (ISIS) integrated access and analysis of
structured databases, sc. literature and
experimental data - Others we will not discuss SemBowser, SemDrug,
.
Let us start with a couple of simple applications
9Life Science Ontologies
- Glyco
- An ontology for structure and function of
Glycopeptides - 573 classes, 113 relationships
- Published through the National Center for
Biomedical Ontology (NCBO)
- ProPreO
- An ontology for capturing process and lifecycle
information related to proteomic experiments - 398 classes, 32 relationships
- 3.1 million instances
- Published through the National Center for
Biomedical Ontology (NCBO) and Open Biomedical
Ontologies (OBO)
10N-Glycosylation metabolic pathway
GNT-Iattaches GlcNAc at position 2
11GlycO ontology
- Challenge model hundreds of thousands of
complex carbohydrate entities - But, the differences between the entities are
small (E.g. just one component) - How to model all the concepts but preclude
redundancy ? ensure maintainability, scalability
12GlycoTree
N. Takahashi and K. Kato, Trends in Glycosciences
and Glycotechnology, 15 235-251
13EnzyO
- The enzyme ontology EnzyO is highly intertwined
with GlycO. While its structure is mostly that
of a taxonomy, it is highly restricted at the
class level and hence allows for comfortable
classification of enzyme instances from multiple
organisms - GlycO together with EnzyO contain all the
information that is needed for the description of
Metabolic pathways - e.g. N-Glycan Biosynthesis
14Pathway representation in GlycO
Pathways do not need to be explicitly defined in
GlycO. The residue-, glycan-, enzyme- and
reaction descriptions contain all the knowledge
necessary to infer pathways.
15Zooming in a little
The N-Glycan with KEGG ID 00015 is the substrate
to the reaction R05987, which is catalyzed by an
enzyme of the class EC 2.4.1.145.
The product of this reaction is the Glycan with
KEGG ID 00020.
16GlycO population
- Multiple data sources used in populating the
ontology - KEGG - Kyoto Encyclopedia of Genes and Genomes
- SWEETDB
- CARBANK Database
- Each data source has different schema for storing
data - There is significant overlap of instances in the
data sources - Hence, entity disambiguation and a common
representational format are needed
17Ontology population workflow
18Ontology population workflow
Asn(41)b-D-GlcpNAc (41)b-D-GlcpNAc
(41)b-D-Manp (31)a-D-Manp (21)b-
D-GlcpNAc (41)b-D-GlcpNAc(61)a-D-M
anp (21)b-D-GlcpNAc
19Ontology population workflow
ltGlycangt ltaglycon name"Asn"/gt ltresidue
link"4" anomeric_carbon"1" anomer"b"
chirality"D" monosaccharide"GlcNAc"gt
ltresidue link"4" anomeric_carbon"1" anomer"b"
chirality"D" monosaccharide"GlcNAc"gt
ltresidue link"4" anomeric_carbon"1" anomer"b"
chirality"D" monosaccharide"Man" gt
ltresidue link"3" anomeric_carbon"1" anomer"a"
chirality"D" monosaccharide"Man" gt
ltresidue link"2" anomeric_carbon"1" anomer"b"
chirality"D" monosaccharide"GlcNAc" gt
lt/residuegt ltresidue link"4"
anomeric_carbon"1" anomer"b" chirality"D"
monosaccharide"GlcNAc" gt lt/residuegt
lt/residuegt ltresidue link"6"
anomeric_carbon"1" anomer"a" chirality"D"
monosaccharide"Man" gt ltresidue
link"2" anomeric_carbon"1" anomer"b"
chirality"D" monosaccharide"GlcNAc"gt
lt/residuegt lt/residuegt lt/residuegt
lt/residuegt lt/residuegt lt/Glycangt
20Ontology population workflow
21ProPreO ontology
- Two aspects of glycoproteomics
- What is it? ? identification
- How much of it is there? ? quantification
- Heterogeneity in data generation process,
instrumental parameters, formats - Need data and process provenance ?
ontology-mediated provenance - Hence, ProPreO models both the glycoproteomics
experimental process and attendant data
22ProPreO population transformation to rdf
Scientific Data
Computational Methods
Ontology instances
23ProPreO population transformation to rdf
Scientific Data
Computational Methods
Key
amino-acid sequence
Protein Data
Protein Path
amino-acid sequence
Extract Peptide Amino-acid Sequence from Protein
Amino-acid Sequence
Peptide Path
Calculate Chemical Mass
Calculate Monoisotopic Mass
Determine N-glycosylation Concensus
RDF
Chemical Mass RDF
Monoisotopic Mass RDF
Amino-acid Sequence RDF
chemical mass
monoisotopic mass
amino-acid sequence
n-glycosylation concensus
chemical mass
monoisotopic mass
amino-acid sequence
n-glycosylation concensus
parent protein
Protein RDF
Peptide RDF
24Semantic empowermentof Life Science Applications
- This talk will demonstrate some of the efforts
in - building large life science ontologies (GlycO -an
ontology for structure and function for
Glycopeptides and ProPreO - an ontology for
capturing process and lifecycle information
related to proteomic experiments) and their
application in advanced ontology-driven semantic
applications - entity and relationship extraction from
unstructured data, automatic semantic annotation
of scientific/experimental data (e.g., mass
spectrometry), and resulting capability in
integrated access and analysis of structured
databases, scientific literature and experimental
data - semantic web services and registries, leading to
better discovery/reuse of scientific tools and
composition of scientific workflows that process
high-throughput data and can be adaptive - semantic applications developed
25Relationship extraction from unstructured
data (other related research biological entity
extraction)
26Overview
UMLS
Biologically active substance
affects
complicates
causes
causes
Disease or Syndrome
instance_of
instance_of
???????
Fish Oils
Raynauds Disease
MeSH
9284 documents
PubMed
4733 documents
5 documents
27About the data used
- UMLS A high level schema of the biomedical
domain - 136 classes and 49 relationships
- Synonyms of all relationship using variant
lookup (tools from NLM) - MeSH
- Terms already asserted as instance of one or more
classes in UMLS - PubMed
- Abstracts annotated with one or more MeSH terms
T147effect T147induce T147etiology
T147cause T147effecting T147induced
28Example PubMed abstract (for the domain expert)
29Method Parse Sentences in PubMed
SS-Tagger (University of Tokyo)
SS-Parser (University of Tokyo)
(TOP (S (NP (NP (DT An) (JJ excessive) (ADJP (JJ
endogenous) (CC or) (JJ exogenous) ) (NN
stimulation) ) (PP (IN by) (NP (NN estrogen) ) )
) (VP (VBZ induces) (NP (NP (JJ adenomatous) (NN
hyperplasia) ) (PP (IN of) (NP (DT the) (NN
endometrium) ) ) ) ) ) )
30Method Identify entities and Relationships in
Parse Tree
31Method Identify entities and Relationships in
Parse Tree
32Method Fact Extraction from Parse Tree
33Semantic annotation of scientific/experimental
data
34ProPreO Ontology-mediated provenance
parent ion charge
830.9570 194.9604 2 580.2985
0.3592 688.3214 0.2526 779.4759
38.4939 784.3607 21.7736 1543.7476
1.3822 1544.7595 2.9977 1562.8113
37.4790 1660.7776 476.5043
parent ion m/z
parent ionabundance
fragment ion m/z
fragment ionabundance
ms/ms peaklist data
Mass Spectrometry (MS) Data
35ProPreO Ontology-mediated provenance
ltms-ms_peak_listgt ltparameter instrumentmicromas
s_QTOF_2_quadropole_time_of_flight_mass_spectromet
er modems-ms/gt ltparent_ion
m-z830.9570 abundance194.9604
z2/gt ltfragment_ion m-z580.2985
abundance0.3592/gt ltfragment_ion
m-z688.3214 abundance0.2526/gt ltfragment_i
on m-z779.4759 abundance38.4939/gt ltfragme
nt_ion m-z784.3607 abundance21.7736/gt ltfr
agment_ion m-z1543.7476 abundance1.3822/gt
ltfragment_ion m-z1544.7595 abundance2.9977/
gt ltfragment_ion m-z1562.8113
abundance37.4790/gt ltfragment_ion
m-z1660.7776 abundance476.5043/gt lt/ms-ms_pea
k_listgt
OntologicalConcepts
Semantically Annotated MS Data
36Semantic empowermentof Life Science Applications
- This talk will demonstrate some of the efforts
in - building large life science ontologies (GlycO -an
ontology for structure and function for
Glycopeptides and ProPreO - an ontology for
capturing process and lifecycle information
related to proteomic experiments) and their
application in advanced ontology-driven semantic
applications - entity and relationship extraction from
unstructured data, automatic semantic annotation
of scientific/experimental data (e.g., mass
spectrometry), and resulting capability in
integrated access and analysis of structured
databases, scientific literature and experimental
data - semantic web services and registries, leading to
better discovery/reuse of scientific tools and
composition of scientific workflows that process
high-throughput data and can be adaptive - semantic applications developed
37N-Glycosylation Process (NGP)
38Semantic Web Process to incorporate provenance
Semantic Annotation Applications
39Converting biological information to the W3C
Resource Description Framework (RDF) Experience
with Entrez Gene
- Collaboration with Dr. Olivier Bodenreider
- (US National Library of Medicine, NIH, Bethesda,
MD)
40Biomedical Knowledge Repository
.
Biomedical Knowledge Repository
Entrez
41Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
42Web interface
ENTREZ GENE
ENTREZ GENE XML
XSLT
ENTREZ GENE RDF GRAPH
ENTREZ GENE RDF
.
43Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
44XML
45Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
46RDF Graph
eghas_protein_reference_name_E
APP (geneid-351)
Alzheimers Disease
subject
predicate
object
47RDF Graph
Entrez Gene RDF graph (W3C Validator Site -
http//www.w3.org/RDF/Validator/)
48Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
49RDF
50Implementation
Entrez Gene
Entrez Gene XML
XSLT
Entrez Gene RDF graph
Entrez Gene RDF
51Connecting different genes
amyloid-beta protein
protease nexin-II
beta-amyloid peptide
APP gene Homo sapiens
A4 amyloid protein
cerebral vascular amyloid peptide
Human APP gene is implicated in Alzheimer's
disease. Which genes are functionally homologous
to this gene?
amyloid beta (A4) precursor protein (protease
nexin-II, Alzheimer disease)
amyloid beta A4 protein
amyloid beta A4 protein
amyloid beta A4 protein
APP gene Gallus gallus
APP gene Canis familiaris
amyloid protein
eghas_protein_reference_name_E
52Inference
- Rules are objects that allow inference from RDF
data 1 - Oracle 10g allows the creation of rulebase based
on RDFS (RDF Schema)
amyloid beta (A4) precursor protein (protease
nexin-II, Alzheimer disease)
eghas_protein_reference_name_E
egis_associated_with
egGene-track_geneid/351
egNeurodegenerative Diseases
53Integrated Semantic Information and knowledge
System (Isis)
Have I performed an error? Give me all result
files from a similar organism, cell,
preparation, mass spectrometric conditions and
compare results.
SPARQL query-based User Interface
ProPreO ontology
Is the result erroneous? Give me all result
files from a similar organism, cell,
preparation, mass spectrometric conditions and
compare results.
Experimental Data Semantic Annotation
Metadata File
Semantic Metadata Registry
PROTEOMECOMMONS
EXPERIMENTAL DATA
ProVault result
MACOT result
mzXML
Pkl
pSplit
Raw
Raw2mzXML
mzXML2Pkl
Pkl2pSplit
MASCOT Search
ProVault
PROTEOMICS WORKFLOW
54Summary, Observations, Conclusions
- We now have semantics and services enabled
approaches that support semantic search, semantic
integration, semantic analytics, decision support
and validation (e.g., error prevention in
healthcare), knowledge discovery, process/pathway
discovery,
55- http//lsdis.cs.uga.edu
- http//knoesis.org
- http//lsdis.cs.uga.edu/projects/asdoc/
- http//lsdis.cs.uga.edu/projects/glycomics/