Title: The Semantic Web: New-style data-integration (and how it works for life-scientists too!)
1The Semantic WebNew-style data-integration
(and how it works for life-scientists too!)
- Frank van Harmelen
- AI Department
- Vrije Universiteit Amsterdam
2Whats the problem?(data-mess in bio-inf)
3Life Science Data
Kenneth Griffiths and Richard Resnick Tut. At
Intell. Systems for Molec. Biol., 2003
Recent focus on genetic data genomics the study
of genes and their function. Recent advances in
genomics are bringing about a revolution in our
understanding of the molecular mechanisms of
disease, including the complex interplay of
genetic and environmental factors. Genomics is
also stimulating the discovery of breakthrough
healthcare products by revealing thousands of new
biological targets for the development of drugs,
and by giving scientists innovative ways to
design new drugs, vaccines and DNA diagnostics.
Genomics-based therapeutics include "traditional"
small chemical drugs, protein drugs, and
potentially gene therapy. The Pharmaceutical
Research and Manufacturers of America -
http//www.phrma.org/genomics/lexicon/g.html
Study of genes and their function Understanding
molecular mechanisms of disease Development of
drugs, vaccines, and diagnostics
4The Study of Genes...
- Chromosomal location
- Sequence
- Sequence Variation
- Splicing
- Protein Sequence
- Protein Structure
5 and Their Function
- Homology
- Motifs
- Publications
- Expression
- HTS
- In Vivo/Vitro Functional Characterization
6Understanding Mechanisms of Disease
7Development of Drugs, Vaccines, Diagnostics
- Differing types of Drugs, Vaccines, and
Diagnostics - Small molecules
- Protein therapeutics
- Gene therapy
- In vitro, In vivo diagnostics
- Development requires
- Preclinical research
- Clinical trials
- Long-term clinical research
- All of which often feeds back into ongoing
Genomics research and discovery.
8The Industrys Problem
- Too much unintegrated data
- from a variety of incompatible sources
- no standard naming convention
- each with a custom browsing and querying
mechanism (no common interface) - and poor interaction with other data sources
9What are the Data Sources?
- Flat Files
- URLs
- Proprietary Databases
- Public Databases
- Data Marts
- Spreadsheets
- Emails
-
10Sample Problem Hyperprolactinemia
- Over production of prolactin
- prolactin stimulates mammary gland development
and milk production - Hyperprolactinemia is characterized by
- inappropriate milk production
- disruption of menstrual cycle
- can lead to conception difficulty
11Understanding transcription factors for prolactin
production
Show me all genes in the public literature that
are putatively related to hyperprolactinemia,
have more than 3-fold expression differential
between hyperprolactinemic and normal pituitary
cells, and are homologous to known transcription
factors.
(Q1?Q2?Q3)
12The Complexity of Biological Data
13Pharmaceutical Productivity
Source PhRMA FDA 2003
14Stitching this all together by hand?
Source Stephens et al. J Web Semantics 2006
15The Medical tower of Babel
- Mesh
- Medical Subject Headings, National Library of
Medicine - 22.000 descriptions
- EMTREE
- Commercial Elsevier, Drugs and diseases
- 45.000 terms, 190.000 synonyms
- UMLS
- Integrates 100 different vocabularies
- SNOMED
- 200.000 concepts, College of American
Pathologists - Gene Ontology
- 15.000 terms in molecular biology
- NCI Cancer Ontology
- 17,000 classes (about 1M definitions),
16Problem with the Current WWW
17Why would Semantic Web technology help?
18machine accessible meaning (What its like
to be a machine)
META-DATA
19What is meta-data?
- it's just data
- it's data describing other data
- its' meant for machine consumption
20Required are
- one or more standard vocabularies
- so search engines, producers and consumersall
speak the same language - a standard syntax,
- so meta-data can be recognised as such
- lots of resources with meta-data attached
- mechanisms for attribution and trust is this
page really about Pamela Anderson??
21What are ontologies what are they used for
world
concept
language
Agree on a conceptualization
no shared understanding
Conceptual and terminological confusion
Make it explicit in some language.
Actors both humans and machines
22standard vocabularies (Ontologies)
- Identify the key concepts in a domain
- Identify a vocabulary for these concepts
- Identify relations between these concepts
- Make these precise enough so that they can be
shared between - humans and humans
- humans and machines
- machines and machines
23Shared content-vocabulariesOntologies
- Formal,
- explicit specification
- of a
- shared
- conceptualisation
24Real life examples
- handcrafted
- music CDnow (2410/5), MusicMoz (1073/7)
- biomedical SNOMED (200k), GO (15k),
Emtree(45k190k Systems biology - ranging from lightweight
- Yahoo, UNSPC, Open directory (400k) to
heavyweight (Cyc (300k)) - ranging from small (METAR) to large (UNSPC)
25Biomedical ontologies (a few..)
- Mesh
- Medical Subject Headings, National Library of
Medicine - 22.000 descriptions
- EMTREE
- Commercial Elsevier, Drugs and diseases
- 45.000 terms, 190.000 synonyms
- UMLS
- Integrates 100 different vocabularies
- SNOMED
- 200.000 concepts, College of American
Pathologists - Gene Ontology
- 15.000 terms in molecular biology
- NCBI Cancer Ontology
- 17,000 classes (about 1M definitions),
26Whats inside an ontology?
- terms specialisation hierarchy
- classes class-hierarchy
- instances
- slots/values
- inheritance (multiple? defaults?)
- restrictions on slots (type, cardinality)
- properties of slots (symm., trans., )
- relations between classes (disjoint, covers)
- reasoning tasks classification, subsumption
27NB were not doing philosophy
- Ontologies are not
- definitive descriptions of what exists in the
world ( philosphy) - Ontologies are
- models of the worldconstructed to facilitate
communication - Yes, ontologies exist(because we build them)
28Remember required are
- one or more standard vocabularies
- so search engines, producers and consumersall
speak the same language - a standard syntax,
- so meta-data can be recognised as such
- lots of resources with meta-data attached
29Stack of languages
30Stack of languages
- XML
- Surface syntax, no semantics
- XML Schema
- Describes structure of XML documents
- RDF
- Datamodel for relations between things
- RDF Schema
- RDF Vocabular Definition Language
- OWL
- A more expressive Vocabular Definition Language
31Why XML
- Structuring data in documents on the internet
- HTML not meant to store data
The Netherlands Geography Capital
Amsterdam (The Hague is the seat of the
government) Neighboring countries Germany,
Belgium
lth2gtThe Netherlandslt/h2gt ltbgtGeographylt/bgtltbrgt ltigt
Capitallt/igt Amsterdam ltbrgt (The Hague is the
seat of the government)ltbrgt ltigtNeighboring
countrieslt/igt Germany, Belgium
32Why XML - 2
- Humans understand information written in HTML
- Computers cannot work with it
- meaning of pieces of text, and their relation?
- XML makes this partly explicit by
- giving possibly meaningful names to tags
- allowing the nesting of tags (tags inside tags)
ltcountry name "The Netherlands" gt
ltgeographygt ltcapital name "Amsterdam" gt
ltremarkgt The Hague is the seat of the
government lt/remarkgt lt/capitalgt
ltneighboring_countrygt Germany lt/neighboring_countr
ygt ltneighboring_countrygt Belgium
lt/neighboring_countrygt lt/geographygt
lt/countrygt
33XML data model
- document is an ordered, labeled tree, with nodes
to represent the document entity, elements,
attributes, processing instructions, and comments
country
comment
geography
name
Should be...
capital
The Netherlands
neighboring country
neighboring country
remark
name
The Hague is the seat of the government
Germany
Belgium
Amsterdam
34Structuring methods
- DTDs
- document type definitions
- traditional inherited from SGML
- PCDATA parsed character data
- no other datatypes
lt!ELEMENT country (geography, people,
economy)gt lt!ATTLIST country name CDATA
REQUIREDgt lt!ELEMENT geography (capital,
neighboring_country)gt lt!ELEMENT capital
(remark)gt lt!ATTLIST capital name CDATA
REQUIREDgt lt!ELEMENT remark (PCDATA)gt lt!ELEMENT
neighboring_country (PCDATA)gt .
35Structuring methods - 2
- XML Schema
- quite new (Rec. 02 May 2001)
- same function as DTD prescribes structure
- but has some advantages
- XML Schema is XML itself
- simple datatyping
- richer grammar
- type hierarchy with derivation
ltcomplexType namesubject"gt ltelement
nametitle" type"string"/gt ltelement
reflecture maxOccurs"unbounded
/gt lt/complexTypegt
36XML Schema - richer grammar
- content models
- grouping, by choice, sequence or all
- cardinality
- attributes minOccurs, maxOccurs
- defaults and constants
- attributes default, fixed
ltcomplexType name"WindowsType"gt ltelement
name"version" type"string minOccurs"0
maxOccurs"1" default"W98"/gt ltelement
name"includedBrowser" type"string
minOccurs"0" maxOccurs"1" fixed"Internet
Explorer"/gt lt/complexTypegt
37Stack of languages
- XML
- Surface syntax, no semantics
- XML Schema
- Describes structure of XML documents
- RDF
- Datamodel for relations between things
- RDF Schema
- RDF Vocabular Definition Language
- OWL
- A more expressive Vocabular Definition Language
38RDF Triples in Life Sciences
39Bluffers guide to RDF (1)
- Object --Attribute-gt Value triples
- objects are web-resources
- Value is again an Object
- triples can be linked
- data-model graph
40Bluffers guide to RDF (2)
- Every identifier is a URL
- world-wide unique naming!
- Has XML syntax
-
- Any statement can be an object
- graphs can be nested
41What does RDF Schema add?
- Defines vocabulary for RDF
- Organizes this vocabulary in a typed hierarchy
- Class, subClassOf, type
- Property, subPropertyOf
- domain, range
Person
subClassOf
subClassOf
range
domain
Teacher
Student
supervises
type
type
supervises
Marta
Frank
42Stack of languages
- XML
- Surface syntax, no semantics
- XML Schema
- Describes structure of XML documents
- RDF
- Datamodel for relations between things
- RDF Schema
- RDF Vocabular Definition Language
- OWL
- A more expressive Vocabular Definition Language
43OWL things RDF Schema cant do
- equality
- enumeration
- number restrictions
- Single-valued/multi-valued
- Optional/required values
- inverse, symmetric, transitive
- boolean algebra
- Union, complement
44OWL more expressivity
Full
DL
Lite
45Remember required are
- one or more standard vocabularies
- so search engines, producers and consumersall
speak the same language - a standard syntax,
- so meta-data can be recognised as such
- lots of resources with meta-data attached
46Question who writes the ontologies?
- Professional bodies, scientific communities,
companies, publishers, . - See previous slide on Biomedical ontologies
- Same developments in many other fields
- Good old fashioned Knowledge Engineering
-
- Convert from DB-schema, UML, etc.
47QuestionWho writes the meta-data ?
- Automated learning
- shallow natural language analysis
- Concept extraction
Example Encyclopedia Britannica on Amsterdam
48QuestionWho writes the meta-data ?
- exploit existing legacy-data
- Amazon
- Lab equipment?
- side-effect from user interaction
- MIT Lab photo-annotator
- NOT from manual effort
- Web 2.0 community/social interaction
49Remember required are
- one or more standard vocabularies
- so search engines, producers and consumersall
speak the same language - a standard syntax,
- so meta-data can be recognised as such
- lots of resources with meta-data attached
50- Some working examples?
- DOPE
- HCLS (http//www.w3.org/2001/sw/hcls/)
51DOPE Background
- Vertical Information Provision
- Buy a topic instead of a Journal !
- Web provides new opportunities
- Business driver drug development
- Rich, information-hungry market
- Good thesaurus (EMTREE)
52The Data
- Document repositories
- ScienceDirect approx. 500.000 fulltext articles
- MEDLINE approx. 10.000.000 abstracts
- Extracted Metadata
- The Collexis Metadata Server concept-extraction
("semantic fingerprinting") -
- Thesauri and Ontologies
- EMTREE 60.000 preferred terms 200.000 synonyms
53Query interface
Architecture
54Architecture
GUI Spectacle (Aduna)
http requests
Mediator Sesame (Aduna)
SeRQL
Document Model (RDFS)
EMTREE Thesaurus (RDFS)
SeRQL
Source Model (RDF)
SOAP
Metadata Server (Collexis)
Java Client
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66- Some working examples?
- DOPE
- Community analysis http//flink.semanticweb.org
67Author teams In HIV research?
68- Some working examples?
- DOPE
- Community analysis http//flink.semanticweb.org
- Biological pathway database http//pkb.stanfor
d.edu/
69Stanford University Use Case
Source http//pkb.stanford.edu/
70Summarising
- Data integration on the Web
- machine processable data besides human
processable data - Syntax for meta-data
- XML (not much meaning)
- RDF (some meaning)
- RDF Schema (some meaning)
- OWL (more meaning
- Vocabularies for meta-data
- Lots of them in bio-inf.
- Actual meta-data
- Lots in bio-inf.
- Will enable
- Better search engines (recall, precision,
concepts) - Combining information across pages (inference)
71Things to do for you
- Practical Use existing software to construct
new use-scenarios - ConceptualCreate on ontology for some area of
bio-medical expertise - from scratch
- as a refinement of an existing ontology
- TechnicalTransform an existing data-set in
meta-data format, and provide a query interface
(for humans and machines)