The Semantic Web: New-style data-integration (and how it works for life-scientists too!) - PowerPoint PPT Presentation

About This Presentation
Title:

The Semantic Web: New-style data-integration (and how it works for life-scientists too!)

Description:

... In vivo diagnostics Development requires Preclinical research Clinical trials Long-term clinical research All of which often feeds back into ongoing ... – PowerPoint PPT presentation

Number of Views:172
Avg rating:3.0/5.0
Slides: 61
Provided by: Frankvan3
Category:

less

Transcript and Presenter's Notes

Title: The Semantic Web: New-style data-integration (and how it works for life-scientists too!)


1
The Semantic WebNew-style data-integration
(and how it works for life-scientists too!)
  • Frank van Harmelen
  • AI Department
  • Vrije Universiteit Amsterdam

2
Whats the problem?(data-mess in bio-inf)
3
Life Science Data
Kenneth Griffiths and Richard Resnick Tut. At
Intell. Systems for Molec. Biol., 2003
Recent focus on genetic data genomics the study
of genes and their function. Recent advances in
genomics are bringing about a revolution in our
understanding of the molecular mechanisms of
disease, including the complex interplay of
genetic and environmental factors. Genomics is
also stimulating the discovery of breakthrough
healthcare products by revealing thousands of new
biological targets for the development of drugs,
and by giving scientists innovative ways to
design new drugs, vaccines and DNA diagnostics.
Genomics-based therapeutics include "traditional"
small chemical drugs, protein drugs, and
potentially gene therapy. The Pharmaceutical
Research and Manufacturers of America -
http//www.phrma.org/genomics/lexicon/g.html
Study of genes and their function Understanding
molecular mechanisms of disease Development of
drugs, vaccines, and diagnostics
4
The Study of Genes...
  • Chromosomal location
  • Sequence
  • Sequence Variation
  • Splicing
  • Protein Sequence
  • Protein Structure

5
and Their Function
  • Homology
  • Motifs
  • Publications
  • Expression
  • HTS
  • In Vivo/Vitro Functional Characterization

6
Understanding Mechanisms of Disease
7
Development of Drugs, Vaccines, Diagnostics
  • Differing types of Drugs, Vaccines, and
    Diagnostics
  • Small molecules
  • Protein therapeutics
  • Gene therapy
  • In vitro, In vivo diagnostics
  • Development requires
  • Preclinical research
  • Clinical trials
  • Long-term clinical research
  • All of which often feeds back into ongoing
    Genomics research and discovery.

8
The Industrys Problem
  • Too much unintegrated data
  • from a variety of incompatible sources
  • no standard naming convention
  • each with a custom browsing and querying
    mechanism (no common interface)
  • and poor interaction with other data sources

9
What are the Data Sources?
  • Flat Files
  • URLs
  • Proprietary Databases
  • Public Databases
  • Data Marts
  • Spreadsheets
  • Emails

10
Sample Problem Hyperprolactinemia
  • Over production of prolactin
  • prolactin stimulates mammary gland development
    and milk production
  • Hyperprolactinemia is characterized by
  • inappropriate milk production
  • disruption of menstrual cycle
  • can lead to conception difficulty

11
Understanding transcription factors for prolactin
production
Show me all genes in the public literature that
are putatively related to hyperprolactinemia,
have more than 3-fold expression differential
between hyperprolactinemic and normal pituitary
cells, and are homologous to known transcription
factors.
(Q1?Q2?Q3)
12
The Complexity of Biological Data
13
Pharmaceutical Productivity
Source PhRMA FDA 2003
14
Stitching this all together by hand?
Source Stephens et al. J Web Semantics 2006
15
The Medical tower of Babel
  • Mesh
  • Medical Subject Headings, National Library of
    Medicine
  • 22.000 descriptions
  • EMTREE
  • Commercial Elsevier, Drugs and diseases
  • 45.000 terms, 190.000 synonyms
  • UMLS
  • Integrates 100 different vocabularies
  • SNOMED
  • 200.000 concepts, College of American
    Pathologists
  • Gene Ontology
  • 15.000 terms in molecular biology
  • NCI Cancer Ontology
  • 17,000 classes (about 1M definitions),

16
Problem with the Current WWW
17
Why would Semantic Web technology help?
18
machine accessible meaning (What its like
to be a machine)
META-DATA
19
What is meta-data?
  • it's just data
  • it's data describing other data
  • its' meant for machine consumption

20
Required are
  • one or more standard vocabularies
  • so search engines, producers and consumersall
    speak the same language
  • a standard syntax,
  • so meta-data can be recognised as such
  • lots of resources with meta-data attached
  • mechanisms for attribution and trust is this
    page really about Pamela Anderson??

21
What are ontologies what are they used for
world
concept
language
Agree on a conceptualization
no shared understanding
Conceptual and terminological confusion
Make it explicit in some language.
Actors both humans and machines
22
standard vocabularies (Ontologies)
  • Identify the key concepts in a domain
  • Identify a vocabulary for these concepts
  • Identify relations between these concepts
  • Make these precise enough so that they can be
    shared between
  • humans and humans
  • humans and machines
  • machines and machines

23
Shared content-vocabulariesOntologies
  • Formal,
  • explicit specification
  • of a
  • shared
  • conceptualisation

24
Real life examples
  • handcrafted
  • music CDnow (2410/5), MusicMoz (1073/7)
  • biomedical SNOMED (200k), GO (15k),
    Emtree(45k190k Systems biology
  • ranging from lightweight
  • Yahoo, UNSPC, Open directory (400k) to
    heavyweight (Cyc (300k))
  • ranging from small (METAR) to large (UNSPC)

25
Biomedical ontologies (a few..)
  • Mesh
  • Medical Subject Headings, National Library of
    Medicine
  • 22.000 descriptions
  • EMTREE
  • Commercial Elsevier, Drugs and diseases
  • 45.000 terms, 190.000 synonyms
  • UMLS
  • Integrates 100 different vocabularies
  • SNOMED
  • 200.000 concepts, College of American
    Pathologists
  • Gene Ontology
  • 15.000 terms in molecular biology
  • NCBI Cancer Ontology
  • 17,000 classes (about 1M definitions),

26
Whats inside an ontology?
  • terms specialisation hierarchy
  • classes class-hierarchy
  • instances
  • slots/values
  • inheritance (multiple? defaults?)
  • restrictions on slots (type, cardinality)
  • properties of slots (symm., trans., )
  • relations between classes (disjoint, covers)
  • reasoning tasks classification, subsumption

27
NB were not doing philosophy
  • Ontologies are not
  • definitive descriptions of what exists in the
    world ( philosphy)
  • Ontologies are
  • models of the worldconstructed to facilitate
    communication
  • Yes, ontologies exist(because we build them)

28
Remember required are
  • one or more standard vocabularies
  • so search engines, producers and consumersall
    speak the same language
  • a standard syntax,
  • so meta-data can be recognised as such
  • lots of resources with meta-data attached

29
Stack of languages
30
Stack of languages
  • XML
  • Surface syntax, no semantics
  • XML Schema
  • Describes structure of XML documents
  • RDF
  • Datamodel for relations between things
  • RDF Schema
  • RDF Vocabular Definition Language
  • OWL
  • A more expressive Vocabular Definition Language

31
Why XML
  • Structuring data in documents on the internet
  • HTML not meant to store data

The Netherlands Geography Capital
Amsterdam (The Hague is the seat of the
government) Neighboring countries Germany,
Belgium
lth2gtThe Netherlandslt/h2gt ltbgtGeographylt/bgtltbrgt ltigt
Capitallt/igt Amsterdam ltbrgt (The Hague is the
seat of the government)ltbrgt ltigtNeighboring
countrieslt/igt Germany, Belgium
32
Why XML - 2
  • Humans understand information written in HTML
  • Computers cannot work with it
  • meaning of pieces of text, and their relation?
  • XML makes this partly explicit by
  • giving possibly meaningful names to tags
  • allowing the nesting of tags (tags inside tags)

ltcountry name "The Netherlands" gt
ltgeographygt ltcapital name "Amsterdam" gt
ltremarkgt The Hague is the seat of the
government lt/remarkgt lt/capitalgt
ltneighboring_countrygt Germany lt/neighboring_countr
ygt ltneighboring_countrygt Belgium
lt/neighboring_countrygt lt/geographygt
lt/countrygt
33
XML data model
  • document is an ordered, labeled tree, with nodes
    to represent the document entity, elements,
    attributes, processing instructions, and comments

country
comment
geography
name
Should be...
capital
The Netherlands
neighboring country
neighboring country
remark
name
The Hague is the seat of the government
Germany
Belgium
Amsterdam
34
Structuring methods
  • DTDs
  • document type definitions
  • traditional inherited from SGML
  • PCDATA parsed character data
  • no other datatypes

lt!ELEMENT country (geography, people,
economy)gt lt!ATTLIST country name CDATA
REQUIREDgt lt!ELEMENT geography (capital,
neighboring_country)gt lt!ELEMENT capital
(remark)gt lt!ATTLIST capital name CDATA
REQUIREDgt lt!ELEMENT remark (PCDATA)gt lt!ELEMENT
neighboring_country (PCDATA)gt .
35
Structuring methods - 2
  • XML Schema
  • quite new (Rec. 02 May 2001)
  • same function as DTD prescribes structure
  • but has some advantages
  • XML Schema is XML itself
  • simple datatyping
  • richer grammar
  • type hierarchy with derivation

ltcomplexType namesubject"gt ltelement
nametitle" type"string"/gt ltelement
reflecture maxOccurs"unbounded
/gt lt/complexTypegt
36
XML Schema - richer grammar
  • content models
  • grouping, by choice, sequence or all
  • cardinality
  • attributes minOccurs, maxOccurs
  • defaults and constants
  • attributes default, fixed

ltcomplexType name"WindowsType"gt ltelement
name"version" type"string minOccurs"0
maxOccurs"1" default"W98"/gt ltelement
name"includedBrowser" type"string
minOccurs"0" maxOccurs"1" fixed"Internet
Explorer"/gt lt/complexTypegt
37
Stack of languages
  • XML
  • Surface syntax, no semantics
  • XML Schema
  • Describes structure of XML documents
  • RDF
  • Datamodel for relations between things
  • RDF Schema
  • RDF Vocabular Definition Language
  • OWL
  • A more expressive Vocabular Definition Language

38
RDF Triples in Life Sciences
39
Bluffers guide to RDF (1)
  • Object --Attribute-gt Value triples
  • objects are web-resources
  • Value is again an Object
  • triples can be linked
  • data-model graph

40
Bluffers guide to RDF (2)
  • Every identifier is a URL
  • world-wide unique naming!
  • Has XML syntax
  • Any statement can be an object
  • graphs can be nested

41
What does RDF Schema add?
  • Defines vocabulary for RDF
  • Organizes this vocabulary in a typed hierarchy
  • Class, subClassOf, type
  • Property, subPropertyOf
  • domain, range

Person
subClassOf
subClassOf
range
domain
Teacher
Student
supervises
type
type
supervises
Marta
Frank
42
Stack of languages
  • XML
  • Surface syntax, no semantics
  • XML Schema
  • Describes structure of XML documents
  • RDF
  • Datamodel for relations between things
  • RDF Schema
  • RDF Vocabular Definition Language
  • OWL
  • A more expressive Vocabular Definition Language

43
OWL things RDF Schema cant do
  • equality
  • enumeration
  • number restrictions
  • Single-valued/multi-valued
  • Optional/required values
  • inverse, symmetric, transitive
  • boolean algebra
  • Union, complement

44
OWL more expressivity
Full
DL
Lite
45
Remember required are
  • one or more standard vocabularies
  • so search engines, producers and consumersall
    speak the same language
  • a standard syntax,
  • so meta-data can be recognised as such
  • lots of resources with meta-data attached

46
Question who writes the ontologies?
  • Professional bodies, scientific communities,
    companies, publishers, .
  • See previous slide on Biomedical ontologies
  • Same developments in many other fields
  • Good old fashioned Knowledge Engineering
  • Convert from DB-schema, UML, etc.

47
QuestionWho writes the meta-data ?
  • Automated learning
  • shallow natural language analysis
  • Concept extraction

Example Encyclopedia Britannica on Amsterdam
48
QuestionWho writes the meta-data ?
  • exploit existing legacy-data
  • Amazon
  • Lab equipment?
  • side-effect from user interaction
  • MIT Lab photo-annotator
  • NOT from manual effort
  • Web 2.0 community/social interaction

49
Remember required are
  • one or more standard vocabularies
  • so search engines, producers and consumersall
    speak the same language
  • a standard syntax,
  • so meta-data can be recognised as such
  • lots of resources with meta-data attached

50
  • Some working examples?
  • DOPE
  • HCLS (http//www.w3.org/2001/sw/hcls/)

51
DOPE Background
  • Vertical Information Provision
  • Buy a topic instead of a Journal !
  • Web provides new opportunities
  • Business driver drug development
  • Rich, information-hungry market
  • Good thesaurus (EMTREE)

52
The Data
  • Document repositories
  • ScienceDirect approx. 500.000 fulltext articles
  • MEDLINE approx. 10.000.000 abstracts
  • Extracted Metadata
  • The Collexis Metadata Server concept-extraction
    ("semantic fingerprinting")
  • Thesauri and Ontologies
  • EMTREE 60.000 preferred terms 200.000 synonyms

53
Query interface
Architecture
54
Architecture
GUI Spectacle (Aduna)
http requests
Mediator Sesame (Aduna)
SeRQL
Document Model (RDFS)
EMTREE Thesaurus (RDFS)
SeRQL
Source Model (RDF)
SOAP
Metadata Server (Collexis)
Java Client
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
(No Transcript)
65
(No Transcript)
66
  • Some working examples?
  • DOPE
  • Community analysis http//flink.semanticweb.org

67
Author teams In HIV research?
68
  • Some working examples?
  • DOPE
  • Community analysis http//flink.semanticweb.org
  • Biological pathway database http//pkb.stanfor
    d.edu/

69
Stanford University Use Case
Source http//pkb.stanford.edu/
70
Summarising
  • Data integration on the Web
  • machine processable data besides human
    processable data
  • Syntax for meta-data
  • XML (not much meaning)
  • RDF (some meaning)
  • RDF Schema (some meaning)
  • OWL (more meaning
  • Vocabularies for meta-data
  • Lots of them in bio-inf.
  • Actual meta-data
  • Lots in bio-inf.
  • Will enable
  • Better search engines (recall, precision,
    concepts)
  • Combining information across pages (inference)

71
Things to do for you
  • Practical Use existing software to construct
    new use-scenarios
  • ConceptualCreate on ontology for some area of
    bio-medical expertise
  • from scratch
  • as a refinement of an existing ontology
  • TechnicalTransform an existing data-set in
    meta-data format, and provide a query interface
    (for humans and machines)
Write a Comment
User Comments (0)
About PowerShow.com