Biological Data Extraction and Integration A Research Area Background Study - PowerPoint PPT Presentation

About This Presentation
Title:

Biological Data Extraction and Integration A Research Area Background Study

Description:

Biological Data Extraction and Integration A Research Area Background Study Cui Tao Department of Computer Science Brigham Young University Research Field Overview ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 29
Provided by: degByuEdu
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Biological Data Extraction and Integration A Research Area Background Study


1
Biological Data Extraction and Integration A
Research Area Background Study
  • Cui Tao
  • Department of Computer Science
  • Brigham Young University

2
Research Field Overview
My research
Semantic Web
Data Integration
Schema Matching
Information Extraction
Bioinformatics
3
Information Extraction
  • Information extraction systems process text
    documents and locate a specific set of relevant
    items. Califf99

4
Information Extraction
  • Information extraction systems process text
    documents and locate a specific set of relevant
    items. Califf99
  • Because the WWW consists primarily of text,
    information extraction is central to all effort
    that would use the web as a resource for
    knowledge discovery. Freitag98

5
Information Extraction
  • Traditional information extraction
  • Hidden web crawling
  • Biological data extraction

6
Traditional Information Extraction
  • Different groups of IE tools Laender02
  • Wrapper generation tools
  • NLP-based and learning-based tools
  • Ontology-based tools

7
Traditional Information Extraction
  • Wrapper generation tools
  • Lixto Baumgartner01
  • Supervised wrapper generation
  • Semi-automatically
  • Not robust Does not work well with unstructured
    data
  • ROADRUNNER Crescenzi01
  • Fully automatic wrapper generation
  • Does not generate robust and general wrappers
  • Only works for highly regular web pages

8
Traditional Information Extraction
  • NLP-based and learning-based tools
  • SRV Freitag98
  • Top-down learner
  • Learns based on simple and relational features
  • Single slot filling
  • RAPIER Califf99
  • Bottom-up learner
  • Learns pre-filler, slot filler, and post-filler
    patterns
  • Only works for free text
  • Single slot filling

9
Traditional Information Extraction
  • Ontology-based tools
  • BYU Ontos Embley99
  • Based on domain-specific extraction ontologies
  • Robust to changes
  • Multiple slot filling
  • Ontologies has to be built manually

10
Hidden Web Crawling
  • Traditional IE tools publicly indexable web
    pages
  • Hidden web crawling
  • Crawl the hidden web according to a users query
  • HiWE (Hidden Web Exposer) Raghavan01
  • Source form representation ?? task-specific DB
    concepts
  • Fill out and submit forms
  • Retrieve information hidden behind the form

11
Biological Data Extraction
  • Mainly from plain text
  • Extract biological terms
  • Dictionary-based
  • Rule-based
  • Extract relationships between biological
    terms/elements
  • Example systems
  • BLAST-based name identifier Krauthammer00
  • PASTA (Protein Active Site Template Acquisition)
    Gaizauskas03

12
The Semantic Web
  • Machine-understandable web
  • Gives information a well-defined meaning
  • Allows automation of tasks
  • Provides biologists
  • Intelligent information services
  • Personalized web resources
  • Semantically empowered search engines

13
The Semantic Web
  • Semantic web languages
  • XOL (XML-based Ontology Exchange Language)
  • SHOE (Simple HTML Ontology Extension)
  • OML (Ontology Markup Language)
  • RDF(S) (Resource Description Framework (Schema))
  • OIL (Ontology Interchange Language)
  • DAMLOIL (DARPA Agent Markup Language OIL)
  • OWL (Ontology Web Language)
  • Semantic Annotation
  • Old indexing of publications in libraries
  • New information extraction

14
Schema Matching
  • Previous methods Raghavan01
  • Individual matchers vs. combining matchers
  • Schema-based matchers vs. instance-based matchers
  • Learning-based matchers vs. rule-based matchers
  • Element-level matchers vs. structure-level
    matchers

15
Schema Matching
  • LSD (Learning Source Description) Doan01
  • Semi-automatic
  • Learning-based
  • Both schema-level and instance-Level
  • Only 1-1 mappings
  • GLUE CGLUE DMD03
  • Ontology alignment
  • CGLUE Complex (non-1-1) mappings

16
Schema Matching
  • Cupid Madhavan01
  • Rule-based matcher
  • Both element-level and structure-level
  • Schema-based
  • Works on hierarchical schemas with schema tree
  • Linguistic similarity structure similarity
  • Matches tree elements by weighted similarities

17
Schema Matching
  • COMA (COmbing MAtch) Do02
  • Combines different matchers
  • Interactive with users
  • Also an evaluation platform for different matchers

18
Biological Data Integration
  • Challenge
  • Huge amount, growing rapidly
  • Highly diverse in granularity and variety
  • Different terminologies, ID systems, units
  • Unstable and unpredictable
  • Different interface and querying capabilities

19
Biological Data Integration
  • SRS (Sequence Retrieval System) Etzold96
  • Keyword-based retrieval system
  • Returns simple aggregation of matched records
  • Only works for relational databases
  • BioKleisli Davidson97
  • Integrated digital library in biomedical domain
  • No global schema or ontology
  • A mediator works on top of source-specific
    wrappers
  • Horizontal integration

20
Biological Data Integration
  • DiscoveryLink Haas01
  • Mediator-based, wrapper-oriented
  • Provides virtual DB access from different sources
  • Cannot deal with complex source data
  • Hard to add new sources
  • Requires knowledge of specific query language
  • TAMBIS (Transparent Access to Multiple
    Bioinformatics Information Sources) Stevens00
  • Mediator-based
  • Uses global ontology and schema
  • Maps source and target concepts manually
  • Not robust to changes
  • Hard to add new sources

21
Bioinformatics
  • Biological ontology
  • Bioinformatics data source discovery
  • Trustworthiness and provenance

22
Bioinformatics
  • Biological ontology
  • GO (Gene Ontology) Ashburner00
  • Controlled vocabulary
  • Molecular Function (7278 terms)
  • Biological Process (8151 terms)
  • Cellular Component (1379 terms)
  • Is represent knowledge hierarchically

23
Bioinformatics
  • Biology Ontology
  • LinKBase Verschelde03
  • Originally a biomedical ontology
  • Over 2,000,000 medical concepts
  • Over 5,300,000 instantiations
  • 543 relations
  • Expanded using GO
  • Only describes simple binary relationships

24
Bioinformatics
  • Bioinformatics data source discovery
  • First step in integrating or answering queries
  • Example System Rocco03
  • Pre-defined classes with class descriptions
  • Tries to map a source with a class
  • Trustworthiness and provenance
  • Trustworthiness
  • Consistency
  • Reliability
  • Competence
  • Honesty
  • Provenance
  • Record History
  • Transformations
  • Annotations
  • updates

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
Summary and Future Work
  • Overcome drawbacks of existing systems
  • Elaborate new algorithms to solve the problem of
    locating and extracting data from heterogeneous
    biological sources
Write a Comment
User Comments (0)
About PowerShow.com