Pattern MarkupLanguage - PowerPoint PPT Presentation

About This Presentation
Title:

Pattern MarkupLanguage

Description:

Pattern Markup-Language. A tool for simplifying data extraction. from semi-structured sources ... Pattern Markup Language. 12. Regular Expression C. Programmer ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 32
Provided by: deg7
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Pattern MarkupLanguage


1
Pattern Markup-Language
  • A tool for simplifying data extraction
  • from semi-structured sources
  • Jonathan Baker, Hilton Campbell,
  • Jordan Crabtree, David W. Embley

2
Many Sites with Genealogical Data
3
(No Transcript)
4
(No Transcript)
5
Structural Patterns
6
(No Transcript)
7
(No Transcript)
8
(No Transcript)
9
(No Transcript)
10
Programmer DefinedRegular Expressions
11
Programmer DefinedRegular Expressions
12
Programmer DefinedRegular Expressions
13
Which Relationships Found?
14
Simple Schema Represents Relationships
15
Combine Schema andRegular Expressions
Tree Represented by XML PatML
16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
PatML Generation Tools
Schema Generator Establishes relationships
21
PatML Generation Tools
PatML Editor Helps write the regular expressions
and establish which facts they match
22
(No Transcript)
23
Using PatML Editor
  • Get your schema file
  • Browse for sample page
  • Add nodes
  • Add expressions
  • See the highlights in source
  • Adjust

24
PatML EditorInterface
Tree representing PatML structure
Text area with sample page source
Browser with rendered sample page
25
(No Transcript)
26
Fast and Versatile
  • Regular sites can be integrated in hours
  • Adaptable to any type of information

27
Implementation to Date
  • Genesis uses PatML files to search a variety of
    sites
  • Searches TNG, Retrospect-GDS, Family Search,
    GedCom and Kansas Gunslingers
  • Standardizes information for a common datamodel
  • Simultaneously searches other sites (in different
    formats) for people with similar information

28
Results
29
Results
  • Produced PatML that correctly extracts data from
    TNG, RGDS, GedCom Sites, and Kansas Gunslingers
  • User Interface allows for improved debugging
    environment
  • 1/10 coding time with PatML generation tools
    compared to similarly functioning hand coded
    parsers

30
Limitations
  • Sites must be recognizable with regular
    expressions
  • Even regular sites have page to page HTML
    variations
  • Programmer error with regular expressions
  • Regular expression operations can be slow

31
Future work
  • Automatic regular expression generation
  • Parsing links to extract data on connected pages
  • Use in other applications and fields
  • XPath approaches
Write a Comment
User Comments (0)
About PowerShow.com