Automatic Extraction From and Reasoning About Genealogical Records: A Prototype - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Extraction From and Reasoning About Genealogical Records: A Prototype

Description:

Title: Retrieving Danish Genealogical Records on the Semantic Web Author: Charla Woodbury Last modified by: Charla Woodbury Created Date: 12/1/2004 4:00:40 PM – PowerPoint PPT presentation

Number of Views:486
Avg rating:3.0/5.0
Slides: 41
Provided by: charlaw
Learn more at: https://www.deg.byu.edu
Category:

less

Transcript and Presenter's Notes

Title: Automatic Extraction From and Reasoning About Genealogical Records: A Prototype


1
Automatic Extraction From and Reasoning About
Genealogical Records A Prototype
  • By
  • Charla J. Woodbury
  • A thesis submitted to the faculty of
  • Brigham Young University
  • in partial fulfillment of the requirements for
    the degree of
  • Master of Science

2
Digital Images Human Index
  • Large number of competing family history
    websites
  • Digital images
  • Human indexes
  • Researchers hunting through records and
    indexes
  • to put families together

2
3
Problem
  • Large amounts of primary genealogical data
  • Big projects to index and extract records
  • Two independent indexers and adjudication
  • Millions of human hours used to index or match
    records for names and families

3
4
Automated Extraction Solution
  • Create a specialized extraction ontology to
    interpret and label genealogical data
  • Add rules and logic that
  • Label family roles - husband, daughter, etc.
  • Link family relationships
  • HUSBAND WIFE
  • PARENT CHILD

4
5
Outline
  1. Data Preparation
  2. Ontology Extraction System (OntoES)
  3. OWL File and SWRL Rules
  4. SPARQL Queries
  5. Experimental Results
  6. Conclusions

5
6
1. Data Preparation
  • Collect machine-readable records from three
    different countries
  • Format in HTML format for extraction
  • Prepare lexicons for names, places, etc.

6
7
New England Vital Records Beverly,
Massachusetts 1668-1849
7
8
Danish Parish Maglebye, Praesto1646-1813
8
9
English Parish South Petherton, Somersetshire
1574-1901
9
10
SOUTH PETHERTON MARRIAGES (from genuki)
same day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts
10
11
Transcript of the original record
  • 1576/1577 eodem die Nicholaus Patch Christinam
    Denman
  • 26 Jan 1605 Richard Patch et Joanna Lavor
  • 1613 Septembris 26 Johannes Elliott et Joanna
    Woodbery matrimonis
    cominguntur
  • 1615 Augusti 7 Thoms Prime et Maria Patch
    matrimonio cominguntur
  • 1616/1617 Januarij 29 Wilhelmus Woodbery et
    Elizabetha Patch matrimonio cominguntur
  • 1620 Maij 2 Wilhelmus Hillerd et Fortu Patch
  • 1622 Septembris 17 Nicholas Patch et Elizabetha
    Owsley matrimonio cominguntur
  • 1627/1628 Januarij 22 Richardus Patch et Maria
    White matrimonio cominguntur
  • 1630/1631 Januarij 15 Andreas Elliott et Joanna
    Patch matrimonio cominguntur
  • 1639/1640 Februarij 12 Andreas Elliott et Joanna
    Pittes matrimonio cominguntur

12
2. Ontology Extraction System
  • OntoES automatically interpret and correctly
    label genealogical data using
  • Data frames
  • Regular expressions
  • Lexicons
  • Date conversion methods

12
13
Marriage Ontology
13
14
Data Frame Editor
14
15
Regular expressions
  • MARDATE
  • Value expression
  • EXAMPLE type 25-Sep 1613
  • (0\d1\d2\d3031\d)-Month\.?\s(\d\d\d\d)
  • Keyword expression
  • (\b(md\.?marrymarriagemarriedmariedwedweddin
    g)\b)

16
Sample MONTH LEXICON
  • 1Ober
  • 7ber
  • 8ber
  • 9ber
  • apr
  • april
  • aprilis
  • aug
  • august
  • augusti
  • augustus
  • avr
  • avril
  • avrilis
  • dec
  • december
  • decembr
  • decembre
  • decembri
  • feb
  • febr
  • februari
  • february
  • jan
  • januarij
  • january
  • jul
  • juli
  • julius
  • july
  • jun
  • june

16
17
Object Level
17
18
CANONICALIZATION METHODSinside the ontology
  • Regularize date (Julian format YYYYddd)
  • 1620 2-May ? 1620093
  • Display stored Julian format as DD MMM YYYY
  • 1620093 ? 2 MAY 1620

18
19
Feast Dates
  • Dates expressed as a holy day
  • Fixed Dates
  • Christmas 1720 ? 25 DEC 1720
  • Moveable Dates around Easter
  • (36 possible Easter dates with leap year
    variation)
  • 1723 Dnica Septuagesima ? 24 JAN 1723
  • Same day as previous entry

19
20
Run Ontology
  • Input
  • Ontology (Created with OntoES)
  • HTML data (Hypertext Markup Language)
  • Output
  • RDF database (Resource Description Format)
  • OWL file (Ontology Web Language)

20
21
Ontology Workbench
21
22
Extracted Marriages
Bet Date MarDate NameM NameF NameU
same day 1576 Nicholas Patch Christian Denman
26 JAN 1605 Richard Patch Joan Lavor
26 SEP 1613 John Elliott Joan Woodbery
7 AUG 1615 Thomas Prime Maria Parry
29 JAN 1616 William Woodbery Elizabeth Patch
2 MAY 1620 William Hillerd Fortu Patch
17 SEP 1622 Nicholas Patch Elizabeth Owlsey
22 JAN 1627 Richard Patch Mary White
16 JAN 1630 Andrew Elliott Joan Patch
12 FEB 1639 Andrew Elliott Joan Pitts
22
23
3. OWL File and SWRL Rules
  • OWL HEADER
  • ltowlClass rdfID"MarriageRecord"/gt
  • ltowlClass rdfID"Person"/gt
  • ltowlClass rdfID"NameU"/gt
  • ltowlDatatypeProperty rdfID"NameUValue"gt
  • ltrdfsdomain rdfresource"NameU"/gt
  • ltrdfsrange rdfresource"xsdstring"/gt
  • lt/owlDatatypePropertygt
  • PERSON - NAMEU
  • ltowlObjectProperty rdfID"Person-NameU"gt
  • ltrdfsdomain rdfresource"Person"/gt
  • ltrdfsrange rdfresource"NameU"/gt
  • ltowlinverseOfgt
  • ltowlObjectProperty
    rdfID"NameU-Person"/gt
  • lt/owlinverseOfgt
  • lt/owlObjectPropertygt

24
Sample RDF Triples
  • Person_10 sameAs Person_10
  • Person_10 type Thing
  • Person_10 type Person
  • NameU_0 NameUValue Christian Denman
  • NameU_0 sameAs NameU_0
  • NameU_0 type Thing
  • NameU_0 type NameU
  • NameM_4 NameMValue Nicholas Patch
  • NameM_4 sameAs NameM_4
  • NameM_4 type Thing
  • NameM_4 type NameM

25
SWRL Rules
  • Define OWL Class
  • Example Husband
  • ltowlClass rdfID"Husband"/gt
  • Define Rule
  • Example Person with male name is a Husband
  • Person-NameM(?x,?y) -gt Husband(?x)

?y
?x
25
26
Marriage Person_10 to Person_4
  • Person_10
  • ltPerson rdfID"Person_10"gt
  • ltPerson-NameU rdfresource"NameU_0"
    /gt
  • lt/Persongt
  • MarriageRecord_7
  • ltMarriageRecord rdfID"MarriageRecord_7"gt
  • ltMarriageRecord-Person
    rdfresource"Person_4" /gt
  • ltMarriageRecord-Person
    rdfresource"Person_10" /gt
  • lt/rdfMarriageRecordgt
  • NameM_4
  • ltNameM rdfID"NameM_4"gt
  • ltNameMValuegt Nicholas Patchlt/NameMValuegt
  • lt/NameMgt
  • Person_4
  • ltPerson rdfID"Person_4"gt
  • ltPerson-NameM rdfresource"NameM_4" /gt
  • lt/Persongt

27
Rule HEAD in OWL file
  • ltswrlImp rdfID"Def-Husband"gt
  • ltswrlhead rdfparseTypeCollectiongt
  • ltswrlClassAtomgt
  • ltswrlargument1 rdfresource"x"/gt
  • ltswrlclassPredicate
    rdfresource"Husband"/gt
  • lt/swrlClassAtomgt
  • lt/swrlheadgt

28
Rule BODY in OWL file
  • ltswrlbodygt
  • ltswrlIndividualPropertyAtomgt
  • ltswrlpropertyPredicate rdfresource"Pers
    on-NameM"/gt
  • ltswrlargument1 rdfresource"x"/gt
  • ltswrlargument2 rdfresource"y"/gt
  • lt/swrlIndividualPropertyAtomgt
  • lt/swrlbodygt
  • lt/swrlImpgt

29
Related Rules
  • NameF is populated then value in NameU is Husband
  • Person-NameF(?w,?v) ? MarriageRecord-Person(?z,?w)
    ?
  • MarriageRecord-Person(?z,?x) ? Person-NameU(?x,?y)
  • -gt Husband(?x)

?z
?v
?w
?x
?y
29
30
HusbandOf Rule
  • Husband(?x) ? Wife(?y) ? MarriageRecord-Person(?z,
    ?x)
  • ? MarriageRecord-Person(?z,?y)
  • -gt HusbandOf(?x,?y)

30
31
Auxiliary Name Rules
  • NameM(?x) -gt Name(?x)
  • NameF(?x) -gt Name(?x)
  • NameU(?x) -gt Name(?x)
  • NameMValue(?x) -gt NameValue(?x)
  • NameFValue(?x) -gt NameValue(?x)
  • NameUValue(?x) -gt NameValue(?x)
  • Person-NameM(?x,?y) -gt Person-Name(?x,?y)
  • Person-NameF(?x,?y) -gt Person-Name(?x,?y)
  • Person-NameU(?x,?y) -gt Person-Name(?x,?y)

31
32
SPARQL Query Who is Husband of Christian Denman?
  • PREFIX http//www.deg.byu.edu/ontology/Marriage
  • SELECT ?Husband
  • WHERE
  • ?X NameValue "Christian Denman" .
  • ?Y Person-Name ?X .
  • ?W HusbandOf ?Y .
  • ?W Person-Name ?V .
  • ?V NameValue ?Husband

32
33
Query Results
  • Husband
  • "Nicholas Patch"http//www.w3.org/2001/XMLSchema
    string

33
34
Query Results
  • Husband
  • "Nicholas Patch"http//www.w3.org/2001/XMLSchema
    string

South Petherton Marriages same day 1576 Nicholas
Patch and Christian Denman 26 Jan 1605 Richard
Patch and Joan Lavor 25-Sep 1613 John Elliott
and Joan Woodbery 7-Aug 1615 Thomas Prime and
Maria Parry 29-Jan 1616 William Woodbery and
Elizabeth Patch 2-May 1620 William Hillerd and
Fortu Patch 17-Sep 1622 Nicholas Patch and
Elizabeth Owsley 22-Jan 1627 Richard Patch and
Mary White 15-Jan 1630 Andrew Elliott and Joan
Patch 12-Feb 1639 Andrew Elliott and Joan Pitts
Nicholas Patch because
NameValue(Nicholas Patch) and
Name-NameValue(n1, Nicholas Patch)
and Name(n1) is NameM(n1) and Person-NameM(p1,
n1) NameValue(Christian Denman) and
Name-NameValue(n2, Christian Denman)
and Name(n2) is NameU(n2) and Person-NameU(p2,
n2) Husband(p1) because Person-NameM(p1,
n1) Wife(p2) because Person-NameU(p2, n2)
and Person-MarriageRecord(p2, r1) and
MarriageRecord-Person(r1, p1) and
Person-NameM(p1, n1) HusbandOf(p1, p2) because
Husband(p1) and Wife(p1) and
MarriageRecord-Person(r1, p1) and
MarriageRecord-Person(r1, p1)
34
35
5. Experimental Results
  • Extraction Results
  • American Extraction Problem
  • Rule Results

35
36
Extraction Results
MARRIAGES ENTITIES RECALL ERRORS PRECISION
English 188 594 588 99.0 8 98.7
American 608 1824 1630 89.4 34 98.0
Danish 171 543 538 99.1 10 98.2
BIRTHS
English 3153 9489 9394 99.0 61 99.4
American 675 2055 1809 88.0 33 98.2
Danish 677 2061 2042 99.1 15 99.3
  DEATHS
English 3458 8675 8589 99.0 83 99.0
American 510 1305 1148 88.0 28 97.6
Danish 833 2113 2093 99.1 19 99.1
36
37
American Difficulty
  • BIRTH
  • WOODBURY, Charles Henry Charles William, P. R.
    4., s. Henry housewright. dup. and Henrietta
    (Galloup), Dec. 4, 1845.
  • Extra information inside brackets parentheses
  • Charles William twin of Charles Henry
  • Henry housewright identified as NAME
  • Henrietta (Galloup) identified as NAME

37
38
Rules Results
  • 100 Precision and Recall
  • (Once rules are well-defined, the results are
    perfect.)
  • Database Size
  • (The RDF database is much larger when rule
    triples are added.)
  • NEW PROPERTIES husband, wife, parent, child
  • NEW LINKS

38
39
Size Impact of Adding Rules
MARRIAGE 21 rules EVENT 30 rules
Triples OWL ( lines) OWL File (kilobytes) Triples OWL ( lines) OWL File (kilobytes)
OWL File 814 498 14 2232 1405 15
W/Rules 1009 785 31 2983 1873 75
Difference 195 287 17 751 468 60
Increase 23.96 57.63 121.43 33.65 33.31 400.00
40
6. Conclusions
  • Speed up data indexing
  • Make production of a full index easier
  • Ground the index in original documents
  • Provide for inferred facts
  • Simplify as well as augment record search
  • Help link records and form family groups and
    ancestral lines

40
Write a Comment
User Comments (0)
About PowerShow.com