Title: Efficiently Querying Contradictory and Uncertain Genealogical Data
1Efficiently Querying Contradictory and Uncertain
Genealogical Data
- Lars E. Olson and David W. Embley
- DEG Lab
- BYU Computer Science Dept.
Supported by National Science Foundation Grant
0083127
2Introduction
- Integrating data from multiple sources
- Some data just doesnt fit the data model
- Multiple data sources conflicting data
- Uncertain or imprecise data
- Data that violates constraints
- Sometimes its not possible to resolve the data
- PAF / Gedcom
3Disjunctive Databases
OR-tables, Imielinski and Vadaparty, 1989
4Shortcomings of OR-tables
- Cant correlate between possible values
- Answering queries in general is CoNP-complete
(Imielinski Vadaparty)
5Sub-relation Data Construct
- Solution store the correlated data in its own
relation
6Disjunctive Database Problems
- How do we avoid the CoNP-completeness problem and
answer queries efficiently? - If more than one value is possible, which one is
the most likely? - Other questions to be solved
- Where are the constraint violations?
- How do we map sub-relations to physical storage?
- How do we efficiently update the database?
7Transitive Closure of Disjunctive Graphs
Solving the CoNP-completeness problem LYY95
Disjunctive graph
Possible interpretation
b
b
e
e
a
c
a
c
f
f
d
d
Transitive closure of a a, d, e
8Using Disjunctive Graphs to Answer Queries
Table Person
Table Place
9Using Disjunctive Graphs to Answer Queries
Person
John Doe
Name
12 Mar 1840
ID
Birth Date
1
12 Mar 1841
Place
Nauvoo
ID
City
Marriage Date
1
Birth Place
Commerce
16 Jun 1869
15 Jun 1869
State
Illinois
State
City
ID
2
Quincy
10Using Disjunctive Graphs to Answer Queries
Place
Nauvoo
ID
1
Person
Commerce
ID
State
1
Illinois
State
City
ID
2
Quincy
11Using Disjunctive Graphs to Answer Queries
meaning what?
- Definitely known?
- All possible values?
- Most likely value?
Place
Nauvoo
1.0
ID
City
1
Person
Commerce
0.2
ID
State
Birth Place
1
Illinois
State
0.8
City
ID
2
Quincy
12Using Disjunctive Graphs to Answer Queries
Person P1
John Doe
12 Mar 1840
Person P2
ID 1
ID 2
12 Mar 1841
James Doe
13 Mar 1840
13Limiting the Search Space
- In genealogy, most disjunctions are mutually
independent - Disjunctions that arent independent are limited
to immediate family relations - Build a relation containing all immediate family
members
(Person P1 P1.parent P2.ID Person P2
P2.ID P3.parent Person P3)
14Limiting the Search Space
- Example constraints
- Each parent should be born before their children
- Each child should be born at least 9 months apart
(except multiple births)
Person P1
Person P2
Person P3
ID 1
ID 1
ID 1
1.0
ID 2
ID 2
ID 2
1.0
ID 3
ID 3
ID 3
ID 4
ID 4
ID 4
parent
child parent-1
15Conclusions
- Genealogical data can be stored in a disjunctive
database format. - Many common queries can be computed in polynomial
time. - We can detect intractable queries and limit the
search space required, usually enough to get
polynomial time.