Title: Target schema and domain evolution
1Determine the likely arson suspect- limited
transportation - fondness for patterns
Name of person or incident String, Incident
cause String, Location streetstring ? P.O.
street list, citystring ? city
list, zip integer ? P.O. list in
area of interest, lat, long
float ? in area of interest
Guess a task-specificschema
One person livingat the center ofa geographic
patternof historical and current incidents
Hypothesisformation
Target schema anddomain evolution
Source metadatapreparation
Transformationand analysis
Find sources to fill in target Familiarization
Clarify semantics and domains
Current events (xml) Historical
events(xls) Historical events(html) People (xls)
Convert to CSV, then KML Load to Google Maps Set
icon colors for visibility Make judgement about
pattern
VisualizationMapping of inexpressible data
Theory formulationVerificationIdentify missing
pieces
An answer Jimmy West
Target datainstantiation
Source datapreparation
Create target instance using CHIME(acceleration
via learning by example) Select people and
suspicious events Project down to name, lat,
long Resolve duplicate entities with CHIME
Fill in target relation, learning by
exampleRemove extraneous data by projection
De-duplicate entities attributes
Data assessment and profiling Find or build
extension functions
- spelling errors in cause field
- Twinford Drive inconsistent with geo data
- Geo data and park names switched on two other
fires - Extension functions ready- split street, city,
state, zip- street name to lat/long- zip code
to lat/long- bin causes as suspicious or
null- CSV ? KML for map upload
Metadata matching
T.name ? People_info.name historical_events.id
current_events.fire.dtg T.cause
? bin(historical_events.cause) T.street, .city,
.zip ? split(People_info.address) T.lat, .long ?
getLatLong(split(People_info.address))
current_events.fire.latitude, .longitude
historical_events.Lat, .Lng
Map source schemas to target
2Arson SuspectTarget Schema andSolution Map
3Rescue OrderTarget Schema and Solution List
John Joan then Jenny then Jack
4A fly (or many?) in the ointment
- We dont know how to computeor verify a
task-specific schemaautomatically
Guess a task-specificschema
Hypothesisformation
Target schema anddomain evolution
Source metadatapreparation
Transformationand analysis
Find sources to fill in target Familiarization
Clarify semantics and domains
- No good language for semantics
- Dont know how to compute Right domains from
source metadata - Data in diverse formats too complext for rapid
human review
- - Some things, like geometric and spatial
recognition, require human interpretation - Whats missing, where do I find it? is not
computable by a machine
VisualizationMapping of inexpressible data
Theory formulationVerificationIdentify missing
pieces
Target datainstantiation
Source datapreparation
- Partial attribute values and unstructured data
lack semantics - must come from human
knowledge(but the copy and paste action required
is learnable by example!) - Entity attribute resolution requires
human-guided choices, e.g. John andJoan Smith
resolve to just one household, but which
purchaseyear is right?
Fill in target relation, learning by
exampleRemove extraneous data by projection
De-duplicate entities attributes
- Need to determine keys and check FDs, then clean
data - No language to describe semantics of inputs and
outputs of extension functions ? selection of
functions cannot be automatic
Data assessment and profiling Find or build
extension functions
Metadata matching
- Matching source to target requiressemantic
knowledge held only by humans
Map source schemas to target
5- Our approach
- Assist users in
- data familiarization
- data assessment and profiling
- Mapping
- entity/attribute resolution
- Let human judgment make the call
- Accelerate human effort via learn-by-example
- Our integration research projects
- Quarry
- Infosonde
- CHIME
6CHIME isan information integration application
to capture evolving human knowledge about
task-specific data
- Evolving task-specific schema and entity sets
- Mapping diverse data to correct attributes and
entities - Learning by example to speed integration where
possible - Resolving entities and attributes, and recording
user choices - Navigating and revising the history of
integration decisions made in a dataset
7CHIME is Part of an architecture for capturing
and sharing semantics, annotations, and usage of
sub-document data
Feed Browser
Mark headline browsing Mark visit
initiation Feed query creation
MarkAPI
Repository
Repository
Repository
InformationIntegrationApplication
Mark semantics review Schema creation Entity
resolution Attribute resolution Literal mark
creation Add to mark semantics
DocAPI
Pub/SubUI
CHIME
1
WebUI
View marks in context Mark semantics review Mark
history review Mark creation Mark deployment to
docs Context gathering Annotation Add to mark
semantics
Editing/Markup Application
2
Copy/Paste
OntologyInferenceandAuthoring
Infer over mark semantics Mark browsing Mark
searching Document submission
Sharemarkreferences
3
Populated schemas
1
New and updated documents
2
Mark browsing Mark searching Mark semantics review
3
Ontologies and thesauri
8Metrics
- Scale-up improvement
- Scale-out improvement
- of target schema successfully integrated
- of identified user tasks automated or assisted
- of data discrepancies detected, corrected
automatically - cold-start to warm-start time-to-solution
ratio