Target schema and domain evolution - PowerPoint PPT Presentation

About This Presentation
Title:

Target schema and domain evolution

Description:

Determine the likely arson suspect - limited transportation - fondness for patterns {Name of person or incident: String, Incident cause: String, Location {street ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 9
Provided by: DaveA166
Learn more at: http://web.cecs.pdx.edu
Category:

less

Transcript and Presenter's Notes

Title: Target schema and domain evolution


1
Determine the likely arson suspect- limited
transportation - fondness for patterns
Name of person or incident String, Incident
cause String, Location streetstring ? P.O.
street list, citystring ? city
list, zip integer ? P.O. list in
area of interest, lat, long
float ? in area of interest
Guess a task-specificschema
One person livingat the center ofa geographic
patternof historical and current incidents
Hypothesisformation
Target schema anddomain evolution
Source metadatapreparation
Transformationand analysis
Find sources to fill in target Familiarization
Clarify semantics and domains
Current events (xml) Historical
events(xls) Historical events(html) People (xls)
Convert to CSV, then KML Load to Google Maps Set
icon colors for visibility Make judgement about
pattern
VisualizationMapping of inexpressible data
Theory formulationVerificationIdentify missing
pieces
An answer Jimmy West
Target datainstantiation
Source datapreparation
Create target instance using CHIME(acceleration
via learning by example) Select people and
suspicious events Project down to name, lat,
long Resolve duplicate entities with CHIME
Fill in target relation, learning by
exampleRemove extraneous data by projection
De-duplicate entities attributes
Data assessment and profiling Find or build
extension functions
  • spelling errors in cause field
  • Twinford Drive inconsistent with geo data
  • Geo data and park names switched on two other
    fires
  • Extension functions ready- split street, city,
    state, zip- street name to lat/long- zip code
    to lat/long- bin causes as suspicious or
    null- CSV ? KML for map upload

Metadata matching
T.name ? People_info.name historical_events.id
current_events.fire.dtg T.cause
? bin(historical_events.cause) T.street, .city,
.zip ? split(People_info.address) T.lat, .long ?
getLatLong(split(People_info.address))
current_events.fire.latitude, .longitude
historical_events.Lat, .Lng
Map source schemas to target
2
Arson SuspectTarget Schema andSolution Map
3
Rescue OrderTarget Schema and Solution List
John Joan then Jenny then Jack
4
A fly (or many?) in the ointment
- We dont know how to computeor verify a
task-specific schemaautomatically
Guess a task-specificschema
Hypothesisformation
Target schema anddomain evolution
Source metadatapreparation
Transformationand analysis
Find sources to fill in target Familiarization
Clarify semantics and domains
  • No good language for semantics
  • Dont know how to compute Right domains from
    source metadata
  • Data in diverse formats too complext for rapid
    human review
  • - Some things, like geometric and spatial
    recognition, require human interpretation
  • Whats missing, where do I find it? is not
    computable by a machine

VisualizationMapping of inexpressible data
Theory formulationVerificationIdentify missing
pieces
Target datainstantiation
Source datapreparation
  • Partial attribute values and unstructured data
    lack semantics - must come from human
    knowledge(but the copy and paste action required
    is learnable by example!)
  • Entity attribute resolution requires
    human-guided choices, e.g. John andJoan Smith
    resolve to just one household, but which
    purchaseyear is right?

Fill in target relation, learning by
exampleRemove extraneous data by projection
De-duplicate entities attributes
  • Need to determine keys and check FDs, then clean
    data
  • No language to describe semantics of inputs and
    outputs of extension functions ? selection of
    functions cannot be automatic

Data assessment and profiling Find or build
extension functions
Metadata matching
  • Matching source to target requiressemantic
    knowledge held only by humans

Map source schemas to target
5
  • Our approach
  • Assist users in
  • data familiarization
  • data assessment and profiling
  • Mapping
  • entity/attribute resolution
  • Let human judgment make the call
  • Accelerate human effort via learn-by-example
  • Our integration research projects
  • Quarry
  • Infosonde
  • CHIME

6
CHIME isan information integration application
to capture evolving human knowledge about
task-specific data
  • Evolving task-specific schema and entity sets
  • Mapping diverse data to correct attributes and
    entities
  • Learning by example to speed integration where
    possible
  • Resolving entities and attributes, and recording
    user choices
  • Navigating and revising the history of
    integration decisions made in a dataset

7
CHIME is Part of an architecture for capturing
and sharing semantics, annotations, and usage of
sub-document data
Feed Browser
Mark headline browsing Mark visit
initiation Feed query creation
MarkAPI
Repository
Repository
Repository
InformationIntegrationApplication
Mark semantics review Schema creation Entity
resolution Attribute resolution Literal mark
creation Add to mark semantics
DocAPI
Pub/SubUI
CHIME
1
WebUI
View marks in context Mark semantics review Mark
history review Mark creation Mark deployment to
docs Context gathering Annotation Add to mark
semantics
Editing/Markup Application
2
Copy/Paste
OntologyInferenceandAuthoring
Infer over mark semantics Mark browsing Mark
searching Document submission
Sharemarkreferences
3
Populated schemas
1
New and updated documents
2
Mark browsing Mark searching Mark semantics review
3
Ontologies and thesauri
8
Metrics
  • Scale-up improvement
  • Scale-out improvement
  • of target schema successfully integrated
  • of identified user tasks automated or assisted
  • of data discrepancies detected, corrected
    automatically
  • cold-start to warm-start time-to-solution
    ratio
Write a Comment
User Comments (0)
About PowerShow.com