Extracting and Delivering Stories from Heterogeneous Information Sources PowerPoint PPT Presentation

presentation player overlay
1 / 61
About This Presentation
Transcript and Presenter's Notes

Title: Extracting and Delivering Stories from Heterogeneous Information Sources


1
Extracting and Delivering Stories from
Heterogeneous Information Sources
V.S. Subrahmanian, M. Fayzullin University of
Maryland M. Albanese, C. Cesarano, A.
Picariello Univ. of Napoli, Italy
2
Talk Outline
  • Motivating examples
  • STORY Architecture
  • Theoretical Model
  • Algorithms
  • OptStory
  • DynStory
  • GenStory
  • Experimental results

3
Motivating example Pakistani Nuclear Scientists
  • Nuclear proliferation is the issue of the day
  • Complex web of
  • Nuclear scientists
  • Personnel at weapons locations
  • Arms dealers
  • Customs officials
  • Shipping companies
  • Front companies
  • Manufacturers
  • Nuclear monitors may want the story on any
    person or place or event to decide if further
    investigation is warranted.

Huge amounts of data need to be processed and
filtered so that only the relevant data is shown
to the analyst.
4
Motivating example US Immigration
  • Customs official sees a traveller.
  • Wants the quick story on him
  • Where does he work?
  • Who does he work for?
  • What is his area of expertise?
  • Any warrants?
  • Is he on a watch list?
  • Who are his associates anyone suspicious?
  • Just the right data should be presented to him.

5
A motivating example Pompeii
  • Pompeii is a spectacular archaeological site.
  • Visitor experience can be greatly improved by
  • Automatically notifying visitors of interesting
    phenomena without posting extra signs
  • Allowing visitors to explore the stories of
    various monuments, paintings, sculptures, etc. in
    Pompeii.
  • Allowing visitors to explore the stories of the
    characters, events and places depicted in these
    monuments, paintings, sculptures, etc.
  • Visitors interests vary so information about
    exhibits must adapt in real time to their
    interests to enhance the experience of the
    visitor.

6
Pompeii Visitors
Visitor arrives at ticket counter and buys ticket.
7
Pompeii Visitors
ANALOG Soldier in Baghdad sets out on a mission.
Visitor arrives at ticket counter and buys ticket.
8
Pompeii Visitors
Ticket agent asks if they would like to use the
story facility and if they would like to use
their cell phone and/ or PDA to get stories of
interest to them.
9
Pompeii Visitors
ANALOG Soldier in Baghdad chooses to receive
stories on his radio or PDA.
Ticket agent asks if they would like to use the
story facility and if they would like to use
their cell phone and/ or PDA to get stories of
interest to them.
10
Pompeii Visitors
As visitor walks through Pompeii, STORY
identifies where he is and predicts where he
might go in the future (probabilistically). Ex.
if he is at location L, it might predict that he
will go to the House of the Vetti.
11
Pompeii Visitors
ANALOG As soldier drives through Baghdad, STORY
identifies where he is and correlates where he
will go with his route plan.
As visitor walks through Pompeii, STORY
identifies where he is and predicts where he
might go in the future (probabilistically). Ex.
if he is at location L, it might predict that he
will go to the House of the Vetti.
12
Pompeii Visitors
See items
You are here (Triclinium in the House of the
Vetti)
Based on this prediction of where he might go in
future, it identifies potential stories he might
be interested in and downloads parts of these
stories to his PDA/cell. E.g. It might download
stories about Pentheus.
13
Pompeii Visitors
ANALOG STORY finds stories satisfying the
soldiers conditions of interest and downloads
them to his PDA or to the nearest radio broadcast
location.
See items
You are here (Triclinium in the House of the
Vetti)
Based on this prediction of where he might go in
future, it identifies potential stories he might
be interested in and downloads parts of these
stories to his PDA/cell. E.g. It might download
stories about Pentheus.
14
Pompeii Visitors
The visitor chooses which story he is interested
in. STORY dynamically generates the story and
delivers it to the users PDA/cell phone, e.g.
user might choose story of Pentheus.
15
Pompeii Visitors
ANALOG STORY delivers the story to the soldier.
He can then further interact with the story if
needed using voice and cursor prompts.
The visitor chooses which story he is interested
in. STORY dynamically generates the story and
delivers it to the users PDA/cell phone, e.g.
user might choose story of Pentheus.
16
Pompeii Visitors
The user can choose to explore the story in
greater detail (e.g. if he is seeing the story of
Pentheus, he can also explore the story of Agave).
17
Stories depend upon context
  • The concept of story is dramatically different
    for the examples mentioned earlier.
  • Pompeii Visitor cares about mythological,
    historical, artistic facts.
  • Soldier in Baghdad cares about security and
    mission related facts. Who are the people around
    me and not who is depicted on the walls.
  • Nuclear analyst cares about the nuclear networks
    who is selling what to whom? Who is moving the
    money? What front companies are involved?
  • What goes into a story depends not only on basic
    facts about entity of interest but also on the
    application domain and specific items of interest
    to the user.

18
STORY Architecture
19
RDF Triples
  • Consist of 3 parts
  • An entity
  • An attribute
  • A value
  • STORY also
  • allows time-stamped values.
  • attributes to have set-valued types.
  • Example
  • Attribute mother, Value Agave
  • Attribute cartag, Value AMD 124
  • Attribute employers, Value ibm, hp

20
RDF Triples
  • Consist of 3 parts
  • An entity
  • An attribute
  • A value
  • STORY also
  • allows time-stamped values.
  • attributes to have set-valued types.
  • Time Varying Attribute (TVA)
  • Example
  • attribute job
  • Value
  • (cardinal, 1500,1509), (pope,1510,1545)
  • Example
  • Attribute worked-for
  • Value (ibm,1990,1998), (hp,1999,2004)

21
Story Schema
Unlike DBs, no need to declare schema in advance.
  • A story schema is a pair (E,A)
  • Examples
  • Set of entities in Pompeii
  • Set of all objects in Pompei
  • Set of all objects and events depicted
  • Any entities related to the previous categories.
  • Set of all people/organizations associated with
    Iraqi cars
  • Set of all car ids
  • Set of owners of such cars
  • Set of people associated with such owners via one
    or many links.

22
Story Instance
Not all attribute values needed for all entities.
  • An instance w.r.t. story schema (E,A) is a
    partial mapping
  • Input
  • an entity of E and an attribute of A
  • Output
  • a value v in dom(A) if A is an ordinary
    attribute, or
  • a timevalue if A is a TVA

23
Extracting RDF from text
  • Text needs to be parsed in order to understand
    its structure before extracting RDF triples
  • Context free grammars to parse the text
  • A set of template-based rules to extract triples
    from parsed text
  • Rule can be derived from examples

24
Generating rules from examples
Validate and define extraction patterns (see next
slide)
Rome is the capital of Italy
Syntactic parsing
Manually mark nodes corresponding to entities,
attributes and values. Add alternatives for
constant tokens (e.g. of in)
25
Generating rules from examples
Each extraction patterns define which marked node
acts as the entity, which one as the attribute
and which one as the value.
26
Generating rules from examples
The same node may act as the entity w.r.t. an
extraction pattern, and as the value w.r.t.
another extraction pattern.
27
Triples extraction
  • Each sentence is parsed, generating one or more
    parse trees.
  • Each parse tree is matched against the parse tree
    that represents an extraction rule using a tree
    matching algorithm.
  • If the match succeeds, the pieces of information
    corresponding to the marked template nodes are
    extracted and triples are built according to the
    extraction patterns.

Probabilistic tree matching Algorithms in progress
28
Example Iran is one of the most dangerous
enemies of the United States
29
Example Iran is one of the most dangerous
enemies of the United States
  • Allows 4 different interpretations, corresponding
    to different parse trees.
  • All of the 4 parsing trees match the template
  • 2 of them allow us to extract the triple
  • Ethe most dangerous enemies of the United
    States
  • Aone
  • VIran
  • 2 of them allow to extract the triple
  • Ethe United States
  • Aone of the most dangerous enemies
  • VIran

30
Example Hu Jintao is the most popular leader in
China
31
Example Hu Jintao is the most popular leader in
China
  • Allows 2 different interpretations, corresponding
    to different parse trees.
  • The first parse tree doesnt match the template
  • The second parse tree matches the template and
    allows us to extract the triple
  • EChina
  • Athe most popular leader
  • VHu Jintao

32
How the system works
  • The story application developer first specifies a
    set of data sources that are to be accessed, e.g.
  • www
  • a relational database
  • an object oriented database
  • database of web documents
  • a set of URLs
  • Some combination of the above.
  • The STORY crawler extracts a full instance.
  • Set of triples obtained from all sources
    specified by the user.
  • Full instances dont resolve inconsistencies,
    generalize data, etc.
  • Stories are then created on demand using the full
    instance and using appropriate conflict
    resolution, generalization, and other modules.

33
XML sources
  • Consider an XML node
  • N ? name,value,c1,cn where c1,cnare
    children nodes
  • Assuming that N is a root node in an XML
    document, and nodes may act both as entities and
    the attributes.
  • e is an entity
  • A is an attribute

34
GetXMLAttr(N,e,A)
  • GetXMLAttr(N,e,A)
  • begin \\
  • Result ?
  • If N.valuee or N.namee then
  • for each child c of N such that c.nameA do
  • Result Result U c.value
  • end for
  • else
  • for each child c of N do
  • Result Result U GetXMLAttr(c,e,A)
  • end for
  • end if
  • return Result
  • end

35
CPR
  • There are good stories and bad stories
  • The STORY architecture supports the goals of
    succinctness and exploration and creates stories
    with respect to three important parameters
  • the priority of the story content,
  • the continuity of the story,
  • the non-repetition of facts covered by the story
  • We want to deliver the most important facts to
    the intended audience.
  • So far, we have focused primarily on priority and
    non-repetition, worrying less about continuity.

36
CPR examples
  • In the story of Pentheus, it makes more sense to
    first say that his parents were Cadmus and Agave,
    then say he reigned as King of Thebes, and then
    explain why he was killed.
  • This rendering of the story is in chronological
    order, ensuring a kind of temporal continuity.
  • Other measures of continuity are also possible
    within the STORY framework.
  • A repetition function may evaluates how much
    repetition there is in a given story.
  • For example, in the case of Pentheus, we may
    extract the fact that Agave is a parent of
    Pentheus, and that Agave is the mother of
    Penthus. Including both these facts in a story is
    repetitive as the latter fact subsumes the former.

37
Story evaluation function
  • eval(S)?. ?(s)?. ?(s) - ?. ?(s)
  • ?, ?, ? are arbitrary functions from the set of
    all possible stories S about some entities to
    0,1
  • ? describes whether high priority facts are
    included in the story.
  • For example, the fact that Pentheus' mother was
    Agave is more important than the length of
    Pentheus' big toe.
  • ? describes how continuous the story is.
  • This means that a story should not jump wildly
    from one fact to another.
  • ? describes repetition.
  • clearly, stories that repeat the same or similar
    facts over and over again leave much to be
    desired.

38
CPR functions
  • There are many ways of defining how continuous a
    story is, how repetitive a story is, etc.
  • Our story creation algorithms can work with any
    continuity, priority and repetition functions
    whatsoever.

39
Attribute Hierarchy
  • The attributes of interest are arranged in an
    attribute hierarchy where attributes can be
    labeled with priorities.
  • The story application developer can browse and
    edit this hierarchy (for example if he wishes to
    add new attributes).
  • He can add priorities to selected items in the
    hierarchy (all sub elements of a given element in
    the hierarchy will inherit the priority value for
    the parent unless otherwise stated).

40
(No Transcript)
41
Conflict Management
  • As multiple data sources may be used to extract
    attributes, conflicts might occur.
  • For example, one source may say that Pentheus
    mother is Agave, while another may say it is
    Hera.
  • STORY allows conflict resolution with an
    application specific method.
  • Conflicts do not always need to be resolved.
    Sometimes, you just report the existence of a
    conflict, and specify what should be reported.

42
Conflict Management Policy
  • Temporal Conflict Resolution
  • Suppose different data sources provide different
    values v1, , vn. Suppose value vi was inserted
    into the data source at time ti. In this case,
    we pick the value vi such that ti max t1,t2,
    ,tn. If multiple exist, one is selected
    randomly.
  • Source based conflict resolution.
  • The developer of a story may assign a credibility
    ci to each source si that provides a value vi for
    attribute A of entity e. This strategy picks
    value vi such that ci max c1,, cn. If
    multiple exist, one is selected randomly.
  • Voting based conflict resolution.
  • Each value vi returned by at least one data
    source has a vote that represents the number of
    sources that return value vi. In this case, this
    conflict resolution strategy returns the value
    with the highest vote. If multiple vi's have the
    same highest vote, one is picked randomly and
    returned.

43
Generalization Module
  • Goal to generalize multiple RDF triples into
    one.
  • For example, if we know that Pentheus's father is
    Cadmus, and his mother is Agave, we may want to
    generalize this to say that Pentheus's parents
    are Cadmus and Agave.
  • If Pentheus was king of one town for some period,
    king of another town for another period of time,
    and so on, we may merely want to say that
    Pentheus was king of many places.
  • The Generalization Module looks at the
    RDF-triples stored in the RDF database and
    augments it with triples that include
    generalization attributes
  • that succinctly summarize a set of less
    general (i.e. more specific) attributes.

44
Generalized Story Schema
  • A generalized story schema consists of a regular
    story schema, a function that associates an
    equivalence relation with each attribute domain
    and a function that associates a generalization
    function with each attribute domain.
  • An equivalence relation on the domain dom(A) of
    attribute A specifies when certain values in
    the domain are considered equivalent. For
    example, we may consider string values king and
    monarch to be equivalent in dom(occupation).
  • For a time varying attribute we may consider
    (king,L,U) and monarch,L',U' to be
    equivalent independently of whether LL and UU'
    is true or not.
  • Our system uses WordNet and some heuristics to
    infer equivalence relationships between terms.
  • Generalization currently being plugged into the
    system.

45
STORY creation
  • Construct a story of length k or less from the
    RDF database.
  • examining all triples in the RDF entity of
    interest,
  • including triples extracted from the data sources
    by the attribute extractor as well as triples
    created by the generalization module.
  • It then finds the k triples that optimize an
    objective function.
  • The objective function must be monotonic in
    priority of the triples and monotonic w.r.t. the
    continuity function selected by the STORY
    application developer, and anti-monotonic in the
    amount of repetition between tuples.

46
Closed Instance
  • We first compute the full instance associated
    with our source access table.
  • We then split this instance into equivalence
    classes using equivalence relation.
  • Suppose the equivalence classes thus generated
    are X1, , Xn.
  • For each equivalence class Xi we compute the
    generalization vi using the generalization
    function associated with attribute A. We insert
    the tuple (e,A, vi) into the full instance.
  • This process is repeated for all entities e and
    all attributes A
  • After all tuples of the form shown above inserted
    into the full instance, it becomes the closed
    instance.

47
Story Computation Problem
  • Given a closed instance I, a positive integer k,
    and an entity e as input, find a story of size ?
    k that maximizes the value of a given evaluation
    function eval.
  • In this case, the found story is called on
    Optimal Story.
  • Theorem Finding an optimal story is NP-hard
    (even after the full instance is created).

48
Story Algorithms
  • OptSTORY algorithm finds the story that
    optimizes the objective function.
  • This algorithm has the disadvantage of being very
    slow.
  • Multiple alternative BestSTORY algorithms
  • DynStory(S) uses a dynamic programming approach
  • GenStory(S) which is based on genetic
    programming.
  • DynStory and GenStory find suboptimal stories,
    but do so very fast.

49
GPS Support SubsystemCurrent implementation
  • Outdoor positioning at Pompeii implemented using
    DGPS
  • Mobile devices are equipped with IEEE 802.11b
    wireless Ethernet to allow internet connection

50
GIS Support SubsystemOutdoor and indoor
positioning
  • Outdoor positioning
  • GPS has been successfully adopted in a lot of
    applications
  • Indoor positioning
  • GPS receivers are blind in indoor spaces
  • Different kinds of positioning systems will be
    used
  • Infrared or ultrasound sensors
  • Radio Frequency sensors
  • WLAN-based positioning
  • We have methods to optimally position a set of
    sensors to monitor the site, but the system is
    not yet implemented.

51
STORY presentation
  • Our STORY architecture applies to several
    different hardware options
  • our current implementation works for both PDAs
    and laptops.
  • Multiple languages
  • we currently support English, Spanish and
    Italian.
  • Multiple output rendering
  • via a graphical user interface or via speech

52
(No Transcript)
53
Methods to merge multiple such sentences into one
are being implemented.
54
STORY Experiments
  • Parameters to be evaluated
  • Value of the facts included into the stories
  • Quality of the prose (does it read nicely)
  • Experiments plan
  • 61 students enrolled as reviewers
  • 51 non experts (no a priori knowledge about the
    subjects of the stories)
  • 10 experts (a priori knowledge)
  • Facts and prose evaluated for
  • Different algorithms
  • Different rendering techniques
  • Different CPR parameters settings
  • Different lengths of the stories

55
Value of the facts vs. length of the story Trends
56
Value of the facts vs. length of the story
Considerations
  • Highest Priorities
  • GenSTORY (version 1 using original sentences
    from sources if available instead of only using
    templates) wins
  • Runner up is DynSTORY (version 1)
  • Even if we ignore how the stories are rendered,
    GenSTORY still wins.
  • Including the original sentences in the story
    adds more information content than rendering the
    same fact through a template.

57
Quality of the prose vs. length of the story
Trends
58
Quality of the prose vs. length of the story
Considerations
  • The quality of the prose is high and seems
    independent of the algorithm used
  • Quality of prose decreases as the story length
    increases (not surprising).
  • Including sentences from text sources into
    stories improves story quality.

59
Value of the facts and quality of the prose
Summary
60
Value of the facts vs. CPR parameters Trends
61
Value of the facts vs. CPR parameters
Considerations
  • Best value of facts is obtained when the
    priority is set to a high value
  • Users are more interested in priority than in
    continuity and repetition
  • Repetition is to avoid when the length of the
    story is very short
  • For low values of L the best results are
    obtained when R is set to a high value
Write a Comment
User Comments (0)
About PowerShow.com