Text Summarization In Search of Effective Ideas and Techniques - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Text Summarization In Search of Effective Ideas and Techniques

Description:

Shuhua Liu, IIS/IAMSR, A. Text Summarization -- In Search of Effective Ideas and Techniques ... Shuhua Liu, IIS/IAMSR, A. Phase 1: Theme detection, topic ... – PowerPoint PPT presentation

Number of Views:206
Avg rating:3.0/5.0
Slides: 42
Provided by: sliu2
Category:

less

Transcript and Presenter's Notes

Title: Text Summarization In Search of Effective Ideas and Techniques


1
Text Summarization -- In Search of Effective
Ideas and Techniques
  • Shuhua Liu, Assistant Professor
  • Department of Information Systems
  • Åbo Akademi University, Finland
  • Visiting scholar at BISC, UC Berkeley
  • Berkeley, September 21, 2004

2
What is text summarization?
  • to reduce (long) textual information to its most
    essential points
  • to distill the most important information from a
    source or sources to produce an abridged version
    of it (Endres-Niggemeyer, 1998 Mani and Maybury,
    1999 Spärck-Jones, 1999).

3
Text summarization a context-dependent activity
4
Text summarization
  • Key issues
  • how to identify the most important content out of
    the rest of the text?
  • how to synthesize the substance and formulate a
    summary text based on the identified content?
  • Major approaches
  • Selection based produce extracts
  • Text understanding based produce abstracts

5
(No Transcript)
6
Selection based summarization how does it work?
  • The most content-bearing sentences or passages
    are identified and selected to compose a summary.
  • Compute a significance value for each sentence
    (Luhn, 1958 Edmundson, 1969)
  • Count word frequency
  • the keywords, title words, cue words it contains
  • the position of the sentence
  • RST (Rhetorical structute theory) based discourse
    analysis (Marcu, 1997)
  • Passage and sentence similarity analysis
    (Goldstein et al, 2000 CMU)

7
MSWord AutoSummarize
8
MEAD/NewsINEssence (Radev et al, 2003)
9
MEAD/NewsINEssence (Radev et al, 2003)
10
MEAD/NewsINEssence (Radev et al, 2003)
11
Text understanding system
  • A text understanding task often aims to recover
    all of the information that there is in a text,
    including what is only implicit in what is
    actually written.
  • All the richness of natural language becomes
    fair game, including metaphor, metonymy,
    discourse structure, and the recognition of the
    author's underlying intentions, and the full
    interplay between language and world knowledge
    becomes central to the task.

12
Text understanding based summarization
  • Depend on complete sentence analysis and
    discourse analysis with full knowledge support
  • Syntactic pasrer, semantic interpreter
  • Linguistic knowledge, world knowledge, domain
    knowledge
  • Reasoning mechnisms that work effectively over
    huge knowledge collections.

13
Selection based vs. Unedrstanding based
  • Selection based general applicable, but
    incoherent content, poor readability due to
    unclear relationships between the selected text
    excerpts, dangling references, and so on.
  • Understanding based high precision, but very
    slow, large amount of wasted computation, highly
    domain specific.
  • Endres-Niggenger (2000) found that, people prefer
    (sometimes) extractive summaries instead of
    gloss-over abstractive summaries!

14
The reality
  • The dominant approach in practice is still
    selection-based
  • Understanding based systems only exist in theory,
    and will continue to be so for quite a while
  • However, certain text understanding tasks in
    small scale or restricted domains can be done.

15
Topic guided text summarization TIDE
  • TIDE is our effort trying to make use of text
    summarization techniques for business
    applications.
  • Such real world applications will require an
    inclusion of these different types of
    summarization forms.
  • Simply extractive summary will not do.
  • Simply abstractive summary will not do.
  • Simply information extraction will not do.

16
Topic guided text summarization TIDE
  • Text summarization as a process of topic
    analysis, passage extraction, and text
    understanding, information integration/fusion,
    and text generation proces.
  • Passage extraction guided by topic structure will
    expect to keep the logic relationships between
    the extracted text parts e.g. sentences are
    arranged logically according to topic structure
  • Tpoic representation will also be very helpful in
    next phase text analysis and information
    integration.

17
Phase 1 Theme detection, topic labels,
sentence/passage selection
  • Theme detection through passage pairwise
    similarity analysis
  • Vector space model of term and document
  • TF-IDF baseline method

18
Passage similarity analysis with LSA method
  • LSA (Latent Sematic Analysis)
  • http//www.cs.utk.edu/lsi/ Deerwester et al,
    1990
  • http//lsa.colorado.edu/
  • Similar results as using TF-IDF
  • Fuzzy LSI approach (Nikravesh, 2002)

19
Passage similarity analysis
  • OKAPI (TREC-3, Robertson et al, 1996)
  • Weight functions take into account document
    length and average document length and relevance
    feedback factors, in addition to term frequency
    and collection frequency
  • Current standard

20
Passage adjacency matrix (partial)
21
Passage Relation Map
22
Passage Extraction Rules
  • Passage clusters help us to identify themes and
    topics unconnected passages form distinct topics
    covered in a document.
  • The MMR algorithm (CMU) (Goldstein et al, 2000)
  • A sentence/passage closest to the centroid of the
    cluster be chosen to be included in the summary.
  • Sentences that are maximally similar to the
    document and maximally dissimilar to sentences
    already in the summary are selected to compose a
    summary.

23
Creating theme labels
  • Keywords (TF based)
  • Word families (semantic related words in a
    passage cluster)
  • Key phrases
  • Linguistic approach
  • Statistical simple heuristics (Kelledy and
    Smeaton, 1997) seems quite effective.

24
Next step
25
WordNet, since 1985
  • Lexical database developed at Princeton
    University, led by George Miller
  • Hand-coded, freely available
  • Word knowledge of nouns, verbs, adjectives,
    adverbs
  • Semantic network representation with only a few
    semantic relations
  • Synonym, hypernynm,
  • Categorization relation Is-a
  • Widely used in query expansion, word similarity
    determination (based on synsets)

26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
FrameNet, The Berkeley FrameNet project
  • A lexical resource but contains much richer
    information about words than WordNet
  • Contains rich linguistic knowledge necessary for
    text understanding
  • Document a range of semantic and syntactic
    combinatory possibilities of each word (nouns,
    verbs, adjectives) in each of its sense
  • Mannual annotation of example sentences
    automatic capture and organization of the
    annotation result (using IE technolohy)
  • Can be displayed and queried via the web

30
(No Transcript)
31
(No Transcript)
32
How to use FrameNet?
  • Frames are formed in accord with various uses of
    prepositions around the verb sense
  • The cases associated with a verb sense are
    related with questions that we would usually ask
    about an event such as who did what to whom, and
    when?
  • Parts of a sentence are applied to instantiate a
    frame, and content is recognized from the text
    segments to fill in the frame slots.
  • Much needs to be explored.
  • Limitation in its coverage.

33
(No Transcript)
34
(No Transcript)
35
ConceptNet, MIT Media Lab
  • Common sense knowledge base with NLP capability
  • Extracted automatically from common sense
    knowledge expressed in semi-structured NL
    sentences from OMCSNet (open mind common sense)
    applying about 50 extraction rules
  • The Effect of falling off a bike is you get
    hurt.
  • A lime is a very sour fruit at OMCS is
    extracted into two assertations
  • IsA (lime, fruit)
  • PropertyOf (lime, very sour)

36
(No Transcript)
37
ConceptNet (Liu and Singh, 2004a, 2004b)
  • Inference
  • Spreading activation node-activation radiating
    outward from an origin code
  • GetContext (node)
  • GetAnalogousConcept (node)
  • Graph traversal
  • FindPathBetweenNodes (node1, node2)

38
ConceptNet (Liu and Singh, 2004a, 2004b)
  • Support
  • Topic sensing
  • Query expansion
  • Semantic similarity of words
  • Lexical generalization
  • Thematic generalization
  • Much needs to be examined
  • Uncontrolled vocabulary, can be biased in terms
    of content but seems quite reliable knowledge.

39
Topic-Sensing
40
Eurovoc multilingual thesaurus
  • Controlled vocabulary, 20 languages, broad fields
  • politics, international relations, European
    Communities, law, economics, trade, finance,
    social questions, education, science,
    international organizations, employment and
    working conditions
  • industry, business and competition, production,
    technology and research,
  • transport, environment, energy,
  • agriculture, forestry and fisheries,
    agri-foodstuffs,
  • geography

41
Next step work
  • It is not clear how the various current knowledge
    resources will help in real world business
    applications. But it is important to have a
    deeper look into them.
  • Study the peculiarities of certain business
    document corpus to improve the selection
    process.
  • Other knowledge resources
Write a Comment
User Comments (0)
About PowerShow.com