Content Extraction in Majordome - PowerPoint PPT Presentation

1 / 8
About This Presentation
Title:

Content Extraction in Majordome

Description:

subsequent and/or parallel to speech recognition (voice messages) or image ... comparing with entries in common first names / family names database, and/or... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 9
Provided by: Vail9
Category:

less

Transcript and Presenter's Notes

Title: Content Extraction in Majordome


1
Content Extraction in Majordome
  • Overall Objective Quick detection of short
    information elements for Message Filtering and
    Reporting to User
  • Functional position of this processing phase
  • Server-side, event-oriented, background task
  • subsequent and/or parallel to speech recognition
    (voice messages) or image processing (faxes)
    previous to text summarizing

2
Useful applications (1)
  • Name/Date/Subject identification (this task
    specifically useful for fax and voice messages
    no standardized fields for storing this
    information)
  • You have 1 fax message from Mrs Diaconu about
    attending the Barcelona meeting
  • Backup information users addressbook (PABX info
    yields senders phone number)

3
Useful applications (2)
  • Message filtering
  • You have received 14 personal E-mail messages,
    among which 3 messages from friends, 6 requests
    from students or colleagues, and 5 spam messages
    you have received 26 mailing list messages, among
    which 3 call for papers, 11 conference
    announcements, and 12 other.
  • Backup information RFC-822 From and Subject
    fields.

4
Techniques (1)
  • Text statistics measures
  • Frequency of occurrence of certain
    words/morphological categories/syntactical
    structures in different types of messages
  • E.g. ratio noun/verb frequency higher in
    technical texts style markers specific to some
    text genres (e.g. frequent use of ! or in
    advertisements loose style abbreviations like
    CU, IMHO in English, or A in French)

5
Techniques (2)
  • Text skimming
  • Spotting good candidates for specific word
    types (e.g. proper names) selecting capitalized
    words
  • comparing with entries in common first names /
    family names database, and/or
  • using local grammars to disambiguate other
    cases.

6
Techniques (3)
  • Merging visual clues and textual clues for mutual
    reinforcement of identification probability.
  • E.g. Probability of an unidentified, capitalized
    character string to be the proper name of a faxs
    sender increases if it stands alone on a line at
    the top of the image.

7
Content Extraction Current Developments
  • Toolbox for text statistics (word frequency,
    contextual windows, co-occurrence frequency)
  • Tool for determining fuzzy membership to a given
    class of words
  • Tool for determining document language and
    segmenting multilingual documents

8
Content Extraction Future Developments
  • Text categorization module for message sorting
    and filtering
  • Text genre database with (user-controlled)
    learning capabilities
Write a Comment
User Comments (0)
About PowerShow.com