CIS392 Text Retrieval - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

CIS392 Text Retrieval

Description:

Reducing complexity of representing word meanings (no need to store all ... helps to recognize and utilize phrase structure. ( It's useful for information ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 24
Provided by: cisN9
Category:

less

Transcript and Presenter's Notes

Title: CIS392 Text Retrieval


1
CIS392 Text Retrieval Mining
  • Structure of Text
  • Material Sullivan Ch2

2
Structure of Text
  • Databases
  • Rows, Columns, Tables, Relations
  • Text
  • Words, Phrases, Sentences, Paragraphs

3
Natural Structures for Text
  • Natural languages are composed of words and rules
    for combining words.

4
Building Blocks of Text
  • Morphology (the study of the structure and form
    of words)
  • Syntax (the study of how words and phrases form
    sentences)
  • Semantics (the meaning of words)
  • Phonology (the study of sounds in language)
  • Pragmatics (the study of idiomatic phrases that
    cannot be analyzed with strict semantic analysis)

5
Morphology
  • Subparts of words
  • Stems core-meaning-bearing elements
  • Affixes prefixes (un-, in-) and suffixes (-able,
    -ment)
  • Inflectional elements
  • Used with verbs to distinguish past from present.
  • Used with nouns to distinguish singular from
    plural.

6
Morphological analysis
  • Reducing complexity of analysis (stemming)
  • Reducing complexity of representing word meanings
    (no need to store all variations of words in
    electronic lexicons)
  • Supporting other text mining operations (on
    identifying concepts that are in noun forms)

7
Syntax
  • It helps to recognize and utilize phrase
    structure. (Its useful for information
    extraction.)

8
Should be article
9
Prof Wus Noun Phrase Extractor
10
Syntax case assignment
  • Case assignment an element of syntax defining
    the relationship between verbs and noun phrases,
    especially the role of a noun in a sentence.
    E.g. to fine an agent doing the fining, a
    recipient being fined, and optionally an amount
    of the fine.
  • The SEC fined Alpha Beta Gamma Industries one
    million dollars.

11
Semantics
  • Semantics in text mining how to represent
    meaning of words, phrases, and sentences.
  • Semantic networks use nodes and arcs to represent
    objects, events, concepts, and the relations.
  • Classification hierarchies and taxonomies are
    limited type of semantic networks that can
    represent type of and part-of relationships.
  • They are useful for searching by topics instead
    of keywords.

12
Limits of Natural Language Processing
  • Identifying the role of phrases
  • I saw the technician with the instrument.
  • Representing abstract concepts
  • Difficulty in drawing conclusion and deductions.
  • Synonyms and multiple terms
  • Representing different concepts
  • Custom made hierarchies for particular
    applications.

13
Statistical techniques and NLP
  • Document summaries can be derived by creating a
    semantic network and finding out important
    concepts on it.
  • Alternative
  • Word frequency.
  • Defining importance of a sentence by adding word
    frequency of each word in the sentence. Rank
    sentences and select important ones to form
    summary.

14
Generating semantic network
  • Using number of co-occurrence between words as
    the degree of correlation between them in the
    network.
  • This method leads to automatic semantic network
    generation.

15
Statistical techniques
Applicability of technique
Linguistic analysis
Figure 2.7
Size of text
16
Problems of automatic methods
  • Frequency and length do not always reflect
    degree of importance.
  • Not intelligent enough to recognize topical
    sections of a document. (having a topic word
    does not mean the paragraph is all about the
    topic word.)

17
Artificial Structures for Text
  • Markup languages
  • Define structure and describe formatting.
  • 2 structures semi-structured text and hypertext.
  • XML widely used for data exchange because XML
    tags allow easy parsing.
  • HTML allow hypertext formatting

18
SGML
  • Standard Generalized Markup Language (SGML)
  • All markup languages are subsets of SGML
  • Providing a language for describing the structure
    of documents.

19
XML tags define structuring elements
  • lt?xml version1.0?gt
  • ltMarketingStatusgt
  • ltReportDategtApril 10, 2001lt/Reportdategt
  • ltMarketingPeriodgt Q1, 2001lt/MarketingPeriodgt
  • ltRegiongt Northeast lt/Regiongt
  • ltStatusgt
  • This quarter has seen a significant shift in
    marketing efforts away from
  • lt/Statusgt

20
XML XPointer
  • XPointer a means of addressing fragments within
    an XML document, such as budget status section.
  • Absolute specific point in XML tree without
    reference to any other point in tree, e.g. root.
  • Relative describe a node in XML tree using a
    starting node and a navigation term.
  • Spanning identifies a sub-section of a tree
    between a starting and an ending point
  • String using string to specifying a point in
    tree.

21
XML Linking
  • Extended (from HTML) features
  • Bi-directional linking
  • Links that annotate read-only documents
    (automatically generate links between documents
    that are not linked to each other.)
  • Additional attributes and descriptive roles
    (semantic info) for links

22
Hypertext
  • Provide ways to structure collections of
    documents ? interconnected network of text, from
    intra-document to inter-document.

23
Hub and Authority
  • Hubs are documents that link to many other
    documents, e.g. web directories like Yahoo!
  • Authorities are document that are pointed by many
    other documents, e.g. IRS web site.
  • These two can be used to improve ranking of
    retrieved documents.
Write a Comment
User Comments (0)
About PowerShow.com