COMP 7118 Fall 2004 - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

COMP 7118 Fall 2004

Description:

Boolean: car and repair shop, tea or coffee, DBMS but not Oracle ... News items. Each item belong to a pre-defined topics. Build classifier to classify them ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 52
Provided by: Math108
Category:
Tags: comp | fall

less

Transcript and Presenter's Notes

Title: COMP 7118 Fall 2004


1
COMP 7118 Fall 2004
  • Text Mining

2
Text Databases
  • Large collections of documents from various
    sources news articles, research papers, books,
    digital libraries, e-mail messages, and Web
    pages, library database
  • Properties
  • Unstructured in general (semi-structured with
    help, e.g. XML)
  • Semantics, not only syntax, is important
  • Non-numeric in nature

3
Text Database and Information Retrieval
  • Information retrieval
  • Traditional study of how to retrieve information
    from text documents
  • Information is organized into (a large number of)
    documents
  • Information retrieval problem locating relevant
    documents based on user input, such as keywords
    or example documents

4
Text Database and Information Retrieval
  • Typical IR systems
  • Online library catalogs
  • Online document management systems
  • Information retrieval vs. database systems
  • Some DB problems are not present in IR, e.g.,
    update, transaction management, complex objects
  • Some IR problems are not addressed well in DBMS,
    e.g., unstructured documents, approximate search
    using keywords and relevance

5
Basic Measures for Text Retrieval
  • Precision the percentage of retrieved documents
    that are in fact relevant to the query (i.e.,
    correct responses)
  • Recall the percentage of documents that are
    relevant to the query and were, in fact, retrieved

6
Basic retrieval problems
  • Keyword-based
  • Find documents that contain certain keywords
  • Expression of keywords
  • Boolean car and repair shop, tea or coffee, DBMS
    but not Oracle
  • Regular Expression car ? ? Repair

7
Basic retrieval problems
  • Similarity-based
  • Finds similar documents based on a set of common
    keywords
  • Answer should be based on the degree of relevance
    based on the nearness of the keywords, relative
    frequency of the keywords, etc.

8
Challenges in text retrieval
  • Semantics
  • Synonymy A keyword T does not appear anywhere in
    the document, even though the document is closely
    related to T, e.g., data mining
  • Polysemy The same keyword may mean different
    things in different contexts, e.g., mining

9
Challenges in text retrieval
  • Data cleaning
  • Stop list Set of words that are deemed
    irrelevant, even though they may appear
    frequently
  • E.g., a, the, of, for, with, etc.
  • Stop lists may vary when document set varies
  • Word stem Several words are small syntactic
    variants of each other since they share a common
    word stem
  • E.g., drug, drugs, drugged
  • Tagging Sometimes it is better to view a group
    of words as a single unit (like a noun phrase)
  • E.g. data mining,

10
Challenges in text retrieval
  • Data representation
  • How to representation data for each data mining
    task
  • Clustering how to define similarity?
  • Classification what attribute to use?

11
Challenges in text retrieval
  • Example Classifying text documents
  • News items
  • Each item belong to a pre-defined topics
  • Build classifier to classify them
  • What should the attributes be?
  • Individual words?
  • Even minus stop words, the number of words may be
    large
  • Training set of 100 documents, typically 1000
    distinct words
  • Each word appear in very few documents

12
Challenges in text retrieval
  • Assume we use a Naïve Bayes classifier
  • Each word appear in very few documents
  • Moreover, many words does not appear in various
    classes
  • Each test document does not contain many words
  • ? Probability are very small, and computational
    error can be introduced
  • The same concept can also be represented by
    different words/phrases
  • Challenge
  • How good we can do without semantics
  • How can we incorporate semantics information

13
Data Representation
  • Document vector / Frequency Matrix
  • Each document is represented by a vector
  • Each dimension of the vector is associated with a
    word/term
  • For each document, the value of each dimension is
    the frequency of that word that exists in the
    vector.

14
Data Representation
  • Example
  • Document 1 This is a database system textbook
  • Document 2 Oracle database sells for 1000 this
    year
  • Vector dimensions (Database, system, textbook,
    oracle, sells, year)
  • D1 (1, 1, 1, 0, 0, 0)
  • D2 (1, 0, 0, 1, 1, 1)

15
Data Representation
  • Variations
  • Binary values only presence/absence of word is
    recorded
  • Normalized values normalize each dimension to
    (0,1) range
  • Weighted frequency

16
Data Representation (tf-idf scheme)
  • Weighted frequency (tf-idf scheme)
  • Useful for static data sets
  • Term frequency (tfij) how often did term j
    appear in document I (normalized)
  • Document Frequency (dfj) how many documents
    contain the term j
  • For document i and term j
  • where d is the number of documents in the
    database

17
Data Representation Example (tf-idf scheme)
  • D1 This is a database system textbook
  • D2 Oracle database sells for 1000 this year
  • D3 My oracle database textbook for my database
    class
  • Raw frequencies

18
Data Representation Example (tf-idf scheme)
  • D1 This is a database system textbook
  • D2 Oracle database sells for 1000 this year
  • D3 My oracle database textbook for my database
    class
  • Normalized frequencies

19
Data Representation Example (tf-idf scheme)
  • D1 This is a database system textbook
  • D2 Oracle database sells for 1000 this year
  • D3 My oracle database textbook for my database
    class
  • Document frequencies

20
normalized term-frequency (tfij)
Document frequency (dfj)
21
Data Representation(tf-idf scheme)
  • Properties
  • Words that appear in every document has value of
    0
  • Words that appear in very few documents has
    higher weight
  • Good for clustering? Classification?

22
Data Representation
  • Term-frequency matrix/table
  • Combination of all the vectors for the documents
    will form a matrix.
  • Query vector
  • Query are usually consist of a set of keywords
    thus can be represented as a vector
  • Query can also be a separate document with
    vectors calculated as before

23
Data Representation Similarity measures
  • For various tasks, need measurement of similarity
    between documents
  • Cosine similarity
  • Focus on co-occurrence of words
  • This correspond to the angle between the two
    vectors

24
Data Representation Latent Semantic Indexing
  • Weakness of keyword based techniques
  • Lack of semantics
  • Cannot identify similar word/concepts without
    help
  • Observation
  • Words/phrases that represent similar concepts are
    usually grouped together
  • The most important unit of information for
    documents may not be word, but concept instead

25
Data Representation Latent Semantic Indexing
  • Latent Semantic Indexing is an attempt to produce
    such information
  • Find a (relatively) small number of concepts as
    each vectors dimension
  • Try to approximate the original information by
    these dimensions

26
Data Representation Latent Semantic Indexing
  • Start with the term-frequency matrix M
  • M is of size (m n)
  • m number of terms
  • n number of documents
  • Find the singular values of M
  • Eigenvalues of MMT
  • Assume we have r eigenvalues (r rank(M))

27
Data Representation Latent Semantic Indexing
  • Then M U S VT
  • S (r r) diagonal matrix of the singular
    values
  • U (m r) matrix of term vectors
  • V (n r) matrix of document vectors
  • Interpretation
  • There are r distinct concepts within this set
    of documents
  • Each row in U is a term vector that corresponds
    to a term
  • Values in each dimension corresponds to how much
    each term corresponds to that concept
  • Similarly with V

28
Data Representation Latent Semantic Indexing
  • Notice that S is diagonal
  • We can reorganize the values in descending order
  • Many values tends to be small
  • Replace those values with 0 and form a new matrix
    S
  • Then M U S VT will produce a good
    approximation for the original matrix M
  • Notice that each document is now represented by a
    shorter vector data reduction
  • And each dimension is now correspond to a concept
    which is based on similar concurrence of words

29
Data Representation Latent Semantic Indexing --
Example
  • (From Berry, Dumais, OBrian 1994)

30
Types of Text Data Mining
  • Keyword-based association analysis
  • Automatic document classification
  • Similarity detection
  • Cluster documents by a common author
  • Cluster documents containing information from a
    common source

31
Types of Text Data Mining
  • Link analysis unusual correlation between
    entities
  • Sequence analysis predicting a recurring event
  • Anomaly detection find information that violates
    usual patterns
  • Hypertext analysis
  • Patterns in anchors/links
  • Anchor text correlations with linked objects

32
Keyword-based association analysis
  • Collect sets of keywords or terms that occur
    frequently together and then find the association
    or correlation relationships among them
  • First preprocess the text data by parsing,
    stemming, removing stop words, etc.
  • Then evoke association mining algorithms
  • Consider each document as a transaction
  • View a set of keywords in the document as a set
    of items in the transaction
  • Term level association mining
  • No need for human effort in tagging documents
  • The number of meaningless results and the
    execution time is greatly reduced

33
Keyword-based association analysis -- SNOWBALL
  • Goal Discover relationships in text documents
  • E.g. Companies headquarters
  • Microsoft, headquarted at Redmond, WA
  • Boeing, a Seattle-based company
  • the Santa Clara, CA company Intel
  • Can we automatically retrieve such information
    from news article

34
SNOWBALL
  • Assumptions
  • Articles tends to follow the same context
  • A base set of (seed) knowledge available
  • General idea
  • Find documents with seed knowledge
  • Extract patterns where knowledge occurs
  • Find similar patterns on other documents
  • Extract knowledge from there

35
SNOWBALL
  • Example of seed knowledge
  • ltorganization, locationgt pairs
  • Seed tuples

36
SNOWBALL Extracting patterns
  • Find location where each pair is present
  • Extract the patterns
  • ltOrganizationgts headquarters in ltLocationgt
  • ltLocationgt-based ltOrganizationgt
  • ltOrganizationgt, ltLocationgt
  • Need a good tagger to tag the information first

37
SNOWBALL Extracting patterns
  • Representing the pattern as a 5-tuple
  • Left, lttag1gt, Middle lttag2gt, Right
  • Left, right, middle are context
  • What word/words/symbol are attached
  • Example The Seattle-based Boeing Company
  • Left ltThegt
  • Middle lt-gt, ltbasedgt
  • Right ltCompanygt
  • Each context has a weight associated with it
  • Function of the frequency of the word in that
    context
  • Scaled middle context have higher weight
  • Normalizaed
  • E.g ltThe, 0.2gt

38
SNOWBALL Extracting patterns
  • Similar patterns are combined
  • A clustering algorithm is run
  • Similarity function inner product of the weights
  • Each cluster is represented by the centroid

39
SNOWBALL finding new tuples
  • Read new documents
  • Locate ltorganizationgt and ltlocationgt tags
  • Locate the patterns
  • See if it match with the centroids
  • If so, add them as candidates
  • Good candidates will become new seed tuples

40
SNOWBALL finding new tuples
  • However, how trustworthy are they?
  • 2 directions
  • Patterns
  • If a pattern match too many new tuples, then it
    is suspicious
  • Example Microsoft, Redmond, announced
  • Left , Middle ,, Right ,
  • Match with everything
  • Not very useful

41
SNOWBALL finding new tuples
  • However, how trustworthy are they?
  • 2 directions
  • Tuples
  • If a tuple appear in many locations, then it is
    more likely to be true
  • Measurement of confidence of tuple and pattern to
    determine which can be used

42
Automatic document classification
  • Motivation
  • Automatic classification for the tremendous
    number of on-line text documents (Web pages,
    e-mails, etc.)
  • A classification problem
  • Text document classification differs from the
    classification of relational data
  • Document databases are not structured according
    to attribute-value pairs

43
Association-based document classification
  • Extract keywords and terms by information
    retrieval and simple association analysis
    techniques
  • Obtain concept hierarchies of keywords and terms
    using
  • Available term classes, such as WordNet
  • Expert knowledge
  • Some keyword classification systems
  • Classify documents in the training set into class
    hierarchies
  • Apply term association mining method to discover
    sets of associated terms

44
Association-based document classification
  • Use the terms to maximally distinguish one class
    of documents from others
  • Derive a set of association rules associated with
    each document class
  • Order the classification rules based on their
    occurrence frequency and discriminative power
  • Use the rules to classify new documents

45
Document clustering
  • Automatically group related documents based on
    their contents
  • Require no training sets or predetermined
    taxonomies, generate a taxonomy at runtime
  • One approach define standard similarity
    measures, and use hierarchical clustering
  • One potential drawback document may have
    multiple themes

46
Document clustering
  • Potential solutions
  • Fuzzy clustering allowing documents in multiple
    clusters
  • Drawback can be inefficient
  • Form multiple base clusters
  • Document can reside in multiple clusters
  • Combine the base clusters in a hierarchical
    fashion

47
Example Suffix Tree Clustering
  • Basic idea
  • Form base clusters based on common phrases
  • Each phrases are assigned a score. Only clusters
    having high enough score will survive
  • Use suffix tree to help recognizing the clusters
  • Each cluster is represented by the documents that
    are in it
  • Inter-cluster distance are measured by the number
    of common documents inside a cluster
  • Single-link hierarchical cluster is used to join
    the base clusters.

48
Example Suffix Tree Clustering
  • Example 3 documents
  • Cat ate cheese
  • Mouse ate cheese too
  • Cat ate mouse too
  • Recognize all the suffixes
  • Cat ate cheese
  • cheese, ate cheese, cat ate cheese
  • Mouse ate cheese too
  • too, cheese too, ate cheese too, mouse ate
    cheese too
  • Cat ate mouse too
  • too, mouse too, ate mouse too, cat ate
    mouse too

49
1 Cat ate cheese 2 mouse ate cheese too
3 Cat ate mouse too
50
Suffix Tree Clustering
  • Find based clusters
  • Documents that share the same prefix of the
    suffix treated as a base cluster
  • Internal nodes of the suffix tree
  • Base clusters evaluated for their goodness
  • Score N F(suffix)
  • N number of documents in that cluster
  • F(suffix) function related to the suffix,
    based on frequency of the word, length of the
    phrase etc.
  • Stop words removed from the phrase
  • Take only base clusters with certain score

51
Suffix Tree Clustering
  • Combine base clusters
  • Combine clusters when they share half of the
    common elements
  • Resulting clusters are represented by the combine
    keywords
Write a Comment
User Comments (0)
About PowerShow.com