Title: COMP 7118 Fall 2004
1COMP 7118 Fall 2004
2Text Databases
- Large collections of documents from various
sources news articles, research papers, books,
digital libraries, e-mail messages, and Web
pages, library database - Properties
- Unstructured in general (semi-structured with
help, e.g. XML) - Semantics, not only syntax, is important
- Non-numeric in nature
3Text Database and Information Retrieval
- Information retrieval
- Traditional study of how to retrieve information
from text documents - Information is organized into (a large number of)
documents - Information retrieval problem locating relevant
documents based on user input, such as keywords
or example documents
4Text Database and Information Retrieval
- Typical IR systems
- Online library catalogs
- Online document management systems
- Information retrieval vs. database systems
- Some DB problems are not present in IR, e.g.,
update, transaction management, complex objects - Some IR problems are not addressed well in DBMS,
e.g., unstructured documents, approximate search
using keywords and relevance
5Basic Measures for Text Retrieval
- Precision the percentage of retrieved documents
that are in fact relevant to the query (i.e.,
correct responses) - Recall the percentage of documents that are
relevant to the query and were, in fact, retrieved
6Basic retrieval problems
- Keyword-based
- Find documents that contain certain keywords
- Expression of keywords
- Boolean car and repair shop, tea or coffee, DBMS
but not Oracle - Regular Expression car ? ? Repair
7Basic retrieval problems
- Similarity-based
- Finds similar documents based on a set of common
keywords - Answer should be based on the degree of relevance
based on the nearness of the keywords, relative
frequency of the keywords, etc.
8Challenges in text retrieval
- Semantics
- Synonymy A keyword T does not appear anywhere in
the document, even though the document is closely
related to T, e.g., data mining - Polysemy The same keyword may mean different
things in different contexts, e.g., mining
9Challenges in text retrieval
- Data cleaning
- Stop list Set of words that are deemed
irrelevant, even though they may appear
frequently - E.g., a, the, of, for, with, etc.
- Stop lists may vary when document set varies
- Word stem Several words are small syntactic
variants of each other since they share a common
word stem - E.g., drug, drugs, drugged
- Tagging Sometimes it is better to view a group
of words as a single unit (like a noun phrase) - E.g. data mining,
10Challenges in text retrieval
- Data representation
- How to representation data for each data mining
task - Clustering how to define similarity?
- Classification what attribute to use?
11Challenges in text retrieval
- Example Classifying text documents
- News items
- Each item belong to a pre-defined topics
- Build classifier to classify them
- What should the attributes be?
- Individual words?
- Even minus stop words, the number of words may be
large - Training set of 100 documents, typically 1000
distinct words - Each word appear in very few documents
12Challenges in text retrieval
- Assume we use a Naïve Bayes classifier
- Each word appear in very few documents
- Moreover, many words does not appear in various
classes - Each test document does not contain many words
- ? Probability are very small, and computational
error can be introduced - The same concept can also be represented by
different words/phrases - Challenge
- How good we can do without semantics
- How can we incorporate semantics information
13Data Representation
- Document vector / Frequency Matrix
- Each document is represented by a vector
- Each dimension of the vector is associated with a
word/term - For each document, the value of each dimension is
the frequency of that word that exists in the
vector.
14Data Representation
- Example
- Document 1 This is a database system textbook
- Document 2 Oracle database sells for 1000 this
year - Vector dimensions (Database, system, textbook,
oracle, sells, year) - D1 (1, 1, 1, 0, 0, 0)
- D2 (1, 0, 0, 1, 1, 1)
15Data Representation
- Variations
- Binary values only presence/absence of word is
recorded - Normalized values normalize each dimension to
(0,1) range - Weighted frequency
16Data Representation (tf-idf scheme)
- Weighted frequency (tf-idf scheme)
- Useful for static data sets
- Term frequency (tfij) how often did term j
appear in document I (normalized) - Document Frequency (dfj) how many documents
contain the term j - For document i and term j
- where d is the number of documents in the
database
17Data Representation Example (tf-idf scheme)
- D1 This is a database system textbook
- D2 Oracle database sells for 1000 this year
- D3 My oracle database textbook for my database
class - Raw frequencies
18Data Representation Example (tf-idf scheme)
- D1 This is a database system textbook
- D2 Oracle database sells for 1000 this year
- D3 My oracle database textbook for my database
class - Normalized frequencies
19Data Representation Example (tf-idf scheme)
- D1 This is a database system textbook
- D2 Oracle database sells for 1000 this year
- D3 My oracle database textbook for my database
class - Document frequencies
20normalized term-frequency (tfij)
Document frequency (dfj)
21Data Representation(tf-idf scheme)
- Properties
- Words that appear in every document has value of
0 - Words that appear in very few documents has
higher weight - Good for clustering? Classification?
22Data Representation
- Term-frequency matrix/table
- Combination of all the vectors for the documents
will form a matrix. - Query vector
- Query are usually consist of a set of keywords
thus can be represented as a vector - Query can also be a separate document with
vectors calculated as before
23Data Representation Similarity measures
- For various tasks, need measurement of similarity
between documents - Cosine similarity
- Focus on co-occurrence of words
- This correspond to the angle between the two
vectors
24Data Representation Latent Semantic Indexing
- Weakness of keyword based techniques
- Lack of semantics
- Cannot identify similar word/concepts without
help - Observation
- Words/phrases that represent similar concepts are
usually grouped together - The most important unit of information for
documents may not be word, but concept instead
25Data Representation Latent Semantic Indexing
- Latent Semantic Indexing is an attempt to produce
such information - Find a (relatively) small number of concepts as
each vectors dimension - Try to approximate the original information by
these dimensions
26Data Representation Latent Semantic Indexing
- Start with the term-frequency matrix M
- M is of size (m n)
- m number of terms
- n number of documents
- Find the singular values of M
- Eigenvalues of MMT
- Assume we have r eigenvalues (r rank(M))
27Data Representation Latent Semantic Indexing
- Then M U S VT
- S (r r) diagonal matrix of the singular
values - U (m r) matrix of term vectors
- V (n r) matrix of document vectors
- Interpretation
- There are r distinct concepts within this set
of documents - Each row in U is a term vector that corresponds
to a term - Values in each dimension corresponds to how much
each term corresponds to that concept - Similarly with V
28Data Representation Latent Semantic Indexing
- Notice that S is diagonal
- We can reorganize the values in descending order
- Many values tends to be small
- Replace those values with 0 and form a new matrix
S - Then M U S VT will produce a good
approximation for the original matrix M - Notice that each document is now represented by a
shorter vector data reduction - And each dimension is now correspond to a concept
which is based on similar concurrence of words
29Data Representation Latent Semantic Indexing --
Example
- (From Berry, Dumais, OBrian 1994)
30Types of Text Data Mining
- Keyword-based association analysis
- Automatic document classification
- Similarity detection
- Cluster documents by a common author
- Cluster documents containing information from a
common source
31Types of Text Data Mining
- Link analysis unusual correlation between
entities - Sequence analysis predicting a recurring event
- Anomaly detection find information that violates
usual patterns - Hypertext analysis
- Patterns in anchors/links
- Anchor text correlations with linked objects
32Keyword-based association analysis
- Collect sets of keywords or terms that occur
frequently together and then find the association
or correlation relationships among them - First preprocess the text data by parsing,
stemming, removing stop words, etc. - Then evoke association mining algorithms
- Consider each document as a transaction
- View a set of keywords in the document as a set
of items in the transaction - Term level association mining
- No need for human effort in tagging documents
- The number of meaningless results and the
execution time is greatly reduced
33Keyword-based association analysis -- SNOWBALL
- Goal Discover relationships in text documents
- E.g. Companies headquarters
- Microsoft, headquarted at Redmond, WA
- Boeing, a Seattle-based company
- the Santa Clara, CA company Intel
- Can we automatically retrieve such information
from news article
34SNOWBALL
- Assumptions
- Articles tends to follow the same context
- A base set of (seed) knowledge available
- General idea
- Find documents with seed knowledge
- Extract patterns where knowledge occurs
- Find similar patterns on other documents
- Extract knowledge from there
35SNOWBALL
- Example of seed knowledge
- ltorganization, locationgt pairs
- Seed tuples
36SNOWBALL Extracting patterns
- Find location where each pair is present
- Extract the patterns
- ltOrganizationgts headquarters in ltLocationgt
- ltLocationgt-based ltOrganizationgt
- ltOrganizationgt, ltLocationgt
- Need a good tagger to tag the information first
37SNOWBALL Extracting patterns
- Representing the pattern as a 5-tuple
- Left, lttag1gt, Middle lttag2gt, Right
- Left, right, middle are context
- What word/words/symbol are attached
- Example The Seattle-based Boeing Company
- Left ltThegt
- Middle lt-gt, ltbasedgt
- Right ltCompanygt
- Each context has a weight associated with it
- Function of the frequency of the word in that
context - Scaled middle context have higher weight
- Normalizaed
- E.g ltThe, 0.2gt
38SNOWBALL Extracting patterns
- Similar patterns are combined
- A clustering algorithm is run
- Similarity function inner product of the weights
- Each cluster is represented by the centroid
39SNOWBALL finding new tuples
- Read new documents
- Locate ltorganizationgt and ltlocationgt tags
- Locate the patterns
- See if it match with the centroids
- If so, add them as candidates
- Good candidates will become new seed tuples
40SNOWBALL finding new tuples
- However, how trustworthy are they?
- 2 directions
- Patterns
- If a pattern match too many new tuples, then it
is suspicious - Example Microsoft, Redmond, announced
- Left , Middle ,, Right ,
- Match with everything
- Not very useful
41SNOWBALL finding new tuples
- However, how trustworthy are they?
- 2 directions
- Tuples
- If a tuple appear in many locations, then it is
more likely to be true - Measurement of confidence of tuple and pattern to
determine which can be used
42Automatic document classification
- Motivation
- Automatic classification for the tremendous
number of on-line text documents (Web pages,
e-mails, etc.) - A classification problem
- Text document classification differs from the
classification of relational data - Document databases are not structured according
to attribute-value pairs
43Association-based document classification
- Extract keywords and terms by information
retrieval and simple association analysis
techniques - Obtain concept hierarchies of keywords and terms
using - Available term classes, such as WordNet
- Expert knowledge
- Some keyword classification systems
- Classify documents in the training set into class
hierarchies - Apply term association mining method to discover
sets of associated terms
44Association-based document classification
- Use the terms to maximally distinguish one class
of documents from others - Derive a set of association rules associated with
each document class - Order the classification rules based on their
occurrence frequency and discriminative power - Use the rules to classify new documents
45Document clustering
- Automatically group related documents based on
their contents - Require no training sets or predetermined
taxonomies, generate a taxonomy at runtime - One approach define standard similarity
measures, and use hierarchical clustering - One potential drawback document may have
multiple themes
46Document clustering
- Potential solutions
- Fuzzy clustering allowing documents in multiple
clusters - Drawback can be inefficient
- Form multiple base clusters
- Document can reside in multiple clusters
- Combine the base clusters in a hierarchical
fashion
47Example Suffix Tree Clustering
- Basic idea
- Form base clusters based on common phrases
- Each phrases are assigned a score. Only clusters
having high enough score will survive - Use suffix tree to help recognizing the clusters
- Each cluster is represented by the documents that
are in it - Inter-cluster distance are measured by the number
of common documents inside a cluster - Single-link hierarchical cluster is used to join
the base clusters.
48Example Suffix Tree Clustering
- Example 3 documents
- Cat ate cheese
- Mouse ate cheese too
- Cat ate mouse too
- Recognize all the suffixes
- Cat ate cheese
- cheese, ate cheese, cat ate cheese
- Mouse ate cheese too
- too, cheese too, ate cheese too, mouse ate
cheese too - Cat ate mouse too
- too, mouse too, ate mouse too, cat ate
mouse too
491 Cat ate cheese 2 mouse ate cheese too
3 Cat ate mouse too
50Suffix Tree Clustering
- Find based clusters
- Documents that share the same prefix of the
suffix treated as a base cluster - Internal nodes of the suffix tree
- Base clusters evaluated for their goodness
- Score N F(suffix)
- N number of documents in that cluster
- F(suffix) function related to the suffix,
based on frequency of the word, length of the
phrase etc. - Stop words removed from the phrase
- Take only base clusters with certain score
51Suffix Tree Clustering
- Combine base clusters
- Combine clusters when they share half of the
common elements - Resulting clusters are represented by the combine
keywords