Title: Text
1Text Web Mining
2Structured Data
- So far we have focused on mining from structured
data
Attribute ? Value Attribute ? Value Attribute ?
Value ? Attribute ? Value
Outlook ? Sunny Temperature ? Hot Windy ?
Yes Humidity ? High Play ? Yes
Most data mining involves such data
3Complex Data Types
- Increased importance of complex data
- Spatial data includes geographic data and
medical satellite images - Multimedia data images, audio, video
- Time-series data for example banking data and
stock exchange data - Text data word descriptions for objects
- World-Wide-Web highly unstructured text and
multimedia data
4Text Databases
- Many text databases exist in practice
- News articles
- Research papers
- Books
- Digital libraries
- E-mail messages
- Web pages
- Growing rapidly in size and importance
5Semi-Structured Data
- Text databases are often semi-structured
- Example
- Title
- Author
- Publication_Date
- Length
- Category
- Abstract
- Content
6Handling Text Data
- Modeling semi-structured data
- Information Retrieval (IR) from unstructured
documents - Text mining
- Compare documents
- Rank importance relevance
- Find patterns or trends across documents
7Information Retrieval
- IR locates relevant documents
- Key words
- Similar documents
- IR Systems
- On-line library catalogs
- On-line document management systems
8Performance Measure
Retrieved documents
Relevant documents
Relevant retrieved
All documents
9Retrieval Methods
- Keyword-based IR
- E.g., data and mining
- Synonymy problem a document may talk about
knowledge discovery instead - Polysemy problem mining can mean different
things - Similarity-based IR
- Set of common keywords
- Return the degree of relevance
- Problem what is the similarity of data mining
and data analysis
10Modeling a Document
- Set of n documents and m terms
- Each document is a vector v in Rm
- The j-th coordinate of v measures the association
of the j-th term - Here r is the number of occurrences of the j-th
term and R is the number of occurrences of any
term.
11Frequency Matrix
12Similarity Measures
Dot product
Norm of the vectors
13Example
- Google search for association mining
- Two of the documents retrieved
- Idaho Mining Association mining in Idaho (doc 1)
- Scalable Algorithms for Association mining (doc
2) - Using only the two terms
14New Model
- Add the term data to the document model
15Frequency Matrix
Will quickly become large
16Association Analysis
- Collect set of keywords frequently used together
and find association among them - Apply any association rule algorithm to a
database in the format - document_id, a_set_of_keywords
17Document Classification
- Need already classified documents as training set
- Induce a classification model
- Any difference from before?
A set of keywords associated with a document has
no fixed set of attributes or dimensions
18Association-Based Classification
- Classify documents based on associated,
frequently occurring text patterns - Extract keywords and terms with IR and simple
association analysis - Create a concept hierarchy of terms
- Classify training documents into class
hierarchies - Use association mining to discover associated
terms to distinguish one class from another
19Remember Generalized Association Rules
Taxonomy
Ancestor of shoes and hiking boots
Clothes
Footwear
Outerwear
Shirts
Shoes
Hiking Boots
Jackets
Ski Pants
Generalized association rule X? Y where no item
in Y is an ancestor of an item in X
20Classifiers
- Let X be a set of terms
- Let Anc (X) be those terms and their ancestor
terms - Consider a rule X?? C and document d
- If X ? Anc (d) then X?? C covers d
- A rule that covers d may be used to classify d
(but only one can be used)
21Procedure
- Step 1 Generate all generalized association
rules , where X is a set of terms and C is a
class, that satisfy minimum support. - Step 2 Rank the rules according to some rule
ranking criterion - Step 3 Select rules from the list
22Web Mining
- The World Wide Web may have more opportunities
for data mining than any other area - However, there are serious challenges
- It is too huge
- Complexity of Web pages is greater than any
traditional text document collection - It is highly dynamic
- It has a broad diversity of users
- Only a tiny portion of the information is truly
useful
23Search Engines ? Web Mining
- Current technology search engines
- Keyword-based indices
- Too many relevant pages
- Synonymy and polysemy problems
- More challenging web mining
- Web content mining
- Web structure mining
- Web usage mining
24Web Content Mining
25Example Classification of Web Documents
- Assign a class to each document based on
predefined topic categories - E.g., use Yahoo!s taxonomy and associated
documents for training - Keyword-based document classification
- Keyword-based association analysis
26Web Structure Mining
27Authoritative Web Pages
- High quality relevant Web pages are termed
authoritative - Explore linkages (hyperlinks)
- Linking a Web page can be considered an
endorsement of that page - Those pages that are linked frequently are
considered authoritative - (This has its roots back to IR methods based on
journal citations)
28Structure via Hubs
- A hub is a set of Web pages containing
collections of links to authorities - There is a wide variety of hubs
- Simple list of recommended links on a persons
home page - Professional resource lists on commercial sites
29HITS
- Hyperlink-Induced Topic Search (HITS)
- Form a root set of pages using the query terms in
an index-based search (200 pages) - Expand into a base set by including all pages the
root set links to (1000-5000 pages) - Go into an iterative process to determine hubs
and authorities
30Calculating Weights
- Authority weight
- Hub weight
Page p is pointed to by page q
31Adjacency Matrix
- Lets number the pages 1,2,,n
- The adjacency matrix is defined by
- By writing the authority and hub weights as
vectors we have
32Recursive Calculations
- We now have
- By linear algebra theory this converges to the
principle eigenvectors of the the two matrices
33Output
- The HITS algorithm finally outputs
- Short list of pages with high hub weights
- Short list of pages with high authority weights
- Have not accounted for context
34Applications
- The Clever Project at IBMs Almaden Labs
- Developed the HITS algorithm
- Google
- Developed at Stanford
- Uses algorithms similar to HITS (PageRank)
- On-line version
35Web Usage Mining
36Complex Data Types Summary
- Emerging areas of mining complex data types
- Text mining can be done quite effectively,
especially if the documents are semi-structured - Web mining is more difficult due to lack of such
structure - Data includes text documents, hypertext
documents, link structure, and logs - Need to rely on unsupervised learning, sometimes
followed up with supervised learning such as
classification