Text - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Text

Description:

... geographic data and medical & satellite images ... Multimedia data: images, audio, & video. Time-series data: for example banking data and stock exchange data ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 37
Provided by: sigurduro5
Category:
Tags: text

less

Transcript and Presenter's Notes

Title: Text


1
Text Web Mining
2
Structured Data
  • So far we have focused on mining from structured
    data

Attribute ? Value Attribute ? Value Attribute ?
Value ? Attribute ? Value
Outlook ? Sunny Temperature ? Hot Windy ?
Yes Humidity ? High Play ? Yes
Most data mining involves such data
3
Complex Data Types
  • Increased importance of complex data
  • Spatial data includes geographic data and
    medical satellite images
  • Multimedia data images, audio, video
  • Time-series data for example banking data and
    stock exchange data
  • Text data word descriptions for objects
  • World-Wide-Web highly unstructured text and
    multimedia data

4
Text Databases
  • Many text databases exist in practice
  • News articles
  • Research papers
  • Books
  • Digital libraries
  • E-mail messages
  • Web pages
  • Growing rapidly in size and importance

5
Semi-Structured Data
  • Text databases are often semi-structured
  • Example
  • Title
  • Author
  • Publication_Date
  • Length
  • Category
  • Abstract
  • Content

6
Handling Text Data
  • Modeling semi-structured data
  • Information Retrieval (IR) from unstructured
    documents
  • Text mining
  • Compare documents
  • Rank importance relevance
  • Find patterns or trends across documents

7
Information Retrieval
  • IR locates relevant documents
  • Key words
  • Similar documents
  • IR Systems
  • On-line library catalogs
  • On-line document management systems

8
Performance Measure
  • Two basic measures

Retrieved documents
Relevant documents
Relevant retrieved
All documents
9
Retrieval Methods
  • Keyword-based IR
  • E.g., data and mining
  • Synonymy problem a document may talk about
    knowledge discovery instead
  • Polysemy problem mining can mean different
    things
  • Similarity-based IR
  • Set of common keywords
  • Return the degree of relevance
  • Problem what is the similarity of data mining
    and data analysis

10
Modeling a Document
  • Set of n documents and m terms
  • Each document is a vector v in Rm
  • The j-th coordinate of v measures the association
    of the j-th term
  • Here r is the number of occurrences of the j-th
    term and R is the number of occurrences of any
    term.

11
Frequency Matrix
12
Similarity Measures
Dot product
  • Cosine measure

Norm of the vectors
13
Example
  • Google search for association mining
  • Two of the documents retrieved
  • Idaho Mining Association mining in Idaho (doc 1)
  • Scalable Algorithms for Association mining (doc
    2)
  • Using only the two terms

14
New Model
  • Add the term data to the document model

15
Frequency Matrix
Will quickly become large
16
Association Analysis
  • Collect set of keywords frequently used together
    and find association among them
  • Apply any association rule algorithm to a
    database in the format
  • document_id, a_set_of_keywords

17
Document Classification
  • Need already classified documents as training set
  • Induce a classification model
  • Any difference from before?

A set of keywords associated with a document has
no fixed set of attributes or dimensions
18
Association-Based Classification
  • Classify documents based on associated,
    frequently occurring text patterns
  • Extract keywords and terms with IR and simple
    association analysis
  • Create a concept hierarchy of terms
  • Classify training documents into class
    hierarchies
  • Use association mining to discover associated
    terms to distinguish one class from another

19
Remember Generalized Association Rules
Taxonomy
Ancestor of shoes and hiking boots
Clothes
Footwear
Outerwear
Shirts
Shoes
Hiking Boots
Jackets
Ski Pants
Generalized association rule X? Y where no item
in Y is an ancestor of an item in X
20
Classifiers
  • Let X be a set of terms
  • Let Anc (X) be those terms and their ancestor
    terms
  • Consider a rule X?? C and document d
  • If X ? Anc (d) then X?? C covers d
  • A rule that covers d may be used to classify d
    (but only one can be used)

21
Procedure
  • Step 1 Generate all generalized association
    rules , where X is a set of terms and C is a
    class, that satisfy minimum support.
  • Step 2 Rank the rules according to some rule
    ranking criterion
  • Step 3 Select rules from the list

22
Web Mining
  • The World Wide Web may have more opportunities
    for data mining than any other area
  • However, there are serious challenges
  • It is too huge
  • Complexity of Web pages is greater than any
    traditional text document collection
  • It is highly dynamic
  • It has a broad diversity of users
  • Only a tiny portion of the information is truly
    useful

23
Search Engines ? Web Mining
  • Current technology search engines
  • Keyword-based indices
  • Too many relevant pages
  • Synonymy and polysemy problems
  • More challenging web mining
  • Web content mining
  • Web structure mining
  • Web usage mining

24
Web Content Mining
25
Example Classification of Web Documents
  • Assign a class to each document based on
    predefined topic categories
  • E.g., use Yahoo!s taxonomy and associated
    documents for training
  • Keyword-based document classification
  • Keyword-based association analysis

26
Web Structure Mining
27
Authoritative Web Pages
  • High quality relevant Web pages are termed
    authoritative
  • Explore linkages (hyperlinks)
  • Linking a Web page can be considered an
    endorsement of that page
  • Those pages that are linked frequently are
    considered authoritative
  • (This has its roots back to IR methods based on
    journal citations)

28
Structure via Hubs
  • A hub is a set of Web pages containing
    collections of links to authorities
  • There is a wide variety of hubs
  • Simple list of recommended links on a persons
    home page
  • Professional resource lists on commercial sites

29
HITS
  • Hyperlink-Induced Topic Search (HITS)
  • Form a root set of pages using the query terms in
    an index-based search (200 pages)
  • Expand into a base set by including all pages the
    root set links to (1000-5000 pages)
  • Go into an iterative process to determine hubs
    and authorities

30
Calculating Weights
  • Authority weight
  • Hub weight

Page p is pointed to by page q
31
Adjacency Matrix
  • Lets number the pages 1,2,,n
  • The adjacency matrix is defined by
  • By writing the authority and hub weights as
    vectors we have

32
Recursive Calculations
  • We now have
  • By linear algebra theory this converges to the
    principle eigenvectors of the the two matrices

33
Output
  • The HITS algorithm finally outputs
  • Short list of pages with high hub weights
  • Short list of pages with high authority weights
  • Have not accounted for context

34
Applications
  • The Clever Project at IBMs Almaden Labs
  • Developed the HITS algorithm
  • Google
  • Developed at Stanford
  • Uses algorithms similar to HITS (PageRank)
  • On-line version

35
Web Usage Mining
36
Complex Data Types Summary
  • Emerging areas of mining complex data types
  • Text mining can be done quite effectively,
    especially if the documents are semi-structured
  • Web mining is more difficult due to lack of such
    structure
  • Data includes text documents, hypertext
    documents, link structure, and logs
  • Need to rely on unsupervised learning, sometimes
    followed up with supervised learning such as
    classification
Write a Comment
User Comments (0)
About PowerShow.com