WEB MINING AND APPLICATIONS - PowerPoint PPT Presentation

1 / 84
About This Presentation
Title:

WEB MINING AND APPLICATIONS

Description:

Laptop. Camera, DVD. C4. USB. Videosoft. MemoryCard. DVD-R, DVD-Rec. C3 ... Since the number of reviews of an object can be large, the goal was to produce ... – PowerPoint PPT presentation

Number of Views:169
Avg rating:3.0/5.0
Slides: 85
Provided by: vaishalik
Category:

less

Transcript and Presenter's Notes

Title: WEB MINING AND APPLICATIONS


1
WEB MINING AND
APPLICATIONS
  • Pallavi Tripathi 105956127
  • Vaishali Kshatriya 105951122
  • Mehru Anand 106113525
  • Minnie Virk 106113516

2
REFERENCES
  • Data Mining Concepts Techniques by Jiawei Han
    and Micheline Kamber
  • Presentation Slides of Prof. Anita Wasilewska
  • http//www.cs.rpi.edu/youssefi/research/VWM/
  • http//www-sop.inria.fr/axis/personnel/Florent.Mas
    seglia/International_Book_Encyclopedia_2005.pdf
  • http//www.galeas.de/webimining.html
  • http//www.cs.helsinki.fi/u/gionis/seminar_papers/
    zaki00spade.ps

3
CITATIONS
  • Amir H. Youssefi, David J. Duke, Mohammed J.
    Zaki, Ephraim P. Glinert, Visual Web Mining 13th
    International World Wide Web Conference (poster
    proceedings), New York, NY, May 2004.
  • Amir H. Youssefi, David Duke, Ephraim P. Glinert,
    and Mohammed J. Zaki, Toward Visual Web Mining,
    3rd International Workshop on Visual Data Mining
    (with ICDM'03), Melbourne, FL, November 2003.

4
  • With the explosive growth of information
    sources available on the World Wide Web, it has
    become increasingly necessary for users to
    utilize automated tools in finding the desired
    information resources, and to track and analyze
    their usage patterns. These factors give rise to
    the necessity of creating serverside and
    clientside intelligent systems that can
    effectively mine for knowledge

http//www.galeas.de/webimining.html
5
WHAT IS WEB MINING?
  • Web Mining is the extraction of interesting
    and potentially useful patterns and implicit
    information from artifacts or activity related to
    the WorldWide Web.

6
AREAS OF CLASSIFICATION
  • WEB CONTENT MINING is the process of extracting
    knowledge from the content of documents or their
    descriptions.
  • WEB STRUCTURE MINING is the process of inferring
    knowledge from the WorldWide Web organization
    and links between references and referents in the
    Web.
  • WEB USAGE MINING, also known as WEB LOG MINING,
    is the process of extracting interesting patterns
    in web access logs
  • In addition to these three web mining types,
    there are other helpful approaches for web
    knowledge discovery, such as information
    visualization which helps us to understand the
    complex relationships and structures of many
    search results.

http//www.galeas.de/webimining.html
7
TOPICS COVERED
  • In todays presentation we would be covering
    the following algorithms related to the various
    aspects of Web Mining
  • Spade Algorithm and its applications in Visual
    Web Mining
  • Sentiment Classification
  • Community Trawling Algorithm

8
VISUAL WEB MINING
  • Application of Information visualization
    techniques on results of Web Mining in order to
    further amplify the perception of extracted
    patterns and visually explore new ones in web
    domain.
  • Application Domain is Web Usage Mining and
    Web Content Mining

http//www.cs.rpi.edu/youssefi/research/VWM/
9
APPROACH USED
  • Make personalized results for targeted web
    surfers
  • Use data mining algorithms for extracting new
    insight and measures
  • Employ a database server and relational query
    language as a means to submit specific queries
    against data
  • Utilize visualization to obtain an overall picture

http//www.cs.rpi.edu/youssefi/research/VWM/
10
SPADE OVERVIEW
  • Proposed by Mohammed J Zaki
  • Sequential PAttern Discovery Using Equivalent
    Class
  • An algorithm based on Apriori for fast discovery
    of frequent sequences
  • Needs three database scans in order to extract
    sequential patterns
  • Given A database of customer transactions, each
    of which having the following characteristics
    sequence-id or customer-id, transaction-time and
    the item involved in the transaction.
  • The aim is to obtain typical behaviors according
    to the user's viewpoint.

http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
11
DEFINITIONS
  • Item Can be considered as the object bought by
    a customer, or the page requested by the user of
    a website, etc.
  • Itemset An itemset is the set of items that are
    grouped by timestamp.
  • Data Sequence Sequence of itemsets associated to
    a customer.
  • Sequential Mining Discovering frequent sequences
    over time of attribute sets in large databases.
  • Frequent Sequential Pattern Sequence whose
    statistical significance in the database is above
    user-specified threshold.

http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
12
SPADE ALGORITHM
  • In the first scan ,find frequent items
  • The second scan aims at finding frequent
    sequences of length 2
  • The last scan associates to frequent sequences of
    length 2, a table of the corresponding sequences
    id and itemsets id in the database
  • Based on this representation in main memory, the
    support of the candidate sequences of length k is
    the result of join operations on the tables
    related to the frequent sequences of length (k-1)
    able to generate this candidate

http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
13
Data Sequence of 4 customers
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
14
AN EXAMPLE
  • With a minimum support of 50 a sequential
    pattern can be considered as frequent if it
    occurs at least in the data sequences of 2
    customers (2/4).
  • In this case a maximal sequential pattern mining
    process will find three patterns
  • S1 (Camera,DVD)(DVD-R,DVD-Rec)
  • S2 (DVD-R,DVD-Rec)(Videosoft)
  • S3 (Memory Card)(USB)

http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
15
Determining Support
SUFFIX JOIN ON ID LIST
ORIGINAL ID LIST DATABASE
http//www-sop.inria.fr/axis/personnel/Florent.Mas
seglia/International_Book_Encyclopedia_2005.pdf
16
ADVANTAGES
  • Uses simple join operations on id table
  • No complicated hash tree structures used
  • No overhead of generating and searching
    subsequences
  • Cuts down on I/O operations by limiting itself to
    three scans

http//www.cs.helsinki.fi/u/gionis/seminar_papers/
zaki00spade.ps
17
  • The visual Web Mining Framework provides
    prototype implementation for applying information
    visualization techniques on these results.

http//www.cs.rpi.edu/youssefi/research/VWM/
18
SYSTEM ARCHITECTURE
http//www.cs.rpi.edu/youssefi/research/VWM
19
  • A robot (webbot) is used to retrieve the pages of
    the Website
  • Web Server log files are downloaded and processed
  • The Integration Engine is a suite of programs for
    data preparation ie extracting, cleaning,
    transforming, integrating data and finally
    loading into database and later generating graphs
    in XGML.

http//www.cs.rpi.edu/youssefi/research/VWM
20
  • We extract user sessions from web logs , this
    yields results of roughly related to a specific
    user
  • The user sessions are converted into format
    suitable for Sequence Mining
  • Outputs are frequent contiguous sequence with
    given minimum support.
  • These are imported into a database
  • Different queries are executed against this data.

http//www.cs.rpi.edu/youssefi/research/VWM
21
APPLICATIONS
  • Designing different visualization diagrams and
    exploring frequent patterns of user access on a
    website
  • Classification of web pages into two classes
    hot and cold attracting high and low number of
    visitors.
  • A webmaster can make exploratory changes to
    website structure and analyze the change in user
    access patterns in real world.

http//www.cs.rpi.edu/youssefi/research/VWM/
22
Sentiment Classification
  • Vaishali Kshatriya
  • 105951122

23
References
  • The Sentimental Factor Improving Review
    Classification via Human-Provided Information. -
    Philip Beineke , Shivakumar Vaithyanathan and
    Trevor Hastie
  • Thumbs Up or Thumbs Down? Semantic orientation
    applied to unsupervised classification of
    reviews Turney (July 2002)
  • http//wing.comp.nus.edu.sg/chime/050427/Sentiment
    Classification3_files/frame.htm
  • http//www.cse.iitb.ac.in/cs621/seminar/Sentiment
    Detection.ppt267,12,Recent Advances
  • Bing Liu, Minqing Hu and Junsheng Cheng. "Opinion
    Observer Analyzing and Comparing Opinions on the
    Web" Proceedings of the 14th international World
    Wide Web conference (WWW-2005), May 10-14, 2005,
    in Chiba, Japan.

24
Sentiment Classification
  • It is a task of labeling a review document
    according to the polarity of its prevailing
    opinion.

25
Online Shopping
26
Topical vs. Sentimental Classification
  • Topical Classification
  • Classifying documents into various subjects for
    example Mathematics, Sports etc
  • comparing individual words (unigrams) in various
    subject areas (Bag-of-Words approach). Example
    score, referee, football gt Sports
  • Sentiment Classification
  • classifying documents according to the overall
    sentiment positive vs. negative E.g. like vs.
    dislike Recommended vs. not recommended
  • More difficult compared to traditional topical
    classification. May need more linguistic
    processing E.g. you will be disappointed and
    it is not satisfactory

http//wing.comp.nus.edu.sg/chime/050427/Sentiment
Classification3_files/frame.htm
27
Challenges
  • Dependence of context on the document
    unpredictable plot, unpredictable performance
  • Negations have to be captured
  • The movie was not that bad.
  • The pictures taken by the cell is not of best
    quality.
  • Subtle Expressions
  • How can someone sit through the entire movie?

http//www.cse.iitb.ac.in/cs621/seminar/Sentiment
Detection.ppt267,12,Recent Advances
28
Unsupervised review classification (Turney ACL
-02)
  • Input Written review
  • Output classification (i.e. positive or
    negative)
  • Step 1 Use part-of-speech tagger to identify
    phrases
  • Step 2 Estimate the semantic orientation of
    extracted phrase
  • Step 3 Assign the given review to a class
    (either recommended or not recommended)

Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
29
Step 1 Extract the phrases
  • Part-of-speech tagger is applied to the review
  • Two consecutive words are extracted from the
    review if their tags conform to any of the
    patterns in the table

where JJ Adjective and NN Noun
Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
30
Step 2 Estimate the semantic orientation
  • Uses PMI-IR (Pointwise Mutual Information and
    Information Retrieval)
  • PMI between 2 words, word1 and word2 can be
    defined as
  • The Semantic Orientation (SO) of a phrase is
    calculated as
  • SO(phrase) PMI(phrase, excellent)
    PMI(phrase, poor)
  • SO is positive when the phrase is more strongly
    associated with excellent and negative when it is
    more strongly associated with poor.

Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
31
Step 2 (contd)
  • PMI-IR estimates PMI by issuing queries to a
    search engine (hence the IR in PMI-IR) and noting
    the number of hits (matching documents).
  • The experiment uses AltaVista

Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
32
Step 3 Assign a Class
  • Calculate the average of the SO of the phrases
    and classify them as recommended if the average
    is positive and not recommended if the average is
    negative.

Reviews of a bank
Citation Thumbs Up or Thumbs Down? Semantic
orientation applied to unsupervised
classification of reviews Turney (02)
33
Drawbacks
  • Sentiment classification is useful but it does
    not find what the reviewer liked or disliked.
  • A negative sentiment on an object does not imply
    that the user did not like anything about the
    product
  • Similarly a positive sentiment does not imply
    that the user liked everything about the product
  • The solution is to go to sentence and feature
    level

http//www.cs.uic.edu/liub/EITC-06.ppt493,20,Ide
ntify opinion orientation of features
34
Feature based Opinion mining and summarization
(Hu and Liu 04)
  • Interested in what reviewers liked and disliked
  • Since the number of reviews of an object can be
    large, the goal was to produce simple summary of
    the reviews
  • The summary can be easily visualized and compared

http//www.cs.uic.edu/liub/EITC-06.ppt493,20,Ide
ntify opinion orientation of features
35
Three main tasks
  • Step1 Identify and extract object features that
    have been commented on in each review
  • Step 2 Determine whether the opinion on the
    review is positive, negative or neutral
  • Step 3 Group synonyms of features
  • Produce a feature-based summary!!

http//www.cs.uic.edu/liub/EITC-06.ppt493,20,Ide
ntify opinion orientation of features
36
Online Shopping
37
Summary
  • Classification of reviews as good or bad
    sentimental classification
  • Unsupervised review classification extracts the
    phrases from the review, estimates the semantic
    orientation and assigns a class to the review
  • The solution for the short-comings of the
    sentimental classification is feature-based
    opinion extraction

38
Discovering Web communities on the web
Mehru Anand (106113525)
39
References
  • Inferring Web Communities from Link Topology
    (1998) David Gibson, Jon Kleinberg, Prabhakar
    Raghavan, UK Conference on Hypertext.
  • Trawling the web for emerging cyber-communities
    (1999)  Ravi Kumar, Prabhakar Raghavan, Sridhar
    Rajagopalan, Andrew Tomkins, WWW8 / Computer
    Networks.
  •  
  • Finding Related Pages in the World Wide Web
    (1999)  Jeffrey Dean, Monika R. Henzinger, WWW8 /
    Computer Networks.
  •  
  • A System for Collaborative Web Resource
    Categorization and Ranking Maxim Lifantsev.
  • Web Mining A Birds Eye View by Sanjay Kumar
    Madria Department of Computer Science,University
    of Missouri-Rolla, MO ,madrias_at_umr.edu
  •  

40
Introduction
  • Introduction of the cyber-community
  • Methods to measure the similarity of web pages on
    the web graph
  • Methods to extract the meaningful communities
    through the link structure

41
What is cyber-community
  • A community on the web is a group of web pages
    sharing a common interest
  • Eg. A group of web pages talking about POP Music
  • Eg. A group of web pages interested in
    data-mining
  • Main properties
  • Pages in the same community should be similar to
    each other in contents
  • The pages in one community should differ from the
    pages in another community
  • Similar to cluster

42
Two different types of communities
  • Explicitly-defined communities
  • They are well known ones, such as the resource
    listed by Yahoo!
  • Implicitly-defined communities
  • They are communities unexpected or invisible to
    most users

Arts
eg.
Music
Painting
Classic
Pop
eg. The group of web pages interested in a
particular singer
43
(No Transcript)
44
Two different types of communities
  • The explicit communities are easy to identify
  • Eg. Yahoo!, InfoSeek, Clever System
  • In order to extract the implicit communities, we
    need analyze the web-graph objectively
  • In research, people are more interested in the
    implicit communities

45
Similarity of web pages
  • Discovering web communities is similar to
    clustering. For clustering, we must define the
    similarity of two nodes
  • A Method I
  • For page and page B, A is related to B if there
    is a hyper-link from A to B, or from B to A
  • Not so good. Consider the home page of IBM and
    Microsoft.

Page A
Page B
46
Similarity of web pages
  • Method II (from Bibliometrics)
  • Co-citation the similarity of A and B is
    measured by the number of pages cite both A and B
  • Bibliographic coupling the similarity of A and B
    is measured by the number of pages cited by both
    A and B.

Page A
Page B
Page A
Page B
47
Methods of clustering
  • Clustering methods based on co-citation analysis
  • Methods derived from HITS (Kleinberg)
  • Using co-citation matrix
  • All of them can discover meaningful communities
  • But their methods are very expensive to the
    whole World Wide Web with billions of web pages.

48
Trawling the Web for emerging cyber-communitiesPr
oceeding of the eighth international conference
on World Wide Web Toronto, Canada Pages 1481 -
1493 Year of Publication 1999 ISSN1389-1286
Ravi Kumar, Prabhakar Raghavan, Sridhar
Rajagopalan, Andrew Tomkins
49
A cheaper method
  • The method from Ravi Kumar, Prabhakar Raghavan,
    Sridhar Rajagopalan, Andrew Tomkins
  • IBM Almaden Research Center
  • They call their method communities trawling (CT)
  • They implemented it on the graph of 200 millions
    pages, it worked very well

50
Basic idea of CT
  • Definition of communities
  • dense directed bipartite sub graphs
  • Bipartite graph Nodes are partitioned into two
    sets, F and C
  • Every directed edge in the graph is directed from
    a node u in F to a node v in C
  • dense if many of the possible edges between F and
    C are present

F
C
51
Basic idea of CT
  • Bipartite cores
  • a complete bipartite subgraph with at least i
    nodes from F and at least j nodes from C
  • i and j are tunable parameters
  • A (i, j) Bipartite core
  • Every community have such a core with a certain i
    and j.

A (i3, j3) bipartite core
52
Basic idea of CT
  • A bipartite core is the identity of a community
  • To extract all the communities is to enumerate
    all the bipartite cores on the web.
  • Author invent an efficient algorithm to enumerate
    the bipartite cores. Its main idea is iterate
    pruning -- elimination-generation pruning

53
Complete bipartite graph there is an edge
between each node in F and each node in C
(i,j)-Core a complete bipartite graph with at
least i nodes in F and j nodes in C (i,j)-Core
is a good signature for finding online
communities Trawling finding cores Find all
(i,j)-cores in the Web graph. In particular
find fans (or hubs) in the graph centers
authorities Challenge Web is huge. How to
find cores efficiently?
54
Main idea pruning
  • Step 1 using out-degrees
  • Rule each fan must point to at least 6
    different websites
  • Pruning results 12 of all pages ( 24M pages)
    are potential fans
  • Retain only links, and ignore page contents

55
Step 2 Eliminate mirroring pages
  • Many pages are mirrors (exactly the same page)
  • They can produce many spurious fans
  • Use a shingling method to identify and
    eliminate duplicates
  • Results
  • 60 of 24M potential-fan pages are removed
  • of potential centers is 30 times of of
    potential fans

56
Step 3 Iterative pruning
  • To find (i,j)-cores
  • Remove all pages whose of out-links is lt i
  • Remove all pages whose of in-links is lt j
  • Do it iteratively
  • Step 4 inclusion-exclusion pruning
  • Idea in each step, we
  • Either include a community
  • Or we exclude a page from further contention

57
  • Check a page x with j out-degree. x is a fan of a
    (i,j)-core if
  • There are i-1 fans point to all the forward
    neighbors of x
  • This step can be checked easily using the index
    on fans and centers
  • Result for (3,3)-cores, 5M pages remained
  • Final step
  • Since the graph is much smaller, we can afford
    to enumerate the remaining cores

58
  • Step 5 using in-degrees of pages
  • Delete pages highly references, e.g., yahoo,
    altavista
  • Reason they are referenced for many reasons, not
    likely forming an emerging community
  • Formally remove all pages with more than k
    inlinks (k 50,for instance)
  • Results
  • 60M pages pointing to 20M pages
  • 2M potential fans

59
Weakness of CT
  • The bipartite graph cannot suit all kinds of
    communities
  • The density of the community is hard to adjust

60
Experiment on CT
  • 200 millions web pages
  • IBM PC with an Intel 300MHz Pentium II processor,
    with 512M of memory, running Linux
  • i from 3 to 10 and j from 3 to 20
  • 200k potential communities were discovered
  • 29 of them cannot be found in Yahoo!.

61
Summary
  • Conclusion The methods to discover communities
    from the web depend on how we define the
    communities through the link structure
  • Future works
  • How to relate the contents to link structure

62
Mining Topic-Specific Concepts and Definitions on
the Web
  • Minnie Virk
  • May 2003,  Proceedings of the 12th International
    conference on World Wide Web, ACM Press
  • Bing Liu, University of Illinois at Chicago, 851
    S. Morgan Street Chicago IL 60607-7053
  • Chee Wee Chin,
  • Hwee Tou Ng, National University of Singapore
  • 3 Science Drive 2
    Singapore

63
References
  • Agrawal, R. and Srikant, R. Fast Algorithm for
    Mining Association Rules, VLDB-94, 1994.
  • Anderson, C. and Horvitz, E. Web Montage A
    Dynamic Personalized Start Page, WWW-02, 2002.
  • Brin, S. and Page, L. The Anatomy of a
    Large-Scale Hypertextual Web Search Engine,
    WWW7, 1998.
  • Web Mining A Birds Eye View by Sanjay Kumar
    Madria Department of Computer Science,University
    of Missouri-Rolla, MO ,madrias_at_umr.edu

64
Introduction
  • When one wants to learn about a topic, one reads
    a book or a survey paper.
  • One can read the research papers about the topic.
  • None of these is very practical.
  • Learning from web is convenient, intuitive, and
    diverse.

65
Purpose of the Paper
  • This papers task is mining topic-specific
    knowledge on the Web.
  • The goal is to help people learn in-depth
    knowledge of a topic systematically on the Web.

66
Learning about a New Topic
  • One needs to find definitions and descriptions of
    the topic.
  • One also needs to know the sub-topics and salient
    concepts of the topic.
  • Thus, one wants the knowledge as presented in a
    traditional book.
  • The task of this paper can be summarized as
    compiling a book on the Web.

67
Proposed Technique
  • First, identify sub-topics or salient concepts of
    that specific topic.
  • Then, find and organize the informative pages
    containing definitions and descriptions of the
    topic and sub-topics.

68
Why are the current search tecnhiques not
sufficient?
  • For definitions and descriptions of the topic
  • Existing search engines rank web pages based
    on keyword matching and hyperlink structures. NOT
    very useful for measuring the informative value
    of the page.
  • For sub-topics and salient concepts of the topic
  • A single web page is unlikely to contain
    information about all the key concepts or
    sub-topics of the topic. Thus, sub-topics need to
    be discovered from multiple web pages. Current
    search engine systems do not perform this task.

69
Related Work
  • Web information extraction wrappers
  • Web query languages
  • User preference approach
  • Question answering in information retrieval
  • Question answering is a closely-related work to
    this paper. The objective of a question-answering
    system is to provide direct answers to questions
    submitted by the user. In this papers task, many
    of the questions are about definitions of terms.

70
The Algorithm
  • WebLearn (T)
  • 1) Submit T to a search engine, which returns a
    set of relevant pages
  • 2) The system mines the sub-topics or salient
    concepts of T using a set S of top ranking pages
    from the search engine
  • 3) The system then discovers the informative
    pages containing definitions of the topic and
    sub-topics (salient concepts) from S
  • 4) The user views the concepts and informative
    pages.
  • If s/he still wants to know more about
    sub-topics then
  • for each user-interested sub-topic Ti of
    T do
  • WebLearn (Ti)

71
Sub-Topic or Salient Concept Discovery
  • Observation
  • Sub-topics or salient concepts of a topic are
    important word phrases, usually emphasized using
    some HTML tags (e.g., lth1gt,...,lth4gt,ltbgt).
  • However, this is not sufficient. Data mining
    techniques are able to help to find the frequent
    occurring word phrases.

72
Sub-Topic Discovery
  • After obtaining a set of relevant top-ranking
    pages (using Google), sub-topic discovery
    consists of the following 5 steps.
  • 1) Filter out the noisy documents that rarely
    contain sub-topics or salient-concepts. The
    resulting set of documents is the source for
    sub-topic discovery.

73
Sub-Topic Discovery
  • 2) Identify important phrases in each page
    (discover phrases emphasized by HTML markup
    tags).
  • Rules to determine if a markup tag can safely
    be ignored
  • Contains a salutation title (Mr, Dr, Professor).
  • Contains an URL or an email address.
  • Contains terms related to a publication
    (conference, proceedings, journal).
  • Contains an image between the markup tags.
  • Too lengthy (the paper uses 15 words as the upper
    limit)

74
Sub-Topic Discovery
  • Also, in this step, some preprocessing techniques
    such as stopwords removal and word stemming are
    applied in order to extract quality text
    segments.
  • Stopwords removal Eliminating the words that
    occur too frequently and have little
    informational meaning.
  • Word stemming Finding the root form of a word by
    removing its suffix.

75
Sub-Topic Discovery
  • 3) Mine frequent occurring phrases
  • - Each piece of text extracted in step 2 is
    stored in a dataset called a transaction set.
  • - Then, an association rule miner based on
    Apriori algorithm is executed to find those
    frequent itemsets. In this context, an itemset is
    a set of words that occur together, and an
    itemset is frequent if it appears in more than
    two documents.
  • - We only need the first step of the Apriori
    algorithm and we only need to find frequent
    itemsets with three words or fewer (this
    restriction can be relaxed).

76
Sub-Topic Discovery
  • 4) Eliminate itemsets that are unlikely to be
    sub-topics, and determine the sequence of words
    in a sub-topic. (postprocessing)
  • Heuristic If an itemset does not appear alone as
    an important phrase in any page, it is unlikely
    to be a main sub-topic and it is removed.

77
Sub-Topic Discovery
  • 5) Rank the remaining itemsets. The remaining
    itemsets are regarded as the sub-topics or
    salient concepts of the search topic and are
    ranked based on the number of pages that they
    occur.

78
Definition Finding
  • This step tries to identify those pages that
    include definitions of the search topic and its
    sub-topics discovered in the previous step.
  • Preprocessing steps
  • Texts that will not be displayed by browsers
    (e.g., ltscriptgt...lt/ script gt,lt!comments--gt) are
    ignored.
  • Word stemming is applied.
  • Stopwords and punctuation are kept as they serve
    as clues to identify definitions.
  • HTML tags within a paragraph are removed.

79
Definition Finding
  • After that, following patterns are applied to
    identify definitions

1 Bing Liu, Chee Wee Chin, Hwee Tou Ng. Mining
Topic-Specific Concepts and Definitions on the Web
80
Definition Finding
  • Besides using the above patterns, the paper also
    relies on HTML structuring and hyperlink
    structures.
  • 1) If a page contains only one header or one big
    emphasized text segment at the beginning in the
    entire document, then the document contains a
    definition of the concept in the header.
  • 2) Definitions at the second level of the
    hyperlink structure are also discovered. All the
    patterns and methods described above are applied
    to these second level documents.

81
Definition Finding
  • Observation Sometimes no informative page is
    found for a particular sub-topic when the pages
    for the main topic are very general and do not
    contain detailed information for sub-topics.
  • In such cases, the sub-topic can be submitted to
    the search engine and sub-subtopics may be found
    recursively.

82
Conclusions
  • The proposed techniques aim at helping Web users
    to learn an unfamiliar topic in-depth and
    systematically.
  • This is an efficient system to discover and
    organize knowledge on the web, in a way similar
    to a traditional book, to assist learning.

83
  • Questions?

84
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com