Discovering Business Intelligence Information by Comparing Company Web Sites PowerPoint PPT Presentation

presentation player overlay
1 / 41
About This Presentation
Transcript and Presenter's Notes

Title: Discovering Business Intelligence Information by Comparing Company Web Sites


1
Discovering Business Intelligence Information by
Comparing Company Web Sites
  • Unit 6 of Web Intelligence
  • Web Mining and Farming

2
Introduction
  • More and more companies, government
    organizations, and individuals are publishing
    their information on the Web
  • How to find the useful/interesting information
    from Web
  • Keyword-based search
  • Manual browsing
  • Wrapper-based approaches
  • Web query languages
  • User-preference approaches
  • They only find the information that matches the
    users specifications

3
Introduction (Cont.)
  • Finding unexpected information can be very
    important
  • Need human analysts browse Web to identify these
    piece of interest (including unexpected)
    information
  • Automated assistance is urgently needed
  • Whether a piece of information is interesting or
    not is subjective
  • Similar to the interestingness problem in data
    mining

4
Interestingness Measures
  • Interestingness measures
  • Unexpectedness a piece of information is
    interesting if it is relevant but unknown to the
    user, or it contradicts the users expectation
  • Actionability a piece of information is
    actionable if the user can do something with it
    to his/her advantages
  • Key concept but elusive (so, decided by the user)
  • Information categorization
  • Information that is both unexpected and
    actionable
  • Information that is unexpected but not actionable
  • Information that is actionable but expected

5
Summary of the Proposed Approach
  • Aim to find interesting information from a
    competitor Web site
  • Input
  • A user site U (expectation of the user)
  • Some additional knowledge E that the user has
    about its competitor (expectation of the user)
  • A competitor site C
  • Actions of WebCompare
  • Analyze U to extract all the information that
    represent the users expectation
  • Analyze C and compare the information contained
    in C, and U and E to find various types of
    expected and unexpected information from C

6
Summary of the Proposed Approach (Cont.)
  • The information in a Web page is represented
    using two schemes
  • Vector space representation similarities,
    differences, and the main concepts of text
    documents can be represented by keywords that
    appear in the documents
  • Concepts
  • Combination of keywords that occur frequently in
    the sentence of a Web page
  • Often represent significant information that the
    owner wants to emphasize

7
Vector Space Representation of Text Documents
8
Vector Space Representation of Text Documents
  • Each document is described by a set of keywords
    called index terms (or simply terms)
  • An index term is simply a word whose semantics
    helps to remember the documents main themes
  • Index terms are used to index and to summarize
    the document content
  • An index term is associated with a weight

9
Term Weight
  • Two approaches to associate a weight with an
    index term
  • Binary
  • the domain contains the the value of one or zero.
  • Weighted
  • the domain is the set of all real positive
    numbers.
  • Ex discuss petroleum refineries in Mexico

Petroleum
Mexico
Oil
Taxes
Refineries
Binary
1
1
1
0
1
Weighted
2.8
1.6
3.5
.3
3.1
10
Term Weight (Cont.)
  • Simple term frequency algorithm
  • The weight is equal to the term frequency (TF)
  • Emphasize the use of particular processing token
    within an item
  • if the word computer occurs 15 times within an
    item it has a weight of 15
  • problems Normalization!!
  • The longer an item is, the more often a
    processing token may occur within the item.

11
Term Weight (Cont.)
  • Inverse document frequency
  • the weight equal to the frequency of occurrence
    of the index terms in all the documents
  • WEIGHTijTfijLog2(n)-Log2(IFj)1
  • WEIGHTij assigned to term jin item i
  • TFij frequency of term j in item i
  • IFij number of items in the database that have
    term j in them
  • n number of documents in the databases

12
Term Weight (Cont.)
  • Ex
  • Weightoil4(Log2(2048)-Log2(128)1)20
  • WeightMexico8(Log2(2048)-Log2(16)1)64
  • WeightRefinery10(Log2(2048)-Log2(1024)1)20

13
Term Weight (Cont.)
  • Signal weighting
  • IDF does not account the term frequency
    distribution of the processing token in the items
    that contain the term.
  • The distribution of the frequency of processing
    tokens within an item can affect the ability to
    rank items.
  • An instance of an event that occurs all the time
    has less information value than an instance of a
    seldom occurring event.

14
Similarity Measure
  • Measure the similarity between a query and a
    document
  • Similarity measure examples

15
Finding Concepts Using Association Rule Mining
  • ??????
  • Cheese ? beer support 10, confidence80
  • ??????????????,?????(Support)??????(Confidence)???
  • An association mining algorithm works in two
    steps (Aprori)
  • Generate all large (frequent) itemsets that
    satisfy minsup
  • An itemset is simply a set of items
  • A large itemset is an itemset that has
    transaction support above minsup
  • Generate all association rules that satisfy
    minconf using the large itemsets

16
???????
17
??????
  • ??c (confidence)?X?Y?,?D?X???Y????????
  • ???s (support)?X?Y?,D????X?Y??

18
???????
19
Finding Concepts Using Association Rule Mining
(Cont.)
  • Association rule mining in WebCompare
  • The set of items I is the set of keywords in a
    page
  • The keywords in each sentence of the page form a
    transaction t
  • The set of all sentences in the page gives the
    transaction set T
  • If a particular keyword occurs more than once in
    a sentence, consider it only once

20
Finding Concepts Using Association Rule Mining
(Cont.)
  • WebCompare mines all large itemsets from every
    page in C and every page in U separately
  • Each page of a Web site typically focuses on a
    specific topic
  • If we mix it with other page, we may not be able
    to find interesting concepts that exist in the
    page, due to the minimum support constraint
  • A concept may be large in one page, but may not
    be large when it is combined with another page,
    as the minimum support is normally specified in
    percentage

21
Proposed Techniques Comparing Two Web Sites
22
Overview
  • Five methods to compare the user site U and the
    competitor site C to help the user find various
    types of interesting and/or unexpected
    information
  • User site U u1, u2,, uw
  • Competitor site Cc1, c2, , cv

23
Finding the Corresponding C Page(s) of a U Page
  • The user is interested in finding some pages in C
    that are similar to a page in U
  • Useful when the user wants to perform detailed
    analysis on a specific topic ? see if C has
    published the same topic
  • Given a U page uj, use the cosine measure to
    compute the similarity between uj and each page
    in C
  • After the comparison, the pages in C are ranked
    according to their similarity values in
    descending order

24
Example
25
Finding Unexpected Terms in a C Page w.r.t. a U
Page
  • Given two similar pages, find unexpected terms
  • Allow the users to obtain the key differences of
    two pages
  • Help the user decide whether to browse the C page
    to find further details
  • Given a U page uj and a C page ci, compare the
    term weights in both documents to obtain those
    unexpected terms in ci w.r.t the terms in uj
  • Unexpectedness value of each term kr in ci w.r.t
    uj

26
Finding Unexpected Terms in a C Page w.r.t. a U
Page (Cont.)
  • After the unexpectedness value for each term kr
    is computed, all the terms in ci are ranked
    according to their unexpTr,i,j values in
    descending order
  • Example we are interested in unexpected terms in
    Cpage 1 w.r.t. Upage 1 ? Rank 1 classify

27
Finding Unexpected Pages in C w.r.t. U
  • These finding pages are often very interesting,
    as they tell the user that the competitor site
    may have some useful pages that the user site
    does not have
  • Combine all the pages in U to form a single
    document Du, and all the pages in C to form
    another single document Dc
  • Compute the unexpectedness value of each term kl
    in Dc w.r.t Du (unexpTl,c,u)
  • The unexpectedness of a page ci w.r.t U
    (unexpPi) the amount of term unexpectedness
    contained in ci

28
Finding Unexpected Pages in C w.r.t. U (Cont.)
  • After all unexpPi values are computed, we rank
    the C pages according to their unexpPi values in
    descending order
  • Example
  • Rank 1 Cpage 2
  • Rank 2 Cpage 3
  • Rank 3 Cpage 1

29
Finding Unexpected Concepts in a C Page w.r.t a U
Page
  • A concept is a set of keywords that occur
    together in the sentences of a page above a
    certain user-specified minimum support (or
    frequency)
  • "information extraction", "extraction of
    information", "information is extracted"
  • Use association rule mining to discover all
    concepts
  • Treat each concept as a term or keyword, and
    apply method 2 and/or method 3

30
Finding Unexpected Outgoing Links from C
  • May indicate some useful resource that are of
    additional help to the customer of the competitor
  • Let the set of outgoing links from U be Lu, and
    let the set of outgoing links from C be Lc.
  • The set of unexpected outgoing links in C w.r.t U
    is Lc-Lu

31
Proposed Techniques Incorporating the User's
Existing Knowledge
  • Users may have some existing knowledge about the
    application domain and its competitor
  • It enable the system to discover truly unexpected
    information
  • It allows the user to check if his/her
    expectations are correct
  • Express the user's knowledge as keywords,
    concepts, and hypertext links. E ? Eg and Es
  • Eg all the general items of the domain that the
    user knows about and does not want them ranked
    high
  • Es specific items of the site that the user
    knows about and does not want them ranked high

32
Proposed Techniques Incorporating the User's
Existing Knowledge (Cont.)
  • In computation, item in E are added to the set of
    items in U
  • Keywords in E are used in methods 2 and 3
  • Concepts in method 4
  • Outgoing links in method 5
  • When a weight is needed for an item, it takes the
    maximum weight

33
System Architecture
34
A Running Example
35
The Crawler Interface
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
Evaluation Application Experience
  • It allow the user to quickly focus on those
    potentially interesting pages, terms, and
    concepts
  • Due to the difficulties of manual analysis,
    before using WebCompare, the users may gave up
    after browsing some top-level pages
  • If a page is long, the users often do not read it
    carefully, and thus may miss some useful
    information. WebCompare can summarize each page
    with keywords and concepts

40
Evaluation -- Efficiency
41
Future Works
  • Study the use of metadata and ontology to provide
    more information related to keywords to create a
    more intelligent system
  • Study how the links of a Web site may be used to
    infer more unexpected information
  • May be extended as a methodology for monitoring a
    competitor's Web site
  • Treat the old web pages of C as the existing
    knowledge or the U site
  • Report any unexpected changes to the old pages by
    the competitor
Write a Comment
User Comments (0)
About PowerShow.com