Title: Discovering Business Intelligence Information by Comparing Company Web Sites
1Discovering Business Intelligence Information by
Comparing Company Web Sites
- Unit 6 of Web Intelligence
- Web Mining and Farming
2Introduction
- More and more companies, government
organizations, and individuals are publishing
their information on the Web - How to find the useful/interesting information
from Web - Keyword-based search
- Manual browsing
- Wrapper-based approaches
- Web query languages
- User-preference approaches
- They only find the information that matches the
users specifications
3Introduction (Cont.)
- Finding unexpected information can be very
important - Need human analysts browse Web to identify these
piece of interest (including unexpected)
information - Automated assistance is urgently needed
- Whether a piece of information is interesting or
not is subjective - Similar to the interestingness problem in data
mining
4Interestingness Measures
- Interestingness measures
- Unexpectedness a piece of information is
interesting if it is relevant but unknown to the
user, or it contradicts the users expectation - Actionability a piece of information is
actionable if the user can do something with it
to his/her advantages - Key concept but elusive (so, decided by the user)
- Information categorization
- Information that is both unexpected and
actionable - Information that is unexpected but not actionable
- Information that is actionable but expected
5Summary of the Proposed Approach
- Aim to find interesting information from a
competitor Web site - Input
- A user site U (expectation of the user)
- Some additional knowledge E that the user has
about its competitor (expectation of the user) - A competitor site C
- Actions of WebCompare
- Analyze U to extract all the information that
represent the users expectation - Analyze C and compare the information contained
in C, and U and E to find various types of
expected and unexpected information from C
6Summary of the Proposed Approach (Cont.)
- The information in a Web page is represented
using two schemes - Vector space representation similarities,
differences, and the main concepts of text
documents can be represented by keywords that
appear in the documents - Concepts
- Combination of keywords that occur frequently in
the sentence of a Web page - Often represent significant information that the
owner wants to emphasize
7Vector Space Representation of Text Documents
8Vector Space Representation of Text Documents
- Each document is described by a set of keywords
called index terms (or simply terms) - An index term is simply a word whose semantics
helps to remember the documents main themes - Index terms are used to index and to summarize
the document content - An index term is associated with a weight
9Term Weight
- Two approaches to associate a weight with an
index term - Binary
- the domain contains the the value of one or zero.
- Weighted
- the domain is the set of all real positive
numbers. - Ex discuss petroleum refineries in Mexico
Petroleum
Mexico
Oil
Taxes
Refineries
Binary
1
1
1
0
1
Weighted
2.8
1.6
3.5
.3
3.1
10Term Weight (Cont.)
- Simple term frequency algorithm
- The weight is equal to the term frequency (TF)
- Emphasize the use of particular processing token
within an item - if the word computer occurs 15 times within an
item it has a weight of 15 - problems Normalization!!
- The longer an item is, the more often a
processing token may occur within the item.
11Term Weight (Cont.)
- Inverse document frequency
- the weight equal to the frequency of occurrence
of the index terms in all the documents - WEIGHTijTfijLog2(n)-Log2(IFj)1
- WEIGHTij assigned to term jin item i
- TFij frequency of term j in item i
- IFij number of items in the database that have
term j in them - n number of documents in the databases
12Term Weight (Cont.)
- Ex
- Weightoil4(Log2(2048)-Log2(128)1)20
- WeightMexico8(Log2(2048)-Log2(16)1)64
- WeightRefinery10(Log2(2048)-Log2(1024)1)20
13Term Weight (Cont.)
- Signal weighting
- IDF does not account the term frequency
distribution of the processing token in the items
that contain the term. - The distribution of the frequency of processing
tokens within an item can affect the ability to
rank items.
- An instance of an event that occurs all the time
has less information value than an instance of a
seldom occurring event.
14Similarity Measure
- Measure the similarity between a query and a
document - Similarity measure examples
15Finding Concepts Using Association Rule Mining
- ??????
- Cheese ? beer support 10, confidence80
- ??????????????,?????(Support)??????(Confidence)???
- An association mining algorithm works in two
steps (Aprori) - Generate all large (frequent) itemsets that
satisfy minsup - An itemset is simply a set of items
- A large itemset is an itemset that has
transaction support above minsup - Generate all association rules that satisfy
minconf using the large itemsets
16???????
17??????
- ??c (confidence)?X?Y?,?D?X???Y????????
- ???s (support)?X?Y?,D????X?Y??
18???????
19Finding Concepts Using Association Rule Mining
(Cont.)
- Association rule mining in WebCompare
- The set of items I is the set of keywords in a
page - The keywords in each sentence of the page form a
transaction t - The set of all sentences in the page gives the
transaction set T - If a particular keyword occurs more than once in
a sentence, consider it only once
20Finding Concepts Using Association Rule Mining
(Cont.)
- WebCompare mines all large itemsets from every
page in C and every page in U separately - Each page of a Web site typically focuses on a
specific topic - If we mix it with other page, we may not be able
to find interesting concepts that exist in the
page, due to the minimum support constraint - A concept may be large in one page, but may not
be large when it is combined with another page,
as the minimum support is normally specified in
percentage
21Proposed Techniques Comparing Two Web Sites
22Overview
- Five methods to compare the user site U and the
competitor site C to help the user find various
types of interesting and/or unexpected
information - User site U u1, u2,, uw
- Competitor site Cc1, c2, , cv
23Finding the Corresponding C Page(s) of a U Page
- The user is interested in finding some pages in C
that are similar to a page in U - Useful when the user wants to perform detailed
analysis on a specific topic ? see if C has
published the same topic - Given a U page uj, use the cosine measure to
compute the similarity between uj and each page
in C - After the comparison, the pages in C are ranked
according to their similarity values in
descending order
24Example
25Finding Unexpected Terms in a C Page w.r.t. a U
Page
- Given two similar pages, find unexpected terms
- Allow the users to obtain the key differences of
two pages - Help the user decide whether to browse the C page
to find further details - Given a U page uj and a C page ci, compare the
term weights in both documents to obtain those
unexpected terms in ci w.r.t the terms in uj - Unexpectedness value of each term kr in ci w.r.t
uj
26Finding Unexpected Terms in a C Page w.r.t. a U
Page (Cont.)
- After the unexpectedness value for each term kr
is computed, all the terms in ci are ranked
according to their unexpTr,i,j values in
descending order - Example we are interested in unexpected terms in
Cpage 1 w.r.t. Upage 1 ? Rank 1 classify
27Finding Unexpected Pages in C w.r.t. U
- These finding pages are often very interesting,
as they tell the user that the competitor site
may have some useful pages that the user site
does not have - Combine all the pages in U to form a single
document Du, and all the pages in C to form
another single document Dc - Compute the unexpectedness value of each term kl
in Dc w.r.t Du (unexpTl,c,u) - The unexpectedness of a page ci w.r.t U
(unexpPi) the amount of term unexpectedness
contained in ci
28Finding Unexpected Pages in C w.r.t. U (Cont.)
- After all unexpPi values are computed, we rank
the C pages according to their unexpPi values in
descending order - Example
- Rank 1 Cpage 2
- Rank 2 Cpage 3
- Rank 3 Cpage 1
29Finding Unexpected Concepts in a C Page w.r.t a U
Page
- A concept is a set of keywords that occur
together in the sentences of a page above a
certain user-specified minimum support (or
frequency) - "information extraction", "extraction of
information", "information is extracted" - Use association rule mining to discover all
concepts - Treat each concept as a term or keyword, and
apply method 2 and/or method 3
30Finding Unexpected Outgoing Links from C
- May indicate some useful resource that are of
additional help to the customer of the competitor - Let the set of outgoing links from U be Lu, and
let the set of outgoing links from C be Lc. - The set of unexpected outgoing links in C w.r.t U
is Lc-Lu
31Proposed Techniques Incorporating the User's
Existing Knowledge
- Users may have some existing knowledge about the
application domain and its competitor - It enable the system to discover truly unexpected
information - It allows the user to check if his/her
expectations are correct - Express the user's knowledge as keywords,
concepts, and hypertext links. E ? Eg and Es - Eg all the general items of the domain that the
user knows about and does not want them ranked
high - Es specific items of the site that the user
knows about and does not want them ranked high
32Proposed Techniques Incorporating the User's
Existing Knowledge (Cont.)
- In computation, item in E are added to the set of
items in U - Keywords in E are used in methods 2 and 3
- Concepts in method 4
- Outgoing links in method 5
- When a weight is needed for an item, it takes the
maximum weight
33System Architecture
34A Running Example
35The Crawler Interface
36(No Transcript)
37(No Transcript)
38(No Transcript)
39Evaluation Application Experience
- It allow the user to quickly focus on those
potentially interesting pages, terms, and
concepts - Due to the difficulties of manual analysis,
before using WebCompare, the users may gave up
after browsing some top-level pages - If a page is long, the users often do not read it
carefully, and thus may miss some useful
information. WebCompare can summarize each page
with keywords and concepts
40Evaluation -- Efficiency
41Future Works
- Study the use of metadata and ontology to provide
more information related to keywords to create a
more intelligent system - Study how the links of a Web site may be used to
infer more unexpected information - May be extended as a methodology for monitoring a
competitor's Web site - Treat the old web pages of C as the existing
knowledge or the U site - Report any unexpected changes to the old pages by
the competitor