Discovering Business Intelligence Information by Comparing Company Web Sites presentation

About This Presentation

Transcript and Presenter's Notes

Title: Discovering Business Intelligence Information by Comparing Company Web Sites

1
Discovering Business Intelligence Information by
Comparing Company Web Sites

Unit 6 of Web Intelligence
Web Mining and Farming

2
Introduction

More and more companies, government
organizations, and individuals are publishing
their information on the Web
How to find the useful/interesting information
from Web
Keyword-based search
Manual browsing
Wrapper-based approaches
Web query languages
User-preference approaches
They only find the information that matches the
users specifications

3
Introduction (Cont.)

Finding unexpected information can be very
important
Need human analysts browse Web to identify these
piece of interest (including unexpected)
information
Automated assistance is urgently needed
Whether a piece of information is interesting or
not is subjective
Similar to the interestingness problem in data
mining

4
Interestingness Measures

Interestingness measures
Unexpectedness a piece of information is
interesting if it is relevant but unknown to the
user, or it contradicts the users expectation
Actionability a piece of information is
actionable if the user can do something with it
to his/her advantages
Key concept but elusive (so, decided by the user)
Information categorization
Information that is both unexpected and
actionable
Information that is unexpected but not actionable
Information that is actionable but expected

5
Summary of the Proposed Approach

Aim to find interesting information from a
competitor Web site
Input
A user site U (expectation of the user)
Some additional knowledge E that the user has
about its competitor (expectation of the user)
A competitor site C
Actions of WebCompare
Analyze U to extract all the information that
represent the users expectation
Analyze C and compare the information contained
in C, and U and E to find various types of
expected and unexpected information from C

6
Summary of the Proposed Approach (Cont.)

The information in a Web page is represented
using two schemes
Vector space representation similarities,
differences, and the main concepts of text
documents can be represented by keywords that
appear in the documents
Concepts
Combination of keywords that occur frequently in
the sentence of a Web page
Often represent significant information that the
owner wants to emphasize

7
Vector Space Representation of Text Documents
8
Vector Space Representation of Text Documents

Each document is described by a set of keywords
called index terms (or simply terms)
An index term is simply a word whose semantics
helps to remember the documents main themes
Index terms are used to index and to summarize
the document content
An index term is associated with a weight

9
Term Weight

Two approaches to associate a weight with an
index term
Binary
the domain contains the the value of one or zero.
Weighted
the domain is the set of all real positive
numbers.
Ex discuss petroleum refineries in Mexico

Petroleum
Mexico
Oil
Taxes
Refineries
Binary
1
1
1
0
1
Weighted
2.8
1.6
3.5
.3
3.1
10
Term Weight (Cont.)

Simple term frequency algorithm
The weight is equal to the term frequency (TF)
Emphasize the use of particular processing token
within an item
if the word computer occurs 15 times within an
item it has a weight of 15
problems Normalization!!
The longer an item is, the more often a
processing token may occur within the item.

11
Term Weight (Cont.)

Inverse document frequency
the weight equal to the frequency of occurrence
of the index terms in all the documents
WEIGHTijTfijLog2(n)-Log2(IFj)1
WEIGHTij assigned to term jin item i
TFij frequency of term j in item i
IFij number of items in the database that have
term j in them
n number of documents in the databases

12
Term Weight (Cont.)

Ex
Weightoil4(Log2(2048)-Log2(128)1)20
WeightMexico8(Log2(2048)-Log2(16)1)64
WeightRefinery10(Log2(2048)-Log2(1024)1)20

13
Term Weight (Cont.)

Signal weighting
IDF does not account the term frequency
distribution of the processing token in the items
that contain the term.
The distribution of the frequency of processing
tokens within an item can affect the ability to
rank items.

An instance of an event that occurs all the time
has less information value than an instance of a
seldom occurring event.

14
Similarity Measure

Measure the similarity between a query and a
document
Similarity measure examples

15
Finding Concepts Using Association Rule Mining

??????
Cheese ? beer support 10, confidence80
??????????????,?????(Support)??????(Confidence)???
An association mining algorithm works in two
steps (Aprori)
Generate all large (frequent) itemsets that
satisfy minsup
An itemset is simply a set of items
A large itemset is an itemset that has
transaction support above minsup
Generate all association rules that satisfy
minconf using the large itemsets

16
???????
17
??????

??c (confidence)?X?Y?,?D?X???Y????????
???s (support)?X?Y?,D????X?Y??

18
???????
19
Finding Concepts Using Association Rule Mining
(Cont.)

Association rule mining in WebCompare
The set of items I is the set of keywords in a
page
The keywords in each sentence of the page form a
transaction t
The set of all sentences in the page gives the
transaction set T
If a particular keyword occurs more than once in
a sentence, consider it only once

20
Finding Concepts Using Association Rule Mining
(Cont.)

WebCompare mines all large itemsets from every
page in C and every page in U separately
Each page of a Web site typically focuses on a
specific topic
If we mix it with other page, we may not be able
to find interesting concepts that exist in the
page, due to the minimum support constraint
A concept may be large in one page, but may not
be large when it is combined with another page,
as the minimum support is normally specified in
percentage

21
Proposed Techniques Comparing Two Web Sites
22
Overview

Five methods to compare the user site U and the
competitor site C to help the user find various
types of interesting and/or unexpected
information
User site U u1, u2,, uw
Competitor site Cc1, c2, , cv

23
Finding the Corresponding C Page(s) of a U Page

The user is interested in finding some pages in C
that are similar to a page in U
Useful when the user wants to perform detailed
analysis on a specific topic ? see if C has
published the same topic
Given a U page uj, use the cosine measure to
compute the similarity between uj and each page
in C
After the comparison, the pages in C are ranked
according to their similarity values in
descending order

24
Example
25
Finding Unexpected Terms in a C Page w.r.t. a U
Page

Given two similar pages, find unexpected terms
Allow the users to obtain the key differences of
two pages
Help the user decide whether to browse the C page
to find further details
Given a U page uj and a C page ci, compare the
term weights in both documents to obtain those
unexpected terms in ci w.r.t the terms in uj
Unexpectedness value of each term kr in ci w.r.t
uj

26
Finding Unexpected Terms in a C Page w.r.t. a U
Page (Cont.)

After the unexpectedness value for each term kr
is computed, all the terms in ci are ranked
according to their unexpTr,i,j values in
descending order
Example we are interested in unexpected terms in
Cpage 1 w.r.t. Upage 1 ? Rank 1 classify

27
Finding Unexpected Pages in C w.r.t. U

These finding pages are often very interesting,
as they tell the user that the competitor site
may have some useful pages that the user site
does not have
Combine all the pages in U to form a single
document Du, and all the pages in C to form
another single document Dc
Compute the unexpectedness value of each term kl
in Dc w.r.t Du (unexpTl,c,u)
The unexpectedness of a page ci w.r.t U
(unexpPi) the amount of term unexpectedness
contained in ci

28
Finding Unexpected Pages in C w.r.t. U (Cont.)

After all unexpPi values are computed, we rank
the C pages according to their unexpPi values in
descending order
Example
Rank 1 Cpage 2
Rank 2 Cpage 3
Rank 3 Cpage 1

29
Finding Unexpected Concepts in a C Page w.r.t a U
Page

A concept is a set of keywords that occur
together in the sentences of a page above a
certain user-specified minimum support (or
frequency)
"information extraction", "extraction of
information", "information is extracted"
Use association rule mining to discover all
concepts
Treat each concept as a term or keyword, and
apply method 2 and/or method 3

30
Finding Unexpected Outgoing Links from C

May indicate some useful resource that are of
additional help to the customer of the competitor
Let the set of outgoing links from U be Lu, and
let the set of outgoing links from C be Lc.
The set of unexpected outgoing links in C w.r.t U
is Lc-Lu

31
Proposed Techniques Incorporating the User's
Existing Knowledge

Users may have some existing knowledge about the
application domain and its competitor
It enable the system to discover truly unexpected
information
It allows the user to check if his/her
expectations are correct
Express the user's knowledge as keywords,
concepts, and hypertext links. E ? Eg and Es
Eg all the general items of the domain that the
user knows about and does not want them ranked
high
Es specific items of the site that the user
knows about and does not want them ranked high

32
Proposed Techniques Incorporating the User's
Existing Knowledge (Cont.)

In computation, item in E are added to the set of
items in U
Keywords in E are used in methods 2 and 3
Concepts in method 4
Outgoing links in method 5
When a weight is needed for an item, it takes the
maximum weight

33
System Architecture
34
A Running Example
35
The Crawler Interface
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
Evaluation Application Experience

It allow the user to quickly focus on those
potentially interesting pages, terms, and
concepts
Due to the difficulties of manual analysis,
before using WebCompare, the users may gave up
after browsing some top-level pages
If a page is long, the users often do not read it
carefully, and thus may miss some useful
information. WebCompare can summarize each page
with keywords and concepts

40
Evaluation -- Efficiency
41
Future Works

Study the use of metadata and ontology to provide
more information related to keywords to create a
more intelligent system
Study how the links of a Web site may be used to
infer more unexpected information
May be extended as a methodology for monitoring a
competitor's Web site
Treat the old web pages of C as the existing
knowledge or the U site
Report any unexpected changes to the old pages by
the competitor

Write a Comment

User Comments (0)

About PowerShow.com

Discovering Business Intelligence Information by Comparing Company Web Sites PowerPoint PPT Presentation