Title: WEB MINING Prof. Navneet Goyal BITS, Pilani
1WEB MININGProf. Navneet GoyalBITS, Pilani
2Web Mining
- Web Mining is the use of the data mining
techniques to automatically discover and extract
information from web documents/services - Discovering useful information from the
World-Wide Web and its usage patterns - My Definition Using data mining techniques to
make the web more useful and more profitable (for
some) and to increase the efficiency of our
interaction with the web
3Web Mining
- Data Mining Techniques
- Association rules
- Sequential patterns
- Classification
- Clustering
- Outlier discovery
- Applications to the Web
- E-commerce
- Information retrieval (search)
- Network management
4Examples of Discovered Patterns
- Association rules
- 98 of AOL users also have E-trade accounts
- Classification
- People with age less than 40 and salary gt 40k
trade on-line - Clustering
- Users A and B access similar URLs
- Outlier Detection
- User A spends more than twice the average amount
of time surfing on the Web
5Web Mining
- The WWW is huge, widely distributed, global
information service centre for - Information services news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc. - Hyper-link information
- Access and usage information
- WWW provides rich sources of data for data mining
6Why Mine the Web?
- Enormous wealth of information on Web
- Financial information (e.g. stock quotes)
- Book/CD/Video stores (e.g. Amazon)
- Restaurant information (e.g. Zagats)
- Car prices (e.g. Carpoint)
- Lots of data on user access patterns
- Web logs contain sequence of URLs accessed by
users - Possible to mine interesting nuggets of
information - People who ski also travel frequently to Europe
- Tech stocks have corrections in the summer and
rally from November until February
7Why is Web Mining Different?
- The Web is a huge collection of documents except
for - Hyper-link information
- Access and usage information
- The Web is very dynamic
- New pages are constantly being generated
- Challenge Develop new Web mining algorithms and
adapt traditional data mining algorithms to - Exploit hyper-links and access patterns
- Be incremental
8Web Mining Applications
- E-commerce (Infrastructure)
- Generate user profiles
- Targetted advertizing
- Fraud
- Similar image retrieval
- Information retrieval (Search) on the Web
- Automated generation of topic hierarchies
- Web knowledge bases
- Extraction of schema for XML documents
- Network Management
- Performance management
- Fault management
9User Profiling
- Important for improving customization
- Provide users with pages, advertisements of
interest - Example profiles on-line trader, on-line
shopper - Generate user profiles based on their access
patterns - Cluster users based on frequently accessed URLs
- Use classifier to generate a profile for each
cluster - Engage technologies
- Tracks web traffic to create anonymous user
profiles of Web surfers - Has profiles for more than 35 million anonymous
users
10Internet Advertizing
- Ads are a major source of revenue for Web portals
(e.g., Yahoo, Lycos) and E-commerce sites - Plenty of startups doing internet advertizing
- Doubleclick, AdForce, Flycast, AdKnowledge
- Internet advertizing is probably the hottest
web mining application today
11Internet Advertizing
- Scheme 1
- Manually associate a set of ads with each user
profile - For each user, display an ad from the set based
on profile - Scheme 2
- Automate association between ads and users
- Use ad click information to cluster users (each
user is associated with a set of ads that he/she
clicked on) - For each cluster, find ads that occur most
frequently in the cluster and these become the
ads for the set of users in the cluster
12Internet Advertizing
- Use collaborative filtering (e.g. Likeminds,
Firefly) - Each user Ui has a rating for a subset of ads
(based on click information, time spent, items
bought etc.) - Rij - rating of user Ui for ad Aj
- Problem Compute user Uis rating for an unrated
ad Aj
13Internet Advertizing
- Key Idea User Uis rating for ad Aj is set to
Rkj, where Uk is the user whose rating of ads is
most similar to Uis - User Uis rating for an ad Aj that has not been
previously displayed to Ui is computed as
follows - Consider a user Uk who has rated ad Aj
- Compute Dik, the distance between Ui and Uks
ratings on common ads - Uis rating for ad Aj Rkj (Uk is user with
smallest Dik) - Display to Ui ad Aj with highest computed rating
14Fraud
- With the growing popularity of E-commerce,
systems to detect and prevent fraud on the Web
become important - Maintain a signature for each user based on
buying patterns on the Web (e.g., amount spent,
categories of items bought) - If buying pattern changes significantly, then
signal fraud - HNC software uses domain knowledge and neural
networks for credit card fraud detection
15Retrieval of Similar Images
- Given
- A set of images
- Find
- All images similar to a given image
- All pairs of similar images
- Sample applications
- Medical diagnosis
- Weather predication
- Web search engine for images
- E-commerce
16Retrieval of Similar Images
- QBIC, Virage, Photobook
- Compute feature signature for each image
- QBIC uses color histograms
- WBIIS, WALRUS use wavelets
- Use spatial index to retrieve database image
whose signature is closest to the querys
signature - WALRUS decomposes an image into regions
- A single signature is stored for each region
- Two images are considered to be similar if they
have enough similar region pairs
17Images retrieved by WALRUS
Query image
18Problems with Web Search Today
- Todays search engines are plagued by problems
- the abundance problem (99 of info of no interest
to 99 of people) - limited coverage of the Web (internet sources
hidden behind search interfaces) - Largest crawlers cover lt 18 of all web pages
- limited query interface based on keyword-oriented
search - limited customization to individual users
19Problems with Web Search Today
- Todays search engines are plagued by problems
- Web is highly dynamic
- Lot of pages added, removed, and updated every
day - Very high dimensionality
20Improve Search By Adding Structure to the Web
- Use Web directories (or topic hierarchies)
- Provide a hierarchical classification of
documents (e.g., Yahoo!) - Searches performed in the context of a topic
restricts the search to only a subset of web
pages related to the topic
Yahoo home page
Recreation
Science
Business
News
Sports
Travel
Companies
Finance
Jobs
21Automatic Creation of Web Directories
- In the Clever project, hyper-links between Web
pages are taken into account when categorizing
them - Use a bayesian classifier
- Exploit knowledge of the classes of immediate
neighbors of document to be classified - Show that simply taking text from neighbors and
using standard document classifiers to classify
page does not work - Inktomis Directory Engine uses Concept
Induction to automatically categorize millions
of documents
22Network Management
- Objective To deliver content to users quickly
and reliably - Traffic management
- Fault management
23Why is Traffic Management Important?
- While annual bandwidth demand is increasing
ten-fold on average, annual bandwidth supply is
rising only by a factor of three - Result is frequent congestion at servers and on
network links - during a major event (e.g., princess dianas
death), an overwhelming number of user requests
can result in millions of redundant copies of
data flowing back and forth across the world - Olympic sites during the games
- NASA sites close to launch and landing of
shuttles
24Traffic Management
- Key Ideas
- Dynamically replicate/cache content at multiple
sites within the network and closer to the user - Multiple paths between any pair of sites
- Route user requests to server closest to the user
or least loaded server - Use path with least congested network links
- Akamai, Inktomi
25Traffic Management
Congested link
Congested server
Request
Service Provider Network
26Traffic Management
- Need to mine network and Web traffic to determine
- What content to replicate?
- Which servers should store replicas?
- Which server to route a user request?
- What path to use to route packets?
- Network Design issues
- Where to place servers?
- Where to place routers?
- Which routers should be connected by links?
- One can use association rules, sequential pattern
mining algorithms to cache/prefetch replicas at
server
27Fault Management
- Fault management involves
- Quickly identifying failed/congested servers and
links in network - Re-routing user requests and packets to avoid
congested/down servers and links - Need to analyze alarm and traffic data to carry
out root cause analysis of faults - Bayesian classifiers can be used to predict the
root cause given a set of alarms
28Web Mining Issues
- Size
- Grows at about 1 million pages a day
- Google indexes 9 billion documents
- Number of web sites
- Netcraft survey says 72 million sites
- (http//news.netcraft.com/archives/web_server_su
rvey.html) - Diverse types of data
- Images
- Text
- Audio/video
- XML
- HTML
29Number of Active Sites
Total Sites Across All Domains August 1995 -
October 2007
30Systems Issues
- Web data sets can be very large
- Tens to hundreds of terabytes
- Cannot mine on a single server!
- Need large farms of servers
- How to organize hardware/software to mine
multi-terabye data sets - Without breaking the bank!
31Different Data Formats
- Structured Data
- Unstructured Data
- OLE DB offers some solutions!
32Web Data
- Web pages
- Intra-page structures
- Inter-page structures
- Usage data
- Supplemental data
- Profiles
- Registration information
- Cookies
33Web Usage Mining
- Pages contain information
- Links are roads
- How do people navigate the Internet
- ? Web Usage Mining (clickstream analysis)
- Information on navigation paths available in log
files - Logs can be mined from a client or a server
perspective
34Website Usage Analysis
- Why analyze Website usage?
- Knowledge about how visitors use Website could
- Provide guidelines to web site reorganization
Help prevent disorientation - Help designers place important information where
the visitors look for it - Pre-fetching and caching web pages
- Provide adaptive Website (Personalization)
- Questions which could be answered
- What are the differences in usage and access
patterns among users? - What user behaviors change over time?
- How usage patterns change with quality of service
(slow/fast)? - What is the distribution of network traffic over
time?
35Website Usage Analysis
36Website Usage Analysis
37Website Usage Analysis
- Analog Web Log File Analyser
- Gives basic statistics such as
- number of hits
- average hits per time period
- what are the popular pages in your site
- who is visiting your site
- what keywords are users searching for to get to
you - what is being downloaded
- http//www.analog.cx/
38Web Usage Mining Process
39Web Usage Mining Process
40Web Usage Mining Process
41Web Mining Outline
- Goal Examine the use of data mining on the World
Wide Web - Web Content Mining
- Web Structure Mining
- Web Usage Mining
42Web Mining Taxonomy
Modified from zai01
43Web Content Mining
- Examine the contents of web pages as well as
result of web searching - Can be thought of as extending the work performed
by basic search engines - Search engines have crawlers to search the web
and gather information, indexing techniques to
store the information, and query processing
support to provide information to the users - Web Content Mining is the process of extracting
knowledge from web contents
44Semi-structured Data
- Content is, in general, semi-structured
- Example
- Title
- Author
- Publication_Date
- Length
- Category
- Abstract
- Content
45Structuring Textual Data
- Many methods designed to analyze structured data
- If we can represent documents by a set of
attributes we will be able to use existing data
mining methods - How to represent a document?
- Vector based representation
- (referred to as bag of words as it is
invariant to permutations) - Use statistics to add a numerical dimension to
unstructured text
46Document Representation
- A document representation aims to capture what
the document is about - One possible approach
- Each entry describes a document
- Attribute describe whether or not a term appears
in the document
47Document Representation
- Another approach
- Each entry describes a document
- Attributes represent the frequency in which a
term appears in the document
48Document Representation
- Stop Word removal Many words are not
informative and thus - irrelevant for document representation
- the, and, a, an, is, of, that,
- Stemming reducing words to their root form
(Reduce dimensionality) - A document may contain several occurrences of
words like fish, fishes, fisher, and fishers. But
would not be retrieved by a query with the
keyword fishing - Different words share the same word stem and
should be represented with its stem, instead of
the actual word Fish