Web Mining

About This Presentation

Title:

Web Mining

Description:

The World Wide Web is a rich source of knowledge that can be useful to many ... is typically derived through the divining of patterns and trends through means ... – PowerPoint PPT presentation

Number of Views:2073

Avg rating:3.0/5.0

Slides: 47

Provided by: ahmed8

Category:

more less

Transcript and Presenter's Notes

Title: Web Mining

1
Web Mining

Ahmed M. Zeki

2
Introduction

The World Wide Web is a rich source of knowledge
that can be useful to many applications.
Source?
Billions of web pages and billions of visitors
and contributors.
What knowledge?
e.g., the hyperlink structure and diversity of
languages.
Purpose?
To improve users efficiency and effectiveness in
searching for information on the web.
Decision-making support or business management.

3
Introduction

Webs Characteristics
Large size
Unstructured
Different data types text, image, hyperlinks and
user usage information
Dynamic content
Time dimension
Multilingual
Hence DM is a significant subfield of this area.
The various activities and efforts in this area
are referred to as Web Mining.

4
Introduction
5
Introduction

Information extraction techniques designed to
identify useful information from text documents
automatically.
Named-entity extraction automatic identification
from text documents of the names of entities of
interest.
Machine learning-based entity extraction systems
rely on algorithms rather than human-created
rules to extract knowledge or identify patterns
from texts.
Neural networks
Decision tree
Hidden Markov Model
Entropy maximization

6
Introduction

Relevance feedback helps users conduct searches
iteratively and reformulate search queries based
on evaluation of previously retrieved documents .
Using relevance feedback, a model can learn the
common characteristics of a set of relevant
documents in order to estimate the probability of
relevance for the remaining documents.
Various Machine Learning algorithms, such as
genetic algorithms have been used in relevance
feedback applications.

7
Introduction

Information filtering techniques try to learn
about users interests based on their evaluations
and actions, and then to use this information to
analyze new documents.
Many personalization and collaborative systems
have been implemented as software agents to help
users in information systems.

8
Introduction

Text classification classification of textual
documents into predefined categories (supervised
learning)
E.g., Support Vector Machine (SVM), a statistical
method that tries to find a hyperplane that best
separates two classes.
Text clustering groups documents into
non-predefined categories which dynamically
defined based on their similarities (unsupervised
learning).
Kohonens Self-Organizing Map (SOM), a type of
neural network that produces a 2-dimensional grid
representation for n-dimensional features, has
been widely applied in IR.
Machine learning is the basis of most text
classification and clustering applications.

9
Introduction

Web Spiders software programs that traverse the
www by following hypertext links and retrieving
Web documents by HTTP protocol.
To build the databases of search engines
To perform personal search
To archive Web sites or even the whole Web
To collect Web statistics
Intelligent Web Spiders some spiders that use
more advanced algorithms during the search
process have been developed.
E.g. , the Itsy Bitsy Spider searches the Web
using a best-first search and a genetic algorithm
approach.

10
Introduction

In order to extract non-English knowledge from
the web, Web Mining systems have to deal with
issues in language-specific text processing.
The base algorithms behind most machine learning
systems are language-independent. Most
algorithms, e.g.,text classification and
clustering, need only to take a set of features
(a vector of keywords) for the learning process.
However, the algorithms usually depend on some
phrase segmentation and extraction programs to
generate a set of features or keywords to
represent Web documents.

11
Introduction

Web Visualization tools have been used to help
users maintain a "big picture" of the retrieval
results from search engines, web sites, a subset
of the Web, or even the whole Web.
The most well known example of using the
tree-metaphor for Web browsing is the hyperbolic
tree developed by Xerox PARC.

12
Introduction

Semantic Web technology tries to add metadata to
describe data and information on the Web. Based
on standards like XML.
Machine learning can play three roles in the
Semantic Web
can be used to automatically create the markup or
metadata for existing unstructured textual
documents on the Web.
can be used to create, merge, update, and
maintain Ontologies.
can understand and perform reasoning on the
metadata provided by the Semantic Web in order to
extract knowledge from the Web more effectively.

13
Web Mining

Web mining is the application of data mining
techniques to discover patterns from the Web.
Coined by Etzioni (1996)
How Web Mining is difference from classical DM?
The web is not a relation
Textual information and linkage structure
Usage data is huge and growing rapidly
Googles usage logs are bigger than their web
crawl
Data generated per day is comparable to largest
conventional data warehouses
Ability to react in real-time to usage patterns
No human in the loop

14
Benefits of Web Data Mining

Match your available resources to visitor
interests
Increase the value of each visitor
Improve the visitor's experience at the website
Perform targeted resource management
Collect information in new ways
Test the relevance of content and web site
architecture

15
Web Mining

According to analysis targets, web mining can be
divided into three different types
Web usage mining
Web content mining
Web structure mining

16
1. Web Usage Mining

The application that uses data mining to analyze
and discover interesting patterns of users usage
data on the web.
The usage data records the users behavior when
the user browses or makes transactions on the web
site in order to better understand and serve the
needs of users or Web-based applications.
It is an activity that involves the automatic
discovery of patterns from one or more Web
servers.

17
1. Web Usage Mining

Organizations often generate and collect large
volumes of data most of this information is
usually generated automatically by Web servers
and collected in server log. Analyzing such data
can help these organizations to determine
the value of particular customers
cross marketing strategies across products
the effectiveness of promotional campaigns, etc.

18
1. Web Usage Mining

The first web analysis tools simply provided
mechanisms to report user activity as recorded in
the servers. Using such tools, it was possible to
determine such information as
the number of accesses to the server
the times or time intervals of visits
the domain names and the URLs of users of the Web
server.
These tools provide little or no analysis of data
relationships among the accessed files and
directories within the Web space.
Now more sophisticated techniques for discovery
and analysis of patterns are now emerging. These
tools fall into two main categories
Pattern Discovery Tools
Pattern Analysis Tools

19
1. Web Usage Mining

Web servers, Web proxies, and client applications
can quite easily capture Web Usage data.
Web server log Every visit to the pages, what
and when files have been requested, the IP
address of the request, the error code, the
number of bytes sent to user, and the type of
browser used
By analyzing the Web usage data, web mining
systems can discover useful knowledge about a
systems usage characteristics and the users
interests which has various applications
Personalization and Collaboration in Web-based
systems
Marketing
Web site design and evaluation
Decision support

20
1. Web Usage Mining

Web usage mining has been used for various
purposes
A knowledge discovery process for mining
marketing intelligence information from Web data.
Web traffic patterns also can be extracted from
Web usage logs in order to improve the
performance of a Web site.
Search engine transaction logs also provide
valuable knowledge about user behavior on Web
searching.
Such information is very useful for a better
understanding of users Web searching and
information seeking behavior and can improve the
design of Web search systems.

21
1. Web Usage Mining

One of the major goals of Web usage mining is to
reveal interesting trends and patterns which can
often provide important knowledge about the users
of a system.
The Framework for Web usage mining.
Preprocessing Data cleansing
Pattern discovery
Pattern analysis

Generic machine learning and Data mining
techniques, such as association rule mining,
classification, and clustering, often can be
applied.
22
1. Web Usage Mining

Many Web applications aim to provide personalized
information and services to users. Web usage data
provide an excellent way to learn about users
interest.
Web usage mining on Web logs can help identify
users who have accessed similar Web pages. The
patterns that emerge can be very useful in
collaborative Web searching and filtering.
Amazon.com uses collaborative filtering to
recommend books to potential customers based on
the preferences of other customers having similar
interests or purchasing histories.
Huang et al. (2002) used Hopfield Net to model
user interests and product profiles in an online
bookstore in Taiwan.

23
Web Server Log
KDnuggets.com Server
User
http//www.kdnuggets.com/jobs/
24
Web Server Log A Sample

152.152.98.11
- -
16/Nov/2005163250 -0500
"GET /jobs/ HTTP/1.1"
200
15140
"http//www.google.com/search?qsalaryfordatami
ninghlenlrstart10saN"
"Mozilla/4.0 (compatible MSIE 6.0 Windows NT
5.1 SV1 .NET CLR 1.1.4322)"

25
Web log fields

IP
152.152.98.11
IP address - can be converted to host name, such
as xyz.example.com
Name
The name of the remote user (usually omitted and
replaced by a dash -)
Login
Login of the remote user (also usually omitted
and replaced by a dash -)
Date/Time/TZ
16/Nov/2005163250 -0500
Request, Status code, Object size, Referrer, User
agent

26
Web Usage Mining - Basic

Totals for each component
Hits total number of requests
Files number of GETs
Pages number of HTML pages
Sites unique IP addresses
Response codes
Kbytes total Kbytes transferred
User Agents

27
Web Log Analysis Programs

Free
Analog, awstats, webalizer
Google analytics
Commercial
WebTrends, WebSideStory,

28
Example KDnuggets.com Nov 2005 totals

Monthly Statistics (from webalizer)

Q What is the difference between Hits and
Files? Answer the difference between Hits and
Files is the number of requests with status code
not 200.
29
Example KDnuggets.com Nov 2005 totals

Q What is the meaning of difference between
Files and Pages ?
A the difference between Files and Pages is the
number of non-HTML files (e.g. image, javascript,
etc
In November 2005 KDnuggets log HTML files were
about 1/3 of all requests
However, this data does not separate bot requests
(which are heavily weighted towards HTML pages)

30
2. Web Content Mining

The process to discover useful information from
the content of a web page.
The type of the web content may consist of
Text
Image
Audio
Video
Web content mining sometimes is called web text
mining, because the text content is the most
widely researched area.
The technologies that are normally used in web
content mining are
Natural Language Processing (NLP)
Information Retrieval (IR)

31
Text Mining

The process of deriving high quality information
from text.
Text mining is an interdisciplinary field which
draws on information retrieval, data mining,
machine learning, statistics, and computational
linguistics. As most information (over 80) is
currently stored as text, text mining is believed
to have a high commercial potential value.
High quality information is typically derived
through the divining of patterns and trends
through means such as statistical pattern
learning.

32
Text Mining

Text mining usually involves the process of
Structuring the input text by
Parsing
Addition of some derived linguistic features and
the removal of others
Subsequent insertion into a database
Deriving patterns within the structured data
Evaluation and interpretation of the output.
'High quality' in text mining refers to some
combination of
Relevance
Novelty
Interestingness

33
Text Mining

Typical text mining tasks include
Text categorization
Text clustering
Concept/entity extraction
Sentiment analysis
Document summarization
Entity relation modeling (i.e., learning
relations between named entities).

34
3. Web Structure Mining

The process of using the graph theory to analyze
the node and connection structure of a web site.
Web structure mining can be divided into two
kinds
Extract patterns from hyperlinks in the web. A
hyperlink is a structural component that connects
the web page to a different location.
Mining the document structure. It is using the
tree-like structure to analyze and describe the
HTML or XML tags within the web page.

35
3. Web Structure Mining

Web structure mining has been largely influenced
by research in
Social network analysis
Citation analysis (bibliometrics).
in-links the hyperlinks pointing to a page
out-links the hyperlinks found in a page.
Usually, the larger the number of in-links, the
better a page is.
By analyzing the pages containing a URL, we can
also obtain
Anchor text how other Web page authors annotate
a page and can be useful in predicting the
content of the target page.

36
3. Web Structure Mining

The PageRank algorithm is computed by weighting
each in-link to a page proportionally to the
quality of the page containing the in-link.
The qualities of these referring pages also are
determined by PageRank. Thus, a page p is
calculated recursively as follows

37
Ads vs. search results

Search advertising is the revenue model
Multi-billion-dollar industry
Advertisers pay for clicks on their ads
Interesting problems
How to pick the top 10 results for a search from
2,230,000 matching pages?
What ads to show for a search?
If Im an advertiser, which search terms should I
bid on and how much to bid?

38
Web Mining vs. Information Access

Text data mining involves extracting nuggets
and/or overall patterns from a collection of
textual information, independent of a users'
information need.
Information access is the process of helping
users find, create, use, re-use, and understand
information to satisfy an information need.
In other words, data mining is opportunistic,
whereas information access is goal-driven.

39
Search Engine Components

Spider (crawler/robot) builds corpus
Collects web pages recursively
For each known URL, fetch the page, parse it, and
extract new URLs
Repeat
Additional pages from direct submissions other
sources
The indexer creates inverted indexes
Various policies wrt which words are indexed,
capitalization, support for Unicode, stemming,
support for phrases, etc.
Query processor serves query results
Front end query reformulation, word stemming,
capitalization, optimization of Booleans, etc.
Back end finds matching documents and ranks
them

40
Application Areas of Web Mining

E-commerce
Search Engines
Personalization
Website Design

41
Application Areas of Web Mining

E-tailers
The ability to find new cross-sell opportunities,
enable comprehensive prospect profiling, and
improve customer satisfaction.
B2B and B2C Ventures

42
Application Areas of Web Mining

Advertising-Based Sites
When the revenue is advertising-based. Blindly
serving ads to visitors will not result in a
large click-thru rate. Instead, ads must be
intelligently targeted to the user, providing the
visitor with products and services that they are
interested in.
Entertainment sites
Media Portals
Advertising Providers

43
Application Areas of Web Mining

Information Repositories
Information overload is a problem that grows
larger every day. Indexing, summarization, and
other metadata tasks are time consuming. Semantic
text analyzers are capable of automating these
tasks, and create user navigation systems on the
fly.
Libraries
Technical Support Sites
Media Sites
Content Providers

44
Application Areas of Web Mining

Security applications
One of the largest text mining applications that
exists is probably the classified ECHELON
surveillance system.
Software and Applications
Research and development departments of major
companies, including IBM and Microsoft, are
researching text mining techniques and developing
programs to further automate the mining and
analysis processes.

45
Application Areas of Web Mining

Academic applications
The issue of text mining is of importance to
publishers who hold large databases of
information requiring indexing for retrieval.

46
Conclusion