Title: Web Mining
1Web Mining
2Introduction
- The World Wide Web is a rich source of knowledge
that can be useful to many applications. - Source?
- Billions of web pages and billions of visitors
and contributors. - What knowledge?
- e.g., the hyperlink structure and diversity of
languages. - Purpose?
- To improve users efficiency and effectiveness in
searching for information on the web. - Decision-making support or business management.
3Introduction
- Webs Characteristics
- Large size
- Unstructured
- Different data types text, image, hyperlinks and
user usage information - Dynamic content
- Time dimension
- Multilingual
- Hence DM is a significant subfield of this area.
- The various activities and efforts in this area
are referred to as Web Mining.
4Introduction
5Introduction
- Information extraction techniques designed to
identify useful information from text documents
automatically. - Named-entity extraction automatic identification
from text documents of the names of entities of
interest. - Machine learning-based entity extraction systems
rely on algorithms rather than human-created
rules to extract knowledge or identify patterns
from texts. - Neural networks
- Decision tree
- Hidden Markov Model
- Entropy maximization
6Introduction
- Relevance feedback helps users conduct searches
iteratively and reformulate search queries based
on evaluation of previously retrieved documents . - Using relevance feedback, a model can learn the
common characteristics of a set of relevant
documents in order to estimate the probability of
relevance for the remaining documents. - Various Machine Learning algorithms, such as
genetic algorithms have been used in relevance
feedback applications.
7Introduction
- Information filtering techniques try to learn
about users interests based on their evaluations
and actions, and then to use this information to
analyze new documents. - Many personalization and collaborative systems
have been implemented as software agents to help
users in information systems.
8Introduction
- Text classification classification of textual
documents into predefined categories (supervised
learning) - E.g., Support Vector Machine (SVM), a statistical
method that tries to find a hyperplane that best
separates two classes. - Text clustering groups documents into
non-predefined categories which dynamically
defined based on their similarities (unsupervised
learning). - Kohonens Self-Organizing Map (SOM), a type of
neural network that produces a 2-dimensional grid
representation for n-dimensional features, has
been widely applied in IR. - Machine learning is the basis of most text
classification and clustering applications.
9Introduction
- Web Spiders software programs that traverse the
www by following hypertext links and retrieving
Web documents by HTTP protocol. - To build the databases of search engines
- To perform personal search
- To archive Web sites or even the whole Web
- To collect Web statistics
- Intelligent Web Spiders some spiders that use
more advanced algorithms during the search
process have been developed. -
- E.g. , the Itsy Bitsy Spider searches the Web
using a best-first search and a genetic algorithm
approach.
10Introduction
- In order to extract non-English knowledge from
the web, Web Mining systems have to deal with
issues in language-specific text processing. - The base algorithms behind most machine learning
systems are language-independent. Most
algorithms, e.g.,text classification and
clustering, need only to take a set of features
(a vector of keywords) for the learning process. - However, the algorithms usually depend on some
phrase segmentation and extraction programs to
generate a set of features or keywords to
represent Web documents.
11Introduction
- Web Visualization tools have been used to help
users maintain a "big picture" of the retrieval
results from search engines, web sites, a subset
of the Web, or even the whole Web. - The most well known example of using the
tree-metaphor for Web browsing is the hyperbolic
tree developed by Xerox PARC.
12Introduction
- Semantic Web technology tries to add metadata to
describe data and information on the Web. Based
on standards like XML. - Machine learning can play three roles in the
Semantic Web - can be used to automatically create the markup or
metadata for existing unstructured textual
documents on the Web. - can be used to create, merge, update, and
maintain Ontologies. - can understand and perform reasoning on the
metadata provided by the Semantic Web in order to
extract knowledge from the Web more effectively.
13Web Mining
- Web mining is the application of data mining
techniques to discover patterns from the Web. - Coined by Etzioni (1996)
- How Web Mining is difference from classical DM?
- The web is not a relation
- Textual information and linkage structure
- Usage data is huge and growing rapidly
- Googles usage logs are bigger than their web
crawl - Data generated per day is comparable to largest
conventional data warehouses - Ability to react in real-time to usage patterns
- No human in the loop
14Benefits of Web Data Mining
- Match your available resources to visitor
interests - Increase the value of each visitor
- Improve the visitor's experience at the website
- Perform targeted resource management
- Collect information in new ways
- Test the relevance of content and web site
architecture
15Web Mining
- According to analysis targets, web mining can be
divided into three different types - Web usage mining
- Web content mining
- Web structure mining
161. Web Usage Mining
- The application that uses data mining to analyze
and discover interesting patterns of users usage
data on the web. - The usage data records the users behavior when
the user browses or makes transactions on the web
site in order to better understand and serve the
needs of users or Web-based applications. - It is an activity that involves the automatic
discovery of patterns from one or more Web
servers.
171. Web Usage Mining
- Organizations often generate and collect large
volumes of data most of this information is
usually generated automatically by Web servers
and collected in server log. Analyzing such data
can help these organizations to determine - the value of particular customers
- cross marketing strategies across products
- the effectiveness of promotional campaigns, etc.
181. Web Usage Mining
- The first web analysis tools simply provided
mechanisms to report user activity as recorded in
the servers. Using such tools, it was possible to
determine such information as - the number of accesses to the server
- the times or time intervals of visits
- the domain names and the URLs of users of the Web
server. - These tools provide little or no analysis of data
relationships among the accessed files and
directories within the Web space. - Now more sophisticated techniques for discovery
and analysis of patterns are now emerging. These
tools fall into two main categories - Pattern Discovery Tools
- Pattern Analysis Tools
191. Web Usage Mining
- Web servers, Web proxies, and client applications
can quite easily capture Web Usage data. - Web server log Every visit to the pages, what
and when files have been requested, the IP
address of the request, the error code, the
number of bytes sent to user, and the type of
browser used - By analyzing the Web usage data, web mining
systems can discover useful knowledge about a
systems usage characteristics and the users
interests which has various applications - Personalization and Collaboration in Web-based
systems - Marketing
- Web site design and evaluation
- Decision support
201. Web Usage Mining
- Web usage mining has been used for various
purposes - A knowledge discovery process for mining
marketing intelligence information from Web data.
- Web traffic patterns also can be extracted from
Web usage logs in order to improve the
performance of a Web site. - Search engine transaction logs also provide
valuable knowledge about user behavior on Web
searching. - Such information is very useful for a better
understanding of users Web searching and
information seeking behavior and can improve the
design of Web search systems.
211. Web Usage Mining
- One of the major goals of Web usage mining is to
reveal interesting trends and patterns which can
often provide important knowledge about the users
of a system. - The Framework for Web usage mining.
- Preprocessing Data cleansing
- Pattern discovery
- Pattern analysis
Generic machine learning and Data mining
techniques, such as association rule mining,
classification, and clustering, often can be
applied.
221. Web Usage Mining
- Many Web applications aim to provide personalized
information and services to users. Web usage data
provide an excellent way to learn about users
interest. - Web usage mining on Web logs can help identify
users who have accessed similar Web pages. The
patterns that emerge can be very useful in
collaborative Web searching and filtering. - Amazon.com uses collaborative filtering to
recommend books to potential customers based on
the preferences of other customers having similar
interests or purchasing histories. - Huang et al. (2002) used Hopfield Net to model
user interests and product profiles in an online
bookstore in Taiwan.
23Web Server Log
KDnuggets.com Server
User
http//www.kdnuggets.com/jobs/
24Web Server Log A Sample
- 152.152.98.11
- - -
- 16/Nov/2005163250 -0500
- "GET /jobs/ HTTP/1.1"
- 200
- 15140
- "http//www.google.com/search?qsalaryfordatami
ninghlenlrstart10saN" - "Mozilla/4.0 (compatible MSIE 6.0 Windows NT
5.1 SV1 .NET CLR 1.1.4322)"
25Web log fields
- IP
- 152.152.98.11
- IP address - can be converted to host name, such
as xyz.example.com - Name
- The name of the remote user (usually omitted and
replaced by a dash -) - Login
- Login of the remote user (also usually omitted
and replaced by a dash -) - Date/Time/TZ
- 16/Nov/2005163250 -0500
- Request, Status code, Object size, Referrer, User
agent
26Web Usage Mining - Basic
- Totals for each component
- Hits total number of requests
- Files number of GETs
- Pages number of HTML pages
- Sites unique IP addresses
- Response codes
- Kbytes total Kbytes transferred
- User Agents
27Web Log Analysis Programs
- Free
- Analog, awstats, webalizer
- Google analytics
- Commercial
- WebTrends, WebSideStory,
28Example KDnuggets.com Nov 2005 totals
- Monthly Statistics (from webalizer)
Q What is the difference between Hits and
Files? Answer the difference between Hits and
Files is the number of requests with status code
not 200.
29Example KDnuggets.com Nov 2005 totals
- Q What is the meaning of difference between
Files and Pages ? - A the difference between Files and Pages is the
number of non-HTML files (e.g. image, javascript,
etc - In November 2005 KDnuggets log HTML files were
about 1/3 of all requests - However, this data does not separate bot requests
(which are heavily weighted towards HTML pages)
302. Web Content Mining
- The process to discover useful information from
the content of a web page. - The type of the web content may consist of
- Text
- Image
- Audio
- Video
- Web content mining sometimes is called web text
mining, because the text content is the most
widely researched area. - The technologies that are normally used in web
content mining are - Natural Language Processing (NLP)
- Information Retrieval (IR)
31Text Mining
- The process of deriving high quality information
from text. - Text mining is an interdisciplinary field which
draws on information retrieval, data mining,
machine learning, statistics, and computational
linguistics. As most information (over 80) is
currently stored as text, text mining is believed
to have a high commercial potential value. - High quality information is typically derived
through the divining of patterns and trends
through means such as statistical pattern
learning.
32Text Mining
- Text mining usually involves the process of
- Structuring the input text by
- Parsing
- Addition of some derived linguistic features and
the removal of others - Subsequent insertion into a database
- Deriving patterns within the structured data
- Evaluation and interpretation of the output.
- 'High quality' in text mining refers to some
combination of - Relevance
- Novelty
- Interestingness
33Text Mining
- Typical text mining tasks include
- Text categorization
- Text clustering
- Concept/entity extraction
- Sentiment analysis
- Document summarization
- Entity relation modeling (i.e., learning
relations between named entities).
343. Web Structure Mining
- The process of using the graph theory to analyze
the node and connection structure of a web site.
Web structure mining can be divided into two
kinds - Extract patterns from hyperlinks in the web. A
hyperlink is a structural component that connects
the web page to a different location. - Mining the document structure. It is using the
tree-like structure to analyze and describe the
HTML or XML tags within the web page.
353. Web Structure Mining
- Web structure mining has been largely influenced
by research in - Social network analysis
- Citation analysis (bibliometrics).
- in-links the hyperlinks pointing to a page
- out-links the hyperlinks found in a page.
- Usually, the larger the number of in-links, the
better a page is. - By analyzing the pages containing a URL, we can
also obtain - Anchor text how other Web page authors annotate
a page and can be useful in predicting the
content of the target page.
363. Web Structure Mining
- The PageRank algorithm is computed by weighting
each in-link to a page proportionally to the
quality of the page containing the in-link. - The qualities of these referring pages also are
determined by PageRank. Thus, a page p is
calculated recursively as follows
37Ads vs. search results
- Search advertising is the revenue model
- Multi-billion-dollar industry
- Advertisers pay for clicks on their ads
- Interesting problems
- How to pick the top 10 results for a search from
2,230,000 matching pages? - What ads to show for a search?
- If Im an advertiser, which search terms should I
bid on and how much to bid?
38Web Mining vs. Information Access
- Text data mining involves extracting nuggets
and/or overall patterns from a collection of
textual information, independent of a users'
information need. - Information access is the process of helping
users find, create, use, re-use, and understand
information to satisfy an information need. - In other words, data mining is opportunistic,
whereas information access is goal-driven.
39Search Engine Components
- Spider (crawler/robot) builds corpus
- Collects web pages recursively
- For each known URL, fetch the page, parse it, and
extract new URLs - Repeat
- Additional pages from direct submissions other
sources - The indexer creates inverted indexes
- Various policies wrt which words are indexed,
capitalization, support for Unicode, stemming,
support for phrases, etc. - Query processor serves query results
- Front end query reformulation, word stemming,
capitalization, optimization of Booleans, etc. - Back end finds matching documents and ranks
them
40Application Areas of Web Mining
- E-commerce
- Search Engines
- Personalization
- Website Design
41Application Areas of Web Mining
- E-tailers
- The ability to find new cross-sell opportunities,
enable comprehensive prospect profiling, and
improve customer satisfaction. - B2B and B2C Ventures
42Application Areas of Web Mining
- Advertising-Based Sites
- When the revenue is advertising-based. Blindly
serving ads to visitors will not result in a
large click-thru rate. Instead, ads must be
intelligently targeted to the user, providing the
visitor with products and services that they are
interested in. - Entertainment sites
- Media Portals
- Advertising Providers
43Application Areas of Web Mining
- Information Repositories
- Information overload is a problem that grows
larger every day. Indexing, summarization, and
other metadata tasks are time consuming. Semantic
text analyzers are capable of automating these
tasks, and create user navigation systems on the
fly. - Libraries
- Technical Support Sites
- Media Sites
- Content Providers
44Application Areas of Web Mining
- Security applications
- One of the largest text mining applications that
exists is probably the classified ECHELON
surveillance system. - Software and Applications
- Research and development departments of major
companies, including IBM and Microsoft, are
researching text mining techniques and developing
programs to further automate the mining and
analysis processes.
45Application Areas of Web Mining
- Academic applications
- The issue of text mining is of importance to
publishers who hold large databases of
information requiring indexing for retrieval.
46Conclusion
- Major limitations of Web mining research
- Lack of suitable test collections that can be
reused by researchers. - Difficult to collect Web usage data across
different Web sites. - Future research directions
- Multimedia data mining a picture is worth a
thousand words. - Multilingual knowledge extraction Web page
translations - Wireless Web WML and HDML.
- The Hidden Web forms, dynamically generated Web
pages. - Semantic Web
This presentation is reproduced from the
articles attached