Title: Web Mining Research : A survey
1Web Mining Research A survey
- Authors
- Raymond Kosala
- Hendrick Blockeel
- Heverlee, Belgium
- Presented by
- Devesh Sinha
2A Survey in Web Mining
- Web mining is the use of data mining techniques
to automatically discover and extract information
from Web documents/services (Etzioni, 1996). - The web mining research is at the cross road of
research from several research communities
(Kosala and Blockeel, July 2000), such as
- database (DB)
information retrieval (IR)
the
sub-areas of machine learning (ML)
natural language processing (NLP)
3Mining the World-Wide Web
- Motivation , Opportunity
- The WWW is huge, widely distributed, global
information service center for - Information services news, advertisements,
consumer information, financial management,
education, government, e-commerce, etc. - Hyper-link information
- Access and usage information
- WWW provides rich sources for data mining
4Mining the World-Wide Web
- Growing and changing very rapidly
-
- Broad diversity of user communities
- Only a small portion of the information on the
Web is truly relevant or useful - 99 of the Web information is useless to 99 of
Web users - How can we find high-quality Web pages on a
specified topic?
5Challenges on www interactions
- Finding Relevant Information
- Creating knowledge from Information available
- Personalization of the information
- Learning about customers / individual users
6Web Mining A more challenging task
- Searches for
- Web access patterns
- Web structures
- Regularity and dynamics of Web contents
- Problems
- The abundance problem
- Limited coverage of the Web hidden Web sources,
majority of data in DBMS - Limited query interface based on keyword-oriented
search - Limited customization to individual users
7Web Mining Subtasks
- Resource Finding
- Task of retrieving intended web-documents
- Information Selection Pre-processing
- Automatic selection and pre-processing specific
information from retrieved web resources - Generalization
- Automatic Discovery of patterns in web sites
- Analysis
- Validation and / or interpretation of mined
patterns
8Discussion Question
- What is the difference between Information
Retrieval Information Extraction ?
9IE - IR
- Information Retrieval
- Automatic retrieval of relevant documents
- Primary Goals
- Indexing Text
- Searching for useful documents in a collection
- Bag of unordered words
- Web document classification task is an
instance of IR
- Information Extraction
- Extract relevant facts from documents
- Primary Goals
- Transform collection of retrieved documents to
information. - Structure of representation of a document
- Web document classification task is an
instance of IR - IE has a higher level of granularity
- Result
- Structured Database
- Compression or summary of Text or documents
10Types of IE
- I E from unstructured texts ( Classical)
- Unstructured ?? Free texts eg.News stories
- Basic to deep linguistic processing
- IE from semi-structured texts (Structural)
- Semi-Structured ?? HTML
- Uses meta-information eg. HTML tags
- Wrapper Induction,
- Machine learning used to build systems
(semi-)automatically -
11Discussion Question
- Is web mining same as learning from the web or
machine learning techniques applied on the web ?
12Agent Paradigm
- Software / Intelligent Agents
- User Interface Agents
- Maximize productivity of current user interaction
by adapting behaviour - Distributed Agents
- Problem Solving by group of agents Relevant
Agents - Mobile Agents
13Web Mining Taxonomy
14Web Content Mining
- Discovery of useful information from web contents
/ data / documents - Information Retrieval View ( Structured
Semi-Structured) - Assist / Improve information finding
- Filtering Information to users on user profiles
- Database View
- Model Data on the web
- Integrate them for more sophisticated queries
15A Survey in Web Mining
- What have been doing in Web content mining?
1.
Developing intelligent tools for IR
-
Finding keywords and keyphrases
- Discovering grammatical
rules and collocations -
Hypertext classification/categorization
- Extracting keyphrases from text documents
- Learning
extraction models/rules
- Hierarchical
clustering
- Predicting (words)
relationship
2.
Developing Web query systems
Many applications such as
WebLog (Lakshmanan, et al., 1996)
3. Mining
multimedia data
- Fayyad, et al. (1996)
mining image from satellite
- Smyth, et al (1996) mining image to
identify small volcanoes on Venus.
16Multiple Layered Web Architecture
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
17Web Structure Mining
- Finding authoritative Web pages
- Retrieving pages that are not only relevant, but
also of high quality, or authoritative on the
topic - Hyperlinks can infer the notion of authority
- The Web consists not only of pages, but also of
hyperlinks pointing from one page to another - These hyperlinks contain an enormous amount of
latent human annotation - A hyperlink pointing to another Web page, this
can be considered as the author's endorsement of
the other page
18A Survey in Web Mining
- What have been doing in Web structure mining?
1.
Calculating the quality relevancy of each Web
page
- Web pages categorization
(Chakrabarti, et al., 1998)
- Discovering micro
communities on the web
- Example
Clever system (Chakrabarti, et al., 1999)
- Example Google (Brin
and Page, 1998)
2.
Mining context of Web warehouse (Madria, et
al.,1999) -
Measuring the completeness of the Web sites
- Measuring the
replication of Web documents
19Web Usage Mining
- Web usage mining, also known as Web log mining,
- process of discovering interesting patterns in
Web access logs. - Commonly used approaches (Borges and Levene,
1999)
- Maps the log
data into relational tables before an adapted
data mining technique is performed.
- Uses the log
data directly by utilizing special pre-processing
techniques. - Typical problems
- Distinguishing among
unique users, server sessions, episodes, etc. in
the presence of caching and proxy servers
(McCallum, et al., 2000 Srivastava, et al.,
2000).
20Mining the World-Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
- Web Page Content Mining
- Web Page Summarization
- WebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon
et.al. 1998) - Web Structuring query languages
- Can identify information within given web pages
- Ahoy! (Etzioni et.al. 1997)Uses heuristics to
distinguish personal home pages from other web
pages - ShopBot (Etzioni et.al. 1997) Looks for product
prices within web pages
General Access Pattern Tracking
Customized Usage Tracking
Search Result Mining
21Mining the World-Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
Web Page Content Mining
- Search Result Mining
- Search Engine Result Summarization
- Clustering Search Result (Leouski and Croft,
1996, Zamir and Etzioni, 1997) - Categorizes documents using phrases in titles and
snippets
General Access Pattern Tracking
Customized Usage Tracking
22Mining the World-Wide Web
Web Content Mining
Web Usage Mining
- Web Structure Mining
- Using Links
- PageRank (Brin et al., 1998)
- CLEVER (Chakrabarti et al., 1998)
- Use interconnections between web pages to give
weight to pages. -
- Using Generalization
- MLDB (1994), VWV (1998)
- Uses a multi-level database representation of the
Web. Counters (popularity) and link lists are
used for capturing structure.
General Access Pattern Tracking
Search Result Mining
Web Page Content Mining
Customized Usage Tracking
23Mining the World-Wide Web
Web Structure Mining
Web Content Mining
Web Usage Mining
Web Page Content Mining
Customized Usage Tracking
- General Access Pattern Tracking
- Web Log Mining (Zaïane, Xin and Han, 1998)
- Uses KDD techniques to understand general access
patterns and trends. - Can shed light on better structure and grouping
of resource providers.
Search Result Mining
24Mining the World-Wide Web
Web Usage Mining
Web Structure Mining
Web Content Mining
- Customized Usage Tracking
- Adaptive Sites (Perkowitz and Etzioni, 1997)
- Analyzes access patterns of each user at a time.
- Web site restructures itself automatically by
learning from user access patterns.
General Access Pattern Tracking
Web Page Content Mining
Search Result Mining
25Web Usage Mining
- Mining Web log records to discover user access
patterns of Web pages - Applications
- Target potential customers for electronic
commerce - Enhance the quality and delivery of Internet
information services to the end user - Improve Web server system performance
- Identify potential prime advertisement locations
- Web logs provide rich information about Web
dynamics - Typical Web log entry includes the URL requested,
the IP address from which the request originated,
and a timestamp
26Discussion Question
- What are the four subtasks of Web Mining ?
- 1.
- 2.
- 3.
- 4.
27Techniques for Web usage mining
- Construct multidimensional view on the Weblog
database - Perform multidimensional OLAP analysis to find
the top N users, top N accessed Web pages, most
frequently accessed time periods, etc. - Perform data mining on Weblog records
- Find association patterns, sequential patterns,
and trends of Web accessing - May need additional information,e.g., user
browsing sequences of the Web pages in the Web
server buffer - Conduct studies to
- Analyze system performance, improve system design
by Web caching, Web page prefetching, and Web
page swapping
28Mining the World-Wide Web
- Design of a Web Log Miner
- Web log is filtered to generate a relational
database - A data cube is generated form database
- OLAP is used to drill-down and roll-up in the
cube - OLAM is used for mining interesting knowledge
Knowledge
Web log
Database
Data Cube
Sliced and diced cube
1 Data Cleaning
2 Data Cube Creation
4 Data Mining
3 OLAP
29Website Usage Analysis (SUA)
- Why developing a Website usage/utilization
analyzation tool?
Knowledge about how visitors
use Website could - prevent
disorientation and help designers place important
information/functions exactly where the
visitors look for it and in the way users need it
- especially help to
build up adaptive Website server
30Website Usage Analysis (SUA)
- What the SUA do?
Discover user navigation
patterns in using Website
-
Establish a aggregated log structure as a
preprocessor to reduce the search space before
the actual log mining phase
- Introduce a model for Website
usage pattern discovery by extending the
classical mining model, and establish the
processing framework of this model
31Website Usage Analysis (SUA)
- Website client-server architecture facilitates
recording user behaviors in every steps by
- submit client-side log files to server
when users use clear functions or exit
window/modules - The special design for local and universal
back/forward/clear functions makes users
navigation pattern more clear for designer by
- analyzing local back/forward history and
corporate it with universal back/forward history
32Website Usage Analysis (SUA)
- What will be included in SUA
1.
Identify and collect log data
2. transfer the data to
server-side and save them in a structure desired
for analysis
3. Prepare mined data by establishing a
customized aggregated log tree/frame
4. Use
modifications of the typical data mining methods,
particularly an extension of a traditional
sequence discovery algorithm, to mine user
navigation patterns
33Website Usage Analysis (SUA)
- Problem need to be considered
- - How to identify the log data when a user go
through uninteresting function/module - - What marks the end of a user session?
- - Client connect Website through proxy servers
- Differences in Website usage analysis with common
Web usage mining - - Client-side log files available
- - Log files format (Web log files follow Common
Log Format specified as a part of HTTP protocol) - - Not necessary for log file cleaning/filtering
(which usually performed in preprocess of Web log
mining)
34WebSift Project
35Reference
- Cooley, R., Mobasher, B., and Srivastava, J. Web
Mining Information and pattern Discovery on the
World Wide Web. IEEE Computer, pages 558-566,
1997. - Etzioni, O. The world wide web Quagmire or gold
mine. Communications of the ACM, 39(11)65-68,
1996. - Fayyad, U., Djorgovski, S., and Weir, N.
Automating the analysis and cataloging of sky
surveys. In Advances in Knowledge Discovery and
Data Mining, pages 471-493. AAAI Press, 1996. - Kosala, R. and Blockeel, H. Web Mining Research
A summary. SIGKDD Explorations, 2(1)1-15, 2000.
36Reference
- Langley, P. User modeling in adaptive
interfaces. In Proceedings of the Seventh
International Conference on User Modeling, pages
357-370, 1999. - Madria, S.K., Bhowmick, S.S., Ng, W.K., and Lim,
E.-P. Research issues in web data mining. In
Proceedings of Data Warehousing and Knowledge
Discovery, First International Conference, DaWaK
99, pages 303-312, 1999. - Masand, B. and Spiliopoulou, M. Webkdd-99
Work-shop on web usage analysis and user
profiling. SIGKDD Explorations, 1(2), 2000. - Mobasher, B., Jain, N. Han, E.H., and Srivastava,
J. Web mining Pattern discovery from world wide
web transactions. Technical Report TR 96-060,
University of Minnesota, Dept. of Computer
Science, Minneapolis, 1996
37Reference
- Smyth, P., Fayyad, U.M., Burl, M.C., and Perona,
P. Modeling subjective uncertainty in image
annotation. In Advances in Knowledge Discovery
and Data Mining, pages 517-539, 1996. - Spiliopoulou, M. Data mining for the web. In
Principles of Data Mining and Knowledge
Discovery, Second European Symposium, PKDD 99,
pages 588-589, 1999. - Srivastava, J., Cooley, R., Deshpande, M., and
Tan, P.-N. Web usage mining Discovery and
applications of usage patterns from web data.
SIGMOD Explorations, 1(2), 2000. - Zaiane, O.R., Xin, M., and Han, J. Discovering
Web access patterns and trends by applying OLAP
and data mining technology on Web logs. IEEE,
pages 19-29, 1998.