Web Mining Research : A survey - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Web Mining Research : A survey

Description:

WebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon et.al. 1998) ... Construct multidimensional view on the Weblog database ... Perform data mining on Weblog records ... – PowerPoint PPT presentation

Number of Views:195
Avg rating:3.0/5.0
Slides: 38
Provided by: deve52
Category:

less

Transcript and Presenter's Notes

Title: Web Mining Research : A survey


1
Web Mining Research A survey
  • Authors
  • Raymond Kosala
  • Hendrick Blockeel
  • Heverlee, Belgium
  • Presented by
  • Devesh Sinha

2
A Survey in Web Mining
  • Web mining is the use of data mining techniques
    to automatically discover and extract information
    from Web documents/services (Etzioni, 1996).
  • The web mining research is at the cross road of
    research from several research communities
    (Kosala and Blockeel, July 2000), such as
  • database (DB)
    information retrieval (IR)
    the
    sub-areas of machine learning (ML)
    natural language processing (NLP)

3
Mining the World-Wide Web
  • Motivation , Opportunity
  • The WWW is huge, widely distributed, global
    information service center for
  • Information services news, advertisements,
    consumer information, financial management,
    education, government, e-commerce, etc.
  • Hyper-link information
  • Access and usage information
  • WWW provides rich sources for data mining

4
Mining the World-Wide Web
  • Growing and changing very rapidly
  • Broad diversity of user communities
  • Only a small portion of the information on the
    Web is truly relevant or useful
  • 99 of the Web information is useless to 99 of
    Web users
  • How can we find high-quality Web pages on a
    specified topic?

5
Challenges on www interactions
  • Finding Relevant Information
  • Creating knowledge from Information available
  • Personalization of the information
  • Learning about customers / individual users

6
Web Mining A more challenging task
  • Searches for
  • Web access patterns
  • Web structures
  • Regularity and dynamics of Web contents
  • Problems
  • The abundance problem
  • Limited coverage of the Web hidden Web sources,
    majority of data in DBMS
  • Limited query interface based on keyword-oriented
    search
  • Limited customization to individual users

7
Web Mining Subtasks
  • Resource Finding
  • Task of retrieving intended web-documents
  • Information Selection Pre-processing
  • Automatic selection and pre-processing specific
    information from retrieved web resources
  • Generalization
  • Automatic Discovery of patterns in web sites
  • Analysis
  • Validation and / or interpretation of mined
    patterns

8
Discussion Question
  • What is the difference between Information
    Retrieval Information Extraction ?

9
IE - IR
  • Information Retrieval
  • Automatic retrieval of relevant documents
  • Primary Goals
  • Indexing Text
  • Searching for useful documents in a collection
  • Bag of unordered words
  • Web document classification task is an
    instance of IR
  • Information Extraction
  • Extract relevant facts from documents
  • Primary Goals
  • Transform collection of retrieved documents to
    information.
  • Structure of representation of a document
  • Web document classification task is an
    instance of IR
  • IE has a higher level of granularity
  • Result
  • Structured Database
  • Compression or summary of Text or documents

10
Types of IE
  • I E from unstructured texts ( Classical)
  • Unstructured ?? Free texts eg.News stories
  • Basic to deep linguistic processing
  • IE from semi-structured texts (Structural)
  • Semi-Structured ?? HTML
  • Uses meta-information eg. HTML tags
  • Wrapper Induction,
  • Machine learning used to build systems
    (semi-)automatically

11
Discussion Question
  • Is web mining same as learning from the web or
    machine learning techniques applied on the web ?

12
Agent Paradigm
  • Software / Intelligent Agents
  • User Interface Agents
  • Maximize productivity of current user interaction
    by adapting behaviour
  • Distributed Agents
  • Problem Solving by group of agents Relevant
    Agents
  • Mobile Agents

13
Web Mining Taxonomy
14
Web Content Mining
  • Discovery of useful information from web contents
    / data / documents
  • Information Retrieval View ( Structured
    Semi-Structured)
  • Assist / Improve information finding
  • Filtering Information to users on user profiles
  • Database View
  • Model Data on the web
  • Integrate them for more sophisticated queries

15
A Survey in Web Mining
  • What have been doing in Web content mining?
    1.
    Developing intelligent tools for IR
    -
    Finding keywords and keyphrases
    - Discovering grammatical
    rules and collocations -
    Hypertext classification/categorization

    - Extracting keyphrases from text documents
    - Learning
    extraction models/rules
    - Hierarchical
    clustering
    - Predicting (words)
    relationship
    2.
    Developing Web query systems
    Many applications such as
    WebLog (Lakshmanan, et al., 1996)
    3. Mining
    multimedia data
    - Fayyad, et al. (1996)
    mining image from satellite
    - Smyth, et al (1996) mining image to
    identify small volcanoes on Venus.

16
Multiple Layered Web Architecture
More Generalized Descriptions
Layern
...
Generalized Descriptions
Layer1
Layer0
17
Web Structure Mining
  • Finding authoritative Web pages
  • Retrieving pages that are not only relevant, but
    also of high quality, or authoritative on the
    topic
  • Hyperlinks can infer the notion of authority
  • The Web consists not only of pages, but also of
    hyperlinks pointing from one page to another
  • These hyperlinks contain an enormous amount of
    latent human annotation
  • A hyperlink pointing to another Web page, this
    can be considered as the author's endorsement of
    the other page

18
A Survey in Web Mining
  • What have been doing in Web structure mining?
    1.
    Calculating the quality relevancy of each Web
    page
    - Web pages categorization
    (Chakrabarti, et al., 1998)
    - Discovering micro
    communities on the web
    - Example
    Clever system (Chakrabarti, et al., 1999)
    - Example Google (Brin
    and Page, 1998)
    2.
    Mining context of Web warehouse (Madria, et
    al.,1999) -
    Measuring the completeness of the Web sites
    - Measuring the
    replication of Web documents

19
Web Usage Mining
  • Web usage mining, also known as Web log mining,
  • process of discovering interesting patterns in
    Web access logs.
  • Commonly used approaches (Borges and Levene,
    1999)
    - Maps the log
    data into relational tables before an adapted
    data mining technique is performed.
    - Uses the log
    data directly by utilizing special pre-processing
    techniques.
  • Typical problems
    - Distinguishing among
    unique users, server sessions, episodes, etc. in
    the presence of caching and proxy servers
    (McCallum, et al., 2000 Srivastava, et al.,
    2000).

20
Mining the World-Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
  • Web Page Content Mining
  • Web Page Summarization
  • WebLog (Lakshmanan et.al. 1996), WebOQL(Mendelzon
    et.al. 1998)
  • Web Structuring query languages
  • Can identify information within given web pages
  • Ahoy! (Etzioni et.al. 1997)Uses heuristics to
    distinguish personal home pages from other web
    pages
  • ShopBot (Etzioni et.al. 1997) Looks for product
    prices within web pages

General Access Pattern Tracking
Customized Usage Tracking
Search Result Mining
21
Mining the World-Wide Web
Web Content Mining
Web Structure Mining
Web Usage Mining
Web Page Content Mining
  • Search Result Mining
  • Search Engine Result Summarization
  • Clustering Search Result (Leouski and Croft,
    1996, Zamir and Etzioni, 1997)
  • Categorizes documents using phrases in titles and
    snippets

General Access Pattern Tracking
Customized Usage Tracking
22
Mining the World-Wide Web
Web Content Mining
Web Usage Mining
  • Web Structure Mining
  • Using Links
  • PageRank (Brin et al., 1998)
  • CLEVER (Chakrabarti et al., 1998)
  • Use interconnections between web pages to give
    weight to pages.
  • Using Generalization
  • MLDB (1994), VWV (1998)
  • Uses a multi-level database representation of the
    Web. Counters (popularity) and link lists are
    used for capturing structure.

General Access Pattern Tracking
Search Result Mining
Web Page Content Mining
Customized Usage Tracking
23
Mining the World-Wide Web
Web Structure Mining
Web Content Mining
Web Usage Mining
Web Page Content Mining
Customized Usage Tracking
  • General Access Pattern Tracking
  • Web Log Mining (Zaïane, Xin and Han, 1998)
  • Uses KDD techniques to understand general access
    patterns and trends.
  • Can shed light on better structure and grouping
    of resource providers.

Search Result Mining
24
Mining the World-Wide Web
Web Usage Mining
Web Structure Mining
Web Content Mining
  • Customized Usage Tracking
  • Adaptive Sites (Perkowitz and Etzioni, 1997)
  • Analyzes access patterns of each user at a time.
  • Web site restructures itself automatically by
    learning from user access patterns.

General Access Pattern Tracking
Web Page Content Mining
Search Result Mining
25
Web Usage Mining
  • Mining Web log records to discover user access
    patterns of Web pages
  • Applications
  • Target potential customers for electronic
    commerce
  • Enhance the quality and delivery of Internet
    information services to the end user
  • Improve Web server system performance
  • Identify potential prime advertisement locations
  • Web logs provide rich information about Web
    dynamics
  • Typical Web log entry includes the URL requested,
    the IP address from which the request originated,
    and a timestamp

26
Discussion Question
  • What are the four subtasks of Web Mining ?
  • 1.
  • 2.
  • 3.
  • 4.

27
Techniques for Web usage mining
  • Construct multidimensional view on the Weblog
    database
  • Perform multidimensional OLAP analysis to find
    the top N users, top N accessed Web pages, most
    frequently accessed time periods, etc.
  • Perform data mining on Weblog records
  • Find association patterns, sequential patterns,
    and trends of Web accessing
  • May need additional information,e.g., user
    browsing sequences of the Web pages in the Web
    server buffer
  • Conduct studies to
  • Analyze system performance, improve system design
    by Web caching, Web page prefetching, and Web
    page swapping

28
Mining the World-Wide Web
  • Design of a Web Log Miner
  • Web log is filtered to generate a relational
    database
  • A data cube is generated form database
  • OLAP is used to drill-down and roll-up in the
    cube
  • OLAM is used for mining interesting knowledge

Knowledge
Web log
Database
Data Cube
Sliced and diced cube
1 Data Cleaning
2 Data Cube Creation
4 Data Mining
3 OLAP
29
Website Usage Analysis (SUA)
  • Why developing a Website usage/utilization
    analyzation tool?
    Knowledge about how visitors
    use Website could - prevent
    disorientation and help designers place important
    information/functions exactly where the
    visitors look for it and in the way users need it
    - especially help to
    build up adaptive Website server

30
Website Usage Analysis (SUA)
  • What the SUA do?
    Discover user navigation
    patterns in using Website
    -
    Establish a aggregated log structure as a
    preprocessor to reduce the search space before
    the actual log mining phase

    - Introduce a model for Website
    usage pattern discovery by extending the
    classical mining model, and establish the
    processing framework of this model


31
Website Usage Analysis (SUA)
  • Website client-server architecture facilitates
    recording user behaviors in every steps by

    - submit client-side log files to server
    when users use clear functions or exit
    window/modules
  • The special design for local and universal
    back/forward/clear functions makes users
    navigation pattern more clear for designer by
    - analyzing local back/forward history and
    corporate it with universal back/forward history

32
Website Usage Analysis (SUA)
  • What will be included in SUA
    1.
    Identify and collect log data
    2. transfer the data to
    server-side and save them in a structure desired
    for analysis
    3. Prepare mined data by establishing a
    customized aggregated log tree/frame
    4. Use
    modifications of the typical data mining methods,
    particularly an extension of a traditional
    sequence discovery algorithm, to mine user
    navigation patterns

33
Website Usage Analysis (SUA)
  • Problem need to be considered
  • - How to identify the log data when a user go
    through uninteresting function/module
  • - What marks the end of a user session?
  • - Client connect Website through proxy servers
  • Differences in Website usage analysis with common
    Web usage mining
  • - Client-side log files available
  • - Log files format (Web log files follow Common
    Log Format specified as a part of HTTP protocol)
  • - Not necessary for log file cleaning/filtering
    (which usually performed in preprocess of Web log
    mining)

34
WebSift Project
35
Reference
  • Cooley, R., Mobasher, B., and Srivastava, J. Web
    Mining Information and pattern Discovery on the
    World Wide Web. IEEE Computer, pages 558-566,
    1997.
  • Etzioni, O. The world wide web Quagmire or gold
    mine. Communications of the ACM, 39(11)65-68,
    1996.
  • Fayyad, U., Djorgovski, S., and Weir, N.
    Automating the analysis and cataloging of sky
    surveys. In Advances in Knowledge Discovery and
    Data Mining, pages 471-493. AAAI Press, 1996.
  • Kosala, R. and Blockeel, H. Web Mining Research
    A summary. SIGKDD Explorations, 2(1)1-15, 2000.

36
Reference
  • Langley, P. User modeling in adaptive
    interfaces. In Proceedings of the Seventh
    International Conference on User Modeling, pages
    357-370, 1999.
  • Madria, S.K., Bhowmick, S.S., Ng, W.K., and Lim,
    E.-P. Research issues in web data mining. In
    Proceedings of Data Warehousing and Knowledge
    Discovery, First International Conference, DaWaK
    99, pages 303-312, 1999.
  • Masand, B. and Spiliopoulou, M. Webkdd-99
    Work-shop on web usage analysis and user
    profiling. SIGKDD Explorations, 1(2), 2000.
  • Mobasher, B., Jain, N. Han, E.H., and Srivastava,
    J. Web mining Pattern discovery from world wide
    web transactions. Technical Report TR 96-060,
    University of Minnesota, Dept. of Computer
    Science, Minneapolis, 1996

37
Reference
  • Smyth, P., Fayyad, U.M., Burl, M.C., and Perona,
    P. Modeling subjective uncertainty in image
    annotation. In Advances in Knowledge Discovery
    and Data Mining, pages 517-539, 1996.
  • Spiliopoulou, M. Data mining for the web. In
    Principles of Data Mining and Knowledge
    Discovery, Second European Symposium, PKDD 99,
    pages 588-589, 1999.
  • Srivastava, J., Cooley, R., Deshpande, M., and
    Tan, P.-N. Web usage mining Discovery and
    applications of usage patterns from web data.
    SIGMOD Explorations, 1(2), 2000.
  • Zaiane, O.R., Xin, M., and Han, J. Discovering
    Web access patterns and trends by applying OLAP
    and data mining technology on Web logs. IEEE,
    pages 19-29, 1998.
Write a Comment
User Comments (0)
About PowerShow.com