Web Mining - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

Web Mining

Description:

The World Wide Web is a rich source of knowledge that can be useful to many ... is typically derived through the divining of patterns and trends through means ... – PowerPoint PPT presentation

Number of Views:2073
Avg rating:3.0/5.0
Slides: 47
Provided by: ahmed8
Category:
Tags: divining | mining | web

less

Transcript and Presenter's Notes

Title: Web Mining


1
Web Mining
  • Ahmed M. Zeki

2
Introduction
  • The World Wide Web is a rich source of knowledge
    that can be useful to many applications.
  • Source?
  • Billions of web pages and billions of visitors
    and contributors.
  • What knowledge?
  • e.g., the hyperlink structure and diversity of
    languages.
  • Purpose?
  • To improve users efficiency and effectiveness in
    searching for information on the web.
  • Decision-making support or business management.

3
Introduction
  • Webs Characteristics
  • Large size
  • Unstructured
  • Different data types text, image, hyperlinks and
    user usage information
  • Dynamic content
  • Time dimension
  • Multilingual
  • Hence DM is a significant subfield of this area.
  • The various activities and efforts in this area
    are referred to as Web Mining.

4
Introduction
5
Introduction
  • Information extraction techniques designed to
    identify useful information from text documents
    automatically.
  • Named-entity extraction automatic identification
    from text documents of the names of entities of
    interest.
  • Machine learning-based entity extraction systems
    rely on algorithms rather than human-created
    rules to extract knowledge or identify patterns
    from texts.
  • Neural networks
  • Decision tree
  • Hidden Markov Model
  • Entropy maximization

6
Introduction
  • Relevance feedback helps users conduct searches
    iteratively and reformulate search queries based
    on evaluation of previously retrieved documents .
  • Using relevance feedback, a model can learn the
    common characteristics of a set of relevant
    documents in order to estimate the probability of
    relevance for the remaining documents.
  • Various Machine Learning algorithms, such as
    genetic algorithms have been used in relevance
    feedback applications.

7
Introduction
  • Information filtering techniques try to learn
    about users interests based on their evaluations
    and actions, and then to use this information to
    analyze new documents.
  • Many personalization and collaborative systems
    have been implemented as software agents to help
    users in information systems.

8
Introduction
  • Text classification classification of textual
    documents into predefined categories (supervised
    learning)
  • E.g., Support Vector Machine (SVM), a statistical
    method that tries to find a hyperplane that best
    separates two classes.
  • Text clustering groups documents into
    non-predefined categories which dynamically
    defined based on their similarities (unsupervised
    learning).
  • Kohonens Self-Organizing Map (SOM), a type of
    neural network that produces a 2-dimensional grid
    representation for n-dimensional features, has
    been widely applied in IR.
  • Machine learning is the basis of most text
    classification and clustering applications.

9
Introduction
  • Web Spiders software programs that traverse the
    www by following hypertext links and retrieving
    Web documents by HTTP protocol.
  • To build the databases of search engines
  • To perform personal search
  • To archive Web sites or even the whole Web
  • To collect Web statistics
  • Intelligent Web Spiders some spiders that use
    more advanced algorithms during the search
    process have been developed.
  • E.g. , the Itsy Bitsy Spider searches the Web
    using a best-first search and a genetic algorithm
    approach.

10
Introduction
  • In order to extract non-English knowledge from
    the web, Web Mining systems have to deal with
    issues in language-specific text processing.
  • The base algorithms behind most machine learning
    systems are language-independent. Most
    algorithms, e.g.,text classification and
    clustering, need only to take a set of features
    (a vector of keywords) for the learning process.
  • However, the algorithms usually depend on some
    phrase segmentation and extraction programs to
    generate a set of features or keywords to
    represent Web documents.

11
Introduction
  • Web Visualization tools have been used to help
    users maintain a "big picture" of the retrieval
    results from search engines, web sites, a subset
    of the Web, or even the whole Web.
  • The most well known example of using the
    tree-metaphor for Web browsing is the hyperbolic
    tree developed by Xerox PARC.

12
Introduction
  • Semantic Web technology tries to add metadata to
    describe data and information on the Web. Based
    on standards like XML.
  • Machine learning can play three roles in the
    Semantic Web
  • can be used to automatically create the markup or
    metadata for existing unstructured textual
    documents on the Web.
  • can be used to create, merge, update, and
    maintain Ontologies.
  • can understand and perform reasoning on the
    metadata provided by the Semantic Web in order to
    extract knowledge from the Web more effectively.

13
Web Mining
  • Web mining is the application of data mining
    techniques to discover patterns from the Web.
  • Coined by Etzioni (1996)
  • How Web Mining is difference from classical DM?
  • The web is not a relation
  • Textual information and linkage structure
  • Usage data is huge and growing rapidly
  • Googles usage logs are bigger than their web
    crawl
  • Data generated per day is comparable to largest
    conventional data warehouses
  • Ability to react in real-time to usage patterns
  • No human in the loop

14
Benefits of Web Data Mining
  • Match your available resources to visitor
    interests
  • Increase the value of each visitor
  • Improve the visitor's experience at the website
  • Perform targeted resource management
  • Collect information in new ways
  • Test the relevance of content and web site
    architecture

15
Web Mining
  • According to analysis targets, web mining can be
    divided into three different types
  • Web usage mining
  • Web content mining
  • Web structure mining

16
1. Web Usage Mining
  • The application that uses data mining to analyze
    and discover interesting patterns of users usage
    data on the web.
  • The usage data records the users behavior when
    the user browses or makes transactions on the web
    site in order to better understand and serve the
    needs of users or Web-based applications.
  • It is an activity that involves the automatic
    discovery of patterns from one or more Web
    servers.

17
1. Web Usage Mining
  • Organizations often generate and collect large
    volumes of data most of this information is
    usually generated automatically by Web servers
    and collected in server log. Analyzing such data
    can help these organizations to determine
  • the value of particular customers
  • cross marketing strategies across products
  • the effectiveness of promotional campaigns, etc.

18
1. Web Usage Mining
  • The first web analysis tools simply provided
    mechanisms to report user activity as recorded in
    the servers. Using such tools, it was possible to
    determine such information as
  • the number of accesses to the server
  • the times or time intervals of visits
  • the domain names and the URLs of users of the Web
    server.
  • These tools provide little or no analysis of data
    relationships among the accessed files and
    directories within the Web space.
  • Now more sophisticated techniques for discovery
    and analysis of patterns are now emerging. These
    tools fall into two main categories
  • Pattern Discovery Tools
  • Pattern Analysis Tools

19
1. Web Usage Mining
  • Web servers, Web proxies, and client applications
    can quite easily capture Web Usage data.
  • Web server log Every visit to the pages, what
    and when files have been requested, the IP
    address of the request, the error code, the
    number of bytes sent to user, and the type of
    browser used
  • By analyzing the Web usage data, web mining
    systems can discover useful knowledge about a
    systems usage characteristics and the users
    interests which has various applications
  • Personalization and Collaboration in Web-based
    systems
  • Marketing
  • Web site design and evaluation
  • Decision support

20
1. Web Usage Mining
  • Web usage mining has been used for various
    purposes
  • A knowledge discovery process for mining
    marketing intelligence information from Web data.
  • Web traffic patterns also can be extracted from
    Web usage logs in order to improve the
    performance of a Web site.
  • Search engine transaction logs also provide
    valuable knowledge about user behavior on Web
    searching.
  • Such information is very useful for a better
    understanding of users Web searching and
    information seeking behavior and can improve the
    design of Web search systems.

21
1. Web Usage Mining
  • One of the major goals of Web usage mining is to
    reveal interesting trends and patterns which can
    often provide important knowledge about the users
    of a system.
  • The Framework for Web usage mining.
  • Preprocessing Data cleansing
  • Pattern discovery
  • Pattern analysis

Generic machine learning and Data mining
techniques, such as association rule mining,
classification, and clustering, often can be
applied.
22
1. Web Usage Mining
  • Many Web applications aim to provide personalized
    information and services to users. Web usage data
    provide an excellent way to learn about users
    interest.
  • Web usage mining on Web logs can help identify
    users who have accessed similar Web pages. The
    patterns that emerge can be very useful in
    collaborative Web searching and filtering.
  • Amazon.com uses collaborative filtering to
    recommend books to potential customers based on
    the preferences of other customers having similar
    interests or purchasing histories.
  • Huang et al. (2002) used Hopfield Net to model
    user interests and product profiles in an online
    bookstore in Taiwan.

23
Web Server Log
KDnuggets.com Server
User
http//www.kdnuggets.com/jobs/
24
Web Server Log A Sample
  • 152.152.98.11
  • - -
  • 16/Nov/2005163250 -0500
  • "GET /jobs/ HTTP/1.1"
  • 200
  • 15140
  • "http//www.google.com/search?qsalaryfordatami
    ninghlenlrstart10saN"
  • "Mozilla/4.0 (compatible MSIE 6.0 Windows NT
    5.1 SV1 .NET CLR 1.1.4322)"

25
Web log fields
  • IP
  • 152.152.98.11
  • IP address - can be converted to host name, such
    as xyz.example.com
  • Name
  • The name of the remote user (usually omitted and
    replaced by a dash -)
  • Login
  • Login of the remote user (also usually omitted
    and replaced by a dash -)
  • Date/Time/TZ
  • 16/Nov/2005163250 -0500
  • Request, Status code, Object size, Referrer, User
    agent

26
Web Usage Mining - Basic
  • Totals for each component
  • Hits total number of requests
  • Files number of GETs
  • Pages number of HTML pages
  • Sites unique IP addresses
  • Response codes
  • Kbytes total Kbytes transferred
  • User Agents

27
Web Log Analysis Programs
  • Free
  • Analog, awstats, webalizer
  • Google analytics
  • Commercial
  • WebTrends, WebSideStory,

28
Example KDnuggets.com Nov 2005 totals
  • Monthly Statistics (from webalizer)

Q What is the difference between Hits and
Files? Answer the difference between Hits and
Files is the number of requests with status code
not 200.
29
Example KDnuggets.com Nov 2005 totals
  • Q What is the meaning of difference between
    Files and Pages ?
  • A the difference between Files and Pages is the
    number of non-HTML files (e.g. image, javascript,
    etc
  • In November 2005 KDnuggets log HTML files were
    about 1/3 of all requests
  • However, this data does not separate bot requests
    (which are heavily weighted towards HTML pages)

30
2. Web Content Mining
  • The process to discover useful information from
    the content of a web page.
  • The type of the web content may consist of
  • Text
  • Image
  • Audio
  • Video
  • Web content mining sometimes is called web text
    mining, because the text content is the most
    widely researched area.
  • The technologies that are normally used in web
    content mining are
  • Natural Language Processing (NLP)
  • Information Retrieval (IR)

31
Text Mining
  • The process of deriving high quality information
    from text.
  • Text mining is an interdisciplinary field which
    draws on information retrieval, data mining,
    machine learning, statistics, and computational
    linguistics. As most information (over 80) is
    currently stored as text, text mining is believed
    to have a high commercial potential value.
  • High quality information is typically derived
    through the divining of patterns and trends
    through means such as statistical pattern
    learning.

32
Text Mining
  • Text mining usually involves the process of
  • Structuring the input text by
  • Parsing
  • Addition of some derived linguistic features and
    the removal of others
  • Subsequent insertion into a database
  • Deriving patterns within the structured data
  • Evaluation and interpretation of the output.
  • 'High quality' in text mining refers to some
    combination of
  • Relevance
  • Novelty
  • Interestingness

33
Text Mining
  • Typical text mining tasks include
  • Text categorization
  • Text clustering
  • Concept/entity extraction
  • Sentiment analysis
  • Document summarization
  • Entity relation modeling (i.e., learning
    relations between named entities).

34
3. Web Structure Mining
  • The process of using the graph theory to analyze
    the node and connection structure of a web site.
    Web structure mining can be divided into two
    kinds
  • Extract patterns from hyperlinks in the web. A
    hyperlink is a structural component that connects
    the web page to a different location.
  • Mining the document structure. It is using the
    tree-like structure to analyze and describe the
    HTML or XML tags within the web page.

35
3. Web Structure Mining
  • Web structure mining has been largely influenced
    by research in
  • Social network analysis
  • Citation analysis (bibliometrics).
  • in-links the hyperlinks pointing to a page
  • out-links the hyperlinks found in a page.
  • Usually, the larger the number of in-links, the
    better a page is.
  • By analyzing the pages containing a URL, we can
    also obtain
  • Anchor text how other Web page authors annotate
    a page and can be useful in predicting the
    content of the target page.

36
3. Web Structure Mining
  • The PageRank algorithm is computed by weighting
    each in-link to a page proportionally to the
    quality of the page containing the in-link.
  • The qualities of these referring pages also are
    determined by PageRank. Thus, a page p is
    calculated recursively as follows

37
Ads vs. search results
  • Search advertising is the revenue model
  • Multi-billion-dollar industry
  • Advertisers pay for clicks on their ads
  • Interesting problems
  • How to pick the top 10 results for a search from
    2,230,000 matching pages?
  • What ads to show for a search?
  • If Im an advertiser, which search terms should I
    bid on and how much to bid?

38
Web Mining vs. Information Access
  • Text data mining involves extracting nuggets
    and/or overall patterns from a collection of
    textual information, independent of a users'
    information need.
  • Information access is the process of helping
    users find, create, use, re-use, and understand
    information to satisfy an information need.
  • In other words, data mining is opportunistic,
    whereas information access is goal-driven.

39
Search Engine Components
  • Spider (crawler/robot) builds corpus
  • Collects web pages recursively
  • For each known URL, fetch the page, parse it, and
    extract new URLs
  • Repeat
  • Additional pages from direct submissions other
    sources
  • The indexer creates inverted indexes
  • Various policies wrt which words are indexed,
    capitalization, support for Unicode, stemming,
    support for phrases, etc.
  • Query processor serves query results
  • Front end query reformulation, word stemming,
    capitalization, optimization of Booleans, etc.
  • Back end finds matching documents and ranks
    them

40
Application Areas of Web Mining
  • E-commerce
  • Search Engines
  • Personalization
  • Website Design

41
Application Areas of Web Mining
  • E-tailers
  • The ability to find new cross-sell opportunities,
    enable comprehensive prospect profiling, and
    improve customer satisfaction.
  • B2B and B2C Ventures

42
Application Areas of Web Mining
  • Advertising-Based Sites
  • When the revenue is advertising-based. Blindly
    serving ads to visitors will not result in a
    large click-thru rate. Instead, ads must be
    intelligently targeted to the user, providing the
    visitor with products and services that they are
    interested in.
  • Entertainment sites
  • Media Portals
  • Advertising Providers

43
Application Areas of Web Mining
  • Information Repositories
  • Information overload is a problem that grows
    larger every day. Indexing, summarization, and
    other metadata tasks are time consuming. Semantic
    text analyzers are capable of automating these
    tasks, and create user navigation systems on the
    fly.
  • Libraries
  • Technical Support Sites
  • Media Sites
  • Content Providers

44
Application Areas of Web Mining
  • Security applications
  • One of the largest text mining applications that
    exists is probably the classified ECHELON
    surveillance system.
  • Software and Applications
  • Research and development departments of major
    companies, including IBM and Microsoft, are
    researching text mining techniques and developing
    programs to further automate the mining and
    analysis processes.

45
Application Areas of Web Mining
  • Academic applications
  • The issue of text mining is of importance to
    publishers who hold large databases of
    information requiring indexing for retrieval.

46
Conclusion
  • Major limitations of Web mining research
  • Lack of suitable test collections that can be
    reused by researchers.
  • Difficult to collect Web usage data across
    different Web sites.
  • Future research directions
  • Multimedia data mining a picture is worth a
    thousand words.
  • Multilingual knowledge extraction Web page
    translations
  • Wireless Web WML and HDML.
  • The Hidden Web forms, dynamically generated Web
    pages.
  • Semantic Web

This presentation is reproduced from the
articles attached
Write a Comment
User Comments (0)
About PowerShow.com