Data Mining for Web Intelligence - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Data Mining for Web Intelligence

Description:

i.e. Amazon.com, Realtor.com. Random surfing. Design Challenges ... i.e. many sites have a link to weather.com, but generally are not weather sites ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 21
Provided by: erd1
Category:

less

Transcript and Presenter's Notes

Title: Data Mining for Web Intelligence


1
Data Mining for Web Intelligence
  • Presentation by
  • Julia Erdman

2
Data Mining the Web
  • Searching, comprehending, and using the
    semi-structured data on the web poses a
    significant challenge over data mining in a
    commercial database system
  • The data from the web is more sophisticated and
    dynamic
  • Data mining helps search engine find high-quality
    web pages

3
Why Data Mining?
  • Challenges of data mining the web
  • Web page complexity far exceeds the complexity of
    any traditional text document collection
  • The Web constitutes a highly dynamic information
    source
  • The Web serves a broad spectrum of user
    communities
  • Only a small portion of the Webs pages contain
    truly relevant or useful information

4
Why Data Mining?
  • Approaches to accessing information on the web
  • Keyword-based search or topic-directory browsing
  • i.e. Google, Yahoo
  • Querying deep Web sources
  • i.e. Amazon.com, Realtor.com
  • Random surfing

5
Design Challenges
  • Traditional schemes for accessing data on the web
    are based on text-oriented, keyword-based web
    pages
  • The current access schemes must be replaced with
    more sophisticated schemes in order to exploit
    the Web completely

6
Access Limitations
  • Lack of high-quality keyword-based searches
  • A search can return many answers
  • i.e. searching popular categories, like sports or
    politics
  • Overloading keyword semantics can return many
    low-quality answers
  • i.e. a search for jaguar could be for an animal,
    car, sports team
  • A search can miss many highly related pages that
    do not contain the posed keywords

7
Access Limitations
  • Lack of effective deep-Web access
  • There are at least 100,000 searchable databases
    on the Web with high-quality, well-maintained
    information, but are not effectively accessible
  • There is an extremely large collection of
    autonomous and heterogeneous databases, each
    supporting specific query interfaces with
    different schema and query constraints

8
Access Limitations
  • Lack of automatically constructed directories
  • A topic or type-oriented Web information
    directory creates an organized picture of a web
    sector
  • Developers must organizes these directories
    manually
  • Costly
  • Provides only limited coverage
  • Not easily scalable or adaptable

9
Access Limitations
  • Lack of semantics-based query primitives
  • Most keyword-based searches only allow of small
    set of search options

10
Access Limitations
  • Lack of feedback on human activities
  • Web links may not be updated frequently,
    regularly, or at all
  • Changes in access frequency do not automatically
    adjust search results

11
Access Limitations
  • Lack of multidimensional analysis and data mining
    support
  • Cannot drill deeply into sites in order to find
    the data we are looking for

12
Mining Web search-engine data
  • Current keyword-based search engines have several
    deficiencies
  • A widely covered topic can contain hundreds of
    thousands of documents
  • Highly relevant documents may not contain the
    keywords used in the search

13
Analyzing the Webs link structure
  • When one web page contains a link to another,
    this can be considered an endorsement of the
    linked page
  • Collected endorsements of the same page from many
    different web authors leads to an authoritative
    web page
  • A hub is a single web page that contains a
    collection of links to authoritative web pages

14
Classifying Web documents automatically
  • Generally, human readers classify Web documents,
    but an automatic classification is highly
    desirable
  • Hyperlinks contain high-quality semantic clues to
    a pages topic, which can help achieve accurate
    classifications
  • However, links to unrelated sites can cloud the
    classification
  • i.e. many sites have a link to weather.com, but
    generally are not weather sites
  • Automatic classification can determine what
    classification a web page belongs to, but not to
    which classification it does not belong to

15
Mining Web page semantics structures and page
contents
  • Fully automatic extraction of Web page structures
    and semantic contents can be difficult due to the
    limitations on automated natural-languages
    parsing
  • Semiautomatic methods can recognize a portion of
    such structures
  • Then further analysis can see how the contents
    fit into these structures

16
Mining Web page semantics structures and page
contents
  • To identify the structures to extract, either an
    expert manually specifies the structures, or
    techniques must be developed to automatically
    produce the structures
  • Or developers can use Web page classes for
    automatic extraction
  • Semantic page structure and content recognition
    will provide for more in-depth analysis of Web
    pages

17
Mining Web dynamics
  • Contents, structures, and access patterns change
    on the Web
  • Storing historical data about Web pages assists
    in finding changes in content and links
  • But due to phenomenal breadth of the Web, it is
    impossible to store images and updates
  • Mining web logs records can provide quality
    results
  • This data needs to be analyzed and transformed
    into useful, significant information

18
Building a multilayered, multidimensional Web
  • Systematically analyze a set of Web pages
  • Group closely related local Web pages or an
    individual page into a cluster, called a semantic
    page
  • The analysis provides a descriptor for the
    cluster
  • Then create a semantics-based, evolving,
    multidimensional, multilayered Web information
    directory

19
Questions? Comments?
20
  • Jiawei H. Chang, K.C.-C. "Data mining for Web
    intelligence" IEEE Computer, Volume 35, Issue 11,
    Nov. 2002. pp. 64- 70.
Write a Comment
User Comments (0)
About PowerShow.com