WEB STRUCTURE MINING - PowerPoint PPT Presentation

About This Presentation

Title:

WEB STRUCTURE MINING

Description:

WEB STRUCTURE MINING SUBMITTED BY: BLESSY JOHN R7A ROLL NO:18 INTRODUCTION Web mining is the application of data mining techniques in search engines. – PowerPoint PPT presentation

Number of Views:278

Avg rating:3.0/5.0

Slides: 23

Provided by: Bij78

Category:

more less

Transcript and Presenter's Notes

Title: WEB STRUCTURE MINING

1
WEB STRUCTURE MINING

SUBMITTED BY
BLESSY JOHN
R7A
ROLL NO18

2
INTRODUCTION

Web mining is the application of data mining
techniques in search engines.
Data mining - process of discovering useful
knowledge from data sources
Web mining automatically discover and extract
information from Web documents.
Web structure mining discovers useful data from
hyperlinks.

3
WEB MINING

Useful patterns extraction from WWW resources
WWW is widely distributed, global information
service centre that constitutes a rich source for
data mining
Employing techniques from Data Mining,
information retrieval,etc.

4
NEED FOR WEB MINING

Aims at finding and extracting relevant
information that is hidden in web- related data.
The challenge is to bring back the semantics of
hyper text document
To turn web data into web knowledge

5
CLASSIFICATION
6
WEB STRUCTURE MINING

Generate structural summary about the Web site
and Web page
Use graph theory to analyse node and connection
structure of a web site
Analysis of the link structure of the web, and
its purposes is to identify more preferable
documents

7
WEB STRUCTURE MINING cont..

Discovering the nature of the hierarchy of
hyperlinks in the website and its structure
Hyperlink identifies authors endorsement of the
other web page
Retrieving information about the relevance and
the quality of the web page.

8
Page Layout and Link Analysis for Web Images
9
WEB BASICS

A web is a huge collection of documents linked
together by references.
To refer from one document to another is based on
hyper text and embedded in HTML
HTML describes how the document should display on
browser window
Web document has a web address called URL that
identifies it uniquely.

10
WEB CRAWLERS

Collects all web documents by browsing the Web
systematically and exhaustively
Region of the web to be crawled can be speci?ed
by using the URL structure.
Used by a search engine to provide local access
to the most recent versions of possibly all web
pages

11
INDEXING AND KEYWORD SEARCH

There are two types of data
structured and unstructured
Structured data have keys associated with each
data item that re?ect its content
Content-based access to unstructured data without
considering the meaning is the keyword search
approach

12
DOCUMENT REPRESENTATION

To facilitate the process of matching keywords
and documents, some preprocessing steps are taken
?rst
Documents are tokenized
Characters are converted to upper or lower case
Words reduced to canonical form
Stopwords are usually removed

13
ALGORITHMS

There are two main algorithms used in web
structure mining
1. HITS (Hypertext-Induced Topic
Search)
2. Page rank algorithm

14
HITS (Hypertext-Induced Topic Search)

Link analysis algorithm
Rates web pages
Developed by Jon Kleinberg
Determines two values for a page
Authority-estimates the value of the content of
the page
Hub-estimates the value of its links to other
pages

15
Hubs and Authorities

Hub pages point to interesting links to
authorities relevant pages
Authorities are targets of hub pages

16
Continue

Authority and hub values are defined in terms of
one another in a mutual recursion
It is executed at querry time with the associated
HIT on performance

17
Page Rank

Link analysis algorithm
Assigns a numerical weightage to each element of
a hyperlinked set of documents
Denoted by PR(E)
Relies on uniquely democratic nature
Link from page A to page B is a vote, by page A,
for page B

18
Continue..

Here, A considers itself important and help to
make B important
Also a probability distribution represents the
probability that a click on a link arrives at any
particular page
Page rank of 0.5 -gt 50 chance that a person
clicking on a link will be directed to the
document with the 0.5 page rank

19
APPLICATIONS

Information retrieval in social networks.
To find out the relevancy of each Web page
Measuring completeness of the Web sites
Used in search engines to find out relevant
information

20
CONCLUSION

Search engines uses web structure mining to find
the information.
We can create new knowledge out of the available
information
Web Content mining can be added to it to enhance
the performance of search engines.