Deep Web - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Deep Web

Description:

... a human instinct Need for more specific information stored in databases Can only be obtained ... Aggregation The process of combining search results from ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 36

Provided by: Sou131

Category:

more less

Transcript and Presenter's Notes

Title: Deep Web

1
Deep Web

Under the guidance of
Prof. Pushpak Bhattacharyya

Presented by - Jayanta Das (11305R012) Souvik
Pal (113059003) Subhro Bhattacharyya
(113059005) (Group 4)
2
Introduction

What is Deep Web

3
Introduction What is Deep Web

Modern Internet Most effective source of
information.
Most popular search engine Google
In 2008, Google added Trillionth (1012) web link
to their index database!
Stores several billion documents!
Despite many a times we are not satisfied with
the search results.
43 users reports dissatisfaction about the
results

4
Real Life Example
5
Motivation Why Deep Web

Then why Google fails?
Most of the Web's information is buried far down
on dynamically generated sites.
Traditional web crawler cannot reach there.
Large portion of data are literally un-explored
Quest for exploration of unknown a human
instinct
Need for more specific information stored in
databases
Can only be obtained if we have access to the
database containing the information.

6
Evolution of Deep Web

Early Days static html pages, crawlers can
easily reach
In mid-90s Introduction of dynamic pages, that
are generated as a result of a query.
In 1994 Jill Ellsworth used the term Invisible
Web to refer to these websites.
In 2001, Bergman coined it as Deep Web

7
Measuring the Deep Web (1)

when you can measure what you are speaking
about, and express it in numbers, you know
something about it Lord Kelvin
First Attempt Bergman (2000 )
Size of surface web is around 19 TB
Size of Deep Web is around 7500 TB
Deep Web is nearly 400 times larger than the
Surface Web

8
Measuring the Deep Web (2)

In 2004 Mitesh classified the deep web more
acurately
Most of the html forms are found either on the
fist hop or 2nd hop from the home page

9
Measuring the Deep Web (3)

Unstructured Data objects as unstructured media
(text, images, audio, video)
e.g www.cnn.com
Structured data objectsas structured
relationalrecords with attribute-value pairs.

10
Deep Resources

Dynamic Web Pages
returned in response to a submitted query or
accessed only through a form
Unlinked Contents
Pages without any backlinks
Private Web
sites requiring registration and login
(password-protected resources)
Limited Access web
Sites with captchas, no-cache pragma http headers
Scripted Pages
Page produced by javascrips, Flash, AJAX etc
Non HTML contents
Multimedia files e.g. images o videos

11
Approach towards crawling Deep Web
12
Timeline How it all started!

2001 Raghavan et al -gt Hidden Web Exposer
domain specific human assisted crawler
2002 Stumbleupon used Human Crawler
human crawlers can find relevant links that
algorithmic crawlers miss.
2003 Bergman introduced LexiBot
used for quantifying the deep web
2004 Yahoo! Content Acquisition Program
paid inclusion for webmasters

13
Time line contd

2005 Yahoo! Subscriptions
Yahoo started searching subcription only sites
eg WSJ
2005 Notulas et. al. -gt Hidden Web Crawler
automatically generated meaningful queries to
issue against search form
2005 Google site map
Allows webmasters to inform search engines about
urls on their websites that are available for
crawling.

14
Present Deep Web Search Scenario

Federated Search
Googles surfacing

15
Federated Search

Federated search is the process of performing a
real-time search of multiple diverse and
distributed sources from a single search page,
with the federated search engine acting as
intermediary.
Why federated?
Content from different sources are combined
instead of searching the sources one at a time.

16
Federated Search Properties (1)

Real Time
Fed search occurs live and results are current.
Diverse and Distributed Sources
Multiple sources present in different locations
in the web are serached. Sources are diverse in
nature containing text, documents, pdfs, ppts etc.

17
Federated Search Properties (2)

Single Search page
Fed search engines provide a single point of
searching.
Fed Search engine acts as intermediary
User does not communicate directly with the
content sources when performing searches. The
search engine does it on the users behalf.

18
Federated Search Method

Works by filling out forms on web pages.
The search engine is programmed with the
knowledge of each form that it has to search.
It knows how to fill out the form, press the
submit button and retrieve the results.

19
Web Form example
A web form that a normal search engine cannot
crawl . This involves filling in the textbox,
clicking search and retreiving the results.
20
Federated search example
WorldWideScience.org Searches science content
from all over the world, from government
agencies, research and academic organizations.
21
Fed Search In Action
Incremental search Federated search engines do
not wait for results from all sources. To improve
response time results are displayed in chunks
while the search continues in the background.
When a new result set is available the user is
prompted.
22
Metasearch vs Fed Search

Metasearch is similar to federated search.
Here the search engine searches other search
engines in real time.
Even though they search the underlying search
engine in real time, the underlying search
engines may not have the most current information
as they themselves are crawlers.
It is NOT a Deep Web Seach!
People often confuse between Meta Search and Fed
Search

23
Metasearch example
24
Federated Search (Advantages)

Efficiency, Time Savings
Instead of querying many search engines one at a
time , the federated search engine does it on the
users behalf
Quality of results
searches only authoritative sources since it has
been programmed to do so.
Most Current content
Searches in real time.

25
Federated Search (Challenges)

Aggregation
The process of combining search results from
different sources in some helpful way
eg sorting by date,title,author
Ranking
Displaying results relevant to search
De-duplication
A federated search engine may retreive the same
result from multiple resources

26
Googles reasons to move away from Fed Search

Federated search works quite well when it is
restricted to one domain.
In case of general search involving multiple
domains it is not as effective.
Number of domains is extremely large
Defining boundary of domain difficult.
Mapping a query to a domain difficult
Dependent on latency of deep web sources.

27
Case StudyGoogles Crawling
28
Case Study Googles crawling (1)

Two approaches for Deep Web Crawling
Virtual Integration
Surfacing

29
Case Study Googles crawling (2)

Virtual Integration (Domain Specific)
A mediator form is created for each domain
semantic mapping between individual data sources
and mediator form.
Performed in real time.
Drawback
Cost of building mediator form and mapping.
Identifying relevant queries for a particular
domain.

30
Case Study Googles crawling (3)

Surfacing
Precomputes most relevant form values for
interesting html forms
Resulting urls are generated offline and indexed
Helps in retaining exsiting infrustructure while
inclusion of Deep Web
Covers maximum web pages while bounding the total
number of web form submissions
GET vs POST method

31
Case Study Googles crawling (4)

Challenges
Which form inputs to fill
Appropiate values to those inputs
Googles approach
Selecting wild card for form submission
Some fields are mandetory
Query template
Testing with all possible values in select menu
Predicting form values from datatypes

32
Subconcious Mind and Deep Web

Inspiration behind exploration of deep web
Analogy
Iceberg example
Real life example

33
References(1)

Wikipedia, http//en.wikipedia.org/wiki/Deep_web
Bergman, Michael K , "The Deep Web Surfacing
Hidden Value". The Journal of Electronic
Publishing , August 2001
Alex Wright, "Exploring a 'Deep Web' That Google
Cant Grasp". The New York Times. Sept 23,
2009.http//www.nytimes.com/2009/02/23/technology
/internet/23search.html?themcth
Jesse Alpert Nissan Hajaj, We knew the web was
big, 2008http//googleblog.blogspot.com/2008/07
/we-knew-web-was-big.html
He, Bin Patel, Mitesh Zhang, Zhen Chang, Kevin
Chen-Chuan ,"Accessing the Deep Web A Survey".
Communications of the ACM (CACM), May 2007

34
References(2)

Madhavan, Jayant David Ko, Lucja Kot, Vignesh
Ganapathy, Alex Rasmussen, Alon Halevy, Googles
Deep-Web Crawl, 2008
Maureen Flynn-Burhoe, "Timeline of events related
to the Deep Web" ,2008, http//papergirls.wordpres
s.com/2008/10/07/timeline-deep-web/
Darcy Pedersen, "Federated Search Finds Content
that Google Cant Reach Part I of III" ,
2009,http//deepwebtechblog.com/federated-search-
finds-content-that-google-cant-reach-part-i-of-ii
i/
Darcy Pedersen, "A Federated Search Primer Part
II of III" , 2009, http//deepwebtechblog.com/a-f
ederated-search-primer-part-ii-of-iii/
Darcy Pedersen, "A Federated Search Primer Part
IIIof III" , 2009, http//deepwebtechblog.com/a-f
ederated-search-primer-part-iii-of-iii/