Title: Deep Web
1Deep Web
- Under the guidance of
- Prof. Pushpak Bhattacharyya
Presented by - Jayanta Das (11305R012) Souvik
Pal (113059003) Subhro Bhattacharyya
(113059005) (Group 4)
2Introduction
3Introduction What is Deep Web
- Modern Internet Most effective source of
information. - Most popular search engine Google
- In 2008, Google added Trillionth (1012) web link
to their index database! - Stores several billion documents!
- Despite many a times we are not satisfied with
the search results. - 43 users reports dissatisfaction about the
results
4Real Life Example
5Motivation Why Deep Web
- Then why Google fails?
- Most of the Web's information is buried far down
on dynamically generated sites. - Traditional web crawler cannot reach there.
- Large portion of data are literally un-explored
- Quest for exploration of unknown a human
instinct - Need for more specific information stored in
databases - Can only be obtained if we have access to the
database containing the information.
6Evolution of Deep Web
- Early Days static html pages, crawlers can
easily reach - In mid-90s Introduction of dynamic pages, that
are generated as a result of a query. - In 1994 Jill Ellsworth used the term Invisible
Web to refer to these websites. - In 2001, Bergman coined it as Deep Web
7Measuring the Deep Web (1)
- when you can measure what you are speaking
about, and express it in numbers, you know
something about it Lord Kelvin - First Attempt Bergman (2000 )
- Size of surface web is around 19 TB
- Size of Deep Web is around 7500 TB
- Deep Web is nearly 400 times larger than the
Surface Web
8Measuring the Deep Web (2)
- In 2004 Mitesh classified the deep web more
acurately - Most of the html forms are found either on the
fist hop or 2nd hop from the home page
9Measuring the Deep Web (3)
- Unstructured Data objects as unstructured media
(text, images, audio, video) - e.g www.cnn.com
- Structured data objectsas structured
relationalrecords with attribute-value pairs.
10Deep Resources
- Dynamic Web Pages
- returned in response to a submitted query or
accessed only through a form - Unlinked Contents
- Pages without any backlinks
- Private Web
- sites requiring registration and login
(password-protected resources) - Limited Access web
- Sites with captchas, no-cache pragma http headers
- Scripted Pages
- Page produced by javascrips, Flash, AJAX etc
- Non HTML contents
- Multimedia files e.g. images o videos
11Approach towards crawling Deep Web
12Timeline How it all started!
- 2001 Raghavan et al -gt Hidden Web Exposer
- domain specific human assisted crawler
- 2002 Stumbleupon used Human Crawler
- human crawlers can find relevant links that
algorithmic crawlers miss. - 2003 Bergman introduced LexiBot
- used for quantifying the deep web
- 2004 Yahoo! Content Acquisition Program
- paid inclusion for webmasters
13Time line contd
- 2005 Yahoo! Subscriptions
- Yahoo started searching subcription only sites
- eg WSJ
- 2005 Notulas et. al. -gt Hidden Web Crawler
- automatically generated meaningful queries to
issue against search form - 2005 Google site map
- Allows webmasters to inform search engines about
urls on their websites that are available for
crawling.
14Present Deep Web Search Scenario
- Federated Search
- Googles surfacing
15Federated Search
- Federated search is the process of performing a
real-time search of multiple diverse and
distributed sources from a single search page,
with the federated search engine acting as
intermediary. - Why federated?
- Content from different sources are combined
instead of searching the sources one at a time.
16Federated Search Properties (1)
- Real Time
- Fed search occurs live and results are current.
- Diverse and Distributed Sources
- Multiple sources present in different locations
in the web are serached. Sources are diverse in
nature containing text, documents, pdfs, ppts etc.
17Federated Search Properties (2)
- Single Search page
- Fed search engines provide a single point of
searching. - Fed Search engine acts as intermediary
- User does not communicate directly with the
content sources when performing searches. The
search engine does it on the users behalf. -
18Federated Search Method
- Works by filling out forms on web pages.
- The search engine is programmed with the
knowledge of each form that it has to search. - It knows how to fill out the form, press the
submit button and retrieve the results.
19Web Form example
A web form that a normal search engine cannot
crawl . This involves filling in the textbox,
clicking search and retreiving the results.
20Federated search example
WorldWideScience.org Searches science content
from all over the world, from government
agencies, research and academic organizations.
21Fed Search In Action
Incremental search Federated search engines do
not wait for results from all sources. To improve
response time results are displayed in chunks
while the search continues in the background.
When a new result set is available the user is
prompted.
22Metasearch vs Fed Search
- Metasearch is similar to federated search.
- Here the search engine searches other search
engines in real time. - Even though they search the underlying search
engine in real time, the underlying search
engines may not have the most current information
as they themselves are crawlers. - It is NOT a Deep Web Seach!
- People often confuse between Meta Search and Fed
Search
23Metasearch example
24Federated Search (Advantages)
- Efficiency, Time Savings
- Instead of querying many search engines one at a
time , the federated search engine does it on the
users behalf - Quality of results
- searches only authoritative sources since it has
been programmed to do so. - Most Current content
- Searches in real time.
25Federated Search (Challenges)
- Aggregation
- The process of combining search results from
different sources in some helpful way - eg sorting by date,title,author
- Ranking
- Displaying results relevant to search
- De-duplication
- A federated search engine may retreive the same
result from multiple resources
26Googles reasons to move away from Fed Search
- Federated search works quite well when it is
restricted to one domain. - In case of general search involving multiple
domains it is not as effective. - Number of domains is extremely large
- Defining boundary of domain difficult.
- Mapping a query to a domain difficult
- Dependent on latency of deep web sources.
27Case StudyGoogles Crawling
28Case Study Googles crawling (1)
- Two approaches for Deep Web Crawling
- Virtual Integration
- Surfacing
29Case Study Googles crawling (2)
- Virtual Integration (Domain Specific)
- A mediator form is created for each domain
- semantic mapping between individual data sources
and mediator form. - Performed in real time.
- Drawback
- Cost of building mediator form and mapping.
- Identifying relevant queries for a particular
domain.
30Case Study Googles crawling (3)
- Surfacing
- Precomputes most relevant form values for
interesting html forms - Resulting urls are generated offline and indexed
- Helps in retaining exsiting infrustructure while
inclusion of Deep Web - Covers maximum web pages while bounding the total
number of web form submissions - GET vs POST method
31Case Study Googles crawling (4)
- Challenges
- Which form inputs to fill
- Appropiate values to those inputs
- Googles approach
- Selecting wild card for form submission
- Some fields are mandetory
- Query template
- Testing with all possible values in select menu
- Predicting form values from datatypes
32Subconcious Mind and Deep Web
- Inspiration behind exploration of deep web
- Analogy
- Iceberg example
- Real life example
33References(1)
- Wikipedia, http//en.wikipedia.org/wiki/Deep_web
- Bergman, Michael K , "The Deep Web Surfacing
Hidden Value". The Journal of Electronic
Publishing , August 2001 - Alex Wright, "Exploring a 'Deep Web' That Google
Cant Grasp". The New York Times. Sept 23,
2009.http//www.nytimes.com/2009/02/23/technology
/internet/23search.html?themcth - Jesse Alpert Nissan Hajaj, We knew the web was
big, 2008http//googleblog.blogspot.com/2008/07
/we-knew-web-was-big.html - He, Bin Patel, Mitesh Zhang, Zhen Chang, Kevin
Chen-Chuan ,"Accessing the Deep Web A Survey".
Communications of the ACM (CACM), May 2007
34References(2)
- Madhavan, Jayant David Ko, Lucja Kot, Vignesh
Ganapathy, Alex Rasmussen, Alon Halevy, Googles
Deep-Web Crawl, 2008 - Maureen Flynn-Burhoe, "Timeline of events related
to the Deep Web" ,2008, http//papergirls.wordpres
s.com/2008/10/07/timeline-deep-web/ - Darcy Pedersen, "Federated Search Finds Content
that Google Cant Reach Part I of III" ,
2009,http//deepwebtechblog.com/federated-search-
finds-content-that-google-cant-reach-part-i-of-ii
i/ - Darcy Pedersen, "A Federated Search Primer Part
II of III" , 2009, http//deepwebtechblog.com/a-f
ederated-search-primer-part-ii-of-iii/ - Darcy Pedersen, "A Federated Search Primer Part
IIIof III" , 2009, http//deepwebtechblog.com/a-f
ederated-search-primer-part-iii-of-iii/
35THANK YOU