Deep Web - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Deep Web

Description:

... a human instinct Need for more specific information stored in databases Can only be obtained ... Aggregation The process of combining search results from ... – PowerPoint PPT presentation

Number of Views:73
Avg rating:3.0/5.0
Slides: 36
Provided by: Sou131
Category:

less

Transcript and Presenter's Notes

Title: Deep Web


1
Deep Web
  • Under the guidance of
  • Prof. Pushpak Bhattacharyya

Presented by - Jayanta Das (11305R012) Souvik
Pal (113059003) Subhro Bhattacharyya
(113059005) (Group 4)
2
Introduction
  • What is Deep Web

3
Introduction What is Deep Web
  • Modern Internet Most effective source of
    information.
  • Most popular search engine Google
  • In 2008, Google added Trillionth (1012) web link
    to their index database!
  • Stores several billion documents!
  • Despite many a times we are not satisfied with
    the search results.
  • 43 users reports dissatisfaction about the
    results

4
Real Life Example
5
Motivation Why Deep Web
  • Then why Google fails?
  • Most of the Web's information is buried far down
    on dynamically generated sites.
  • Traditional web crawler cannot reach there.
  • Large portion of data are literally un-explored
  • Quest for exploration of unknown a human
    instinct
  • Need for more specific information stored in
    databases
  • Can only be obtained if we have access to the
    database containing the information.

6
Evolution of Deep Web
  • Early Days static html pages, crawlers can
    easily reach
  • In mid-90s Introduction of dynamic pages, that
    are generated as a result of a query.
  • In 1994 Jill Ellsworth used the term Invisible
    Web to refer to these websites.
  • In 2001, Bergman coined it as Deep Web

7
Measuring the Deep Web (1)
  • when you can measure what you are speaking
    about, and express it in numbers, you know
    something about it Lord Kelvin
  • First Attempt Bergman (2000 )
  • Size of surface web is around 19 TB
  • Size of Deep Web is around 7500 TB
  • Deep Web is nearly 400 times larger than the
    Surface Web

8
Measuring the Deep Web (2)
  • In 2004 Mitesh classified the deep web more
    acurately
  • Most of the html forms are found either on the
    fist hop or 2nd hop from the home page

9
Measuring the Deep Web (3)
  • Unstructured Data objects as unstructured media
    (text, images, audio, video)
  • e.g www.cnn.com
  • Structured data objectsas structured
    relationalrecords with attribute-value pairs.

10
Deep Resources
  • Dynamic Web Pages
  • returned in response to a submitted query or
    accessed only through a form
  • Unlinked Contents
  • Pages without any backlinks
  • Private Web
  • sites requiring registration and login
    (password-protected resources)
  • Limited Access web
  • Sites with captchas, no-cache pragma http headers
  • Scripted Pages
  • Page produced by javascrips, Flash, AJAX etc
  • Non HTML contents
  • Multimedia files e.g. images o videos

11
Approach towards crawling Deep Web
12
Timeline How it all started!
  • 2001 Raghavan et al -gt Hidden Web Exposer
  • domain specific human assisted crawler
  • 2002 Stumbleupon used Human Crawler
  • human crawlers can find relevant links that
    algorithmic crawlers miss.
  • 2003 Bergman introduced LexiBot
  • used for quantifying the deep web
  • 2004 Yahoo! Content Acquisition Program
  • paid inclusion for webmasters

13
Time line contd
  • 2005 Yahoo! Subscriptions
  • Yahoo started searching subcription only sites
  • eg WSJ
  • 2005 Notulas et. al. -gt Hidden Web Crawler
  • automatically generated meaningful queries to
    issue against search form
  • 2005 Google site map
  • Allows webmasters to inform search engines about
    urls on their websites that are available for
    crawling.

14
Present Deep Web Search Scenario
  • Federated Search
  • Googles surfacing

15
Federated Search
  • Federated search is the process of performing a
    real-time search of multiple diverse and
    distributed sources from a single search page,
    with the federated search engine acting as
    intermediary.
  • Why federated?
  • Content from different sources are combined
    instead of searching the sources one at a time.

16
Federated Search Properties (1)
  • Real Time
  • Fed search occurs live and results are current.
  • Diverse and Distributed Sources
  • Multiple sources present in different locations
    in the web are serached. Sources are diverse in
    nature containing text, documents, pdfs, ppts etc.

17
Federated Search Properties (2)
  • Single Search page
  • Fed search engines provide a single point of
    searching.
  • Fed Search engine acts as intermediary
  • User does not communicate directly with the
    content sources when performing searches. The
    search engine does it on the users behalf.

18
Federated Search Method
  • Works by filling out forms on web pages.
  • The search engine is programmed with the
    knowledge of each form that it has to search.
  • It knows how to fill out the form, press the
    submit button and retrieve the results.

19
Web Form example
A web form that a normal search engine cannot
crawl . This involves filling in the textbox,
clicking search and retreiving the results.
20
Federated search example
WorldWideScience.org Searches science content
from all over the world, from government
agencies, research and academic organizations.
21
Fed Search In Action
Incremental search Federated search engines do
not wait for results from all sources. To improve
response time results are displayed in chunks
while the search continues in the background.
When a new result set is available the user is
prompted.
22
Metasearch vs Fed Search
  • Metasearch is similar to federated search.
  • Here the search engine searches other search
    engines in real time.
  • Even though they search the underlying search
    engine in real time, the underlying search
    engines may not have the most current information
    as they themselves are crawlers.
  • It is NOT a Deep Web Seach!
  • People often confuse between Meta Search and Fed
    Search

23
Metasearch example
24
Federated Search (Advantages)
  • Efficiency, Time Savings
  • Instead of querying many search engines one at a
    time , the federated search engine does it on the
    users behalf
  • Quality of results
  • searches only authoritative sources since it has
    been programmed to do so.
  • Most Current content
  • Searches in real time.

25
Federated Search (Challenges)
  • Aggregation
  • The process of combining search results from
    different sources in some helpful way
  • eg sorting by date,title,author
  • Ranking
  • Displaying results relevant to search
  • De-duplication
  • A federated search engine may retreive the same
    result from multiple resources

26
Googles reasons to move away from Fed Search
  • Federated search works quite well when it is
    restricted to one domain.
  • In case of general search involving multiple
    domains it is not as effective.
  • Number of domains is extremely large
  • Defining boundary of domain difficult.
  • Mapping a query to a domain difficult
  • Dependent on latency of deep web sources.

27
Case StudyGoogles Crawling
28
Case Study Googles crawling (1)
  • Two approaches for Deep Web Crawling
  • Virtual Integration
  • Surfacing

29
Case Study Googles crawling (2)
  • Virtual Integration (Domain Specific)
  • A mediator form is created for each domain
  • semantic mapping between individual data sources
    and mediator form.
  • Performed in real time.
  • Drawback
  • Cost of building mediator form and mapping.
  • Identifying relevant queries for a particular
    domain.

30
Case Study Googles crawling (3)
  • Surfacing
  • Precomputes most relevant form values for
    interesting html forms
  • Resulting urls are generated offline and indexed
  • Helps in retaining exsiting infrustructure while
    inclusion of Deep Web
  • Covers maximum web pages while bounding the total
    number of web form submissions
  • GET vs POST method

31
Case Study Googles crawling (4)
  • Challenges
  • Which form inputs to fill
  • Appropiate values to those inputs
  • Googles approach
  • Selecting wild card for form submission
  • Some fields are mandetory
  • Query template
  • Testing with all possible values in select menu
  • Predicting form values from datatypes

32
Subconcious Mind and Deep Web
  • Inspiration behind exploration of deep web
  • Analogy
  • Iceberg example
  • Real life example

33
References(1)
  1. Wikipedia, http//en.wikipedia.org/wiki/Deep_web
  2. Bergman, Michael K , "The Deep Web Surfacing
    Hidden Value". The Journal of Electronic
    Publishing , August 2001
  3. Alex Wright, "Exploring a 'Deep Web' That Google
    Cant Grasp". The New York Times. Sept 23,
    2009.http//www.nytimes.com/2009/02/23/technology
    /internet/23search.html?themcth
  4. Jesse Alpert Nissan Hajaj, We knew the web was
    big, 2008http//googleblog.blogspot.com/2008/07
    /we-knew-web-was-big.html
  5. He, Bin Patel, Mitesh Zhang, Zhen Chang, Kevin
    Chen-Chuan ,"Accessing the Deep Web A Survey".
    Communications of the ACM (CACM), May 2007

34
References(2)
  1. Madhavan, Jayant David Ko, Lucja Kot, Vignesh
    Ganapathy, Alex Rasmussen, Alon Halevy, Googles
    Deep-Web Crawl, 2008
  2. Maureen Flynn-Burhoe, "Timeline of events related
    to the Deep Web" ,2008, http//papergirls.wordpres
    s.com/2008/10/07/timeline-deep-web/
  3. Darcy Pedersen, "Federated Search Finds Content
    that Google Cant Reach Part I of III" ,
    2009,http//deepwebtechblog.com/federated-search-
    finds-content-that-google-cant-reach-part-i-of-ii
    i/
  4. Darcy Pedersen, "A Federated Search Primer Part
    II of III" , 2009, http//deepwebtechblog.com/a-f
    ederated-search-primer-part-ii-of-iii/
  5. Darcy Pedersen, "A Federated Search Primer Part
    IIIof III" , 2009, http//deepwebtechblog.com/a-f
    ederated-search-primer-part-iii-of-iii/

35
THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com