CS621 : Seminar2008 - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

CS621 : Seminar2008

Description:

Deep Web consists of 91,000 terabytes of data whereas surface web contains ... is an implementation of a Web crawler uses human expertise or machine guidance ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 23
Provided by: MEE97
Category:

less

Transcript and Presenter's Notes

Title: CS621 : Seminar2008


1
CS621 Seminar-2008
  • DEEP WEB
  • Shubhangi Agrawal (08305044)?
  • Jayalekshmy S. Nair (08305056)?

2
Introduction
  • Deep Web The part of web which does not come
    under surface web.
  • Surface Web That part of the World Wide Web
    which is crawled and indexed by conventional
    search engines.
  • Deep Web consists of 91,000 terabytes of data
    whereas surface web contains only 167 terabytes.

3
Contextual View Of The Deep Web
4
What Constitutes Deep Web
  • Dynamic content dynamic pages which are
    returned in response to a submitted query.
  • Unlinked content pages which are not linked to
    other pages.
  • Private Web sites that require registration and
    login.

5
What Constitutes Deep Web
  • Limited access content sites that limit access
    to their pages in a technical way.
  • Scripted content pages that are only accessible
    through links produced by JavaScript.
  • Non-HTML/text content textual content encoded
    in multimedia (image or video) files or specific
    file formats not handled by search engines.

6
Why Is The Information Not Accessible
  • Conventional search engines use programs called
    spiders or crawlers.
  • When a search engine reaches a page, it will
    capture the text on that page, indexes it and
    crawls to any pages that may have static
    hyperlinks to it.
  • Cannot crawl and index information in databases
    because they don't have a static URL.

7
Why Use The Deep Web
  • Very vast 550 times that of surface web
  • Quality of content / higher level of authority
  • Comprehensiveness
  • Focused
  • Timeliness
  • The material isnt available elsewhere on the Web

8
How To Access Contents Of Deep Web
  • Manually search all the databases
  • Human Crawlers (Web Harvesting)?
  • Federated Search

9
Web Harvesting
Web Harvesting is an implementation of a Web
crawler uses human expertise or machine guidance
to direct the crawler to URLs which compose a
specialized collection or set of knowledge. Web
harvesting can be thought of as focused or
directed Web crawling.
10
Process
  • Identifying and specifying as input to a computer
    program a list of URLs that defines a specialized
    collection or a set of knowledge
  • The computer program then begins to download this
    list of URLs.
  • Crawl depth can be defined , crawling need not be
    recursive
  • The downloaded content is then indexed by the
    search engine application and offered to
    information customers as a searchable Web
    application.

11
Limitations
  • Amount of human intervention needed is high.
  • Some sites are very slow, particularly during
    busy periods, so getting all the information
    needed within a limited time window may be
    impossible.

12
Federated Search
  • Simultaneous search of multiple online databases
  • User enters the query in a single interface
  • Query is sent to different databases associated
    with the search engine.
  • Results are presented in a manner suitable to the
    user

13
Process
  • Transforming a query and broadcasting it to a
    group of databases with the appropriate syntax
  • Merging the results collected from the databases
  • Presenting them in a unified format with minimal
    duplication
  • Providing a means, performed either automatically
    or by the portal user, to sort the merged result
    set.

14
Federated Search contd...
  • Advantage They are as current as the
    information sources as the sources are searched
    in real time
  • Eg WorldWideScience
  • Contains 40 information sources several of them
    are federated search portals themselves







15
Limitations
  • Scalability
  • Vast amount of info coming can be a problem
  • All the databases cannot be covered
  • Either it searches the entire database or User
    intervention is required
  • Results depend on user supplying the correct
    keywords

16
Automatic Information Discovery From The
Invisible Web
Automatic Information Discovery From The
Invisible Web
A system that maintains information about the
specialized search engines in the invisible web.
When a query arrives, the system not only finds
the most appropriate specialized engines, but
also redirects the query automatically so that
the user can directly receive the appropriate
query results.Characteristics
  • Database of specialized search engines
  • Automatic search engine selection
  • Data mining for better query specification and
    search

17
System Architecture
18
System Overview
1.Populate the search engine database
  • Crawlers identify search engines using form tags
  • Along with the URL , an engine description is
    also stored in the database

2.Query pre-processing
  • Send the keywords to some general search engines
    for a query and return the top results.
  • Based on the results, find words and phrases that
    appear often with the search keywords.

19
System Overview
3.Engine selection
  • Each keyword/phrase generated from the
    pre-processing step is matched with the search
    engine description of database

4.Query execution and result post-processing
  • After the search engines are selected, the
    system automatically sends the query to all the
    search engines and awaits the results to return.
  • Based on the information stored in the database,
    the system can automatically generate the query
    string and send the appropriate query to the
    websites

20
Conclusion
  • Deep Web constitutes a large repository of
    information which is getting deeper and bigger
    all the time. There are various possible ways in
    which the information in it can be accessed.
    There has been continuous improvement in this
    field , still there is need of more efficient
    methods to be commercially implemented.

21
References
  • Bergman, M.K. (2001). The deep web Surfacing
    hidden value. The Journal of Electronic
    Publishing, 7(1). Retrieved from
    http//www.press.umich. edu/jep/07-01/bergman.html
  • King-Ip Lin, Hui Chen, "Automatic Information
    Discovery from the "Invisible Web","
    itcc,pp.0332, International Conference on
    Information Technology Coding and Computing,
    2002
  • www.wikipedia.com
  • http//worldwidescience.org/
  • http//science.gov/

22
Queries ???
Write a Comment
User Comments (0)
About PowerShow.com