CS621 : Seminar2008 - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

CS621 : Seminar2008

Description:

Deep Web consists of 91,000 terabytes of data whereas surface web contains ... is an implementation of a Web crawler uses human expertise or machine guidance ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 23

Provided by: MEE97

Category:

more less

Transcript and Presenter's Notes

Title: CS621 : Seminar2008

1
CS621 Seminar-2008

DEEP WEB
Shubhangi Agrawal (08305044)?
Jayalekshmy S. Nair (08305056)?

2
Introduction

Deep Web The part of web which does not come
under surface web.
Surface Web That part of the World Wide Web
which is crawled and indexed by conventional
search engines.
Deep Web consists of 91,000 terabytes of data
whereas surface web contains only 167 terabytes.

3
Contextual View Of The Deep Web
4
What Constitutes Deep Web

Dynamic content dynamic pages which are
returned in response to a submitted query.
Unlinked content pages which are not linked to
other pages.
Private Web sites that require registration and
login.

5
What Constitutes Deep Web

Limited access content sites that limit access
to their pages in a technical way.
Scripted content pages that are only accessible
through links produced by JavaScript.
Non-HTML/text content textual content encoded
in multimedia (image or video) files or specific
file formats not handled by search engines.

6
Why Is The Information Not Accessible

Conventional search engines use programs called
spiders or crawlers.
When a search engine reaches a page, it will
capture the text on that page, indexes it and
crawls to any pages that may have static
hyperlinks to it.
Cannot crawl and index information in databases
because they don't have a static URL.

7
Why Use The Deep Web

Very vast 550 times that of surface web
Quality of content / higher level of authority
Comprehensiveness
Focused
Timeliness
The material isnt available elsewhere on the Web

8
How To Access Contents Of Deep Web

Manually search all the databases
Human Crawlers (Web Harvesting)?
Federated Search

9
Web Harvesting
Web Harvesting is an implementation of a Web
crawler uses human expertise or machine guidance
to direct the crawler to URLs which compose a
specialized collection or set of knowledge. Web
harvesting can be thought of as focused or
directed Web crawling.
10
Process

Identifying and specifying as input to a computer
program a list of URLs that defines a specialized
collection or a set of knowledge
The computer program then begins to download this
list of URLs.
Crawl depth can be defined , crawling need not be
recursive
The downloaded content is then indexed by the
search engine application and offered to
information customers as a searchable Web
application.

11
Limitations

Amount of human intervention needed is high.
Some sites are very slow, particularly during
busy periods, so getting all the information
needed within a limited time window may be
impossible.

12
Federated Search

Simultaneous search of multiple online databases
User enters the query in a single interface
Query is sent to different databases associated
with the search engine.
Results are presented in a manner suitable to the
user

13
Process

Transforming a query and broadcasting it to a
group of databases with the appropriate syntax
Merging the results collected from the databases
Presenting them in a unified format with minimal
duplication
Providing a means, performed either automatically
or by the portal user, to sort the merged result
set.

14
Federated Search contd...

Advantage They are as current as the
information sources as the sources are searched
in real time
Eg WorldWideScience
Contains 40 information sources several of them
are federated search portals themselves

15
Limitations

Scalability
Vast amount of info coming can be a problem
All the databases cannot be covered
Either it searches the entire database or User
intervention is required
Results depend on user supplying the correct
keywords

16
Automatic Information Discovery From The
Invisible Web
Automatic Information Discovery From The
Invisible Web
A system that maintains information about the
specialized search engines in the invisible web.
When a query arrives, the system not only finds
the most appropriate specialized engines, but
also redirects the query automatically so that
the user can directly receive the appropriate
query results.Characteristics

Database of specialized search engines
Automatic search engine selection
Data mining for better query specification and
search

17
System Architecture
18
System Overview
1.Populate the search engine database

Crawlers identify search engines using form tags
Along with the URL , an engine description is
also stored in the database

2.Query pre-processing

Send the keywords to some general search engines
for a query and return the top results.
Based on the results, find words and phrases that
appear often with the search keywords.

19
System Overview
3.Engine selection

Each keyword/phrase generated from the
pre-processing step is matched with the search
engine description of database

4.Query execution and result post-processing

After the search engines are selected, the
system automatically sends the query to all the
search engines and awaits the results to return.
Based on the information stored in the database,
the system can automatically generate the query
string and send the appropriate query to the
websites

20
Conclusion

Deep Web constitutes a large repository of
information which is getting deeper and bigger
all the time. There are various possible ways in
which the information in it can be accessed.
There has been continuous improvement in this
field , still there is need of more efficient
methods to be commercially implemented.

21
References

Bergman, M.K. (2001). The deep web Surfacing
hidden value. The Journal of Electronic
Publishing, 7(1). Retrieved from
http//www.press.umich. edu/jep/07-01/bergman.html
King-Ip Lin, Hui Chen, "Automatic Information
Discovery from the "Invisible Web","
itcc,pp.0332, International Conference on
Information Technology Coding and Computing,
2002
www.wikipedia.com
http//worldwidescience.org/
http//science.gov/

22
Queries ???

Write a Comment

User Comments (0)