Title: CS621 : Seminar2008
1CS621 Seminar-2008
-
- DEEP WEB
- Shubhangi Agrawal (08305044)?
- Jayalekshmy S. Nair (08305056)?
-
2Introduction
- Deep Web The part of web which does not come
under surface web. - Surface Web That part of the World Wide Web
which is crawled and indexed by conventional
search engines. - Deep Web consists of 91,000 terabytes of data
whereas surface web contains only 167 terabytes.
3Contextual View Of The Deep Web
4What Constitutes Deep Web
- Dynamic content dynamic pages which are
returned in response to a submitted query. - Unlinked content pages which are not linked to
other pages. - Private Web sites that require registration and
login.
5What Constitutes Deep Web
- Limited access content sites that limit access
to their pages in a technical way. - Scripted content pages that are only accessible
through links produced by JavaScript. - Non-HTML/text content textual content encoded
in multimedia (image or video) files or specific
file formats not handled by search engines.
6Why Is The Information Not Accessible
- Conventional search engines use programs called
spiders or crawlers. - When a search engine reaches a page, it will
capture the text on that page, indexes it and
crawls to any pages that may have static
hyperlinks to it. - Cannot crawl and index information in databases
because they don't have a static URL.
7Why Use The Deep Web
- Very vast 550 times that of surface web
- Quality of content / higher level of authority
- Comprehensiveness
- Focused
- Timeliness
- The material isnt available elsewhere on the Web
8How To Access Contents Of Deep Web
- Manually search all the databases
- Human Crawlers (Web Harvesting)?
- Federated Search
9Web Harvesting
Web Harvesting is an implementation of a Web
crawler uses human expertise or machine guidance
to direct the crawler to URLs which compose a
specialized collection or set of knowledge. Web
harvesting can be thought of as focused or
directed Web crawling.
10Process
- Identifying and specifying as input to a computer
program a list of URLs that defines a specialized
collection or a set of knowledge - The computer program then begins to download this
list of URLs. - Crawl depth can be defined , crawling need not be
recursive - The downloaded content is then indexed by the
search engine application and offered to
information customers as a searchable Web
application.
11Limitations
- Amount of human intervention needed is high.
- Some sites are very slow, particularly during
busy periods, so getting all the information
needed within a limited time window may be
impossible.
12Federated Search
- Simultaneous search of multiple online databases
- User enters the query in a single interface
- Query is sent to different databases associated
with the search engine. - Results are presented in a manner suitable to the
user
13Process
- Transforming a query and broadcasting it to a
group of databases with the appropriate syntax - Merging the results collected from the databases
- Presenting them in a unified format with minimal
duplication - Providing a means, performed either automatically
or by the portal user, to sort the merged result
set.
14Federated Search contd...
- Advantage They are as current as the
information sources as the sources are searched
in real time - Eg WorldWideScience
- Contains 40 information sources several of them
are federated search portals themselves
15Limitations
- Scalability
- Vast amount of info coming can be a problem
- All the databases cannot be covered
- Either it searches the entire database or User
intervention is required - Results depend on user supplying the correct
keywords
16Automatic Information Discovery From The
Invisible Web
Automatic Information Discovery From The
Invisible Web
A system that maintains information about the
specialized search engines in the invisible web.
When a query arrives, the system not only finds
the most appropriate specialized engines, but
also redirects the query automatically so that
the user can directly receive the appropriate
query results.Characteristics
- Database of specialized search engines
- Automatic search engine selection
- Data mining for better query specification and
search
17System Architecture
18System Overview
1.Populate the search engine database
- Crawlers identify search engines using form tags
- Along with the URL , an engine description is
also stored in the database
2.Query pre-processing
- Send the keywords to some general search engines
for a query and return the top results. - Based on the results, find words and phrases that
appear often with the search keywords.
19System Overview
3.Engine selection
- Each keyword/phrase generated from the
pre-processing step is matched with the search
engine description of database
4.Query execution and result post-processing
- After the search engines are selected, the
system automatically sends the query to all the
search engines and awaits the results to return. - Based on the information stored in the database,
the system can automatically generate the query
string and send the appropriate query to the
websites
20Conclusion
- Deep Web constitutes a large repository of
information which is getting deeper and bigger
all the time. There are various possible ways in
which the information in it can be accessed.
There has been continuous improvement in this
field , still there is need of more efficient
methods to be commercially implemented.
21References
- Bergman, M.K. (2001). The deep web Surfacing
hidden value. The Journal of Electronic
Publishing, 7(1). Retrieved from
http//www.press.umich. edu/jep/07-01/bergman.html
- King-Ip Lin, Hui Chen, "Automatic Information
Discovery from the "Invisible Web","
itcc,pp.0332, International Conference on
Information Technology Coding and Computing,
2002 - www.wikipedia.com
- http//worldwidescience.org/
- http//science.gov/
22Queries ???