Title: SEARCHING FOR TRUTH
1SEARCHING FOR TRUTH
chapter5
- Locating Information on the WWW
2Information Retrieval
- Major issues
- Finding information
- How is it organized
- Is it searchable
- Is it available
- Much older content is not digitized
- How to digitize flowing text, ancient scripts,
etc. - Who should have access at what cost
- Intersection of computer science and library
science
3Information Retrieval
- Online access to card catalogs first achievement
- Digital library databases current
state-of-the-art - Electronic bookshelves
- Restricted access since most are commercial
- You can access these via Pitt library
- Future trend is to make the entire web an
intelligent information repository
4UW libraries Top 20 Databases with links
5Summary links for UW libraries reference materials
6National Public Radio (NPR) home page www.npr.org
7The NPR home page Programming pull-down menu
What are the top level classifications?
8The NPR programming hierarchy tree
How many levels in the hierarchy?
9Top level, second level and third level
classifications of a collection
10Information Retrieval
- How search engines find pages
- Crawl the web, scanning pages for keywords
following links to other pages - Build huge databases of web resources (docs,
images, etc.) - Google has over 6 billion in their DB
11Information Retrieval
- How search engines rank pages
- Page ranking based on location frequency of
keywords - Page more relevant if keyword in title or top of
body - Page more relevant if keyword occurs frequently
- Webmasters can manipulate page to increase
ranking (meta tag in head section) - Monitor links clicked to measure relevance
12The Google search engines advanced search view
13Restricting the Thai restaurants hits by
eliminating any page containing the word review.
14Making Effective Queries
- Queries use the three logical operators
- worda AND wordb -- both words must appear
- worda OR wordb -- either word may appear
- NOT worda -- the word is prohibited from
appearing - Google has separate windows for each
- When 1 window is available, write a formula
- (Lincoln OR Jefferson) AND NOT Memorial
Which parts of the diagram do the operators cover?
15Information Retrieval
- Alternative Search Engines
- A9.com
- Search service provided by Amazon.com
- Piggybacks on Google
- Sorts web files into categories
- Files, images, books, etc.
- Can setup account and save search results
- Amazon can track your search usage!
- Search results for HTML Tutorial
16Information Retrieval
- Vivisimo.com
- Located in Pittsburgh, founded by CMU researchers
- Uses clustering technique to sort web files into
topical folders (common associations between
keywords and other words) - Off to a successful start, Google is pursuing
same approach - Search results for HTML Tutorial
- Search engines are not impartial
- Most surfers cant identify links that are
effectively advertisements (see article)
17Information Retrieval
- Semantic Web
- Tim Berners-Lees article in Scientific American
- Todays web is for people to read, not computers
- Goal is to bring a knowledge structure to www
- Software agents will roam the WWW perform tasks
on behalf of the user - Based on knowledge representation inference
rules..the field of Artificial Intelligence (AI) - XML other technologies used to structure
knowledge - WWW becomes enormous database where new knowledge
can be discovered - Will be used to control physical devices too
18Information Retrieval
- Semantic Web
- SW agents will follow hyperlinks to definitions
of key terms - And will be able to reason logically
- Users will compose semantic web pages using
commercial SW - Web will be structured info with sets of
inference rules - Knowledge representation is part of AI
- Basic computer science research that predates the
web
19Information Retrieval
- Semantic Web
- Custom XML tags used to structure the web
document - Resource Description Framework (RDF) used to
define what the XML tags mean - Subject - Verb - Object triple
- John - is the brother of - Mary
- Subjects, objects verbs are defined by a URI
- URI is Uniform Resource Identifier (URL is type
of URI) - Ontology is collection of logical statements
written in RDF - Can also be used to identify synonyms
- ex) zip code postal code
20Information Retrieval
- Semantic Web is potential Killer App
- Improved web searches based on meaning of
keywords - automated discovery of useful information via
agents - Entire WWW becomes vast repository of linked
information understandable by computers - Agents will be able to read digital signatures
that prove authenticity of information - Exponential increase in the usefulness of
information
21Information Retrieval
- Digital Libraries
- Effort to turn WWW into immense library
collection - Recently Google announced itll scan/digitize
books from - Harvard, Stanford, Oxford, Michigan NYC
libraries - Like semantic web, based on knowledge
representation - More than just an electronic bookshelf
- Organization, classification, retrieval
analysis of information - Pitt SIS planning a Masters program in Digital
Library Information Management (DLIM) - The WWW of the Future
- An automated, intelligent web containing all of
the worlds information (even historical
archives!) that real-time decisions can be based
on