Title: Prof. Ray Larson
1Lecture 23 Web Searching
Principles of Information Retrieval
- Prof. Ray Larson
- University of California, Berkeley
- School of Information
- Tuesday and Thursday 1030 am - 1200 pm
- Spring 2007
- http//courses.ischool.berkeley.edu/i240/s07
2Mini-TREC
- Proposed Schedule
- February 15 Database and previous Queries
- February 27 report on system acquisition and
setup - March 8, New Queries for testing
- April 19, Results due (Next Thursday)
- April 24 or 26, Results and system rankings
- May 8 Group reports and discussion
3All Minitrec Runs
4All Groups Best Runs
5All Groups Best Runs RRL
6Results Data
- trec_eval runs for each submitted file have been
put into a new directory called RESULTS in your
group directories - The trec_eval parameters used for these runs are
-o for the .res files and -o q for the
.resq files. The .dat files contain the
recall level and precision values used for the
preceding plots - The qrels for the Mini-TREC queries are available
now in the /projects/i240 directory as
MINI_TREC_QRELS
7Mini-TREC Reports
- In-Class Presentations May 8th
- Written report due May 8th (Last day of Class)
4-5 pages - Content
- System description
- What approach/modifications were taken?
- results of official submissions (see RESULTS)
- results of post-runs new runs with results
using MINI_TREC_QRELS and trec_eval
8Term Paper
- Should be about 8-15 pages on
- some area of IR research (or practice) that you
are interested in and want to study further - Experimental tests of systems or IR algorithms
- Build an IR system, test it, and describe the
system and its performance - Due May 8th (Last day of class)
9Today
- Review
- Web Crawling and Search Issues
- Web Search Engines and Algorithms
- Web Search Processing
- Parallel Architectures (Inktomi - Brewer)
- Cheshire III Design
Credit for some of the slides in this lecture
goes to Marti Hearst and Eric Brewer
10Web Crawlers
- How do the web search engines get all of the
items they index? - More precisely
- Put a set of known sites on a queue
- Repeat the following until the queue is empty
- Take the first page off of the queue
- If this page has not yet been processed
- Record the information found on this page
- Positions of words, links going out, etc
- Add each link on the current page to the queue
- Record that this page has been processed
- In what order should the links be followed?
11Page Visit Order
- Animated examples of breadth-first vs depth-first
search on trees - http//www.rci.rutgers.edu/cfs/472_html/AI_SEARCH
/ExhaustiveSearch.html
12Sites Are Complex Graphs, Not Just Trees
13Web Crawling Issues
- Keep out signs
- A file called robots.txt tells the crawler which
directories are off limits - Freshness
- Figure out which pages change often
- Recrawl these often
- Duplicates, virtual hosts, etc
- Convert page contents with a hash function
- Compare new pages to the hash table
- Lots of problems
- Server unavailable
- Incorrect html
- Missing links
- Infinite loops
- Web crawling is difficult to do robustly!
14Search Engines
- Crawling
- Indexing
- Querying
15Web Search Engine Layers
From description of the FAST search engine, by
Knut Risvikhttp//www.infonortics.com/searchengin
es/sh00/risvik_files/frame.htm
16Standard Web Search Engine Architecture
Check for duplicates, store the documents
DocIds
crawl the web
user query
create an inverted index
Inverted index
Search engine servers
Show results To user
17More detailed architecture,from Brin Page
98.Only covers the preprocessing in detail, not
the query serving.
18Indexes for Web Search Engines
- Inverted indexes are still used, even though the
web is so huge - Most current web search systems partition the
indexes across different machines - Each machine handles different parts of the data
(Google uses thousands of PC-class processors and
keeps most things in main memory) - Other systems duplicate the data across many
machines - Queries are distributed among the machines
- Most do a combination of these
19Search Engine Querying
In this example, the data for the pages is
partitioned across machines. Additionally, each
partition is allocated multiple machines to
handle the queries. Each row can handle 120
queries per second Each column can handle 7M
pages To handle more queries, add another row.
From description of the FAST search engine, by
Knut Risvikhttp//www.infonortics.com/searchengin
es/sh00/risvik_files/frame.htm
20Querying Cascading Allocation of CPUs
- A variation on this that produces a cost-savings
- Put high-quality/common pages on many machines
- Put lower quality/less common pages on fewer
machines - Query goes to high quality machines first
- If no hits found there, go to other machines
21Google
- Google maintains (probably) the worlds largest
Linux cluster (over 15,000 servers) - These are partitioned between index servers and
page servers - Index servers resolve the queries (massively
parallel processing) - Page servers deliver the results of the queries
- Over 8 Billion web pages are indexed and served
by Google
22Search Engine Indexes
- Starting Points for Users include
- Manually compiled lists
- Directories
- Page popularity
- Frequently visited pages (in general)
- Frequently visited pages as a result of a query
- Link co-citation
- Which sites are linked to by other sites?
23Starting Points What is Really Being Used?
- Todays search engines combine these methods in
various ways - Integration of Directories
- Today most web search engines integrate
categories into the results listings - Lycos, MSN, Google
- Link analysis
- Google uses it others are also using it
- Words on the links seems to be especially useful
- Page popularity
- Many use DirectHits popularity rankings
24Web Page Ranking
- Varies by search engine
- Pretty messy in many cases
- Details usually proprietary and fluctuating
- Combining subsets of
- Term frequencies
- Term proximities
- Term position (title, top of page, etc)
- Term characteristics (boldface, capitalized, etc)
- Link analysis information
- Category information
- Popularity information
25Ranking Hearst 96
- Proximity search can help get high-precision
results if gt1 term - Combine Boolean and passage-level proximity
- Proves significant improvements when retrieving
top 5, 10, 20, 30 documents - Results reproduced by Mitra et al. 98
- Google uses something similar
26Ranking Link Analysis
- Assumptions
- If the pages pointing to this page are good, then
this is also a good page - The words on the links pointing to this page are
useful indicators of what this page is about - References Page et al. 98, Kleinberg 98
27Ranking Link Analysis
- Why does this work?
- The official Toyota site will be linked to by
lots of other official (or high-quality) sites - The best Toyota fan-club site probably also has
many links pointing to it - Less high-quality sites do not have as many
high-quality sites linking to them
28Ranking PageRank
- Google uses the PageRank
- We assume page A has pages T1...Tn which point to
it (i.e., are citations). The parameter d is a
damping factor which can be set between 0 and 1.
d is usually set to 0.85. C(A) is defined as the
number of links going out of page A. The PageRank
of a page A is given as follows - PR(A) (1-d) d (PR(T1)/C(T1) ...
PR(Tn)/C(Tn)) - Note that the PageRanks form a probability
distribution over web pages, so the sum of all
web pages' PageRanks will be one
29PageRank
Note these are not real PageRanks, since they
include values gt 1
T3 Pr1
X2
X1
T1 Pr.725
T4 Pr1
A Pr4.2544375
T2 Pr1
T5 Pr1
T8 Pr2.46625
T7 Pr1
T6 Pr1
30PageRank
- Similar to calculations used in scientific
citation analysis (e.g., Garfield et al.) and
social network analysis (e.g., Waserman et al.) - Similar to other work on ranking (e.g., the hubs
and authorities of Kleinberg et al.) - How is Amazon similar to Google in terms of the
basic insights and techniques of PageRank? - How could PageRank be applied to other problems
and domains?
31Today
- Review
- Web Crawling and Search Issues
- Web Search Engines and Algorithms
- Web Search Processing
- Parallel Architectures (Inktomi Eric Brewer)
- Cheshire III Design
Credit for some of the slides in this lecture
goes to Marti Hearst and Eric Brewer
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39(No Transcript)
40(No Transcript)
41(No Transcript)
42(No Transcript)
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52Grid-based Search and Data Mining Using Cheshire3
Presented by Ray R. Larson University of
California, Berkeley School of Information
- In collaboration with
- Robert Sanderson
- University of Liverpool
- Department of Computer Science
53Overview
- The Grid, Text Mining and Digital Libraries
- Grid Architecture
- Grid IR Issues
- Cheshire3 Bringing Search to Grid-Based Digital
Libraries - Overview
- Grid Experiments
- Cheshire3 Architecture
- Distributed Workflows
54Grid Architecture -- (Dr. Eric Yen, Academia
Sinica, Taiwan.)
..
High energy physics
Chemical Engineering
Climate
Astrophysics
Cosmology
Combustion
Applications Application Toolkits Grid Service
s Grid Fabric
..
Remote Computing
Remote Visualization
Collaboratories
Remote sensors
Data Grid
Portals
Grid middleware
Protocols, authentication, policy,
instrumentation, Resource management, discovery,
events, etc.
Storage, networks, computers, display devices,
etc. and their associated local services
55Grid Architecture (ECAI/AS Grid Digital Library
Workshop)
Digital Libraries
High energy physics
Humanities computing
Bio-Medical
Chemical Engineering
Astrophysics
Climate
Cosmology
Combustion
Applications Application Toolkits Grid Service
s Grid Fabric
Text Mining
Remote Computing
Remote Visualization
Metadata management
Search Retrieval
Collaboratories
Remote sensors
Data Grid
Portals
Grid middleware
Protocols, authentication, policy,
instrumentation, Resource management, discovery,
events, etc.
Storage, networks, computers, display devices,
etc. and their associated local services
56Grid-Based Digital Libraries
- Large-scale distributed storage requirements and
technologies - Organizing distributed digital collections
- Shared Metadata standards and requirements
- Managing distributed digital collections
- Security and access control
- Collection Replication and backup
- Distributed Information Retrieval issues and
algorithms
57Grid IR Issues
- Want to preserve the same retrieval performance
(precision/recall) while hopefully increasing
efficiency (I.e. speed) - Very large-scale distribution of resources is a
challenge for sub-second retrieval - Different from most other typical Grid processes,
IR is potentially less computing intensive and
more data intensive - In many ways Grid IR replicates the process (and
problems) of metasearch or distributed search
58Introduction
- Cheshire History
- Developed at UC Berkeley originally
- Solution for library data (C1), then SGML (C2),
then XML - Monolithic applications for indexing and
retrieval server in C TCL scripting - Cheshire3
- Developed at Liverpool, plus Berkeley
- XML, Unicode, Grid scalable Standards based
- Object Oriented Framework
- Easy to develop and extend in Python
59Introduction
- Today
- Version 0.9.4
- Mostly stable, but needs thorough QA and docs
- Grid, NLP and Classification algorithms
integrated - Near Future
- June Version 1.0
- Further DM/TM integration, docs, unit tests,
stability - December Version 1.1
- Grid out-of-the-box, configuration GUI
60Context
- Environmental Requirements
- Very Large scale information systems
- Terabyte scale (Data Grid)
- Computationally expensive processes (Comp. Grid)
- Digital Preservation
- Analysis of data, not just retrieval (Data/Text
Mining) - Ease of Extensibility, Customizability (Python)
- Open Source
- Integrate not Re-implement
- "Web 2.0" interactivity and dynamic interfaces
61Context
62Cheshire3 Object Model
Protocol Handler
Record
63Object Configuration
- One XML 'record' per non-data object
- Very simple base schema, with extensions as
needed - Identifiers for objects unique within a
context(e.g., unique at individual database
level, but not necessarily between all databases) - Allows workflows to reference by identifier but
act appropriately within different contexts. - Allows multiple administrators to define objects
without reference to each other
64Grid
- Focus on ingest, not discovery (yet)
- Instantiate architecture on every node
- Assign one node as master, rest as slaves. Master
then divides the processing as appropriate. - Calls between slaves possible
- Calls as small, simple as possible (objectIdenti
fier, functionName, arguments) - Typically('workflow-id', 'process',
'document-id')
65Grid Architecture
Master Task
(workflow, process, document)
(workflow, process, document)
fetch document
fetch document
Data Grid
document
document
Slave Task 1
Slave Task N
extracted data
extracted data
GPFS Temporary Storage
66Grid Architecture - Phase 2
Master Task
(index, load)
(index, load)
store index
store index
Data Grid
Slave Task 1
Slave Task N
fetch extracted data
fetch extracted data
GPFS Temporary Storage
67Workflow Objects
- Written as XML within the configuration record.
- Rewrites and compiles to Python code on object
instantiation - Current instructions
- object
- assign
- fork
- for-each
- break/continue
- try/except/raise
- return
- log ( send text to default logger object)
- Yes, no if!
68Workflow example
ltsubConfig idbuildSingleWorkflowgt ltobjectTypegt
workflow.SimpleWorkflowlt/objectTypegt ltworkflowgt
ltobject typeworkflow refPreParserWorkflow/gt
lttrygt ltobject typeparser
refNsSaxParser/gt lt/trygt ltexceptgt
ltloggtUnparsable Recordlt/loggt ltraise/gt
lt/exceptgt ltobject typerecordStore
functioncreate_record/gt ltobject
typedatabase functionadd_record/gt ltobject
typedatabase functionindex_record/gt
ltloggtLoaded Record input.idlt/loggt lt/workflowgt
lt/subConfiggt
69Text Mining
- Integration of Natural Language Processing tools
- Including
- Part of Speech taggers (noun, verb,
adjective,...) - Phrase Extraction
- Deep Parsing (subject, verb, object,
preposition,...) - Linguistic Stemming (is/be fairy/fairy vs is/is
fairy/fairi) - Planned Information Extraction tools
70Data Mining
- Integration of toolkits difficult unless they
support sparse vectors as input - text is high
dimensional, but has lots of zeroes - Focus on automatic classification for predefined
categories rather than clustering - Algorithms integrated/implemented
- Perceptron, Neural Network (pure python)
- Naïve Bayes (pure python)
- SVM (libsvm integrated with python wrapper)
- Classification Association Rule Mining (Java)
71Data Mining
- Modelled as multi-stage PreParser object
(training phase, prediction phase) - Plus need for AccumulatingDocumentFactory to
merge document vectors together into single
output for training some algorithms (e.g., SVM) - Prediction phase attaches metadata (predicted
class) to document object, which can be stored in
DocumentStore - Document vectors generated per index per
document, so integrated NLP document
normalization for free
72Data Mining Text Mining
- Testing integrated environment with 500,000
medline abstracts, using various NLP tools,
classification algorithms, and evaluation
strategies. - Computational grid for distributing expensive NLP
analysis - Results show better accuracy with fewer
attributes
73Applications (1)
- Automated Collection Strength Analysis
- Primary aim Test if data mining techniques
could be used to develop a coverage map of items
available in the London libraries. - The strengths within the library collections were
automatically determined through enrichment and
analysis of bibliographic level metadata records.
- This involved very large scale processing of
records to - Deduplicate millions of records
- Enrich deduplicated records against database of
45 million - Automatically reclassify enriched records using
machine learning processes (Naïve Bayes)
74Applications (1)
- Data mining enhances collection mapping
strategies by making a larger proportion of the
data usable, by discovering hidden relationships
between textual subjects and hierarchically based
classification systems. - The graph shows the comparison of numbers of
books classified in the domain of Psychology
originally and after enhancement using data
mining
75Applications (2)
- Assessing the Grade Level of NSDL Education
Material - The National Science Digital Library has
assembled a collection of URLs that point to
educational material for scientific disciplines
for all grade levels. These are harvested into
the SRB data grid. - Working with SDSC we assessed the grade-level
relevance by examining the vocabulary used in the
material present at each registered URL. - We determined the vocabulary-based grade-level
with the Flesch-Kincaid grade level assessment.
The domain of each website was then determined
using data mining techniques (TF-IDF derived fast
domain classifier). - This processing was done on the Teragrid cluster
at SDSC.
76Cheshire3 Grid Tests
- Running on an 30 processor cluster in Liverpool
using PVM (parallel virtual machine) - Using 16 processors with one master and 22
slave processes we were able to parse and index
MARC data at about 13000 records per second - On a similar setup 610 Mb of TEI data can be
parsed and indexed in seconds
77SRB and SDSC Experiments
- We are working with SDSC to include SRB support
- We are planning to continue working with SDSC and
to run further evaluations using the TeraGrid
server(s) through a small grant for 30000 CPU
hours - SDSC's TeraGrid cluster currently consists of
256 IBM cluster nodes, each with dual 1.5 GHz
Intel Itanium 2 processors, for a peak
performance of 3.1 teraflops. The nodes are
equipped with four gigabytes (GBs) of physical
memory per node. The cluster is running SuSE
Linux and is using Myricom's Myrinet cluster
interconnect network. - Planned large-scale test collections include
NSDL, the NARA repository, CiteSeer and the
million books collections of the Internet
Archive
78Conclusions
- Scalable Grid-Based digital library services can
be created and provide support for very large
collections with improved efficiency - The Cheshire3 IR and DL architecture can provide
Grid (or single processor) services for
next-generation DLs - Available as open source via
- http//cheshire3.sourceforge.net or
- http//www.cheshire3.org/