Title: Amanda Spink : Analysis of Web Searching and Retrieval
1Amanda Spink Analysis of Web Searching and
Retrieval
- Larry Reeve
- INFO861 - Topics in Information Science
- Dr. McCain - Winter 2004
2Background
- Amanda Spink
- Self-described areas of work
- Information Retrieval
- Web Retrieval
- Human Information Behavior / Information Seeking
- Medical Informatics
- Ph.D. 1993 Rutgers University
- Thesis - Feedback in Information Retrieval
- Studied under Tefko Saracevic
3Background
- Amanda Spink
- Over 140 papers published
- 5th in journal article production,
- 18th in citation production among U.S. IS faculty
- Institute for Information Science most highly
cited paper in Web Retrieval - Real Life, Real Users, Real needs A Study and
Analysis of User Queries on the Web (2000)
4Background
- Amanda Spink
- Associate Professor at University of Pittsburgh
- School of Information Sciences
- Prior faculty positions
- Pennsylvania State University
- School of Information Science Technology
- Web Research Group
- University of North Texas
- School of Library and Information Sciences
5Background
- Tefko Saracevic
- Associate Dean
- School of Communication, Information and Library
Studies, Rutgers University - Related research
- Test and Evaluation of IR systems
- Relevance in Information Science
- Analysis of web queries
6Web Searching and Retrieval
- Analyze user queries
- Important for building future IR systems on Web
- Focus on search terms
- Failure analysis in query construction
- Term Relevance Feedback (TRF)
- Topics / Classification
- Use of language
7Studies Conducted
- U.S. Excite (www.excite.com)
- 51K study
- 51,473 queries
- 18,113 users
- March 9, 1997
- 1M study
- 1,025,910 queries
- 211,063 users
- September 16, 1997
8Studies Conducted
- European - AllTheWeb.com
- 1 million queries
- 200,000 users
- Logs from two days
- February 6, 2001
- May 28, 2002
- Most users from Norway and Germany
9Studies Conducted
- Issues with Web transaction logs
- Where does session start and end?
- Temporal boundary Spink found 15 mins avg,
- Others found 5mins, 12mins, 32mins, and 2 hours
- Numerical boundary 100 entries
- How to eliminate non-individual users
- Meta-search engines, other agents
- No user insight into users process
10Findings
- Relevance Feedback
- Advanced Search Techniques
- Term Characteristics
- Query Classification
- American vs. European
11Findings Relevance Feedback
- Term Relevance Feedback (TRF) rarely used
- 51K study
- 1,597 queries from 823 users (lt5 of queries)
- Those using TRF had longer sessions
- Successful 60 of time
- Implications
- Failure rate of 40 may be too high
- IR designers could automatically perform TRF
12Findings Relevance Feedback
- Mediated searching
- 11 of search terms come from TRF
- 37 from users, 63 from mediators
- 2/3 of TRF contributed positively
13Findings Relevance Feedback
- Identified 6 session states
- Initial Query, Modified Query, Next Page,
- New Query, Relevance Feedback, Prev Query
- Identified 4 session patterns
- Using the 6 session states
- Implication IR designers should accommodate
these states and patterns
14Findings Relevance Feedback
- Relevance Feedback Session Patterns
15Findings Advanced Search Techniques
- Includes
- Boolean operators
- Modifiers , -
- Quotes (phrases)
- Not often used by Web users, but used more by
mediated search - Boolean lt10, Modifiers 9, 6 phrases
- Used incorrectly
- Boolean AND50, OR28, AND NOT19
- Modifiers 75 of time
- Phrases 8
- Users and advanced techniques do not get along!
16Findings Advanced Search Techniques
- Boolean, most common problems
- Not capitalizing AND
- Confusing AND operator with and conjunction
- e.g. Science and Technology
- Science AND Technology
- Modifiers, most common problems
- Prefix rather than mathematical postix
- news weather rather than newsweather
- No space required, as is required with Boolean
17Findings Term Characteristics
- Terms per query
- 1 26.6, 2 31.5, 3 18.2, gt7 1.8
- Mediated searching 7-15 terms
- Distribution of terms not quite Zipf
- Top terms account for 10 of all terms
- Single-use terms account for 9 of all terms
- Not understood why this occurs
18Findings Query Classification
Classification of queries based on Rutgers Web
Classification
19Findings Query Classification
- What users are looking for is not what is on Web
- Distribution of content
- 83 Commercial, 6 Educational, 3 Health
- Example 10 of searches are for Health
- Searchers find classifications understandable
- IR system presentation design
20Findings American European Searching
- Commonalities
- Three or fewer terms
- American 80, European 85
- Predominantly use English terms
- Relevance judgments less than 15 minutes viewing
retrieved documents - Information seeking sessions short
21Findings American European Searching
- Differences
- Categories
- American Entertainment, Sex, Commerce
- European People-places-things, Computers,
Commerce - American searchers spent more time searching
e-commerce sites than European counterparts - Did not examine
- Use of advanced techniques
- Relevance feedback
- First in initial set of studies?
22Findings Summary
- Number of query terms is about 2
- TRF is not used often
- Boolean operators and modifiers not used often
difficulty in using them correctly - Users do not spend much time making relevancy
judgments - Term frequency distribution is a few terms used
often, many terms used only once
23Findings Summary
- Most users had single query only and did not
follow up with successive queries - Average viewing of 2 pages
- 50 did not access beyond first page more than
75 did not go beyond 2 pages
24Implications / Further Research
- Improve use of advanced search techniques
- UI changes, Venn Diagrams
- Improve use of relevance feedback
- Automatic generation of TRF results
- Improve classification of results
- UI changes, result overview
- Improve understanding of language use
- Adapt IR designs to language
- Examine cultural differences
- TRF, advanced search techniques (same or
different)
25Amanda Spink - Web Searching and Retrieval