Title: Web Search Studies: Approaches and Methods
1Web Search Studies Approaches and Methods
- Amanda Spink
- Queensland University of Technology
2Overview
- Introduction
- Key Issues Search Studies
- Transaction log analysis
- Search Evaluation
- Conclusions
3Search Studies
- 1960s onwards online search library focus
- 1990s onwards - Web search focus -
Commercial search engines - Enterprise search - User search modeling studies important for
academia, industry and organizations offering
information via search engines
4Book
- Amanda Spink Jim Jansen (2004). Web Search
Public Searching of the Web. Springer.
5Research Background
- Since 1990 user/IR systems interaction
- Since 1997 user/Web search engine interaction
- Relevance measurement relevance regions
- IR interaction measures
6Research Background
-
- Examine patterns, trends, user modeling, systems/
interface design ideas and significant insights - Human information behavior (HIB) focus system
interaction imbedded in HIB
7Web Search Studies
- Web search engines - Alta Vista - Ask
Jeeves - Excite - AlltheWeb - Vivi
simo - Dogpile - Transaction log analysis studies
- Focus on user search analysis for competitive
advantage -
8Data Collection Methods
- Various combinations of methods and approaches
- Transaction log analysis
- Videotaping Audio-taping
- Think aloud protocols
- Usability HCI techniques
- Focus groups
- Interviews
- Survey
- Experiments
- Diaries
9Data Analysis Methods
- Quantitative and statistical analysis
- Qualitative analysis grounded theory
- Combination of both methods
10Key Issues Search Studies
- What is the goal of the project? - Insights,
understanding develop theory - User
modeling - Trends analysis - Interface/s
ystems design - User training - What resources are available sample size,
expertise and funds? - Academic or industry research?
- Time pressures?
11Key Issues Search Studies
- What variables to measure?
- How much data is enough?
- Methods used single or multiple?
- HCI approach test interface/system features
12Transaction Log Analysis (TLA)
- File or log of communications between user and
system - File recorded on a server server side
recordings - Log or file formats vary
13Why Collect and Analyze Log Data?
- Gain understanding of user interaction with
system and interface - Goal to improve system and interface design, and
improve user training. - Transaction log analysis is extensively used in
academia and industry
14TLA Process
- Goals and objectives
- Data collection
- Log preparation
- Data analysis
- Making sense
15Goals and Objectives
- Gain understanding of user interaction with
system and interface - Theoretical modeling and user modeling
- Improve system and interface design, and improve
user training - Examine trends patterns
16Data Collection
- Process of collecting the interaction data for a
given period in a transaction log - Collect data on the search episode
- User identification
- Date
- Time
- Search session content
- Resources accessed, e.g., URLs
17Logging Software
- Custom commercial applications
- WinWhatWhere spy software
- Morea 1.1 software
- Camtasia Studio
18Data Preparation
- Process of cleaning and preparing the log data
for analysis - Log data into a relational database
- Cleaning the log corrupted data
- Parsing the log, e.g., removing Web sessions with
over 100 queries - Normalizing the log
19Log Analysis Three Levels
20Term Level Analysis
- Term occurrence
- Total terms
- High low usage terms
- Term distribution
- Co-occurring terms
21Term Distribution
22Terms Per Query 1997-2001
23Queries Per User 1997-2001
24Pages Viewed Per User 1997-2001
25Top 10 Query Terms 1997-2001
26Query Level Analysis
- Initial query
- Subsequent queries
- Modified queries query reformulation
- Identical queries
- Query complexity
- Boolean use
- Spelling
- Types of queries
- Query topics
27Query Subjects Alta Vista 2002 Vivisimo 2004
- 1. People/Places 49.2
- 2. Commerce, etc. 12.5
- 3. Computers, etc. 12.4
- 4. Health/sciences 7.4
- 5. Education/Humanities 5
- 6. Entertainment, etc. 4.5
- 7. Sex/Pornography 3.2
- 8. Society/Culture, etc. 3.1
- 9. Government 1.5
- 10. Performing/Fine Arts 0.6
- 1. Commerce, etc. 21
- 2. Indiscernible 19
- 3. People/Places, etc. 15
- 4. Computers/Internet 13
- 5. Social/Culture 9
- 6. Health/Sciences 6
- 7. Education/Humanities 5
- 8. Sex/Pornography 4
- 9. Performing/Fine Arts 3
- 10. Government 3
- 11. Entertainment, etc. 2
28Session Level Analysis
- Duration
- Patterns
- Successive multitasking sessions
- Page or resource viewing
29Web Session Duration (Minutes)
- 56 less than 1 minute
- 72 sessions less than 5 minutes
- 81 sessions less than 15 minutes
- Mean approx. 58 minutes and 2 seconds
30Pages Viewed Per User
- 2004 - Most users view VERY FEW pages beyond the
first or first two pages. - 14 of users view Web pages for less than 30
seconds
31Log Analysis Methods
- Quantitative and statistical analysis requires
software and expertise - Qualitative analysis requires training
- Creativity factor
- Combination of quantitative and qualitative
methods
32TLA Strengths
- Data from a large user base
- Reasonable and non-intrusive
- Less time than other methods
- Can be relatively inexpensive
33TLA Limitations
- TLA does not include user demographic and other
data - Lacks data on search reasons and motivations
- Incomplete data due to corrupted logging
34Relevance Judgments
User-centered approaches to relevance have led to
a better understanding of the user/IR system
interaction process Studies have addressed the
limitations of precision and recall as effective
measures of IR performance Few studies offered
any new IR evaluation measures
35Relevance Judgment Assumptions
- Relevance research based on assumptions about
user behavior - Only Highly Relevant Items important to the user
- Partial relevant items not important to the user
36Relevance Judgments Distribution
37Relevance Judgment Levels
Relevant - A judgment that confirms that some
relationship by inference exists between the
retrieved item and the information problem at
hand Partially Relevant - A judgment that
confirms that some relation by inference exists,
but the relationship is weaker than a relevant
judgment Partially Not Relevant - A judgment that
confirms that some non-relation by inference
exists, but the relationship is not strong enough
to totally reject the relationship as not
relevant Not Relevant - A judgment that confirms
that a relationship by inference does not exist
between the retrieved item and the information
problem at hand
38Measuring Search Impact
- IR interaction measures impact of system
interaction on users information seeking progress
39One Search Assumption
- One search assumption user conduct single
searches on information problem - Search is more complex and holistic
- Search is embedded in human information behaviors
40Search Levels
41Conclusions
- Search analysis is a complex process with many
choices - TLA a powerful tool
- Requires planning, training and expertise
- Can be combined with other data collection and
analysis techniques
42Conclusions
- Search is more complex than the Web single search
single query paradigm - Search context is important
- Search technology is changing, however many user
search characteristics are relatively stable - New search technology (e.g. history,
visualization) impact user search behavior?
43Conclusions
- Need for more comparison of Web search engine
performance - Comparison of single versus meta-search engines
- Need for better user-based evaluation measures
- Better usability testing of Web search engine
interfaces and techniques
44Conclusions
-
- How do users coordinate their information
behaviors? -
- Relation to information seeking stage, domain
knowledge, gender or other cognitive variables? - Model multitasking and dual-tasking behaviors
45Conclusions
- Need for improved and better search technology
- Need for improved user effort
- Technology not the complete answer more user
awareness of search process and their own human
information behavior - Significant improvement in search will only come
from improved user effort -
46Further Reading
- Jansen, B. J. (forthcoming). Search log analysis
What is it whats been done How to do it.
Library and Information Science Research. - Jansen, B. J., Spink, A. (2005). How are we
searching the Web? A comparison of nine search
engine transaction logs. Information Processing
and Management, 42(1), 248-263. - Peters, T. (1993). The history and development of
transaction log analysis. Library Hi Tech,
42(11), 41-66. - Spink, A., Jansen, B. J. (2004). Web Search
Public Searching of the Web. Springer. - Spink, A., Jansen, B. J., Wolfram, D.,
Saracevic, T. (2002). From e-sex to e-commerce
Web search changes. IEEE Computer, 35(3),
133-135. - Spink, A., Park, M., Jansen, B. J. (2006).
Multitasking during Web search sessions.
Information Processing and Management, 42(1),
264-275.
47QUESTIONS??