Title: Designing and Evaluating Search Interfaces
1Designing and Evaluating Search Interfaces
Prof. Marti Hearst School of Information UC
Berkeley  Â
2Outline
- Why is Supporting Search Difficult?
- What Works?
- How to Evaluate?
3Why is Supporting Search Difficult?
- Everything is fair game
- Abstractions are difficult to represent
- The vocabulary disconnect
- Users lack of understanding of the technology
- Clutter vs. Information
4Everything is Fair Game
- The scope of what people search for is all of
human knowledge and experience. - Other interfaces are more constrained
- (word processing, formulas, etc)
- Interfaces must accommodate human differences in
- Knowledge / life experience
- Cultural background and expectations
- Reading / scanning ability and style
- Methods of looking for things (pilers vs. filers)
5Abstractions Are Hard to Represent
- Text describes abstract concepts
- Difficult to show the contents of text in a
visual or compact manner - Exercise
- How would you show the preamble of the US
Constitution visually? - How would you show the contents of Joyces
Ulysses visually? How would you distinguish it
from Homers The Odyssey or McCourts Angelas
Ashes? - The point it is difficult to show text without
using text
6Vocabulary Disconnect
- If you ask a set of people to describe a set of
things there is little overlap in the results.
7 The Vocabulary Problem
- Data sets examined (and of participants)
- Main verbs used by typists to describe the kinds
of edits that they do (48) - Commands for a hypothetical message decoder
computer program (100) - First word used to describe 50 common objects
(337) - Categories for 64 classified ads (30)
- First keywords for a each of a set of recipes
(24)
Furnas, Landauer, Gomez, Dumais The Vocabulary
Problem in Human-System Communication. Commun.
ACM 30(11) 964-971 (1987)
8 The Vocabulary Problem
- These are really bad results
- If one person assigns the name, the probability
of it NOT matching with another persons is about
80 - What if we pick the most commonly chosen words as
the standard? Still not good
Furnas, Landauer, Gomez, Dumais The Vocabulary
Problem in Human-System Communication. Commun.
ACM 30(11) 964-971 (1987)
9Lack of Technical Understanding
- Most people dont understand the underlying
methods by which search engines work.
10People Dont Understand Search Technology
- A study of 100 randomly-chosen people found
- 14 never type a url directly into the address
bar - Several tried to use the address bar, but did it
wrong - Put spaces between words
- Combinations of dots and spaces
- nursing spectrum.com consumer reports.com
- Several use search form with no spaces
- plumberslocal9 capitalhealthsystem
- People do not understand the use of quotes
- Only 16 use quotes
- Of these, some use them incorrectly
- Around all of the words, making results too
restrictive - lactose intolerance recipies
- Here the excludes the recipes
- People dont make use of advanced features
- Only 1 used find in page
- Only 2 used Google cache
Hargattai, Classifying and Coding Online Actions,
Social Science Computer Review 22(2), 2004
210-227.
11People Dont Understand Search Technology
- Without appropriate explanations, most of 14
people had strong misconceptions about - ANDing vs ORing of search terms
- Some assumed ANDing search engine indexed a
smaller collection most had no explanation at
all - For empty results for query to be or not to be
- 9 of 14 could not explain in a method that
remotely resembled stop word removal - For term order variation boat fire vs. fire
boat - Only 5 out of 14 expected different results
- Understanding was vague, e.g.
- Lycos separates the two words and searches for
the meaning, instead of whatre your looking for.
Google understands the meaning of the phrase.
Muramatsu Pratt, Transparent Queries
Investigating Users Mental Models of Search
Engines, SIGIR 2001.
12What Works?
13Cool Doesnt Cut It
- Its very difficult to design a search interface
that users prefer over the standard - Some ideas have a strong WOW factor
- Examples
- Kartoo
- Groxis
- Hyperbolic tree
- But they dont pass the will you use it test
- Even some simpler ideas fall by the wayside
- Example
- Visual ranking indicators for results set
listings
14Early Visual Rank Indicators
15(No Transcript)
16(No Transcript)
17Metadata Matters
- When used correctly, text to describe text,
images, video, etc. works well - Searchers often turn into browsers with
appropriate links - However, metadata has many perils
- The Kosher Recipe Incident
18Small Details Matter
- UIs for search especially require great care in
small details - In part due to the text-heavy nature of search
- A tension between more information and
introducing clutter - How and where to place things important
- People tend to scan or skim
- Only a small percentage reads instructions
19Small Details Matter
- UIs for search especially require endless tiny
adjustments - In part due to the text-heavy nature of search
- Example
- In an earlier version of the Google Spellchecker,
people didnt always see the suggested correction - Used a long sentence at the top of the page
- If you didnt find what you were looking for
- People complained they got results, but not the
right results. - In reality, the spellchecker had suggested an
appropriate correction.
Interview with Marissa Mayer by Mark Hurst
http//www.goodexperience.com/columns/02/1015googl
e.html
20Small Details Matter
- The fix
- Analyzed logs, saw people didnt see the
correction - clicked on first search result,
- didnt find what they were looking for (came
right back to the search page - scrolled to the bottom of the page, did not find
anything - and then complained directly to Google
- Solution was to repeat the spelling suggestion at
the bottom of the page. - More adjustments
- The message is shorter, and different on the top
vs. the bottom
Interview with Marissa Mayer by Mark Hurst
http//www.goodexperience.com/columns/02/1015googl
e.html
21(No Transcript)
22Small Details Matter
- Layout, font, and whitespace for
information-centric interfaces requires very
careful design - Example
- Photo thumbnails
- Search results summaries
23What Works for Search Interfaces?
- Query term highlighting
- in results listings
- in retrieved documents
- Term Suggestions (if done right)
- Sorting of search results according to important
criteria (date, author) - Grouping of results according to well-organized
category labels (see Flamenco) - DWIM only if highly accurate
- Spelling correction/suggestions
- Simple relevance feedback (more-like-this)
- Certain types of term expansion
- So far not really visualization
Hearst et al Finding the Flow in Web Site
Search, CACM 45(9), 2002.
24Highlighting Query Terms
- Boldface or color
- Adjacency of terms with relevant context is a
useful cue.
25(No Transcript)
26(No Transcript)
27Highlighted query term hits using Google toolbar
Microso
US
Blackout
PGA
Microsoft
28How to Introduce New Features?
- Example Yahoo shortcuts
- Search engines now provide groups of enriched
content - Automatically infer related information, such as
sports statistics - Accessed via keywords
- User can quickly specify very specific
information - united 570 (flight arrival time)
- map san francisco
- Were heading back to command languages!
29(No Transcript)
30(No Transcript)
31(No Transcript)
32(No Transcript)
33Introducing New Features
- A general technique scaffolding
- Scaffolding
- Facilitate a students ability to build on prior
knowledge and internalize new information. - The activities provided in scaffolding
instruction are just beyond the level of what the
learner can do already. - Learning the new concept moves the learner up one
step on the conceptual ladder
34Scaffolding Example
- The problem how do people learn about these
fantastic but unknown options? - Example scaffolding the definition function
- Where to put a suggestion for a definition?
- Google used to simply hyperlink it next to the
statistics for the word. - Now a hint appears to alert people to the
feature.
35Unlikely to notice the function here
36Scaffolding to teach what is available
37Query Term Suggestions
38Query Reformulation
- Query reformulation
- After receiving unsuccessful results, users
modify their initial queries and submit new ones
intended to more accurately reflect their
information needs. - Web search logs show that searchers often
reformulate their queries - A study of 985 Web user search sessions found
- 33 went beyond the first query
- Of these, 35 retained the same number of terms
while 19 had 1 more term and 16 had 1 fewer
Use of query reformulation and relevance feedback
by Excite users, Spink, Janson Ozmultu,
Internet Research 10(4), 2001
39Query Reformulation
- Many studies show that if users engage in
relevance feedback, the results are much better. - In one study, participants did 17-34 better with
RF - They also did better if they could see the RF
terms than if the system did it automatically
(DWIM) - But the effort required for doing so is usually a
roadblock. - Before the web and in most research, searches
have to select MANY relevant documents or MANY
terms.
Koenemann Belkin, A Case for Interaction A
Study of Interactive Information Retrieval
Behavior and Effectiveness, CHI96
40Query Reformulation
- What happens when the web search engines suggests
new terms? - Web log analysis study using the Prisma term
suggestion system
Anick, Using Terminological Feedback for Web
Search Refinement A Log-based Study, SIGIR03.
41Query Reformulation Study
- Feedback terms were displayed to 15,133 user
sessions. - Of these, 14 used at least one feedback term
- For all sessions, 56 involved some degree of
query refinement - Within this subset, use of the feedback terms was
25 - By user id, 16 of users applied feedback terms
at least once on any given day - Looking at a 2-week session of feedback users
- Of the 2,318 users who used it once, 47 used it
again in the same 2-week window. - Comparison was also done to a baseline group that
was not offered feedback terms. - Both groups ended up making a page-selection
click at the same rate.
Anick, Using Terminological Feedback for Web
Search Refinement A Log-based Study, SIGIR03.
42Query Reformulation Study
Anick, Using Terminological Feedback for Web
Search Refinement A Log-based Study, SIGIR03.
43Query Reformulation Study
- Other observations
- Users prefer refinements that contain the initial
query terms - Presentation order does have an influence on term
uptake
Anick, Using Terminological Feedback for Web
Search Refinement A Log-based Study, SIGIR03.
44Query Reformulation Study
Anick, Using Terminological Feedback for Web
Search Refinement A Log-based Study, SIGIR03.
45Prognosis Query Reformulation
- Researchers have always known it can be helpful,
but the methods proposed for user interaction
were too cumbersome - Had to select many documents and then do feedback
- Had to select many terms
- Was based on statistical ranking methods which
are hard for people to understand - RF is promising for web-based searching
- The dominance of AND-based searching makes it
easier to understand the effects of RF - Automated systems built on the assumption that
the user will only add one term now work
reasonably well - This kind of interface is simple
46Supporting the Search Process
- We should differentiate among searching
- The Web
- Personal information
- Large collections of like information
- Different cues useful for each
- Different interfaces needed
- Examples
- The Stuff Ive Seen Project
- The Flamenco Project
47The Stuff Ive Seen project
- Did intense studies of how people work
- Used the results to design an integrated search
framework - Did extensive evaluations of alternative designs
- The following slides are modifications of ones
supplied by Sue Dumais, reproduced with
permission.
Dumais, Cutrell, Cadiz, Jancke, Sarin and
Robbins, Stuff I've Seen A system for personal
information retrieval and re-use. SIGIR 2003.
48Searching Over Personal Information
- Many locations, interfaces for finding things
(e.g., web, mail, local files, help, history,
notes)
Slide adapted from Sue Dumais.
49The Stuff Ive Seen project
- Unified index of items touched recently by user
- All types of information, e.g., files of all
types, email, calendar, contacts, web pages, etc. - Full-text index of content plus metadata
attributes (e.g., creation time, author, title,
size) - Automatic and immediate update of index
- Rich UI possibilities, since its your content
- Search only over things already seen
- Re-use vs. initial discovery
Slide adapted from Sue Dumais.
50SIS Interface
Slide adapted from Sue Dumais
51Search With SIS
Slide adapted from Sue Dumais
52Evaluating SIS
- Internal deployment
- 1500 downloads
- Users include program management, test, sales,
development, administrative, executives, etc. - Research techniques
- Free-form feedback
- Questionnaires Structured interviews
- Usage patterns from log data
- UI experiments (randomly deploy different
versions) - Lab studies for richer UI (e.g., timeline,
trends) - But even here must work with users own content
Slide adapted from Sue Dumais
53SIS Usage Data
- Detailed analysis for 234 people, 6 weeks usage
- Personal store characteristics
- 5k 100k items index lt150 meg
- Query characteristics
- Short queries (1.59 words)
- Few advanced operators or fielded search in query
box (7.5) - Frequent use of query iteration (48)
- 50 refined queries involve filters type, date
most common - 35 refined queries involve changes to query
- 13 refined queries involve re-sort
- Query content
- Importance of people
- 29 of the queries involve peoples names
Slide adapted from Sue Dumais
54SIS Usage Data, contd
- Characteristics of items opened
- File types opened
- 76 Email
- 14 Web pages
- 10 Files
- Age of items opened
- 7 today
- 22 within the last week
- 46 within the last month
- Ease of finding information
- Easier after SIS for web, email, files
- Non-SIS search decreases for web, email, files
Log(Freq) -0.68 log(DaysSinceSeen) 2.02
Slide adapted from Sue Dumais
55SIS Usage, contd
- UI Usage
- Small effects of Top/Side, Previews
- Sort order
- Date by far the most common sort field, even for
people who had Okapi Rank as default - Importance of time
- Few searches for best match many other
criteria
Number of Queries Issued
Slide adapted from Sue Dumais
56Web Sites and Collections
- A report by Forrester research in 2001 showed
that while 76 of firms rated search as
extremely important only 24 consider their Web
sites search to be extremely useful.
Johnson, K., Manning, H., Hagen, P.R., and
Dorsey, M. Specialize Your Site's Search.
Forrester Research, (Dec. 2001), Cambridge, MA
www.forrester.com/ER/Research/Report/Summary/0,133
8,13322,00
57There are many ways to do it wrong
- Examples
- Melvyl online catalog
- no way to browse enormous category listings
- Audible.com, BooksOnTape.com, and
BrillianceAudio - no way to browse a given category and
simultaneosly select unabridged versions - Amazon.com
- has finally gotten browsing over multiple kinds
of features working this is a recent development - but still restricted on what can be added into
the query
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62(No Transcript)
63(No Transcript)
64(No Transcript)
65(No Transcript)
66(No Transcript)
67(No Transcript)
68(No Transcript)
69(No Transcript)
70(No Transcript)
71(No Transcript)
72The Flamenco Project
- Incorporating Faceted Hierarchical Metadata into
Interfaces for Large Collections - Key Goals
- Support integrated browsing and keyword search
- Provide an experience of browsing the shelves
- Add power and flexibility without introducing
confusion or a feeling of clutter - Allow users to take the path most natural to them
- Method
- User-centered design, including needs assessment
and many iterations of design and testing
Yee, Swearingen, Li, Hearst, Faceted Metadata for
Image Search and Browsing, Proceedings of CHI
2003.
73Some Challenges
- Users dont like new search interfaces.
- How to show lots more information without
overwhelming or confusing? - Our approach
- Integrate the search seamlessly into the
information architecture. - Use proper HCI methodologies.
- Use faceted metadata
74The Flamenco Interface
- Hierarchical facets
- Chess metaphor
- Opening
- Middle game
- End game
- Tightly Integrated Search
- Expand as well as Refine
- Intermediate pages for large categories
- For this design, small details really matter
75(No Transcript)
76(No Transcript)
77(No Transcript)
78(No Transcript)
79(No Transcript)
80(No Transcript)
81(No Transcript)
82(No Transcript)
83(No Transcript)
84What is Tricky About This?
- It is easy to do it poorly
- Yahoo directory structure
- It is hard to be not overwhelming
- Most users prefer simplicity unless complexity
really makes a difference - It is hard to make it flow
- Can it feel like browsing the shelves?
85Using HCI Methodology
- Identify Target Population
- Architects, city planners
- Needs assessment.
- Interviewed architects and conducted contextual
inquiries. - Lo-fi prototyping.
- Showed paper prototype to 3 professional
architects. - Design / Study Round 1.
- Simple interactive version. Users liked metadata
idea. - Design / Study Round 2
- Developed 4 different detailed versions
evaluated with 11 architects results somewhat
positive but many problems identified. Matrix
emerged as a good idea. - Metadata revision.
- Compressed and simplified the metadata
hierarchies
86Using HCI Methodology
- Design / Study Round 3.
- New version based on results of Round 2
- Highly positive user response
- Identified new user population/collection
- Students and scholars of art history
- Fine arts images
- Study Round 4
- Compare the metadata system to a strong,
representative baseline
87Most Recent Usability Study
- Participants Collection
- 32 Art History Students
- 35,000 images from SF Fine Arts Museum
- Study Design
- Within-subjects
- Each participant sees both interfaces
- Balanced in terms of order and tasks
- Participants assess each interface after use
- Afterwards they compare them directly
- Data recorded in behavior logs, server logs,
paper-surveys one or two experienced testers at
each trial. - Used 9 point Likert scales.
- Session took about 1.5 hours pay was 15/hour
88The Baseline System
- Floogle
- Take the best of the existing keyword-based image
search systems
89Comparison of Common Image Search Systems
System Collection Results /page Categories? Familiar
Google Web 20 No 27
AltaVista Web 15 No 8
Corbis Photos 9-36 No 8
Getty Photos, Art 12-90 Yes 6
MS Office Photos, Clip art 6-100 Yes N/A
Thinker Fine arts images 10 Yes 4
BASELINE Fine arts images 40 Yes N/A
90sword
sword
91(No Transcript)
92(No Transcript)
93(No Transcript)
94Evaluation Quandary
- How to assess the success of browsing?
- Timing is usually not a good indicator
- People often spend longer when browsing is going
well. - Not the case for directed search
- Can look for comprehensiveness and correctness
(precision and recall) - But subjective measures seem to be most
important here.
95Hypotheses
- We attempted to design tasks to test the
following hypotheses - Participants will experience greater search
satisfaction, feel greater confidence in the
results, produce higher recall, and encounter
fewer dead ends using FC over Baseline - FC will perceived to be more useful and flexible
than Baseline - Participants will feel more familiar with the
contents of the collection after using FC - Participants will use FC to create multi-faceted
queries
96Four Types of Tasks
- Unstructured (3) Search for images of interest
- Structured Task (11-14) Gather materials for an
art history essay on a given topic, e.g. - Find all woodcuts created in the US
- Choose the decade with the most
- Select one of the artists in this periods and
show all of their woodcuts - Choose a subject depicted in these works and find
another artist who treated the same subject in a
different way. - Structured Task (10) compare related images
- Find images by artists from 2 different countries
that depict conflict between groups. - Unstructured (5) search for images of interest
97Other Points
- Participants were NOT walked through the
interfaces. - The wording of Task 2 reflected the metadata not
the case for Task 3 - Within tasks, queries were not different in
difficulty (tslt1.7, p gt0.05 according to
post-task questions) - Flamenco is and order of magnitude slower than
Floogle on average. - In task 2 users were allowed 3 more minutes in FC
than in Baseline. - Time spent in tasks 2 and 3 were significantly
longer in FC (about 2 min more).
98Results
- Participants felt significantly more confident
they had found all relevant images using FC (Task
2 t(62)2.18, plt.05 Task 3 t(62)2.03, plt.05) - Participants felt significantly more satisfied
with the results - (Task 2 t(62)3.78, plt.001 Task 3 t(62)2.03,
plt.05) - Recall scores
- Task2a In Baseline 57 of participants found all
relevant results, in FC 81 found all. - Task 2b In Baseline 21 found all relevant, in
FC 77 found all.
99Post-Interface Assessments
All significant at plt.05 except simple and
overwhelming
100Perceived Uses of Interfaces
Baseline
FC
101Post-Test Comparison
FC
Baseline
Which Interface Preferable For
Find images of roses Find all works from a given
period Find pictures by 2 artists in same media
Overall Assessment
More useful for your tasks Easiest to use Most
flexible More likely to result in dead
ends Helped you learn more Overall preference
102Facet Usage
- Facets driven largely by task content
- Multiple facets 45 of time in structured tasks
- For unstructured tasks,
- Artists (17)
- Date (15)
- Location (15)
- Others ranged from 5-12
- Multiple facets 19 of time
- From end game, expansion from
- Artists (39)
- Media (29)
- Shapes (19)
103Qualitative Observations
- Baseline
- Simplicity, similarity to Google a plus
- Also noted the usefulness of the category links
- FC
- Starting page well-organized, gave ideas for
what to search for - Query previews were commented on explicitly by 9
participants - Commented on matrix prompting where to go next
- 3 were confused about what the matrix shows
- Generally liked the grouping and organizing
- End game links seemed useful 9 explicitly
remarked positively on the guidance provided
there. - Often get requests to use the system in future
104Study Results Summary
- Overwhelmingly positive results for the faceted
metadata interface. - Somewhat heavy use of multiple facets.
- Strong preference over the current state of the
art. - This result not seen in similarity-based image
search interfaces. - Hypotheses are supported.
105Summary
- Usability studies done on 3 collections
- Recipes 13,000 items
- Architecture Images 40,000 items
- Fine Arts Images 35,000 items
- Conclusions
- Users like and are successful with the dynamic
faceted hierarchical metadata, especially for
browsing tasks - Very positive results, in contrast with studies
on earlier iterations - Note it seems you have to care about the
contents of the collection to like the interface
106Using DWIM
- DWIM Do What I Mean
- Refers to systems that try to be smart by
guessing users unstated intentions or desires - Examples
- Automatically augment my query with related terms
- Automatically suggest spelling corrections
- Automatically load web pages that might be
relevant to the one Im looking at - Automatically file my incoming email into folders
- Pop up a paperclip that tells me what kind of
help I need. - THE CRITICAL POINT
- Users love DWIM when it really works
- Users DESPISE it when it doesnt
- unless not very intrusive
107DWIM that Works
- Amazons customers who bought X also bought Y
- And many other recommendation-related features
108DWIM Example Spelling Correction/Suggestion
- Googles spelling suggestions are highly accurate
- But this wasnt always the case.
- Google introduced a version that wasnt very
accurate. People hated it. They pulled it.
(According to a talk by Marissa Mayer of Google.) - Later they introduced a version that worked well.
People love it. - But dont get too pushy.
- For a while if the user got very few results, the
page was automatically replaced with the results
of the spelling correction - This was removed, presumably due to negative
responses
Information from a talk by Marissa Mayer of Google
109What Weve Covered
- Introduction
- Why is designing for search difficult?
- How to Design for Search
- HCI and iterative design
- What works?
- Small details matter
- Scaffolding
- The Role of DWIM
- Core Problems
- Query specification and refinement
- Browsing and searching collections
110Final Words
- User interfaces for search remains a fascinating
and challenging field - Search has taken a primary role in the web and
internet business - Thus, we can continue to expect fascinating
developments, and maybe some breakthroughs, in
the next few years!
111Thank you!
- Marti Hearst
- http//www.ischool.berkeley.edu/hearst
112References
- Anick, Using Terminological Feedback for Web
Search Refinement A Log-based Study, SIGIR03. - Bates, The Berry-Picking Search UI Design, in
User Interface Design, Thimbley (ED),
Addison-Wesley 1990 - Chen, Houston, Sewell, and Schatz, JASIS 49(7)
- Chen and Yu, Empirical studies of information
visualization a meta-analysis, IJHCS 53(5),2000 - Dumais, Cutrell, Cadiz, Jancke, Sarin and
Robbins, Stuff I've Seen A system for personal
information retrieval and re-use. SIGIR 2003. - Furnas, Landauer, Gomez, Dumais The Vocabulary
Problem in Human-System Communication. Commun.
ACM 30(11) 964-971 (1987) - Hargattai, Classifying and Coding Online Actions,
Social Science Computer Review 22(2), 2004
210-227. - Hearst, English, Sinha, Swearingen, Yee. Finding
the Flow in Web Site Search, CACM 45(9), 2002. - Hearst, User Interfaces and Visualization,
Chapter 10 of Modern Information Retrieval,
Baeza-Yates and Rebeiro-Nato (Eds),
Addison-Wesley 1999. - Johnson, Manning, Hagen, and Dorsey. Specialize
Your Site's Search. Forrester Research, (Dec.
2001), Cambridge, MA
113References
- Koenemann Belkin, A Case for Interaction A
Study of Interactive Information Retrieval
Behavior and Effectiveness, CHI96 - Marissa Mayer Interview by Mark Hurst
http//www.goodexperience.com/columns/02/1015googl
e.html - Muramatsu Pratt, Transparent Queries
Investigating Users Mental Models of Search
Engines, SIGIR 2001. - ODay Jeffries, Orienteering in an information
landscape how information seekers get from here
to there, Proceedings of InterCHI 93. - Rose Levinson, Understanding User Goals in Web
Search, Proceedings of WWW04 - Russell, Stefik, Pirolli, Card, The Cost
Structure of Sensemaking , Proceedings of
InterCHI 93. - Sebrechts, Cugini, Laskowski, Vasilakis and
Miller, Visualization of search results a
comparative evaluation of text, 2D, and 3D
interfaces, SIGIR 99. - Swan and Allan, Aspect windows, 3-D
visualizations, and indirect comparisons of
information retrieval systems, SIGIR 1998. - Spink, Janson Ozmultu, Use of query
reformulation and relevance feedback by Excite
users, Internet Research 10(4), 2001 - Yee, Swearingen, Li, Hearst, Faceted Metadata for
Image Search and Browsing, Proceedings of CHI 2003