SIMS 247 Information Visualization and Presentation - PowerPoint PPT Presentation

About This Presentation
Title:

SIMS 247 Information Visualization and Presentation

Description:

SIMS 247 Information Visualization and Presentation Marti Hearst March 15, 2002 Outline Why Text is Tough Visualizing Concept Spaces Clusters Category Hierarchies ... – PowerPoint PPT presentation

Number of Views:170
Avg rating:3.0/5.0
Slides: 91
Provided by: hea4
Category:

less

Transcript and Presenter's Notes

Title: SIMS 247 Information Visualization and Presentation


1
SIMS 247Information Visualization and
Presentation
  • Marti Hearst
  • March 15, 2002

2
Outline
  • Why Text is Tough
  • Visualizing Concept Spaces
  • Clusters
  • Category Hierarchies
  • Visualizing Query Specifications
  • Visualizing Retrieval Results
  • Usability Study Meta-Analysis

3
Why Visualize Text?
  • To help with Information Retrieval
  • give an overview of a collection
  • show user what aspects of their interests are
    present in a collection
  • help user understand why documents retrieved as a
    result of a query
  • Text Data Mining
  • Mainly clustering nodes-and-links
  • Software Engineering
  • not really text, but has some similar properties

4
Why Text is Tough
  • Text is not pre-attentive
  • Text consists of abstract concepts
  • which are difficult to visualize
  • Text represents similar concepts in many
    different ways
  • space ship, flying saucer, UFO, figment of
    imagination
  • Text has very high dimensionality
  • Tens or hundreds of thousands of features
  • Many subsets can be combined together

5
Why Text is Tough
The Dog.
6
Why Text is Tough
The Dog.
The dog cavorts.
The dog cavorted.
7
Why Text is Tough
The man.
The man walks.
8
Why Text is Tough
The man walks the cavorting dog.
So far, we can sort of show this in pictures.
9
Why Text is Tough
As the man walks the cavorting dog,
thoughts arrive unbidden of the previous spring,
so unlike this one, in which walking was marching
and dogs were baleful sentinals outside unjust
halls.
How do we visualize this?
10
Why Text is Tough
  • Abstract concepts are difficult to visualize
  • Combinations of abstract concepts are even more
    difficult to visualize
  • time
  • shades of meaning
  • social and psychological concepts
  • causal relationships

11
Why Text is Tough
  • Language only hints at meaning
  • Most meaning of text lies within our minds and
    common understanding
  • How much is that doggy in the window?
  • how much social system of barter and trade (not
    the size of the dog)
  • doggy implies childlike, plaintive, probably
    cannot do the purchasing on their own
  • in the window implies behind a store window,
    not really inside a window, requires notion of
    window shopping

12
Why Text is Tough
  • General categories have no standard ordering
    (nominal data)
  • Categorization of documents by single topics
    misses important distinctions
  • Consider an article about
  • NAFTA
  • The effects of NAFTA on truck manufacture
  • The effects of NAFTA on productivity of truck
    manufacture in the neighboring cities of El Paso
    and Juarez

13
Why Text is Tough
  • Other issues about language
  • ambiguous (many different meanings for the same
    words and phrases)
  • different combinations imply different meanings

14
Why Text is Tough
  • I saw Pathfinder on Mars with a telescope.
  • Pathfinder photographed Mars.
  • The Pathfinder photograph mars our perception of
    a lifeless planet.
  • The Pathfinder photograph from Ford has arrived.
  • The Pathfinder forded the river without marring
    its paint job.

15
Why Text is Easy
  • Text is highly redundant
  • When you have lots of it
  • Pretty much any simple technique can pull out
    phrases that seem to characterize a document
  • Instant summary
  • Extract the most frequent words from a text
  • Remove the most common English words

16
Guess the Text
  • 478 said
  • 233 god
  • 201 father
  • 187 land
  • 181 jacob
  • 160 son
  • 157 joseph
  • 134 abraham
  • 121 earth
  • 119 man
  • 118 behold
  • 113 years
  • 104 wife
  • 101 name
  • 94 pharaoh

17
Text Collection Overviews
  • How can we show an overview of the contents of a
    text collection?
  • Show info external to the docs
  • e.g., date, author, source, number of inlinks
  • does not show what they are about
  • Show the meanings or topics in the docs
  • a list of titles
  • results of clustering words or documents
  • organize according to categories (next time)

18
Clustering for Collection Overviews
  • Scatter/Gather
  • show main themes as groups of text summaries
  • Scatter Plots
  • show docs as points closeness indicates nearness
    in cluster space
  • show main themes of docs as visual clumps or
    mountains
  • Kohonen Feature maps
  • show main themes as adjacent polygons
  • BEAD
  • show main themes as links within a force-directed
    placement network

19
Clustering for Collection Overviews
  • Two main steps
  • cluster the documents according to the words they
    have in common
  • map the cluster representation onto a
    (interactive) 2D or 3D representation

20
Text Clustering
  • Finds overall similarities among groups of
    documents
  • Finds overall similarities among groups of tokens
  • Picks out some themes, ignores others

21
Scatter/Gather
22
S/G Example query on star
  • Encyclopedia text
  • 14 sports
  • 8 symbols 47 film, tv
  • 68 film, tv (p) 7 music
  • 97 astrophysics
  • 67 astronomy(p) 12 steller phenomena
  • 10 flora/fauna 49 galaxies, stars
  • 29 constellations
  • 7 miscelleneous
  • Clustering and re-clustering is entirely
    automated

23
Scatter/Gather
  • Cutting, Pedersen, Tukey Karger 92, 93, Hearst
    Pedersen 95
  • How it works
  • Cluster sets of documents into general themes,
    like a table of contents
  • Display the contents of the clusters by showing
    topical terms and typical titles
  • User chooses subsets of the clusters and
    re-clusters the documents within
  • Resulting new groups have different themes
  • Originally used to give collection overview
  • Evidence suggests more appropriate for displaying
    retrieval results in context
  • Appearing (sort-of) in commercial systems

24
Northern Light Web Search Started out with
clustering. Then integrated with categories.
Now does not do web search and uses only
categories.
25
Teoma appears to combine categories and clusters
26
Scatter Plot of Clusters(Chen et al. 97)
27
BEAD (Chalmers 97)
28
BEAD (Chalmers 96)
An example layout produced by Bead, seen in
overview, of 831 bibliography entries. The
dimensionality (the number of unique words in
the set) is 6925. A search for cscw or
collaborative shows the pattern of occurrences
coloured dark blue, mostly to the right. The
central rectangle is the visualizers motion
control.
29
Example Themescapes(Wise et al. 95)
Themescapes (Wise et al. 95)
30
Clustering for Collection Overviews
  • Since text has tens of thousands of features
  • the mapping to 2D loses a tremendous amount of
    information
  • only very coarse themes are detected

31
Galaxy of News Rennison 95
32
Galaxy of News Rennison 95
33
Kohonen Feature Maps(Lin 92, Chen et al. 97)
(594 docs)
34
Study of Kohonen Feature Maps
  • H. Chen, A. Houston, R. Sewell, and B. Schatz,
    JASIS 49(7)
  • Comparison Kohonen Map and Yahoo
  • Task
  • Window shop for interesting home page
  • Repeat with other interface
  • Results
  • Starting with map could repeat in Yahoo (8/11)
  • Starting with Yahoo unable to repeat in map (2/14)

35
How Useful is Collection Cluster Visualization
for Search?
  • Three studies find negative results

36
Study 1
  • Kleiboemer, Lazear, and Pedersen. Tailoring a
    retrieval system for naive users. In Proc. of
    the 5th Annual Symposium on Document Analysis and
    Information Retrieval, 1996
  • This study compared
  • a system with 2D graphical clusters
  • a system with 3D graphical clusters
  • a system that shows textual clusters
  • Novice users
  • Only textual clusters were helpful (and they were
    difficult to use well)

37
Study 2 Kohonen Feature Maps
  • H. Chen, A. Houston, R. Sewell, and B. Schatz,
    JASIS 49(7)
  • Comparison Kohonen Map and Yahoo
  • Task
  • Window shop for interesting home page
  • Repeat with other interface
  • Results
  • Starting with map could repeat in Yahoo (8/11)
  • Starting with Yahoo unable to repeat in map (2/14)

38
Study 2 (cont.)
  • Participants liked
  • Correspondence of region size to documents
  • Overview (but also wanted zoom)
  • Ease of jumping from one topic to another
  • Multiple routes to topics
  • Use of category and subcategory labels

39
Study 2 (cont.)
  • Participants wanted
  • hierarchical organization
  • other ordering of concepts (alphabetical)
  • integration of browsing and search
  • correspondence of color to meaning
  • more meaningful labels
  • labels at same level of abstraction
  • fit more labels in the given space
  • combined keyword and category search
  • multiple category assignment (sportsentertain)

40
Study 3 NIRVE
  • NIRVE Interface by Cugini et al. 96. Each
    rectangle is a cluster. Larger clusters closer
    to the pole. Similar clusters near one
    another. Opening a cluster causes a projection
    that shows the titles.

41
Study 3
  • Visualization of search results a comparative
    evaluation of text, 2D, and 3D interfaces
    Sebrechts, Cugini, Laskowski, Vasilakis and
    Miller, Proceedings of SIGIR 99, Berkeley, CA,
    1999.
  • This study compared
  • 3D graphical clusters
  • 2D graphical clusters
  • textual clusters
  • 15 participants, between-subject design
  • Tasks
  • Locate a particular document
  • Locate and mark a particular document
  • Locate a previously marked document
  • Locate all clusters that discuss some topic
  • List more frequently represented topics

42
Study 3
  • Results (time to locate targets)
  • Text clusters fastest
  • 2D next
  • 3D last
  • With practice (6 sessions) 2D neared text
    results 3D still slower
  • Computer experts were just as fast with 3D
  • Certain tasks equally fast with 2D text
  • Find particular cluster
  • Find an already-marked document
  • But anything involving text (e.g., find title)
    much faster with text.
  • Spatial location rotated, so users lost context
  • Helpful viz features
  • Color coding (helped text too)
  • Relative vertical locations

43
Visualizing Clusters
  • Huge 2D maps may be inappropriate focus for
    information retrieval
  • cannot see what the documents are about
  • space is difficult to browse for IR purposes
  • (tough to visualize abstract concepts)
  • Perhaps more suited for pattern discovery and
    gist-like overviews

44
Co-Citation Analysis
  • Has been around since the 50s. (Small, Garfield,
    White McCain)
  • Used to identify core sets of
  • authors, journals, articles for particular fields
  • Not for general search
  • Main Idea
  • Find pairs of papers that cite third papers
  • Look for commonalitieis
  • A nice demonstration by Eugene Garfield at
  • http//165.123.33.33/eugene_garfield/papers/mapsci
    world.html

45
Co-citation analysis (From Garfield 98)
46
Co-citation analysis (From Garfield 98)
47
Co-citation analysis (From Garfield 98)
48
Category Combinations
  • Lets show categories instead of clusters

49
DynaCat (Pratt, Hearst, Fagan 99)
50
DynaCat (Pratt 97)
  • Decide on important question types in an advance
  • What are the adverse effects of drug D?
  • What is the prognosis for treatment T?
  • Make use of MeSH categories
  • Retain only those types of categories known to be
    useful for this type of query.

51
DynaCat Study
  • Design
  • Three queries
  • 24 cancer patients
  • Compared three interfaces
  • ranked list, clusters, categories
  • Results
  • Participants strongly preferred categories
  • Participants found more answers using categories
  • Participants took same amount of time with all
    three interfaces

52
HiBrowse
53
Category Combinations
  • HiBrowse Problem
  • Search is not integrated with browsing of
    categories
  • Only see the subset of categories selected (and
    the corresponding number of documents)

54
MultiTrees (Furnas Zacks 94)
55
Cat-a-ConeMultiple Simultaneous Categories
  • Key Ideas
  • Separate documents from category labels
  • Show both simultaneously
  • Link the two for iterative feedback
  • Distinguish between
  • Searching for Documents vs.
  • Searching for Categories

56
Cat-a-Cone Interface
57
Cat-a-Cone
  • Catacomb
  • (definition 2b, online Websters)
  • A complex set of interrelated things
  • Makes use of earlier PARC work on 3Danimation

Rooms Henderson and Card 86 IV Cone Tree
Robertson, Card, Mackinlay 93 Web Book Card,
Robertson, York 96
58
search
browse
query terms
Category Hierarchy
Collection
Retrieved Documents
59
ConeTree for Category Labels
  • Browse/explore category hierarchy
  • by search on label names
  • by growing/shrinking subtrees
  • by spinning subtrees
  • Affordances
  • learn meaning via ancestors, siblings
  • disambiguate meanings
  • all cats simultaneously viewable

60
Virtual Book for Result Sets
  • Categories on Page (Retrieved Document) linked to
    Categories in Tree
  • Flipping through Book Pages causes some Subtrees
    to Expand and Contract
  • Most Subtrees remain unchanged
  • Book can be Stored for later Re-Use

61
Improvements over Standard Category Interfaces
  • Integrate category selection with viewing of
    categories
  • Show all categories context
  • Show relationship of retrieved documents to the
    category structure
  • But do users understand and like the 3D?

62
The FLAMENCO Project
  • Basic idea similar to Cat-a-Cone
  • But use familiar HTML interaction to achieve
    similar goals
  • Usability results are very strong for users who
    care about the collection.

63
Query Specification
64
Command-Based Query Specification
  • command attribute value connector
  • find pa shneiderman and tw user
  • What are the attribute names?
  • What are the command names?
  • What are allowable values?

65
Form-Based Query Specification (Altavista)
66
Form-Based Query Specification (Melvyl)
67
Form-based Query Specification (Infoseek)
68
Direct Manipulation Spec.VQUERY (Jones 98)
69
Menu-based Query Specification(Young
Shneiderman 93)
70
Context
71
Putting Results in Context
  • Visualizations of Query Term Distribution
  • KWIC, TileBars, SeeSoft
  • Visualizing Shared Subsets of Query Terms
  • InfoCrystal, VIBE, Lattice Views
  • Table of Contents as Context
  • Superbook, Cha-Cha, DynaCat
  • Organizing Results with Tables
  • Envision, SenseMaker
  • Using Hyperlinks
  • WebCutter

72
Putting Results in Context
  • Interfaces should
  • give hints about the roles terms play in the
    collection
  • give hints about what will happen if various
    terms are combined
  • show explicitly why documents are retrieved in
    response to the query
  • summarize compactly the subset of interest

73
KWIC (Keyword in Context)
  • An old standard, ignored until recently by
    internet search engines
  • used in some intranet engines, e.g., Cha-Cha

74
Display of Retrieval Results
  • Goal minimize time/effort for deciding which
    documents to examine in detail
  • Idea show the roles of the query terms in the
    retrieved documents, making use of document
    structure

75
TileBars
  • Graphical Representation of Term Distribution and
    Overlap
  • Simultaneously Indicate
  • relative document length
  • query term frequencies
  • query term distributions
  • query term overlap

76
Example
Query terms What roles do they play in
retrieved documents?
DBMS (Database Systems) Reliability
Mainly about both DBMS reliability
Mainly about DBMS, discusses reliability
Mainly about, say, banking, with a subtopic
discussion on DBMS/Reliability
Mainly about high-tech layoffs
77
(No Transcript)
78
(No Transcript)
79
Exploiting Visual Properties
  • Variation in gray scale saturation imposes a
    universal, perceptual order (Bertin et al. 83)
  • Varying shades of gray show varying quantities
    better than color (Tufte 83)
  • Differences in shading should align with the
    values being presented (Kosslyn et al. 83)

80
Key Aspect Faceted Queries
  • Conjunct of disjuncts
  • Each disjunct is a concept
  • osteoporosis, bone loss
  • prevention, cure
  • research, Mayo clinic, study
  • User does not have to specify which are main
    topics, which are subtopics
  • Ranking algorithm gives higher weight to overlap
    of topics
  • This kind of query works better at high-precision
    queries than similarity search (Hearst 95)

81
TileBars Summary
  • Preliminary User Studies
  • users understand them
  • find them helpful in some situations, but
    probably slower than just reading titles
  • sometimes terms need to be disambiguated

82
SeeSoft Showing Text Content using a linear
representation and brushing and linking (Eick
Wills 95)
83
Query Term Subsets
  • Show which subsets of query terms occur in
    which subsets of documents occurs in which
    subsets of retrieved documents

84
Term Occurrences in Results Sets
  • Show how often each query term occurs in
    retrieved documents
  • VIBE (Korfhage 91)
  • InfoCrystal (Spoerri 94)
  • Problems
  • cant see overlap of terms within docs
  • quantities not represented graphically
  • more than 4 terms hard to handle
  • no help in selecting terms to begin with

85
InfoCrystal (Spoerri 94)
86
VIBE (Olson et al. 93, Korfhage 93)
87
Term Occurrences in Results Sets
  • Problems
  • cant see overlap of terms within docs
  • quantities not represented graphically
  • more than 4 terms hard to handle
  • no help in selecting terms to begin with

88
DLITE (Cousins 97)
  • Supporting the Information Seeking Process
  • UI to a digital library
  • Direct manipulation interface
  • Workcenter approach
  • experts create workcenters
  • lots of tools for one task
  • contents persistent

89
DLITE (Cousins 97)
  • Drag and Drop interface
  • Reify queries, sources, retrieval results
  • Animation to keep track of activity

90
IR Infovis Meta-Analysis (Chen Yu 00)
  • Goal
  • Find invariant underlying relations suggested
    collectively by empirical findings from many
    different studies
  • Procedure
  • Examine the literature of empirical infoviz
    studies
  • 35 studies between 1991 and 2000
  • 27 focused on information retrieval tasks
  • But due to wide differences in the conduct of the
    studies and the reporting of statistics, could
    use only 6 studies

91
IR Infovis Meta-Analysis (Chen Yu 00)
  • Conclusions
  • IR Infoviz studies not reported in a standard
    format
  • Individual cognitive differences had the largest
    effect
  • Especially on accuracy
  • Somewhat on efficiency
  • Holding cognitive abilities constant, users did
    better with simpler visual-spatial interfaces
  • The combined effect of visualization is not
    statistically significant
  • Misc
  • Tilebars and Scatter/Gather are well-known enough
    to not require citations!!
Write a Comment
User Comments (0)
About PowerShow.com