Title: Text and Documents 2
1Text and Documents 2
- CS 7450 - Information Visualization
- November 2, 2000
2InfoVis Tasks
Recall
- Two main tasks that Information Visualization can
assist with in this area - Enhance a persons ability to read, understand
and gain knowledge from a document - Understand the contents of a document or
collection of documents without reading them
3More Specific Tasks
Recall
- Which documents contain text on topic XYZ?
- Which documents are of interest to me?
- Are there other documents that might be close
enough to be worthwhile? - What are the main themes of a document?
- How are certain words or themes distributed
through a document?
4Simple Taxonomy
Enhancedpresentation (syntax)
Concepts andrelationships (semantics)
Singledocument
Collection ofdocuments
Todays focus
5Document Collections
- Problem or challenge is how to present the
contents/semantics/themes/etc of the documents to
someone who does not have time to read them all - Who cares?
- Researchers, news people, CIA, .
6Vector Space Analysis
- How does one compare the similarity of two
documents? - One model
- Make list of each unique word in document
- Throw out common words (a, an, the, )
- Make different forms the same (bake, bakes,
baked) - Store count of how many times each word appeared
- Alphabetize, make into a vector
7Vector Space Analysis
- Model (continued)
- Want to see how closely two vectors go in same
direction, inner product - Can get similarity of each document to every
other one - Use a mass-spring layout algorithm to position
representations of each document - Some similarities to how search engines work
8Wiggle
- Not all terms or words are equally useful
- Often apply TFIDF
- Term frequency, inverse document frequency
- Weight of a word goes up if it appears often in a
document, but not often in the collection
9Process
Documents
Vectors, keywords
Data tables for vis
Analysis
Algorithms
Visualization
Similarity, clustering, normalization
Decomposition, statistics
2D, 3D display
10Smart System
- Uses vector space model for documents
- May break document into chapters and sections and
deal with those as atoms - Plot document atoms on circumference of circle
- Draw line between items if their similarity
exceeds some threshold value
Salton et al 95
11Text Relation Maps
- Label on line can indicate similarity value
- Items evenly spaced
- Doesnt give viewer idea of how big each
section/document is
12Improved Design
Proportional to length of section
Links placed at correct relative position
13Text Themes
- Look for sets of regions in a document (or sets
of documents) that all have common theme - Closely related to each other, but different from
rest - Need to run clustering process
14Algorithm
- Recognize triangles in relation maps
- Three with edges above threshold
- Make a new vector that is centroid of 3
- Triangles merged whenever centroid vectors are
sufficiently similar
15Text Theme Example
- Triangles shown
- Colored in to help presentation
16Skimming and Summarization
- Can use graph traversal to follow specific themes
throughout collection - Walk along connected edges
17Helpful
18VIBE System
- Smaller sets of documents than whole library
- Example Set of 100 documents retrieved from a
web search - Idea is to understand contents of documents
relate to each other
Olsen et al 93
19Focus
- Points of Interest
- Terms or keywords that are of interest to user
- Example cooking, pies, apples
- Want to visualize a document collection where
each documents relation to points of interest is
show - Also visualize how documents are similar or
different
20Technique
- Represent points of interest as vertices on
convex polygon - Documents are small points inside the polygon
- How close a point is to a vertex represents how
strong that term is within the document
P1
P2
P3
21Algorithm
- Example 3 POIs
- Document (P1, P2, P3) (0.4, 0.8, 0.2)
- Take first two
P1
0.4 0.40.8
0.333
P2
1/3 of way from P2 to P1
22Algorithm
- Combine weight of first two 1.2 and make a new
point, P - Do same thing for third point
1.2 1.20.2
P1
0.86
P
P3
P2
0.14 of way from P to P3
23Sample Visualization
24Sample Visualization
25VIBE Pros and Cons
- Effectively communications relationships
- Straightforward methodology and vis are easy to
follow - Can show relatively large collections
- Not showing much about a document
- Single items lose detail in the presentation
- Starts to break down with large number of terms
26Visualizing Documents
- VIBE presented documents with respect to a finite
number of special terms - How about generalizing this?
- Show large set of documents
- Any important terms within the set become key
landmarks - Not restricted to convex polygon idea
27Kohonens Feature Maps
- AKA Self-Organizing Maps
- Expresses complex, non-linear relationships
between high dimensional data items into simple
geometric relationships on a 2-d display - Uses neural network techniques
28Map Display of SOM
29Map Attributes
- Different, colored areas correspond to different
concepts in collection - Size of area corresponds to its relative
importance in set - Neighboring regions indicate commonalities in
concepts - Dots in regions can represent documents
30More Maps
ai2.bpa.arizona.edu/ent/
31More Maps
lislin.gws.uky.edu/Sitemap/
Interactive demos
Xia Lin
32Map Review
- Do you think the technique is useful?
- Strengths/weaknesses?
33Work at PNL
- Group has developed a number of visualization
techniques for document collections - Galaxies
- Themescapes
- ThemeRiver
- ...
Wise et al 95 www.pnl.gov/infoviz
34Galaxies
Presentation of documents where similar ones
cluster together
35Themescapes
- Self-organizing maps didnt reflect density of
regions all that well -- Can we improve? - Use 3D representation, and have height represent
density or number of documents in region
36Themescape
Video
37WebTheme
38ThemeRiver
39Cartia
www.cartia.com www.newsmaps.com
Spinoff of PNL that uses ThemeScape idea for a
commercial tool
40Galaxy of News
Current Issues
news information
Current info. Infrastructure simply cant
handle exploding scale of news information and
its cross correlation.
Need for an intelligent system that
automatically builds the correlations and
relationships between news articles
Rennison 94
41Objectives
Main Purpose Benefits
powerful relationship construction engine
effective navigation
- Allows users to quickly gain a broad
understanding of a news base - Allows users to explore and effectively browse
through expanding news base - Allows users to find relationships between
articles that would otherwise be unknown
42Technique
How it works
Pyramidal Representation
No global, explicit hierarchical representation
Semantic Zooming and Panning with fluidity
animation
No fixed locations, space is dynamically
reconstructed
Use of galaxy as a metaphor gives space
a freedom of dimensional constraints
43Key Features
- Users move freely, smoothly and continuously in
3D space - Move through network of symbols that are sorted,
close proximity means related - x-y topic space, z (depth) is more detail
- Users trajectory determines what appears
- Key idea space is not statically defined and
laid out
44Display
45Discussion
Why Not Use Hypermedia ?
clock .
- Hyper, sudden Jumping
- No idea what other info. is available
- Manually connect all documents ?
Credibility issue on news filters
Dilemma Trust news filters Vs. read them
all Approach Provide access to all articles No
filtering or retrieval techniques
46Potential Application
Internet Search Engine
47Related Idea on Web
www.plumbdesign.com/thesaurus/
48References
- Spence and CMS texts
- All referred to papers
- F 99 slides
- Ho and Kim
- Lewis
- Miller