Text and Documents 2 - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Text and Documents 2

Description:

Are there other documents that might be close enough to be worthwhile? ... sets of documents than whole ... Documents are small points inside the polygon ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 49
Provided by: JohnS3
Category:
Tags: documents | text

less

Transcript and Presenter's Notes

Title: Text and Documents 2


1
Text and Documents 2
  • CS 7450 - Information Visualization
  • November 2, 2000

2
InfoVis Tasks
Recall
  • Two main tasks that Information Visualization can
    assist with in this area
  • Enhance a persons ability to read, understand
    and gain knowledge from a document
  • Understand the contents of a document or
    collection of documents without reading them

3
More Specific Tasks
Recall
  • Which documents contain text on topic XYZ?
  • Which documents are of interest to me?
  • Are there other documents that might be close
    enough to be worthwhile?
  • What are the main themes of a document?
  • How are certain words or themes distributed
    through a document?

4
Simple Taxonomy
Enhancedpresentation (syntax)
Concepts andrelationships (semantics)
Singledocument
Collection ofdocuments
Todays focus
5
Document Collections
  • Problem or challenge is how to present the
    contents/semantics/themes/etc of the documents to
    someone who does not have time to read them all
  • Who cares?
  • Researchers, news people, CIA, .

6
Vector Space Analysis
  • How does one compare the similarity of two
    documents?
  • One model
  • Make list of each unique word in document
  • Throw out common words (a, an, the, )
  • Make different forms the same (bake, bakes,
    baked)
  • Store count of how many times each word appeared
  • Alphabetize, make into a vector

7
Vector Space Analysis
  • Model (continued)
  • Want to see how closely two vectors go in same
    direction, inner product
  • Can get similarity of each document to every
    other one
  • Use a mass-spring layout algorithm to position
    representations of each document
  • Some similarities to how search engines work

8
Wiggle
  • Not all terms or words are equally useful
  • Often apply TFIDF
  • Term frequency, inverse document frequency
  • Weight of a word goes up if it appears often in a
    document, but not often in the collection

9
Process
Documents
Vectors, keywords
Data tables for vis
Analysis
Algorithms
Visualization
Similarity, clustering, normalization
Decomposition, statistics
2D, 3D display
10
Smart System
  • Uses vector space model for documents
  • May break document into chapters and sections and
    deal with those as atoms
  • Plot document atoms on circumference of circle
  • Draw line between items if their similarity
    exceeds some threshold value

Salton et al 95
11
Text Relation Maps
  • Label on line can indicate similarity value
  • Items evenly spaced
  • Doesnt give viewer idea of how big each
    section/document is

12
Improved Design
Proportional to length of section
Links placed at correct relative position
13
Text Themes
  • Look for sets of regions in a document (or sets
    of documents) that all have common theme
  • Closely related to each other, but different from
    rest
  • Need to run clustering process

14
Algorithm
  • Recognize triangles in relation maps
  • Three with edges above threshold
  • Make a new vector that is centroid of 3
  • Triangles merged whenever centroid vectors are
    sufficiently similar

15
Text Theme Example
  • Triangles shown
  • Colored in to help presentation

16
Skimming and Summarization
  • Can use graph traversal to follow specific themes
    throughout collection
  • Walk along connected edges

17
Helpful
  • What do you think?

18
VIBE System
  • Smaller sets of documents than whole library
  • Example Set of 100 documents retrieved from a
    web search
  • Idea is to understand contents of documents
    relate to each other

Olsen et al 93
19
Focus
  • Points of Interest
  • Terms or keywords that are of interest to user
  • Example cooking, pies, apples
  • Want to visualize a document collection where
    each documents relation to points of interest is
    show
  • Also visualize how documents are similar or
    different

20
Technique
  • Represent points of interest as vertices on
    convex polygon
  • Documents are small points inside the polygon
  • How close a point is to a vertex represents how
    strong that term is within the document

P1
P2
P3
21
Algorithm
  • Example 3 POIs
  • Document (P1, P2, P3) (0.4, 0.8, 0.2)
  • Take first two

P1
0.4 0.40.8
0.333
P2
1/3 of way from P2 to P1
22
Algorithm
  • Combine weight of first two 1.2 and make a new
    point, P
  • Do same thing for third point

1.2 1.20.2
P1
0.86
P
P3
P2
0.14 of way from P to P3
23
Sample Visualization
24
Sample Visualization
25
VIBE Pros and Cons
  • Effectively communications relationships
  • Straightforward methodology and vis are easy to
    follow
  • Can show relatively large collections
  • Not showing much about a document
  • Single items lose detail in the presentation
  • Starts to break down with large number of terms

26
Visualizing Documents
  • VIBE presented documents with respect to a finite
    number of special terms
  • How about generalizing this?
  • Show large set of documents
  • Any important terms within the set become key
    landmarks
  • Not restricted to convex polygon idea

27
Kohonens Feature Maps
  • AKA Self-Organizing Maps
  • Expresses complex, non-linear relationships
    between high dimensional data items into simple
    geometric relationships on a 2-d display
  • Uses neural network techniques

28
Map Display of SOM
29
Map Attributes
  • Different, colored areas correspond to different
    concepts in collection
  • Size of area corresponds to its relative
    importance in set
  • Neighboring regions indicate commonalities in
    concepts
  • Dots in regions can represent documents

30
More Maps
ai2.bpa.arizona.edu/ent/
31
More Maps
lislin.gws.uky.edu/Sitemap/
Interactive demos
Xia Lin
32
Map Review
  • Do you think the technique is useful?
  • Strengths/weaknesses?

33
Work at PNL
  • Group has developed a number of visualization
    techniques for document collections
  • Galaxies
  • Themescapes
  • ThemeRiver
  • ...

Wise et al 95 www.pnl.gov/infoviz
34
Galaxies
Presentation of documents where similar ones
cluster together
35
Themescapes
  • Self-organizing maps didnt reflect density of
    regions all that well -- Can we improve?
  • Use 3D representation, and have height represent
    density or number of documents in region

36
Themescape
Video
37
WebTheme
38
ThemeRiver
39
Cartia
www.cartia.com www.newsmaps.com
Spinoff of PNL that uses ThemeScape idea for a
commercial tool
40
Galaxy of News
Current Issues
news information
Current info. Infrastructure simply cant
handle exploding scale of news information and
its cross correlation.
Need for an intelligent system that
automatically builds the correlations and
relationships between news articles
Rennison 94
41
Objectives
Main Purpose Benefits
powerful relationship construction engine
effective navigation
  • Allows users to quickly gain a broad
    understanding of a news base
  • Allows users to explore and effectively browse
    through expanding news base
  • Allows users to find relationships between
    articles that would otherwise be unknown

42
Technique
How it works
Pyramidal Representation
No global, explicit hierarchical representation
Semantic Zooming and Panning with fluidity
animation
No fixed locations, space is dynamically
reconstructed
Use of galaxy as a metaphor gives space
a freedom of dimensional constraints
43
Key Features
  • Users move freely, smoothly and continuously in
    3D space
  • Move through network of symbols that are sorted,
    close proximity means related
  • x-y topic space, z (depth) is more detail
  • Users trajectory determines what appears
  • Key idea space is not statically defined and
    laid out

44
Display
45
Discussion
Why Not Use Hypermedia ?
clock .
  • Hyper, sudden Jumping
  • No idea what other info. is available
  • Manually connect all documents ?

Credibility issue on news filters
Dilemma Trust news filters Vs. read them
all Approach Provide access to all articles No
filtering or retrieval techniques
46
Potential Application
Internet Search Engine
47
Related Idea on Web
www.plumbdesign.com/thesaurus/
48
References
  • Spence and CMS texts
  • All referred to papers
  • F 99 slides
  • Ho and Kim
  • Lewis
  • Miller
Write a Comment
User Comments (0)
About PowerShow.com