Title: WordSieve: Learning Task Differentiating Keywords Automatically
1WordSieve Learning Task Differentiating Keywords
Automatically
- Travis Bauer
- Sandia National Laboratories
- (Research discussed today was done at Indiana
University)
2Learning Task ContextsCalvin
- Learn what characterizes a users task contexts
- Unobtrusive Observing
- Keyword Extraction
- Index based on Context
3Currently Used Algorithms
- TFIDF
- Latent Semantic Analysis
- Log-Entropy
4Currently Used Algorithms
- TFIDF
- "One of the most successful and well tested
techniques in Information Retrieval." - Pazanni - Syskill Webert (Pazanni '96)
- Hierarchical Feature Map (Merkl '97)
- Learning in Document Filtering (Callen '98)
- Topic Detection (Shultz '99)
- Remembrance Agent (Rhodes '00)
- Lexical Signatures (Park '02)
- Latent Semantic Analysis
- Log-Entropy
5Currently Used Algorithms
- TFIDF
- Latent Semantic Analysis
- Well known, popular, well covered in the
literature - Grading Essay Tests
- Taking Physics tests
- Taking synonym exams
- Cross Linguistic IR (Dumais '97)
- Assigning papers for peer review (Dumais '92)
- Information Filtering (Foltz '90)
- Log-Entropy
6Currently Used Algorithms
- TFIDF
- Latent Semantic Analysis
- Log-Entropy
- Not used as much for Personal Information
Retrieval - Higher overhead than TFIDF
- Indexes based on the distribution of terms across
documents potentially better performance
7Comparison to Current Techniques
- Current Techniques
- Static Corpora
- Comprehensive Statistics
- WordSieve
- Neural Network-like processing
- Stream of data
- Local learning
- Competitive Learning
8Good Discriminator of Context
9WordSieve Concept
User Browsing
Attributes Term Activation Priming
10WordSieve 1
Words Absent in Document Sequences
User Profile
Context Profile
Words Occurring in Document Sequences
Words Currently Occurring Frequently
11WordSieve 2
User Profile
Words Reflecting Context
Context Profile
Words Currently Occurring Frequently
12Web Browsing Data Set
- Sixteen Users
- Four Topics, 10 minutes Each
- Political Life Al Gore
- Political Life George Bush
- Traditional Indonesian Cooking
- Traditional Thai Cooking
Categorized Document Set
Automatically Generated Queries
13Browsing Results
14Contributions
- It is possible to extract context differentiating
terms from document streams using unsupervised
competitive learning. - Comprehensive statistics are not necessary in the
described situations given an ordering of the
documents. - Performance is comprable to LSI and better than
Log-Entropy and TFIDF
15Potential Next Steps
- WordSieve
- Automate Parameter Optimization
- Co-occurrance of terms
- Other Domains
- Multi-dimensional data stream
- Machine Vision
16Support
- This work was conducted under the advisement of
David Leake at Indiana University. - It was sponsored in part by the GAANN fellowship.
- The original version of the personal information
agent was designed and written with partial
support from NASA under award No NCC 2-1035
17For More Information
- Travis Bauer
- www.cs.indiana.edu/trbauer/publications.htm
18Usenet Data Set
- Three sets of 5 newsgroups
- alt.atheismtalk.religion.miscsoc.religion.christ
ianrec.sport.baseballrec.sport.hockey - comp.os.ms-windows.misccomp.sys.ibm.pc.hardwarec
omp.sys.mac.hardwarerec.autosrec.motorcycles - talk.politics.gunstalk.politics.miscsci.electron
icssci.medsci.space
Categorized Document Set
Automatically Generated Queries
19Usenet Results