Title: Text%20Data%20Mining
 1Text Data Mining
- Prof. Marti Hearst 
- UC Berkeley SIMS 
- Guest Lecture, ME 290M 
- Prof. Agogino 
- May 4, 1999
2Theres Lots of Text Out There
- Is it Information Overload?
3- Why not TURBO-Text? 
- How can we SYNTHESIZE whats there to make new 
 discoveries?
4Talk Outline
- Definitions 
- What is Data Mining? 
- What is Text Data Mining? 
-  Text data mining examples 
- Lexical knowledge acquisition 
- Merging textual records 
- Finding cures for diseases (from medical 
 literature)
- Future Directions
5What is Data Mining? (Fayyad  Uthurusamy 96, 
Fayyad 97)
- Fitting models to or determining patterns from 
 very large datasets.
- A regime which enables people to interact 
 effectively with massive data stores.
- Deriving new information from data. 
- finding patterns across large datasets 
- discovering heretofore unknown information
6What is Data Mining?
- Potential point of confusion 
- The extracting ore from rock metaphor does not 
 really apply to the practice of data mining
- If it did, then standard database queries would 
 fit under the rubric of data mining
- Find all employee records in which employee earns 
 300/month less than their managers
- In practice, DM refers to 
- finding patterns across large datasets 
- discovering heretofore unknown information
7Why Data Mining?
- Because the data is there. 
- Because current DBMS technology does not support 
 data analysis.
- Because 
- larger disks 
- faster cpus 
- high-powered visualization 
- networked information 
- are becoming widely available.
8DM Touchstone Applications(CACM 39 (11) Special 
Issue)
- Finding patterns across data sets 
- Reports on changes in retail sales 
- to improve sales 
- Patterns of sizes of TV audiences 
- for marketing 
- Patterns in NBA play 
- to alter, and so improve, performance 
- Deviations in standard phone calling behavior 
- to detect fraud 
- for marketing
9DM Touchstone Applications(CACM 39 (11) Special 
Issue)
- Separating signal from noise 
- Classifying faint astronomical objects 
- Finding genes within DNA sequences 
- Discovering novel tectonic activity
10What is Text Data Mining?
- Peoples first thought 
- Make it easier to find things on the Web. 
- This is information retrieval! 
- The metaphor of extracting ore from rock does 
 make sense for extracting documents of interest
 from a huge pile.
- But does not reflect notions of DM in practice 
- finding patterns across large collections 
- discovering heretofore unknown information
11Text DM ? IR
- Data Mining 
- Patterns, Nuggets, Exploratory Analysis 
- Information Retrieval 
- Finding and ranking documents that match users 
 information need
- ad hoc query 
- filtering/standing query 
- Rarely Patterns, Exploratory Analysis
12Real Text DM
- The point 
- Discovering heretofore unknown information is not 
 what we usually do with text.
- (If it werent known, it could not have been 
 written by someone.)
- However 
- There is a field whose goal is to learn about 
 patterns in text for its own sake ...
13Computational Lingustics
- Goal automated language understanding 
- this isnt possible 
- instead, go for subgoals, e.g., 
- word sense disambiguation 
- phrase recognition 
- semantic associations 
- Current approach 
- statistical analyses of very large text 
 collections
14WordNet A Lexical Database
A list of hypernyms for each sense of crow 
 15Lexicographic Knowledge Acquisition
- Given a large lexical database ... 
- Wordnet Miller, Fellbaum et al. at Princeton 
- http//www.cogsci.princeton.edu/wn 
-  and a huge text collection 
- How to automatically add new relations?
16Idea Use Simple Lexico-Syntactic Analysis
- Patterns of the following type work 
- NP0 such as NP1, NP2 , (and  or) NPi 
- i gt 1, implies 
- forall NPi, igt1, hyponym(NPi, NP0) 
- Example 
- Agar is a substance prepared from a mixture of 
 red algae, such as Gelidium, for laboratory or
 industrial use.
- implies hyponym(Gelidium, red algae)
17More Examples
- Felonies, such as shootings and stabbings  
 implies
- hyponym(shootings, felonies) 
- hyponym(stabbings, felonies) 
- Is this in the WordNet hierarchy? 
18Linking Killing to Felonies 
 19Another Example
- Einstein is (was) a physicist. 
- Is/was he a genius?
20Making Einstein a Genius 
 21Results from such as lexico-syntactic relation 
 22Results with the or other lexico-syntactic 
relation 
 23Procedure
- Discover a pattern that indicates a lexical 
 relationship
- Scan through a large collection extract 
 sentences that match the pattern
- Extract the NPs from the sentence 
- requires some phrase parsing 
- Check if suggested relation is in WordNet or not 
- this part not automated, but could be
24Discovering New Patterns
- Suggested algorithm 
- Decide on a lexical relation of interest, e.g., 
 hyponymy
- Derive a list of word pairs from WordNet that are 
 known to hold that relation
- e.g., (crow, bird) 
- Extract sentence from text collection in which 
 both terms occur
- Find commonalities among lexico-syntactic context 
- Test these out against other word pairs known to 
 hold the relationship in WordNet
25Text Merging ExampleDiscovering Hypocritical 
Congresspersons 
 26Discovering Hypocritical Congresspersons
- Feb 1, 1996 
- US House of Reps votes to pass Telecommunications 
 Reform Act
- this contains the CDA (Communications Decency 
 Act)
- violaters subject to fines of 250,000 and 5 
 years in prison
- eventually struck down by court
27Discovering Hypocritical Congresspersons
- Sept 11, 1998 
- US House of Reps votes to place the Starr report 
 online
- the content would (most likely) have violated the 
 CDA
- 365 people were members for both votes 
- 284 members voted aye both times 
- 185 (94) Republicants voted aye both times 
-  96 (57) Democrats voted aye both times 
28(No Transcript) 
 29(No Transcript) 
 30How to find Hypocritical Congresspersons?
- This must have taken a lot of work 
- Hand cutting and pasting 
- Lots of picky details 
- Some people voted on one but not the other bill 
- Some people share the same name 
- Check for different county/state 
- Still messed up on Bono 
- Taking stats at the end on various attributes 
- Which state 
- Which party 
- Tools should help streamline, reuse results
31How to find Hypocritical Congresspersons?
- The hard part? 
- Knowing two compare these two sets of voting 
 records.
32How to find causes of disease?Don Swansons 
Medical Work
- Given 
- medical titles and abstracts 
- a problem (incurable rare disease) 
- some medical expertise 
- find causal links among titles 
- symptoms 
- drugs 
- results 
33Swanson Example (1991)
- Problem Migraine headaches (M) 
- stress associated with M 
- stress leads to loss of magnesium 
- calcium channel blockers prevent some M 
- magnesium is a natural calcium channel blocker 
- spreading cortical depression (SCD)implicated in 
 M
- high levels of magnesium inhibit SCD 
- M patients have high platelet aggregability 
- magnesium can suppress platelet aggregability 
- All extracted from medical journal titles
34Swansons TDM
- Two of his hypotheses have received some 
 experimental verification.
- His technique 
- Only partially automated 
- Required medical expertise 
- Few people are working on this.
35How to Automate This?
- Idea mixed-initiative interaction 
- User applies tools to help explore the hypothesis 
 space
- System runs suites of algorithms to help explore 
 the space, suggest directions
36Our Proposed Approach
- Three main parts 
- UI for building/using strategies 
- Backend for interfacing with various databases 
 and translating different formats
- Content analysis/machine learning for figuring 
 out good hypotheses/throwing out bad ones
37The UI part
- Need support for building strategies 
- Mixed-initiative system 
- Trade off between user-initiated hypotheses 
 exploration and system-initiated suggestions
- Information visualization 
- Another way to show lots of choices
38Candidate Associations
Suggested Strategies
Current Retrieval Results 
 39Lindi Linking Information for Novel Discovery 
and Insight
- Just starting up now (fall 98) 
- Initial work Hao Chen, Ketan Mayer-Patel, 
 Shankar Raman
40Ore-Filled Text Collections
- Congressional Voting Records 
- Answer questions like 
- Who are the most hypocritical congresspeople? 
- Medical Articles 
- Create hypotheses about causes of rare diseases 
- Create hypotheses about gene function 
- Patent Law 
- Answer questions like 
- Is government funding of research worthwhile? 
41Summary
- Text Data Mining 
- Extracting heretofore undiscovered information 
 from large text collections
- Not the same as information retrieval 
- Examples 
- Lexicographic knowledge acquisition 
- Merging of text representations 
- Linking related information 
- The truth is out there!