Title: Text Tango: A New Text Data Mining Project
1 Text TangoA New Text Data Mining Project
- Marti A. Hearst
- GUIR Meeting, Sept 17, 1998
2Talk Outline
- What is Data Mining?
- What isnt Text Data Mining?
- What is Text Data Mining
- Examples
- A proposal for a system for Text Data Mining
3What is Data Mining? (Fayyad Uthurusamy 96,
Fayyad 97)
- Fitting models to or determining patterns from
very large datasets. - A regime which enables people to interact
effectively with massive data stores. - Deriving new information from data.
- finding patterns across large datasets
- discovering heretofore unknown information
4What is Data Mining?
- Potential point of confusion
- The extracting ore from rock metaphor does not
really apply to the practice of data mining - If it did, then standard database queries would
fit under the rubric of data mining - Find all employee records in which employee earns
300/month less than their managers - In practice, DM refers to
- finding patterns across large datasets
- discovering heretofore unknown information
5DM Touchstone Applications(CACM 39 (11) Special
Issue)
- Finding patterns across data sets
- Reports on changes in retail sales
- to improve sales
- Patterns of sizes of TV audiences
- for marketing
- Patterns in NBA play
- to alter, and so improve, performance
- Deviations in standard phone calling behavior
- to detect fraud
- for marketing
6What is Text Data Mining?
- Peoples first thought
- Make it easier to find things on the Web.
- This is information retrieval!
- The metaphor of extracting ore from rock does
make sense for extracting documents of interest
from a huge pile. - But does not reflect notions of DM in practice
- finding patterns across large collections
- discovering heretofore unknown information
7Text DM ! IR
- Data Mining
- Patterns, Nuggets, Exploratory Analysis
- Information Retrieval
- Finding and ranking documents that match users
information need - ad hoc query
- filtering/standing query
8Real Text DM
- What would finding a pattern across a large text
collection really look like?
9Bill Gates MS-DOS in the Bible!
From The Internet Diary of the man who
cracked the Bible Code Brendan McKay,
Yahoo Internet Life, www.zdnet.com/yil
(William Gates, agitator, leader)
10From The Internet Diary of the man who cracked
the Bible Code Brendan McKay, Yahoo Internet
Life, www.zdnet.com/yil
11Real Text DM
- The point
- Discovering heretofore unknown information is not
what we usually do with text. - (If it werent known, it could not have been
written by someone.) - However
- There are some interesting problems of this type!
12Combining Data Typesfor Novel Tasks
- Text Links to find authority pages
(Kleinberg at Cornell, Page at Stanford) - Usage Time Links to study evolution of web
and information use (Pitkow et al. at PARC)
13Ore-Filled Text Collections
- Congressional Voting Records
- Answer questions like
- Who are the most hypocritical congresspeople?
- Medical Articles
- Create hypotheses about causes of rare diseases
- Create hypotheses about gene function
- Patent Law
- Answer questions like
- Is government funding of research worthwhile?
14(No Transcript)
15(No Transcript)
16How to find Hypocritical Congresspersons?
- This must have taken a lot of work
- Hand cutting and pasting
- Lots of picky details
- Some people voted on one but not the other bill
- Some people share the same name
- Check for different county/state
- Still messed up on Bono
- Taking stats at the end on various attributes
- Which state
- Which party
17How to find causes of disease?Don Swansons
Medical Work
- Given
- medical titles and abstracts
- a problem (incurable rare disease)
- some medical expertise
- find causal links among titles
- symptoms
- drugs
- results
18Swanson Example (1991)
- Problem Migraine headaches (M)
- stress associated with M
- stress leads to loss of magnesium
- calcium channel blockers prevent some M
- magnesium is a natural calcium channel blocker
- spreading cortical depression (SCD)implicated in
M - high levels of magnesium inhibit SCD
- M patients have high platelet aggregability
- magnesium can suppress platelet aggregability
- All extracted from medical journal titles
19Swansons TDM
- Two of his hypotheses have received some
experimental verification. - His technique
- Only partially automated
- Required medical expertise
- Few people are working on this.
20How to find functions of genes?
- Important problem in molecular biology
- Have the genetic sequence
- Dont know what it does
- But
- Know which genes it coexpresses with
- Some of these have known function
- So Infer function based on function of
co-expressed genes - This is new work by Michael Walker and others at
Incyte Pharmaceuticals
21Gene Co-expressionRole in the genetic pathway
Kall.
Kall.
g?
h?
PSA
PSA
PAP
PAP
g?
Other possibilities as well
22Make use of the literature
- Look up what is known about the other genes.
- Different articles in different collections
- Look for commonalities
- Similar topics indicated by Subject Descriptors
- Similar words in titles and abstracts
- adenocarcinoma, neoplasm, prostate, prostatic
neoplasms, tumor markers, antibodies ...
23Developing Strategies
- Different strategies seem needed for different
situations - First see what is known about Kallikrein.
- 7341 documents. Too many
- AND the result with disease category
- If result is non-empty, this might be an
interesting gene - Now get 803 documents
- AND the result with PSA
- Get 11 documents. Better!
24Developing Strategies
- Look for commalities among these documents
- Manual scan through 100 category labels
- Would have been better if
- Automatically organized
- Intersections of important categories scanned
for first
25Try a new tack
- Researcher uses knowledge of field to realize
these are related to prostate cancer and
diagnostic tests - New tack intersect search on all three known
genes - Hope they all talk about diagnostics and prostate
cancer - Fortunately, 7 documents returned
- Bingo! A relation to regulation of this cancer
26Formulate a Hypothesis
- Hypothesis mystery gene has to do with
regulation of expression of genes leading to
prostate cancer - New tack do some lab tests
- See if mystery gene is similar in molecular
structure to the others - If so, it might do some of the same things they
do
27Strategies again
- In hindsight, combining all three genes was a
good strategy. - Store this for later
- Might not have worked
- Need a suite of strategies
- Build them up via experience and a good UI
28The System
- Doing the same query with slightly different
values each time is time-consuming and tedious - Same goes for cutting and pasting results
- IR systems dont support varying queries like
this very well. - Each situation is a bit different
- Some automatic processing is needed in the
background to eliminate/suggest hypotheses
29The System
- Three main parts
- UI for building/using strategies
- Backend for interfacing with various databases
and translating different formats - Content analysis/machine learning for figuring
out good hypotheses/throwing out bad ones
30The UI part
- Need support for building strategies
- Lots of info lying around, so a nice option is
... - Two-handed interface
- Big table display
- Mixed-initiative system
- Trade off between user-initiated hypotheses
exploration and system-initiated suggestions - Information visualization
- Another way to show lots of choices
31Candidate Associations
Suggested Strategies
Current Retrieval Results
32Other applications
- Patent example
- Political example
- The truths out there!
33Text Tango
- Just starting up now.
- Let me know if youd like to work on it!