Title: CIS392 Text Retrieval
1CIS392 Text Retrieval Mining
- Exploiting the Structure of Text
- Material Sullivan Ch 3 (exclude integration with
data warehouse and WWW) and Ch 8
2Text-Oriented Business Intelligence
- How do business intelligence analysts work?
- Summarizing documents
- Classifying and routing documents to interested
readers - Answering questions
- Searching and browsing by topic and theme
- Searching with topic
- Browsing by topic
- Searching by example
3Summarizing Documents
- FEDERAL RESERVE POLICY FACILITATED BYMARKET
PRICE INDICATORSACCORDING TO NEW JEC STUDY - http//www.house.gov/jec/press/2000/10-18-0.htm
- Summarized Text http//web.njit.edu/wu/teaching/
sp03/CIS392/JECPressRelease.htm
4Summarization
- Problems
- News story summaries tend to read reasonably.
Automatic summaries of other documents might not
have a logic flow. - Wrong order of points of an argument.
- 20 rule can still lead to long summarized text,
but anything less then 5 is not understandable.
5Undirected summarization
- Does not use patterns or templates
- Select and copy most important sentences from
original document. - Methods
- Add up frequency of words and select sentences
with high total frequency - Find trigger words or phrases, e.g. in
conclusion.
6Directed summarization
- Also called information extraction
- Items (key phrases) to find are pre-defined.
- Templates and patterns are pre-defined.
- Processing involves noun phrase identification,
pattern matching, and templates fulfillment.
7NJIT CIS 634 Information Retrieval Fall 2002
- Information Extraction
- Material
- Information Extraction Techniques and
Challenges, by Ralph Grishman
8What do people want from IE?
- Lists of relevant entities rather than lists of
relevant documents. - How many companies filed bankruptcy in year 2001?
- How many universities are there in the United
States?
9Definitions
- IE is the identification of instances of a
particular class of events or relationships in a
natural language text, and the extraction of the
relevant arguments of the event relationship. - It involves the creation of structured
representation of selected information drawn from
the text.
10Example
- Text 19 March A bomb went off this morning
near a power tower in San Salvador leaving a
large part of the city without energy, but no
casualties have been reported. According to
unofficial sources, the bomb allegedly
detonated by urban guerrilla commandos blew up
a power tower in the northwestern part of San
Salvador at 0650 (1250 GMT).
11Results
- INCIDENT TYPE bombing
- DATE March 19
- LOCATION El Salvador San Salvador (city)
- PERPETRATOR urban guerrilla commandos
- PHYSICAL TARGET power tower
- HUMAN TARGET -
- EFFECT ON PHYSICAL TARGET destroyed
- EFFECT ON HUMAN TARGET no injury or death
- INSTRUMENT bomb
12Top Level Overview of Processes
- Facts are extracted from text through local text
analysis. - Facts are integrated, producing larger facts or
new facts. - Facts are translated into required format.
- Domain vs scenario vs template.
13Desired outputs
- Scenario Sam Schwartz retired as executive vice
president of the famous hot dog manufacturer,
Hupplewhite, Inc. He will be succeeded by Harry
Himmelfarb. - Templates
- Event start job
- Person Harry Himmelfarb
- Position Executive vice president
- Company Hupplewhite Inc.
- --------------------------------------------------
--------------- - Event leave job
- Person Sam Schwartz
- Position Executive vice president
- Company Hupplewhite Inc.
14Pattern creation and template structure building
- Create sets of expression patterns.
- Person retires as position
- Person is succeeded by person.
- Structures for templates
- Entities
- Events
- (The role of patterns is to extract events or
relationships relevant to the scenario.)
15Local text analysis step 1Lexical Analysis
- Text is first divided into sentences and into
tokens. - Each token is looked up in the dictionaries
(general vs specialized) to determine its
possible parts-of-speech and features.
16Local text analysis step 2 and 3
- Name Recognition
- Identifying various types of proper names and
other special forms (e.g. dates, currency). - Syntactic Structure
- Arguments are mostly noun phrases.
- Relationships grammatical functional relations
- Example Company-description, company-name,
- Position of company
17Example of syntactic structure
- np e1 Sam Schwartz vg retired as np e2
executive vice president of np e3 the famous
hot dog manufacturer, np e4 Hupplewhite, Inc.
np e5 He vg will be succeeded by np e6 Harry
Himmelfarb.
18Example (cont)
- Semantic Entity
- Entity e1 type person name Sam Schwartz
- Entity e2 type position value executive vice
president - Entity e3 type manufacturer
- Entity e4 type company name Hupplewhite
Inc. - Entity e5 type person
- Entity e6 type person name Harry Himmelfarb
- Updated according to pattern position of
company - Entity e1 type person name Sam Schwartz
- Entity e2 type position value executive vice
president companye3 - Entity e3 type manufacturer name Hupplewhite
Inc. - Entity e5 type person
- Entity e6 type person name Harry Himmelfarb
19Local text analysis step 4 Scenario Pattern
Matching
- Extract the events or relationships relevant to
the scenario, which is executive succession in
this case. - Person (A) is succeeded by person (B).
- Entity e1 type person name Sam Schwartz
- Entity e2 type position value executive vice
president - Entity e3 type manufacturer name Hupplewhite
Inc. - Entity e5 type person
- Entity e6 type person name Harry Himmelfarb
- Event e7 type leave-job persone1 positione2
- Event e8 type succeed person1e6 person2e5
20Discourse analysis step 1 CORE-ference Analysis
- Resolving anaphoric references by pronouns and
definite noun phrases - E5 type person (pronoun -- he)
- It is replaced by the most recent previously
mentioned entity of type person, which is e1 Sam
Schwartz.
21Discourse analysis step 2 Inferenceing and Event
Merging
- Leave-job (X-person, Y-job) succeed (Z-person,
X-person) - gt start-job (Z-person, Y-job)
- Start-job (X-person, Y-job) succeed (X-person,
Z-person) - gt leave-job (Z-person, Y-job)
22Inferencing and Event Merging (cont)
- Entity e1 type person name Sam Schwartz
- Entity e2 type position value executive vice
president company e3 - Entity e3 type manufacturer name Hupplewhite
Inc. - Entity e6 type person name Harry Himmelfarb
- Event e7 type leave-job persone1 positione2
- Event e8 type succeed person1e6 person2e1
- Event e9 type stat-job persone6 positione2
23(No Transcript)
24Design Issues
- To Parse or not to Parse linguistics complexity
involved. - Portability low
- Performance not satisfactory
25Classifying and routing docs
- Process classify docs ? route them to specific
users - Classify docs according to thesaurus, subject
hierarchy, taxonomy, or ontology.
26Answering questions
- Also called question answering
- For very specific and straightforward questions
extract related noun phrase from text. - Example what is the capital of Denmark?
- Solution a document containing capital and
Denmark, and also has Copenhagen near them
(note C is in upper case, meaning its a
proper name.)
27Answering questions
- For complicated questions, 1 word, or 1 phrase
answer is not enough background info is needed. - what is document warehousing?
- If no answers are found, provide alternate
questions to users.
28Searching and browsing by topic
- Ad hoc searching with topics
- Search within a category (select domain first)
http//dir.yahoo.com/Business_and_Economy/ - Browsing by topic
- Effectiveness depends on breadth and depth of the
subject hierarchy. - Browse Yahoo!s main page and narrow down topic.
- Commercial DBs have incorporated text processing.
29Searching by example
- Also called query by example
- Google similar pages and Page-Specific Search
(in Advanced Search page) are examples of query
by example. - It works well for very narrow and specific topics.
30Full text searching
- Boolean operators (AND, OR, NOT)
- Proximity operators (Food and Drug
Administration or FDA) NEAR clinical trails - Weighting operators (commodity AND wheat3
31Clustering Definitions
- Discovering group structure amongst the cases of
n by p matrix. -- (Venables, W. N., and Ripley,
B. D. (1997). Modern Applied Statistics with
S-Plus (2 ed.). Statistics and Computing Series.
New York Springer. ) - Clustered groups
- In a group, each object has majority of the
attributes and each attribute is owned by
majority of objects. - Resultant groups are supposed to be as distant to
each other as possible. - Inside a group, members are supposed to be as
close to each other as possible.
32Document Clustering
- Unlike classification schemes, it does not use a
pre-define set of terms to group documents - Theoretically, documents are grouped together
because their contents are similar. - Closely associated documents tend to be relevant
to the same query ? they are likely to be wanted
together. - Documents in the same clustered group are treated
the same until further examined individually.
33Document Clustering
- Steps
- Find attributes, i.e. a set of key words (columns
in next slide) from documents (rows in next
slide.) - Vector representation. E.g. the vector for object
2 is (1, 1, 0, 1, 0, 0, 0, 0) - Calculate distances between document pairs.
34(No Transcript)
35Document Space and Clustering
NJIT
Doc2
Doc1
Doc3
Doc4
Information Systems Dept
36The Use of Clustering in IR
- Choosing a clustering method
- The method should produce stable results under
growth (of the size of document collection) - Small errors in the description should lead to
small changes in the clustering - The method should be independent of the initial
ordering of the objects.
37The use of clustering in IR
- Can be used for filtering and routing.
- Can be used for creating categories for retrieval.
38Clustering Routines(optional, wont be in
exams.)
- K-means
- PAM
- CLARA
- Hierarchical clustering AGNES, DIANA, and MONA
- FANNY
- Model based clustering mclust
- See Kaufman and Rousseeuw (1990) for details.
39Dissimilarity Metrics
- DAISY a routine for calculating dissimilarity
either using Euclidean or Manhattan distance. - The following clustering routines are all based
on distance measures.
40K-means
- The number of clusters needs to be pre-specified.
- An initial clustering is created.
- Iterative relocation by moving objects from one
group to another if this reduces the sum of
squares.
41PAM (Partitioning Around Medoids)
- The number of clusters needs to be pre-specified.
- The algorithm computes k representative
objectives, called medoids, which together
determine a clustering. - Each object will be assigned to the nearest
medoid according to dissimilarity value.
42CLARA (Clustering Large Applications)
- It deals with large data set by considering data
subsets of fixed size. - Each sub-dataset is partitioned into k clusters
using the same algorithm as in the PAM function.
The remaining objects in the original dataset are
assigned to the nearest medoid. - The procedure is repeated several times until the
best result is reached.
43FANNY (Fuzzy Analysis)
- PAM and CLARA are crisp clustering methods
namely, each object belongs to one cluster. - FANNY spreads objects over groups.
- A membership value is used to determined how
strongly an object belongs to a group.
44AGNES (Agglomerative Nesting)
- At first, each object is a cluster. Then repeat
the following two steps - Merge two clusters that have the smallest
between-cluster dissimilarity. - Compute the dissimilarity between the new cluster
and all remaining clusters. - AC Agglomerative Coefficient
45DIANA (Divisive Analysis)
- It starts with a large cluster, which contains
ALL objects. - The cluster is split into two smaller clusters
according to distance measure, until finally all
clusters contain only one object. - DC Divisive Coefficient
46(No Transcript)
47MONA (Monothetic Analysis)
- It is a divisive hierarchical method and it
operates on matrixes with binary variables. - For each split, MONA uses one variable at a time.
- Repeat the following steps
- Select one variable that has the largest total
association to other variables. - Then the cluster is divided into two groups one
cluster with all objects having value 1 for that
variable, one with objects having 0 for that
variable.
48mclust (model-based clustering)
- Assumption there is a underlying probability
distribution in data clusters have different
orientations, shapes, and sizes. - mclust function can suggest an optimum number of
clusters.