Title: CLiMB: Computational Linguistics for Metadata Building
1CLiMB Computational Linguistics for Metadata
Building
- Center for Research on Information Access
- Columbia University Libraries
2(No Transcript)
3Overall Goals
- Research Development of richer retrieval through
increased numbers of descriptors - Research and Practice Creation of enabling
technologies for new large digitization projects - Research and Practice Expand capability for
cross-collection searching - Practice Development of suite of CLiMB tools
- Resources Vocabulary list which can be used by
other visual resource professionals - The essence of CLiMB
- Use scholars themselves as catalogers by
utilizing scholarly publications - Enhance existing descriptive metadata
4Computational Linguistic Techniques
- What techniques have we tried?
- How well have they worked?
- What else do we want to try?
5Computational Linguistic Techniques
- What techniques have we tried?
- Goal Identify high quality metadata terms
- Goal Use metadata for finding images
- How well have they worked?
- What else do we want to try?
6Text about Images
- The Blacker House is known for its porte
cochère and adjacent terraces. Samuel Parker
Williams, an occasional Greene collaborator,
worked on the site, particularly on the sandstone
boulder foundation for the sleeping porch. - -- Based on Bosley
7Techniques We Have Tried
- Supervised (using existing resources)
- Matching algorithms - proper names variants
- Back of book index analysis
- Composite list of terms from authoritative lists
- Unsupervised
- Part of speech tagging
- Noun phrase identification
- Proper noun identification
8What about LSI?
- Latent Semantic Indexing
- Builds a representation of a document
- Effective in information retrieval
- Why not for CLiMB?
- LSI is useful for text query and document
retrieval - LSI, a statistical technique, removes phrasal
info - CLiMB needs high quality phrases
- May be useful in later stages
9Indexing for What Purpose
- Index find important terms and phrases
- Index characterize a document with a set of
terms that occurs in the doc
10Indexing for What Purpose
- Index find important terms and phrases
- sleeping porch
- occasional collaborator
- sandstone boulder foundation
- Index characterize a document with a set of
terms that occurs in the doc - sleep, porch, occas, collaborat, foundat
- enables location of docs with similar profile
11Finding Similar Documents
- Linear Algebra Techniques
- Latent Semantic Indexing
- Singular Value Decomposition (SVD)
- Semidiscrete Decomposition
- Vector Space Models
- Term by Document matrices
- Term Weighting
- Polysemy and Synonymy
- Clustering Techniques
- K-means
- EM Clustering
- Wavelet
12Computational Linguistic Techniques
- What techniques have we tried?
- Goal Identify high quality metadata terms
- Goal Load metadata into image search database
- Goal Use enriched metadata for finding images
- How well have they worked?
- What else do we want to try?
13Art Object Identification (AO-ID)
- Need Unique Identifiers
- Key of database records
- Varies from collection to collection
- Greene Greene Project Names
- Chinese Paper Gods God Names
- South Asian Temples Temple Names
14Text about Images
- The Blacker House is known for its porte
cochère and adjacent terraces. Samuel Parker
Williams, an occasional Greene collaborator,
worked on the site, particularly on the sandstone
boulder foundation for the sleeping porch. - -- Based on Bosley
15Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
16Create Composite List of Subject Terms
- Philosophy Use whatever resources exist
- Catalog records
- Robert R. Blacker house (Pasadena, Calif.)
- Greene, Charles Sumner
- Blacker, Robert R.
- Art and Architecture Thesaurus
- porte cochère
- Back of the book index
- Blacker house
17Progress Composite List
- Greene Greene
- Extracted back of the book indexes
- Direct matching of index terms to the text
- Terms found - highlighted in yellow
- David Gamble
- Pasadena
- Westmoreland Place
- furniture
18(No Transcript)
19Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
20Three Term Types and Approaches
- 1) Art Object ID names and other proper nouns
important to the domain (Charles Pratt) - Named Entity noun phrase finders, POS taggers
- 2) Common noun terms, semantically significant to
the domain (V-shaped plan) - List of domain terms from authority sources
- 3) Common noun phrases in a generic domain
vocabulary (chimney) - Statistical methods for identifying relevant terms
21Part of Speech (POS) taggers
- Why use a part of speech tagger?
- To identify nouns, verbs and proper nouns
- The Blacker House is known for its porte cochère
- ltDeterminergtThe
- ltProper_Noungt
- ltSingular_Proper_NoungtBlacker
- ltSingular_Proper_NoungtHouse
- ltVerb_Presentgtis
- ltVerb_Past_Participlegtknown
- ltPrepositiongtfor
- ltPossessive_Pronoungtits
- ltAdjectivegtadjacent
- ltNoun_Pluralgtterraces
22Part of Speech (POS) taggers
- Strength An essential step allows the rest of
the system to work - Weakness The best POS taggers have 95 accuracy
- A typical 20-word sentence is likely to have a
mistake! - But some errors do not matter much
- E.g. sleeping porch
23What We Tried POS Taggers
- Mitre Alembic WorkBench
- Freeware from Mitre corporation
- Strong for proper nouns
- Average for common nouns
- IBMs Nominator
- Accurate for both
- Restrictive licensing
24Proper Nouns
- Alembic WorkBench Results
- 91.2 recall
- Misses The senior Pratt, Hall brothers
- 97.5 precision using Alembic
- Successfully finds William Issac Ott, University
of California - This is very good!
- Highlighted in light green
- Mary
- Greene
- Persian
- Etc.
25(No Transcript)
26Noun Phrase Chunking
- The Blacker House is known for
- its Porte Cochère and adjacent terraces .
- Samuel Parker Williams,
- an occasional Greene collaborator,
- worked on the site, particularly on
- the sandstone boulder foundation
- for the sleeping porch .
- -- Based on Bosley
27NP Chunkers
- Columbias LinkIT
- Regular expression grammar over POS tags
- Improves WorkBench results through finding
simplex NPs - LTChunk
- By LTG Group, University of Edinburgh
- Not as many NPs
- Arizona - commercialized
- IBM also commercial
28Results Proper Nouns
29Results Proper Nouns
30Results NP Chunking
- Highlighted in purple
- The design process
- The southwest adobe-stucco
- July 1907
31(No Transcript)
32Experiments with Algorithms
- TF/IDF and term frequency ratios
- Filter technical terms from frequent common nouns
- Term frequency ratio algorithm to improve
accuracy - Co-occurrence
- Useful terms may appear near other good ones
- Machine learning
- Use learning algorithms to discover complex
associational context
33Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
34What is Segmentation?
- Divide texts into cohesive chunks
- Needed for determining associational context
- Needed to determine what terms are related to an
art object
35Results Segmentation
- Use the frequency that our terms appear within a
document to estimate where the document is about
that term - This graph shows where different names are
mentioned in Bosley on Greene Greene Ch. 5
36What Weve Tried Segmenters
- Marti Hearsts TextTiling
- Performs well for a general algorithm, but not
sufficient for this specialized task - M. Hearst, ACL, 1993
- F. Chois C99 segmenter
- Performance comparable to TextTiling
- F. Y. Y. Choi, NAACL, 2000
- Frequency ratio approach outperformed TextTiling
- In-house tool to be tested
- Kan Klavans, WVLC-6, 1998, Segmenter
37Meronymy as Part-Of
- Why is this potentially useful?
- A method for identifying hot paragraphs
- Descriptive text contains part of relations
- Details that correlate to the whole
- Porch is a part of house
- An early hypothesis in testing stages
38Meronymy for Cohesion
The Spinks house design is an elaboration of the
rectangular, large-gabled form of the California
House .has porches and terraces. In front, an
expanse of lawn rises nearly to the level of the
entry terrace. The front door is approached
obliquely in the shaded recess of the terrace.
39Meronymy and Other Relations
The California House
Other Houses
Spinks House
entry terrace
front entry
terrace
porch
front door
40Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
41Progress Project Name Matching
- Finding project names in Greene Greene
- Challenge finding variations
- AO-ID Robert Roe Blacker House
- RRB House
- The house
- 1214 Fairlawn Terrace.
- Possible techniques to improve matching
- Developing a semi-automatic technique
- Use existing information to label text
- An iterative platform for manual intervention
42Variants of The Culbertson House
- Cordelia A. Culbertson house (Pasadena, Calif.)
- Francis F. Prentiss house (Pasadena, Calif.)
- Culbertson sisters house (Pasadena, Calif.)
- Prentiss, Francis F.
- Culbertson, Cordelia A.
- Allen, Elizabeth S.
- Allen, Mrs. Dudley P.
- House was purchased by Allens, who remarried and
became Prentiss!
43Zaoshen (Chinese deity)
- USE FOR Dingfuzhenjun (Chinese deity)
- USE FOR Kitchen God (Chinese deity)
- USE FOR Simingzaojun (Chinese deity)
- USE FOR Simingzaoshen (Chinese deity)
- USE FOR Ssu-ming-tsao-chèun (Chinese deity)
- USE FOR Ssu-ming-tsao-shen (Chinese deity)
- USE FOR Ting-fu-chen-chèun (Chinese deity)
- USE FOR Tsao-chèun (Chinese deity)
- USE FOR Tsao-shen (Chinese deity)
- USE FOR Tsao-wang (Chinese deity)
- USE FOR Tsao-wang-yeh (Chinese deity)
- USE FOR Zaojun (Chinese deity)
- USE FOR Zaowang (Chinese deity)
- REFERENCE Encyc. Britannicab(Tsao Shen, pinyin
Zao Shen, in Chinese mythology, the god of the
kitchen (god of the hearth), who is believed to
report to the celestial gods on family conduct
and have it within his power to bestow poverty or
riches on individual families has also been
confused with Ho Shen (god of fire) and Tsao
Chèun (Furnace Prince))
44Some Data to Illustrate
- Unaltered Project Names
- 0 matches (both case sensitive and insensitive)
- Case Insensitive Project Name matching
- 4 matches
- Theodore Irwin house occurs 1 time
- California Institute of Technology occurs 1
time - William R. Thorsen house occurs 1 time
- William T. Bolton house occurs 1 time
- At least double in the chapter
45A Future Solution
- Bootstrapping algorithm
- Seed terms hand labelled
- Terms mapped into multi-dimensional feature space
- Other terms that are close to the seed terms are
added to the set - Features
- Window size
- Headedness
- Modifier similar to that of a seed term
46Summary Research Tools Tested
- Part of Speech Taggers
- Noun Phrase Chunkers
- Merging techniques
- Proper Noun Finders
- Proper Name Variant Finder
- Segmenters
47Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
48Future Determine relationships
- The Blacker House related to Greene
- The Greenes built the house.
- Porte Cochère is related to Blacker House
- because they are directly a part of the house.
- William Issac Ott is related to
- Blacker House (on which he worked)
- Greene (with whom he worked).
- Detecting these semantic relationships
statistically is a challenge for our next steps - Co-occurrence
- Use of subject headings
- Meronymy and other relations (WordNet)
49Compile list of subject vocabulary
Find meaningful terms in texts
Segment relevant texts
Collect terms from all sources. Identify and
link AO-ID described in text.
Determine term relationships
Extract metadata
Insert into existing metadata records. Mount in
image search platform.
Process queries and evaluate
50Thank you!
- Any questions?
- www.columbia.edu/cu/cria/climb