Title: Research update Language modeling, ngrams and search engines
1Research update Language modeling,n-grams and
search engines
- David Johnson
- UTas September 2005
Supervisors Dr Vishv Malhotra and Dr Peter Vampl
ew
2Overview
- AISAT 2004
- SWIRL 2004
- Language modeling
- A New Information Retrieval Tool
- Questions
3AISAT 2004
- AISAT 2004 - The 2nd International Conference on
Artificial Intelligence in Science and
Technology
- November 2004, UTAS, Hobart
- Hosted by The School of Engineering
4AISAT 2004
- Johnson DG, Malhotra VM, Vamplew PW, Patro S,
Refining Search Queries from Examples Using
Boolean Expressions and Latent Semantic
Analysis - Extension of prior research (Vishv M. Malhotra,
Sunanda Patro, David Johnson Synthesize Web
Queries Search the Web by Examples. ICEIS 2005.)
the contribution of this paper was
investigating the application of LSA to improve
the refined queries - This is only a brief overview the full paper is
available o from the UTAS Eprints Repository
5AISAT 2004
- Consider the problems facing a web searcher
looking for information in an unfamiliar domain
- Lack of knowledge of domain specific terms,
jargon and concepts
- Difficulty rating the importance of newly
discovered terms in targeting relevant
information
- Lack of understanding about how extra terms and
changing the structure of the query will affect
the search results
6AISAT 2004
- This often leads to problems with the resulting
query
- Poor recall most of the relevant documents are
not located
- Poor precision many of the retrieved documents
are not relevant
- Frustration often results
- MSN-Harris Survey (August 2004) reports there is
a significant minority -- 29 percent of search
engine users -- who only sometimes or rarely
find what they want (http//www.microsoft.com/pre
sspass/press/2004/aug04/08-02searchpollpr.mspx)
7AISAT 2004
- However it is usually relatively easy for a
searcher to classify some example documents (from
an initial search) as relevant or not
- The text from these documents can then be
analyzed to build a better query one that will
select more relevant and less irrelevant
documents
8AISAT 2004
Resubmit to Web Search Engine
9AISAT 2004
- The new query is initially built in conjunctive
normal form (CNF)
- Original query (a OR b OR ) AND (a OR c OR )
AND
- Each maxterm is chosen to select all documents in
set relevant and reject as many of the
documents from irrelevant as possible
- The CNF expression (i.e. the conjunction of all
maxterms) is chosen to reject all documents from
irrelevant
- In order to minimise the size of the CNF
expression, terms with high selective potential
must be used
- In some cases further optimization was required
for instance (at the time) Google would only
accept queries of up to 10 keywords in length
10AISAT 2004
The potential of a candidate term, t, in a
partially constructed maxterm is calculated as
TR relevant documents not yet selected
TIR irrelevant docs still selected by
conjunction of prior maxterms TRt new relevant
documents selected by term t TIRt irrelevant do
cuments from TIR selected by term t
11AISAT 2004
- The Potential(t) function enabled the formulation
of effective queries, but they were often
counterintuitive, e.g. searching for information
related to the Sun (as a star) - The synthesised query was
- sun AND (solar AND (recent OR (field AND tour))
- OR (lower AND home) OR (million AND core))
- This was felt to be at least partially due to
overtraining on the relatively small sets of
example documents but could we improve the
generated queries?
12AISAT 2004
- Latent Semantic Analysis (LSA) was investigated
as a technique with the potential to help in
selecting more meaningful terms
- It is A theory and method for extracting and
representing the contextual-usage meaning of
words by statistical computations applied to a
large corpus of text (Landauer and Dumais,
1997) - It uses no humanly constructed dictionaries or
knowledge bases, yet has been shown to overlap
with human scores on some language-based judgment
tasks
13AISAT 2004
- LSA can help overcome the problems of polysemy
(same word different meaning) and synonymy
(different word same meaning) by associating
words with abstract concepts via its analysis of
word usage patterns - It can be very computationally intensive
(requiring the singular value decomposition (SVD)
of a large matrix), but for the small collections
of relevant / irrelevant documents we are
analysing it can be performed quickly (CPU time)
14AISAT 2004
- The term weighting was adjusted to account for
the LSA weighting as follows
- Weight(t) Potential(t) ((Normalised LSA
weight in set relevant) (Normalised LSA
weight in set irrelevant))
- During the experimentation that lead to the
development of this weighting scheme, we noted
that not allowing the Potential(t) values
sufficient weight resulted in very long queries - Note The Potential(t) value is recalculated
after each term is selected, but the LSA values
are based on the entire document sets
15AISAT 2004
- Results - Example 1
- Naïve query elephants searching for
information suitable for a high school project
- 7 out of first 20 documents, and around 30 of
all initial documents from the naïve query were
judged relevant
- The initial number of relevant documents was
quite high, reflecting the reasonably broad
nature of the information need
16AISAT 2004
- Results Example 1
- Refined query no LSA
- elephant AND (forests OR lived OR calves OR
poachers OR maximus OR threaten OR electric OR
tel OR kudu)
- Refined query with LSA
- elephant AND (climate OR habitats OR grasslands
OR quantities OR africana OR threaten OR insects
OR electric OR kudu)
- LSA did help in the selection of intuitive
words, although not as much as hoped it avoided
lived and tel (abbreviation for telephone
no), although it still selected quantities
17AISAT 2004
- Results - Example 2
- Naïve query mushrooms searching for
information on growth, lifecycle and structure of
mushrooms
- NOT interested in recipes, magic mushrooms, or
identifying wild edible mushrooms
- 3 of first 20 documents and 13 of initial
documents relevant
18AISAT 2004
- Results - Example 2
- Refined query no LSA
- mushrooms AND (cylindrical OR ancestors OR hyphae
OR cellulose OR issued OR hydrogen OR developing
OR putting)
- Refined query LSA
- mushrooms AND (ascospores OR hyphae OR itis OR
peroxide OR discharge OR developing OR pulled OR
jean)
- In this example the LSA query has identified some
additional technical terms, although still
including the quite general term pulled. Note
that the selection of the term jean was due to
Jean being the first name of several authors
mentioned in relevant documents - ITIS Integrated Taxonomic Information System
19(No Transcript)
20(No Transcript)
21AISAT 2004
- Conclusions
- The refined search queries show significant
improvement over original naïve queries, without
the need for detailed domain knowledge
- LSA did assist in the formulation of more
meaningful Boolean web queries
- LSA did not significantly improve retrieval
performance compared to the original query
enhancement algorithm
22AISAT 2004
- Further Comments (Not in original
presentation/paper)
- A criticism of this approach is that the
requirement for the user to review and classify
several documents requires quite a bit of effort
were we really helping the user or not? - While it is our assertion that the review /
classification process required is not very time
consuming, we have tried to address this
criticism by taking a different approach in
current work
23SWIRL 2004
- Strategic Workshop on Information Retrieval in
Lorne, December 2004
- Organized by Justin Zobel from RMIT and Alistair
Moffat from the University of Melbourne
- Funded by the Australian Academy of Technological
Sciences and Engineering under their Frontiers
of Science and Technology Missions and Workshops
program - 17 International and 13 Australian researchers
plus 8 research students
- Included researchers from industry and government
as well as academia (Microsoft, NIST, CSIRO)
- ( NIST National Institute of Standards and
Technology, a US Government research organization)
24SWIRL 2004
- It was a discussion-based residential workshop
- The aim for the workshop was to try and define
what we know (and also don't know) about
Information Retrieval, examining past work to
identify fundamental contributions, challenges,
and turning points and then to examine possible
future research directions, including possible
joint projects. That is, our goal was to pause
for a few minutes, reflect on the lessons of past
research, and reconsider what questions are
important and which research might lead to
genuine advances. - From the SWIRL 2004 web site -
http//www.cs.mu.oz.au/alistair/swirl2004/
25SWIRL 2004
- Attendees included researchers responsible for
many of the key innovations in web searching and
the implementation of effective and efficient
information retrieval systems.
John Tait (University of Sunderland)
Dave Harper (The Robert Gordon University,
Scotland)
Alistair Moffat (University of Melbourne)
Justin Zobel (RMIT)
Andrew Turpin (University of Melbourne)
Bruce Croft (University of Massachusetts,
Amherst)
Ross Wilkinson (CSIRO)
Bill Hersh, M.D. (Oregon Health Science Univers
ity)
Robert Dale (Macquarie University)
Kal Järvelin (University of Tampere, Finland)
Jamie Callan (Carnegie Mellon University )
David Harper (CSIRO)
26SWIRL 2004
- Lorne Beach during the site visit (Winter)
27SWIRL 2004
- Lorne Beach when we arrived (Summer)
28SWIRL 2004
- Program - Day 1
- Travel from Melbourne to Lorne (group bus)
- Keynote Presentation
- The IR Landscape Bruce Croft
- Presentation
- Adventures in IR evaluation Ellen Voorhees
- Group Discussion
- IR where are we, how did we get here, and where
might we go? Mark Sanderson
- Workshop Dinner
29SWIRL 2004
- Program - Day 2
- Group Discussion
- What motivates research Justin Zobel / Alistair
Moffat
- Small Group Discussion
- Challenges in information retrieval and language
modeling David Harper, Phil Vines and David
Johnson Challenges in Contextual Retrieval,
Challenges in Metasearch, Challenges in Cross
Language Information Retrieval (CLIR) - Group Discussion
- Important papers for new research students to be
aware of David Hawking
- Small Group Project
- How to spend 1M per year for five years Ross
Wilkinson
30SWIRL 2004
- Program - Day 3
- Presentations from Group Projects
- Mark Sanderson, Ellen Voorhees, David Johnson, et
al Experimental Educational Search Engine
Targeting information needs for upper primary to
grade 10 - Group Discussion
- Where to now with SIGIR? Jamie Callan
- Group Discussion
- Writing the ideal SIGIR 2005 paper Susan
Dumais
- Return to Melbourne
- Via Port Campbell National Park and Great Ocean
Road
31SWIRL 2004
- Summary
- An excellent opportunity to meet and talk to many
respected IR researchers
- Provided guidance on the current state of the
art and ideas for the future direction of my
research
- Provided many useful contacts
32Language Modeling - Introduction
- The seminal paper for language modeling in modern
IR A Language Modeling Approach to Information
Retrieval Jay Ponte and Bruce Croft 1998
- Like LSA, it can deal with the problems of
polysemy (same word different meaning) and
synonymy (different word same meaning)
- Unlike LSA, it has a firm theoretical
underpinning
- Less computationally demanding than LSA
particularly important for large document
collections
33Language Modeling - Introduction
- Documents and queries are considered to be
generated stochastically using their underlying
language model
- The document in a collection that is considered
most relevant to a given query is the one with
the highest probability of generating the query
from its language model - The assumption of word order independence is
usually made to make the model mathematically
tractable, although bi-grams, tri-grams, etc. can
be accommodated
34Language Modeling - Introduction
- A language model of a document is a function that
for any given word calculates the probability of
that word appearing in the document
- The probability of a phrase (or query) is
calculated by multiplying together the individual
word probabilities (applying the word order
independence assumption) - All we need is a method to estimate the language
models of the documents!
35Language Modeling Model Estimation
- Obviously the document itself is the primary data
source, but
- If a word, w1,doesnt appear in a document,
should we have P(w1Md) 0? (in other words
meaning that it is impossible for language model
Md to generate word w1) - If word w2 appears 5 times in a 1,000 word
document, is P(w2Md) 0.005 really a reasonable
estimate?
- It is important to smooth the language model to
overcome problems caused by lack of training
data
36Language Modeling Model Estimation
- There are a number of smoothing methods in use,
the basic premise being that some of the
probability mass in the model should be taken
from the observed data to be used as an estimate
for unseen words - They take into account Zipfs law nature of
word usage (a few words used many times, many
words used infrequently) to improve estimates
for instance the Good-Turing algorithm - Corpus or general English word usage data may
also be used to augment the document data, but
care has to be taken. The Ponte-Croft paper
addresses this problem by calculating a
risk-adjustment factor, based on the difference
between corpus and document word usage
37Language Modeling Model Estimation
38Language Modeling Results
- A basic implementation of the Ponte-Croft method
proved very useful in quickly locating relevant
documents in small local document collections
- It is planned to implement an improved version as
part of the information retrieval tool that is
currently under development discussed next
39A New Information Retrieval Tool
- Goals
- Assist the user to get required information from
an ad-hoc web search more quickly and
effectively
- Data driven to allow quick access to key parts of
retrieved documents
- In some cases provide answers to questions
directly without the user needing to view
documents
- Assist in refining web queries
- Use language modeling techniques locally on
retrieved documents to satisfy more complex
information needs i.e. those not able to be
expressed adequately in a web query
40A New Information Retrieval Tool
Web Query
Search Engine
Links to Results
WWW
- From bi-gram List
- Question may be answered
- Jump directly to relevant parts of documents
containing bi-grams
- Explore documents further using language
modeling
- Formulate and run a refined web query
including/rejecting selected bi-grams
Retrieved Document Text 150 docs 35 seconds 200
docs 50 seconds
User
Bi-gram List
41A New Information Retrieval Tool
- Why use bi-grams?
- Express a concept much better than a single word
Peanut
Foods/cooking Peanut butter Peanut candy Roaste
d peanut Chocolate peanut Peanut brittle Peanut
cookie Peanut recipe Peanut lover Peanut soup
Peanut oil
Agriculture Peanut institute Peanut commission
Peanut grower Peanut farmer Peanut producer Pea
nut plant
Commercial/Brands Peanut software Peanut linux
Peanut van Peanut inn Peanut clothing Peanut Ap
parel Baby peanut Peanut sandals Peanut ties M
r peanut
Medical Peanut allergy
Other Peanut gallery
42A New Information Retrieval Tool
- Why use bi-grams?
- Can be used directly as a web search term by
using quotes
- Easily combined into well targeted searches using
OR and - operators
- Reasonably easy to extract meaningful bi-grams
from document text using simple rules
- Tri-grams or higher n-grams tend to occur too
infrequently
- Also, bi-grams seem to be somewhat neglected in
current IR research possibly because of the
sparse data problem a corpus with a
vocabulary of 50,000 words has the potential for
2,500,000,000 bi-grams causing difficulty in
language modeling and many other IR techniques
43A New Information Retrieval Tool
- Simple bi-gram extraction rules
- Ignore a bi-gram if it contains a stop word (a
word that doesnt convey much meaning, for
instance - a, an, and, of, the, etc. without
this step the most frequent bi-grams are usually
of the, in the, to the and so on) - If bi-grams are found that are the same except
for plurals (e.g. african elephant and african
elephants) only present the most common form to
the user - Sort bi-grams by descending occurrence count
within descending document occurrence count
- Alternative method
- Part of speech filtering almost all interesting
bi-grams are of the form Adjective Noun or
Noun Noun
- It isnt always possible to determine the part of
speech exactly (for example unknown words, words
with multiple possible parts of speech), but we
can certainly reject many bi-grams that could
never be of the required form
44A New Information Retrieval Tool
- Example of an automatically generated bi-gram
list
- 150 documents from simple Google search
Elephants
- elephant jokes
- baby elephant
- elephant elephas
- 70 years
- indian elephant
- elephant society
- large ears
- wild elephants
- land mammal
- white elephant
- forest elephant
- baby elephants
- elephants eat
- ivory tusks
- family elephantidae
- species survival
- 13 feet
- largest living
- largest land
- elephant seal
- natural history
- young elephants
- female elephants
- elephant conservation
- long time
- elephant range
- give birth
- incisor teeth
- 20 years
- ivory trade
- 60 years
- blood vessels
- african elephant
- elephant man
- loxodonta africana
- elephant loxodonta
- asian elephant
- south africa
- privacy policy
- united states
- years old
- elephas maximus
- 22 months
- national park
- endangered species
- elephants live
- forest elephants
- small objects
45A New Information Retrieval Tool
- From browsing the bi-gram list the user
- Gets an idea of the topic areas in the retrieved
data
- Can jump directly to relevant parts of retrieved
documents
- Can mark bi-grams to include/exclude from a new
web search
- Can mark portions of retrieved documents as
relevant/irrelevant to use in a more targeted
local search using language modeling (this allows
drilling down to discover information on concepts
that are difficult to express as a web query) - The user can also use natural language and/or
keyword queries using language modeling to assist
in examining local data (i.e. the document text
downloaded as part of the retrieval process)
46A New Information Retrieval Tool
- Example of a new web query formulated by marking
relevant/irrelevant bi-grams
- elephant ("african elephant" OR "asian elephants"
OR "loxodonta africana")
- -"elephant man" -"elephant seal" -"elephant
jokes" -"white elephant
- Using the criterion of our test information need
(web pages with suitable information for a high
school project on the land mammal elephant), 48
of the first 50 pages returned by the search were
judged to be relevant. - The web search also indicated there were about
228,000 pages matching our query, so we were
still getting a very wide coverage
47A New Information Retrieval Tool
- Are the generated bi-gram lists always useful?
- At least some of the information in the retrieved
documents must be relevant for interesting
bi-grams to be generated
- On the other hand, if the bi-grams are all way
off track, the user knows immediately to rethink
the initial search, rather than reviewing many
irrelevant documents - Initial testing with 27 different one word Google
searches (150 documents retrieved for each
search) generated useful bi-gram lists in 21
cases - By using a two word search (e.g. angle geometry
instead of angle, cobra snake instead of
cobra), useful bi-gram lists were obtained for
five of the remaining six cases
48(No Transcript)
49(No Transcript)
50(No Transcript)
51A New Information Retrieval Tool
- Future work
- Continue development and refinement of this
information retrieval tool
- Identify examples of information needs that are
difficult to satisfy by direct web search alone
- User survey how does the tool perform in
practice? How could it be improved?
52Questions/Comments?