Title: Sample Searches
1Sample Searches
- Query smalheiser
- Response DBLP Neil Smalheiser
- PNAS Abstract Smalheiser
et - et al 97
- MBC abstract Smalheiser
-
- UIC Dept of Psychiatry Neil
- Smalheiser
2Query computer science genetics
- Campus program university, campus, college and
employment resources - SpringerLink (On-line journals and books in
science, technology and medicine) - Course ( Advanced topics in computer science and
computational genetics) - Annual Review of Computer Science
-
3Comments for Slides 1-2
- The first two searches are intended to show
Google is reasonably good - In retrieving Web pages when given a few
keywords. In the first query, - The name Smalheiser is submitted. Ideally, the
home page of Smalheiser - should be retrieved first. Instead, his
publications in computer science are - retrieved in the first document. This is followed
by some of his publications - In the medical area. Finally, his home page in
Pyschiatry is retrieved. - The second query asks for important documents in
the intersection of the two - areas computer science and genetics. The
first retrieved document seems - to be unrelated to the query. The second
retrieved document seems ok. - The third document is a course in both areas.
- The examples show that Google is still far from
perfect. -
4Information Retrieval
- Document RepresentationRemove stop
wordsEg. In Automatically IdentifyingGene
terms in MEDLINE AbstractsRemove
inStemming Automatically becomes
automaticIdentifying becomes
identifyAbstracts becomes abstract
5Comments on last slide
The first two steps in constructing a document
representation consist of eliminating non-content
words and mapping variations of the same word to
the same stem via a process called stemming.
6Document representation
- Document is a set of content words or terms
automatic, identify, gene, term, medline,
abstractSometimes, keep locations of terms.
Eg. automatic first word in title
7Comments on last slide
- Location information can be of importance in
differentiating the ordering of contents words in
a query from other orderings of the same words.
It is also useful in determining phrases.
8Assign weights to terms
- Term frequency no. of times the term occurs in
the document - Document frequency no. of documents having
the term - The weight of a term in a document
proportional to term frequency, inversely
proportional to document frequencyEg term
frequency log ( N/document frequency)N no.
of documents in collection
9Comments on the last slide
- The well-known tf-idf weighting scheme to assign
a weight to a term is given. The weight is
proportional to the term frequency and inversely
proportional to its document frequency. There are
numerous variations of this formula, but all of
them have the property that higher weights are
given to terms with higher term frequencies and
lower document frequencies.
10Other factors in assigning weights
- Terms in title
- Terms in abstractTerms in big fonts etc.
-
- get heavier weights
11Comments to last slide
- If a term occurs in the title, it usually gets a
higher weight than the same term occurring in the
main text. This may apply to the term appearing
in the abstract. If the term occurs in big fonts
or a way that attracts readers attention, it
should also gets a higher weight. - All these situations can be implemented by
assuming that each occurrence of such a term is
equivalent to k occurrences of the same term in
the main text with k gt1.
12Query representation
- Two common modelsVector space model query
as a set of terms, possibly orderedBoolean
Model Terms connected by AND, OR and NOT
13Comments to the last slide
- In the information retrieval literature, it has
been shown that the vector pace model is usually
better than the Boolean Model, because if a query
contains quite a few terms which are connected by
ANDs, then there may not be a document
satisfying the query. If the terms are connected
by ORs, then there may be too many unordered
documents satisfying the query and the user has
no efficient way to identify the useful documents
from the irrelevant ones. - In practice, it is likely that a hybrid model
having features of both models is used for
effective retrieval.
14Vector space model
- Each dimension of a vector represents a distinct
termdimensions all terms in the collection,
including proper namesEg. Automatic identify
gene ( 1, 1, 1, .)
15Compute the similarity between a query and a
document
- Q (q1, , qn) D (d1, , dn)
- Dot Q, D terms in common, favors long
documentsNorm( D) Cosine( Q, D) DotQ, D
/( Norm(D)Norm(Q))
16Comments on last slide
- When the documents are binary vectors, the Dot
product similarity function obtains the number of
terms in common between the two vectors. When the
terms are weighted, the weights are incorporated
into the similarity function. Clearly, this
favors a long document such as an encyclopedia. - To compensate it, the norm ( length) of a
document is included in the denominator of the
similarity function so that a longer document
gets a larger denominator. The query norm is used
to ensure that the Cosine function returns a
value between 0 and 1, if all terms have
non-negative values. When the two vectors differ
from each other by a positive multiplicative
constant, their angular distance is 0 and the
Cosine value is 1.
17Boolean model
- gene AND abstract( sometimes, uses to
ensure the term needs to be present) - gene OR abstractgene AND NOT abstract( uses
- to indicate undesiredterms Eg. gene
abstract)
18Other features
- Phrase search information retrievalProximity
search information NEAR retrievalDate
search 2003Field search Eg in the field
Author, look for NealWildcard search
smaler
19Comments on last slide
- Some systems require a query phrase such
- as information retrieval to be placed in
- quotes. This may require a retrieved document to
have exactly such a phrase. - If a document containing the words retrieval of
information is desired, the query can be
reformulated as Information near retrieval. - Filtering operations can be specified by filling
in additional information in specific fields such
as the author field. Wildcard entries such as
smaler, where denotes zero or more
characters are allowed, provided that does
not occur in the first few characters ( say 3),
otherwise the space for searching matching
strings will be too large.
20Additional features
-
- Case sensitive java gets java, Java,
JAVAJava gets Java and possibly JAVA ( first
capital letter implies a proper name ) - ordered query terms eg. stray dog
- spelling error if no such word,some search
engines suggests similar words
21Comments on last slide
- Location information in documents and the query
can be used to differentiate stray dog from dog
stray. - If a word does not exist in the index of all
words in the documents, then some search engines
may suggest some neighboring words which differ
from the misspelled word by 1 or 2 characters.
Note that proper names are included in the index.
22Directory search
- Specify subtree computer finance medicinehard
ware software .. . - .
- query memory under computer means computer
memory vs human memory in medicine
23Comments in last slide
- Directory search may reduce ambiguities.
- In the given directory, documents or pages are
classified under each node. For example, there is
a set of documents which are classified under
computer and another class under medicine. The
former class contains documents about computer
memory while the latter class contains documents
about human memory. If the query is restricted to
the class computer, then only documents in the
former class relating to computer memory will be
retrieved.
24Feedback
- identify relevant documents and possibly
irrelevant documentsre-formulate query using
terms from relevant documents and from irrelevant
documents - Query apple Rel Doc computer Irrel fruit
- Modified query apple, computer, - fruit
25Comments to last slide
- The user needs to identify relevant documents and
possibly irrelevant documents. Terms from the
relevant documents may be added to the query,
while terms from the irrelevant documents may be
used to exclude documents having such terms to be
retrieved in the next round. In the example, the
term computer is found in the relevant
documents and is added to the query, while the
term fruit is found in the irrelevant documents
and it is used to exclude documents having such a
word.
26Web
- Surface Web linked togetherDeep Web Not
linked documents can be generated dynamically by
programsQuite a few medical databases and
bio-medical databases are in the Deep Web
27Comments on the last slide
- The Web is roughly classified into the
- Surface web and the Deep Web. The pages in the
former are hyperlinked, while pages in the latter
are accessible only by submitting queries to
query interfaces. - Web crawlers which extract content information
from Surface Web pages are unable to get into
Deep Web pages for lack of hyperlinks.
28Retrieval from the surface Web
- Anchor text belong to the document pointed
to. - lta href"http//tigger.uic.edu/htbin/cgiwrap/bin/n
ewsbureau/cgi-bin/index.cgi"gtMore Newslt/agt - Page rank importance of a Web pageRank( P)
-
- for every Qi pointing to P iterative Web
surfing interpretation
29Comments on last slide
- There are some differences between retrieval from
the Web and from non-Web sources. In the former
case, words known as anchor texts which appear
together with the link from a page A to another
page B should be utilized for retrieval.
Specifically, the anchor words should be used as
content words for page B, as they describe the
contents of B as observed by the user who creates
A.
30Example to illustrate page rank
- Rank(P) ½ Rank(A) 1/3 Rank(B)
- A lot of pages pointing to the IBM home page,
implying that it has a very high page rank.
A
P
B
31Comments on last slide
- The example illustrates how the page rank of a
page can be computed. In practice, all pages are
initialized with the same rank and the page rank
formula is applied to compute the page ranks of
all pages. This process is repeated until
convergence is reached. Under some reasonable
assumptions, convergence is guaranteed. The page
rank information is utilized to rank pages for
any user query.
32Query IBM
- Thousands of pages have that word, but among
those pages having that word, IBM home page has
largest rank.Google utilizes page rank
33Comments on last slide
- There are a number of ways to utilize page ranks
to rank pages for a given query. One way is to
first retrieve pages which have reasonable
similarities with the query. Then the retrieved
pages are re-ranked in descending order of page
rank. Another way is to compute the relevance of
a page based on a function of the similarity of
the page with the query and its page rank. Then
pages are re-ranked in descending order of
relevance.
34Authority and Hub
- Query retrieves documents based on similarities
- Expand this set by adding their parents and their
children - Compute A(p) sum H(q) for each edge (q,p)
- Compute H(p) sum A(q) for each edge (p,q)
35Authority and Hub continued
- Normalize A(p) and H(p)
- Repeat until A() and H() converge
- Output pages with top authority scores
- ( It has been shown that convergence is
guaranteed.) - www.teoma.com
- ( This company claims to have an advanced search
capability which is more accurate than the
standard authority and hub technique.)
36 - Various features of different search engines,
including Google, AltaVista, Hotbot etc - Search Engines for the World Wide Web
- By Alfred and Emily Glossbrenner, 3rd
edition,Peachpit Press, 2001.
37Metasearch engine
- Connects to numerous search engines.Given
query Q, finds suitable search engines to process
the query, invokes the selected search engines to
search and merges their results.
38Comments on last slide
- Instead of using a search engine such as Google,
a metasearch engine which connects to numerous
search engines can be utilized. Upon receiving a
user query, a metasearch engine sends the query (
with possibly some modifications) to appropriate
search engines and merges and re-ranks the
retrieved documents returned from the invoked
search engines.
39Advantages of Metasearch Engines over Search
Engines
- Do not need substantial hardware relative to
large search engines - Large coverage
- up-to-date information.
40Comments to last slide
- There is no need for substantial hardware,
because the searches are done by the underlying
search engines. The coverage of a metasearch
engine is the union of the coverages of the
individual search engines. That it may have more
up-to-date information than a large search engine
will be explained by the next few slides.
41Up-to-date information
- Search engine crawler gets data
- Builds large index database
- Time consuming to update large index database
- Metasearch engine connects to numerous small
search engines
42Comments to last slide
- A search engine utilizes a crawler to extract
contents from Surface Web pages and then builds
an index database. Upon receiving a query, the
search engine searches the index database to
determine the pages to return to the user. Since
the contents of Web pages keep on changing, the
index database needs to be updated. However, the
index database is large and refreshing it may
take a long time, say weeks. In contrast, if a
metasearch engine is connected to numerous small
search engines and each of these search engines
keeps its database up-to-date, the metasearch
engine may be able to provide current information.
43Utilizes dictionary/ontology
- Wordnet ordinary dictionary terms
- MeSH hierarchy medical terms
-
-
-
-
- May want to include synonyms and hyponyms of
query terms into query
Person --- (Synonyms human, people)
Hyponyms man woman
44Comments to last slide
- Dictionaries or ontologies may be utilized to
achieve high retrieval effectiveness. A common
dictionary in a general domain is Wordnet which
provides synonyms, hyponyms as well as other
relationships to each ordinary word. As an
example, if a query contains the word person,
its synonyms and hyponyms may be added to the
query. Note that a word may have multiple senses
(meanings) and selections of suitable synonyms
and hyponyms are essential. It is worthwhile to
explore the use of the MeSH hierarchy for
effective retrieval in the medical domain.
45Difficulty
- A word sometimes has many sensesEg Query
drugs for mental patients - senses for drugs prescription drugs illegal
drugsuseful to include antidepressantwill
retrieve a lot of irrelevant documents if include
heroin
46Comments to the last slide
- The example shows that a correct addition of a
hyponym (antidepressant is a hyponym of drug)
will lead to high retrieval effectiveness while
an incorrect addition (heroin is also a hyponym
of drug) leads to poor retrieval results.
47Natural language Processingfor Information
Retrieval
- finds part-of-speech of each word
- identify noun phrases
- identify proper names
- recognizes acronymseg. CHF congestive heart
failures - Word sense disambiguation
- eg. Apple CPU
48Comments to last slide
- Natural language processing plays a role in
information retrieval. However, so far, it is
used to identify parts of speech of words, named
entities and phrases only. - Recognition of acronyms is also useful for
information retrieval.
49Word Sense Disambiguation
- Pine 1 kinds of evergreen tree with needle-shaped
leaves - 2 waste away through sorrow or illness
- Cone 1 solid body which narrows to a point
- 2 fruit of certain evergreen trees
- Find the combination of descriptions which have
the largest number of words in common.
50Comments to last slide
- If the query is pine cone and each of the two
words has multiple senses, the correct sense may
be identified by finding the combination of
senses whose descriptions have the largest number
of words in common. In this example, sense 1 of
pine and sense 2 of cone have the words
evergreen tree in common in their
descriptions. These common words may be added to
the query to improve retrieval effectiveness.
51Information extraction
- Information retrieval obtains whole documents
often users want small parts of retrieved
documents.Examples - From certain papers on heart disease, extract
names of authorsfrom experimental sections of
papers, extract tables of interest.
52Techniques
- (1) Construct rules involving patterns or
keywords of identify parts of interest utilizes
a grammar to extract required informationEg. To
identify terrorist eventsuseful keywords kill,
bomb etc. use a grammar to identify the subjects
(terrorists) and the objects (victims)
53Comments
- Traditionally, information extraction is achieved
by manually constructed rules for the extraction,
after examining numerous instances of what are
desired. In order to save labor cost, machine
learning techniques are introduced. Rules are
automatically constructed and based on positive
and negative examples, promising rules are kept
for future extraction activities.
54 - (2) Use machine learning techniques to
- construct rules
- Positive and negatives examples can
- be given to guide the construction
- Aim Reduce manual construction of rules
55Machine learning
- Example Pavilion a230 Minitower
- AMD Athlon XP .. GHz
- .
- Pavilion a210n Minitower
- Intel Celeron GHz
- Rule (var1) ( var2) GHz
56Comments
- In this example, the user supplies a few positive
instances to be extracted. Then, the system
automatically constructs the rule with R and GHz
as landmarks. The words before the landmarks are
captured by variables. - In the Web environment, HTML or XML documents
have tags and they may be used to construct
rules. However, rules involving tags may be site
dependent, implying that new rules may need to be
generated when there is a site change.
57Rules involving tags
- ltbgt Martino Motor Sales lt/bgt
- ltbgt Currie Motors Lincoln Memory lt/bgt
- Rule ltbgt Var lt/bgt
- Extracted data may not be that structured
- Layout of document can be site dependent,
implying that new correct rules need to be
constructed for new sites
58Summary
- Information retrieval
- users point of view
- eg. phrase, case sensitive
- system point of view
- eg. Feedback query construction
- Web retrieval vs non-web retrieval
- search engine vs metasearch engine
59Summary continued
- Natural language processing
- Eg. acronym recognition
- Information extraction
- rules manual, machine learning
- Can be site dependent
-