Title: Informetrics and IR
1Informetrics and IR
- Birger Larsen
- Information Interaction and Information
Architecture - Royal School of Library and Information Science
- Copenhagen, Denmark
- blar_at_db.dk
If I have not seen as far as others, it is
because there were giants standing on my
shoulders. --Hal Abelson
2Outline
- Part 1 Garfields citation indexes
- Part 2 Own PhD project automatic IR by
citations - Part 3 Pennant diagrams for visualising
retrieval results
3Part 1Garfields citation indexes
- Science Citation Index
- Social Science Citation Index
- Arts Humanities Citation Index
- Based on the idea that there is a subject
relation between a document and its references
(cf. Law) - That is, if you find that a given document is
relevant to you, then there is a fair chance that
you will also find that some of its references in
its bibliography (and those later citing the
document) are relevant
See his website and essays http//www.garfield.li
brary.upenn.edu/essays.html
4Garfields citation indexes
- NOT developed for citation counting and research
evaluation (!), by as a retrieval tool - Advantages of using references as index terms
- Time and money saved compared to intellectual
indexing - Facilitates a multidisciplinary index
- No inter-indexer inconsistency
- References have strong semantics for researchers
(Heisenberg, 1928 Small, 1973 Garfield,
1955) - The network of references and citations supports
new retrieval strategies intuitively appealing to
researchers
5Garfields citation indexes
- Multi-disciplinary indexes ? by indexing the core
journals of a number of fields these are covered
quite well ? Garfields law of Concentration - The Bradford curve tail consists mainly of core
journals of other fields - One of the factors behind selection of journals
for the citation indexes - High citation rates is another
- Scientific panels advice ISI on journals to
include or exclude
6Garfields citation indexes
- Advantages
- Multidiciplinary, indexes 9.000 journals
- Indexes all references (also non-source
documents), and facilitates analysis of received
citations - Indexes cover-to-cover
- All authors and their addresses are indexed for
source documents
7Garfields citation indexes
Target
Source
- Important difference between
- Source documents
- Has been handled by ISI indexers
- A wealth of bibliographic data
- Cited documents (targets)
- Much more meagre data (CA, CY, cited vol page,
CW) - Can be citations to source documents, but also
anything else (e.g., Tolkien or the Bible)!! - The cited references are inverted into a citation
index
8Other Citation Indexes
- SCOPUS
- Produced by Elsevier (no access from RSLIS)
- 15,000 peer-reviewed journals, 1,000 Open Access
journals, 500 Conference Proceedings, 600 Trade
Publications 125 Book Series - Google Scholar (http//scholar.google.com)
- Automatically generated citation index many
errors - Research Index/CiteSeer (http//citeseer.ist.psu.e
du/) - Automatically generated citation index
- Good coverage of Computer Science
9Part 2 Own PhD project automatic IR by
citations
- Title References and citations in automatic
indexing and retrieval systems experiments with
the boomerang effect - The PhD project is an amalgamation of ideas from
Information Retrieval, and informetrics/bibliometr
ics
10The Boomerang Effect
- In short, the overall objective of the project is
to develop and test empirically a new method to
automatic indexing of structured full text
documents, where references and citations are
given special attention - The central hypothesis is that by working with
several (poly-)representations derived from
different cognitive and functional origins better
performing IR systems can be constructed
11Motivation
- The principle of polyrepresentation hypothesises
that overlaps between representations can be
exploited to reduce the uncertainty inherent in
IR (Ingwersen, 1994, 1996) - Emphasis is placed on representations of
different cognitive and functional origin
generated over time from both users information
needs, the documents, and other cognitive agents - The ever growing number of documents in digital
libraries, makes it possible to test this as a
multitude of representations may be generated
from the full text documents (move away from
bag-of-words)
12Motivation
- The project attempts to exploit Eugene Garfield's
original idea behind the citations indexes, which
was to create an alternative form of subject
access to scientific documents (Garfield, 1955) - Studies from the online IR community has shown
good performance when using citations (McCain,
1989 Pao, 1993), (and web IR is relying more and
more on link analysis, e.g., Google,
ResearchIndex.com) - A problem, however, has been that queries cannot
be stated in natural language, but only as seed
documents
13Motivation
- References and citations have shown promising
results as representations in IR, but have never
been integrated into the research on automatic IR
systems - Scientific documents (i.e., with outgoing
references and received citations) are for the
first time becoming available electronically in
large quantities - This offers new possibilities for combining
conventional automatic indexing and retrieval
techniques with the exploitation of references
and citations
14References and citations
- Scientists add references to earlier work in
their papers as part of the scientific
communication of ideas - Scientists motives for giving references can be
analysed from both normative and social
constructivist positions - A unifying explanation is provided by Small
(1978) who regards references as concept symbols - an idea that is being used in the course of an
argument - the references actually given function as symbols
for a concept
15References and citations in IR
- Many possible retrieval strategies with citation
indexes, but forward chaining is a unique
capability - Allows you to retrieve more recent, citing
documents - Forward chaining does, however, require that a
seed document (i.e. a known, relevant document)
is given as starting point - Good seed documents may be hard to obtain,
particularly if the subject is not well-know or
new - The lack of available seed documents is probably
one of the main reasons behind the rare inclusion
of references and citations in IR research
16The Boomerang effect
Boolean
Modified from Larsen, 2002
Step 1documents
(References)
Step 2references
(Citations)
Step 3documents
OL Overlap Level
17Pre-experiment
- Pre-experiment with records from SCI and 3 work
tasks with 100 documents per work task assessed
by a researcher - Good results Precision was as expected better at
higher OLs, and was consistently lower at lower
OLs in all work tasks, step by step - Described in Larsen (2002) and is the background
for the best match version proposed in Larsen and
Ingwersen (2002), and tested experimentally in
the main experiment
18Pre-experiment
Sample A 135 documents from step 1 also
retrieved in step 3Sample B 143 additional
documents retrieved in step 3
19Pre-experiment
- Advantages
- A lot of documents retrieved (264 in step 1 ?
4774 in step 3) - Results ordered in overlap levels with higher
precision at the top levels - Disadvantages
- Not real best match, no discrimination at same
overlap level - Few representations, mostly from the author
- Effects on recall not known (presumably good
though?) - Resource demanding, hard to handle overlaps
- Hard to compare to other IR approaches
20Best match Boomerang effect?
- Weighted references from step 2 may be used as
queries in an IR system which support weighting
of query terms - The results will be a list of documents, ranked
by how many of the references from step 2 they
contain and how heavily weighted these are - This output may be compared to that of other best
match IR systems, and the performance of the
strategy be compared to known approaches
21The Boomerang effect
Best match
- Weighting the references in step 2
From Larsen and Ingwersen, 2002
- i References
- p The result of a query in a representation
(pool)
22Experiment with users?
- In accord with the cognitive viewpoint the
initial plan was to test the system interactively
using test persons and simulated work tasks - However, no complete prototype ? laboratory
experiments - (which turned out to be quite a good idea in the
end)
23Main experiment - INEX
(http//inex.is.informatik.uni-duisburg.de/2005/)
- Test of the best match boomerang effect within
the INEX2002 test collection - 12,107 XML documents (IEEE CS)
- 23 topics
- Relevance assessments (2 relevance dimensions,
graded assessments) - Possibility of many representations
- Generated from the XML documents (TI, ABS,
keywords, figure and table captions,
introductions and conclusions, cited titles,
citation index) - Interpretations by other cognitive agents
Descriptors and identifiers from the INSPEC
database
24Best match boomerang effect
ThresholdDCV_step1
ThresholdCCV_step2
25Main experiment - variables
- The document cut-off value in step 1 (DCV_step1),
- The use of either a flat (f) or an expanded (x)
citation index to extract citations for step 2, - The citation cut-off value in step 2 (CCV_step2),
- The use of either a flat (f) or an expanded (x)
citation index to run the weighted citation
queries against in step 3.
Flat citation index presence (binary)Expanded
citation index frequency (non-binary)
26Main experiment - setup
- The best match boomerang effect with thresholds
in step 1 and step 2 - Two types of citation indexes normal expanded
- Two baselines
- A polyrepresentation baseline without the
citation network - A traditional bag-of-words baseline
- The polyrepresentation baseline and the boomerang
effect was implemented in a mix of own
programming the InQuery IR system
27Official INEX2002 results
Run AvgP Rank
boomerang 0.0227 37/49
polyrepresentation 0.0271 32/49
bag-of-words 0.0618 3/49
- Boomerang effect
- step1_threshold 500
- No step2_theshold
- Expanded citation indexes
- Discouraging results
- Bag-of-words did very well
- Polyrep. and BE did not
- but a shot from the hip
- ? Add thresholds
Results for the generalized INEX2002 scoring
function
28Optimised results from the main experiment
- bag-of-words still best
- but substantial improvements for BE Polyrep.
Run AvgP
boomerang (H/ff/32) 0.0422
polyrepresentation 0.0419
bag-of-words 0.0606
Results for the generalized INEX2002 scoring
function
29Main experiment - results
- It did come back! ? The boomerang effect makes it
possible to retrieve relevant documents through
the network of references and citations without
the specification of seed documents in advance
(albeit not as effectively as traditional methods
? and it hit me in the back of the neck!) - A number of factors were investigated that affect
the use of references and citations in automatic
indexing and retrieval systems
30Contribution
- The boomerang effect as a method to facilitate
the exploitation of references and citations in
best match IR - Automatic selection of seed documents
- Query-biased as opposed to Googles PageRank
- Initial experiments with polyrepresentation at
the unstructured end of the polyrepresentation
continuum
31Lessons learnt
- Too complex design (Keep It Simple, Stupid)
- Find focus early on
- Even experiments without the involvement of users
took a long time - Do not get too disappointed by not-to-high
performing results - Re-use and re-development is a good thing in
publication
32Reference
- References and citations in automatic indexing
and retrieval systems experiments with the
boomerang effect / Birger Larsen. Copenhagen
Department of Information Studies, Royal School
of Library and Information Science, 2004. xiii,
297 p. ISBN 87-7415-275-0 - Available (along with a few defence photos)
at http//www.db.dk/blar/dissertation
33Part 3 Pennant diagrams for IR
- Pennant diagrams is an idea of Howard White
- A combination of bibliometrics, information
retrieval and relevance theory - For a thorough theoretical and methodical
introduction please consult White, H. (2007a,
2007b) - Uses Sperber and Wilsons relevance theory, where
the ratio of Cognitive Effects / Processing
effort defines the relevance of communication - Operationalises this ratio using the tf idf
term weighting scheme from IR for terms
co-occurring with a user-supplied seed document - ? Pennant diagrams, bibliometric distribution for
IR result visualisation
34New synthesis proposed by White
- Log(tf) of terms co-occurring with a seed term
- Measures the predicted cognitive effects within
the context of that seed term (system side) - Log(idf) for the same distribution
- Measures the predicted processing effort of the
terms co-occurring in that context (system side) - tf idf Cognitive Effects/Processing Effort
- When bibliometric distributions predicts degrees
of cognitive effect and processing effort - User-oriented and instrumental
35Pennant diagrams
Ease of processing how easy it is to see a
connection between a given term and the seed term
Values on the cognitive effects scale are
detrmined by the judgements of citers, authors,
indexers etc.
36Pennant diagram An example (1/2)
idf items in file
DIALOG RANK Results (Detailed Display) ----------
----------------------------- RANK S2/1-651
Field CA File(s) 7 (Rank fields found in 651
records -- 9545 unique terms) RANK No. Items in
File Items Ranked Items Ranked Term --------
------------- ------------ ------------- ----
1 651 651 100.0
INGWERSEN P 2 899 238
36.6 BELKIN NJ 3 1048
209 32.1 SARACEVIC T 4
789 153 23.5
BATES MJ 5 520 142
21.8 SPINK A 6 571
141 21.7 KUHLTHAU CC 7
967 137 21.0 ELLIS
D 8 901 134 20.6
DERVIN B 9 833 113
17.4 BORGMAN CL 10 2269
110 16.9 WILSON TD
tf items ranked
37Pennant diagram An example (2/2)
Dialog subfile (SF) is used for N
tf idf tfidf INGWERSEN
P 3.813 3.839 14.642 BJORNEBORN
L 2.716 4.761 12.931 THELWALL M 2.924 4.367 12.77
2 ALMIND TC 2.732 4.670 12.762 VAKKARI
P 2.929 4.289 12.566 BELKIN NJ 3.376 3.699 12.491
BARILAN J 2.785 4.457 12.415 SPINK
A 3.152 3.937 12.411 BORLUND P 2.623 4.718 12.37
8 KUHLTHAU CC 3.149 3.896 12.271
tfidf 1log(tf)log(N/df)
y
x
38Sectors of the pennant
A subordinate
B coordinate
C superordinate
39Sectors of the pennant
A subordinate
B coordinate
C superordinate
40Pennant diagram Cited author
41Predictions about sectors when seed is a co-cited
author
Cocitee is seeds
Junior / younger / less famous / subspecialities
Peer / same age / equally famous / discipline
Senior / older / more famous / other disciplines
42Extending pennant diagrams Plotting a Bradford
distribution
s ice(w)core/ti,abRANK JN
Bradford zones are implicitly present on the x
axis The core journals produce their greatest
effects in the context of a subject term and
hence are relevant to it
43Summary points, pennant diagrams
- Most interestingly, when bibliometric data are
subjected to tfidf, plotted as pennants, and
interpreted according to relevance theory, the
results evoke major variables in information
science - Topicality (intercoherrence and intercohesion
among texts) - Other types of relevance stratified by sectors
- Cognitive effects in relation to peoples
questions - Levels of expertise as a precondition for
cognitive effects - Processing effort (principle of least effort)
- Specificity of terms as it affects processing
effort - Relevance as the effects/effort ratio
- Authority of texts and their authors
44Summary points, Informetrics and IR
- Citation indexing was invented for retrieval
- Has some strong alternative IR characteristics
- Will find other documents than searching by words
- Offer many possibilities for retrieval and
visualisation of results, interlinking and
categorisation
45