Informetrics and IR - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Informetrics and IR

Description:

Informetrics and IR – PowerPoint PPT presentation

Number of Views:206
Avg rating:3.0/5.0
Slides: 46
Provided by: bir50
Category:

less

Transcript and Presenter's Notes

Title: Informetrics and IR


1
Informetrics and IR
  • Birger Larsen
  • Information Interaction and Information
    Architecture
  • Royal School of Library and Information Science
  • Copenhagen, Denmark
  • blar_at_db.dk

If I have not seen as far as others, it is
because there were giants standing on my
shoulders. --Hal Abelson
2
Outline
  • Part 1 Garfields citation indexes
  • Part 2 Own PhD project automatic IR by
    citations
  • Part 3 Pennant diagrams for visualising
    retrieval results

3
Part 1Garfields citation indexes
  • Science Citation Index
  • Social Science Citation Index
  • Arts Humanities Citation Index
  • Based on the idea that there is a subject
    relation between a document and its references
    (cf. Law)
  • That is, if you find that a given document is
    relevant to you, then there is a fair chance that
    you will also find that some of its references in
    its bibliography (and those later citing the
    document) are relevant

See his website and essays http//www.garfield.li
brary.upenn.edu/essays.html
4
Garfields citation indexes
  • NOT developed for citation counting and research
    evaluation (!), by as a retrieval tool
  • Advantages of using references as index terms
  • Time and money saved compared to intellectual
    indexing
  • Facilitates a multidisciplinary index
  • No inter-indexer inconsistency
  • References have strong semantics for researchers
    (Heisenberg, 1928 Small, 1973 Garfield,
    1955)
  • The network of references and citations supports
    new retrieval strategies intuitively appealing to
    researchers

5
Garfields citation indexes
  • Multi-disciplinary indexes ? by indexing the core
    journals of a number of fields these are covered
    quite well ? Garfields law of Concentration
  • The Bradford curve tail consists mainly of core
    journals of other fields
  • One of the factors behind selection of journals
    for the citation indexes
  • High citation rates is another
  • Scientific panels advice ISI on journals to
    include or exclude

6
Garfields citation indexes
  • Advantages
  • Multidiciplinary, indexes 9.000 journals
  • Indexes all references (also non-source
    documents), and facilitates analysis of received
    citations
  • Indexes cover-to-cover
  • All authors and their addresses are indexed for
    source documents

7
Garfields citation indexes
Target
Source
  • Important difference between
  • Source documents
  • Has been handled by ISI indexers
  • A wealth of bibliographic data
  • Cited documents (targets)
  • Much more meagre data (CA, CY, cited vol page,
    CW)
  • Can be citations to source documents, but also
    anything else (e.g., Tolkien or the Bible)!!
  • The cited references are inverted into a citation
    index

8
Other Citation Indexes
  • SCOPUS
  • Produced by Elsevier (no access from RSLIS)
  • 15,000 peer-reviewed journals, 1,000 Open Access
    journals, 500 Conference Proceedings, 600 Trade
    Publications 125 Book Series
  • Google Scholar (http//scholar.google.com)
  • Automatically generated citation index many
    errors
  • Research Index/CiteSeer (http//citeseer.ist.psu.e
    du/)
  • Automatically generated citation index
  • Good coverage of Computer Science

9
Part 2 Own PhD project automatic IR by
citations
  • Title References and citations in automatic
    indexing and retrieval systems experiments with
    the boomerang effect
  • The PhD project is an amalgamation of ideas from
    Information Retrieval, and informetrics/bibliometr
    ics

10
The Boomerang Effect
  • In short, the overall objective of the project is
    to develop and test empirically a new method to
    automatic indexing of structured full text
    documents, where references and citations are
    given special attention
  • The central hypothesis is that by working with
    several (poly-)representations derived from
    different cognitive and functional origins better
    performing IR systems can be constructed

11
Motivation
  • The principle of polyrepresentation hypothesises
    that overlaps between representations can be
    exploited to reduce the uncertainty inherent in
    IR (Ingwersen, 1994, 1996)
  • Emphasis is placed on representations of
    different cognitive and functional origin
    generated over time from both users information
    needs, the documents, and other cognitive agents
  • The ever growing number of documents in digital
    libraries, makes it possible to test this as a
    multitude of representations may be generated
    from the full text documents (move away from
    bag-of-words)

12
Motivation
  • The project attempts to exploit Eugene Garfield's
    original idea behind the citations indexes, which
    was to create an alternative form of subject
    access to scientific documents (Garfield, 1955)
  • Studies from the online IR community has shown
    good performance when using citations (McCain,
    1989 Pao, 1993), (and web IR is relying more and
    more on link analysis, e.g., Google,
    ResearchIndex.com)
  • A problem, however, has been that queries cannot
    be stated in natural language, but only as seed
    documents

13
Motivation
  • References and citations have shown promising
    results as representations in IR, but have never
    been integrated into the research on automatic IR
    systems
  • Scientific documents (i.e., with outgoing
    references and received citations) are for the
    first time becoming available electronically in
    large quantities
  • This offers new possibilities for combining
    conventional automatic indexing and retrieval
    techniques with the exploitation of references
    and citations

14
References and citations
  • Scientists add references to earlier work in
    their papers as part of the scientific
    communication of ideas
  • Scientists motives for giving references can be
    analysed from both normative and social
    constructivist positions
  • A unifying explanation is provided by Small
    (1978) who regards references as concept symbols
  • an idea that is being used in the course of an
    argument
  • the references actually given function as symbols
    for a concept

15
References and citations in IR
  • Many possible retrieval strategies with citation
    indexes, but forward chaining is a unique
    capability
  • Allows you to retrieve more recent, citing
    documents
  • Forward chaining does, however, require that a
    seed document (i.e. a known, relevant document)
    is given as starting point
  • Good seed documents may be hard to obtain,
    particularly if the subject is not well-know or
    new
  • The lack of available seed documents is probably
    one of the main reasons behind the rare inclusion
    of references and citations in IR research

16
The Boomerang effect
Boolean
Modified from Larsen, 2002
Step 1documents
(References)
Step 2references
(Citations)
Step 3documents
OL Overlap Level
17
Pre-experiment
  • Pre-experiment with records from SCI and 3 work
    tasks with 100 documents per work task assessed
    by a researcher
  • Good results Precision was as expected better at
    higher OLs, and was consistently lower at lower
    OLs in all work tasks, step by step
  • Described in Larsen (2002) and is the background
    for the best match version proposed in Larsen and
    Ingwersen (2002), and tested experimentally in
    the main experiment

18
Pre-experiment
  • Example results

Sample A 135 documents from step 1 also
retrieved in step 3Sample B 143 additional
documents retrieved in step 3
19
Pre-experiment
  • Advantages
  • A lot of documents retrieved (264 in step 1 ?
    4774 in step 3)
  • Results ordered in overlap levels with higher
    precision at the top levels
  • Disadvantages
  • Not real best match, no discrimination at same
    overlap level
  • Few representations, mostly from the author
  • Effects on recall not known (presumably good
    though?)
  • Resource demanding, hard to handle overlaps
  • Hard to compare to other IR approaches

20
Best match Boomerang effect?
  • Weighted references from step 2 may be used as
    queries in an IR system which support weighting
    of query terms
  • The results will be a list of documents, ranked
    by how many of the references from step 2 they
    contain and how heavily weighted these are
  • This output may be compared to that of other best
    match IR systems, and the performance of the
    strategy be compared to known approaches

21
The Boomerang effect
Best match
  • Weighting the references in step 2

From Larsen and Ingwersen, 2002
  • i References
  • p The result of a query in a representation
    (pool)

22
Experiment with users?
  • In accord with the cognitive viewpoint the
    initial plan was to test the system interactively
    using test persons and simulated work tasks
  • However, no complete prototype ? laboratory
    experiments
  • (which turned out to be quite a good idea in the
    end)

23
Main experiment - INEX
(http//inex.is.informatik.uni-duisburg.de/2005/)
  • Test of the best match boomerang effect within
    the INEX2002 test collection
  • 12,107 XML documents (IEEE CS)
  • 23 topics
  • Relevance assessments (2 relevance dimensions,
    graded assessments)
  • Possibility of many representations
  • Generated from the XML documents (TI, ABS,
    keywords, figure and table captions,
    introductions and conclusions, cited titles,
    citation index)
  • Interpretations by other cognitive agents
    Descriptors and identifiers from the INSPEC
    database

24
Best match boomerang effect
ThresholdDCV_step1
ThresholdCCV_step2
25
Main experiment - variables
  1. The document cut-off value in step 1 (DCV_step1),
  2. The use of either a flat (f) or an expanded (x)
    citation index to extract citations for step 2,
  3. The citation cut-off value in step 2 (CCV_step2),
  4. The use of either a flat (f) or an expanded (x)
    citation index to run the weighted citation
    queries against in step 3.

Flat citation index presence (binary)Expanded
citation index frequency (non-binary)
26
Main experiment - setup
  • The best match boomerang effect with thresholds
    in step 1 and step 2
  • Two types of citation indexes normal expanded
  • Two baselines
  • A polyrepresentation baseline without the
    citation network
  • A traditional bag-of-words baseline
  • The polyrepresentation baseline and the boomerang
    effect was implemented in a mix of own
    programming the InQuery IR system

27
Official INEX2002 results
Run AvgP Rank
boomerang 0.0227 37/49
polyrepresentation 0.0271 32/49
bag-of-words 0.0618 3/49
  • Boomerang effect
  • step1_threshold 500
  • No step2_theshold
  • Expanded citation indexes
  • Discouraging results
  • Bag-of-words did very well
  • Polyrep. and BE did not
  • but a shot from the hip
  • ? Add thresholds

Results for the generalized INEX2002 scoring
function
28
Optimised results from the main experiment
  • bag-of-words still best
  • but substantial improvements for BE Polyrep.

Run AvgP
boomerang (H/ff/32) 0.0422
polyrepresentation 0.0419
bag-of-words 0.0606
Results for the generalized INEX2002 scoring
function
29
Main experiment - results
  • It did come back! ? The boomerang effect makes it
    possible to retrieve relevant documents through
    the network of references and citations without
    the specification of seed documents in advance
    (albeit not as effectively as traditional methods
    ? and it hit me in the back of the neck!)
  • A number of factors were investigated that affect
    the use of references and citations in automatic
    indexing and retrieval systems

30
Contribution
  • The boomerang effect as a method to facilitate
    the exploitation of references and citations in
    best match IR
  • Automatic selection of seed documents
  • Query-biased as opposed to Googles PageRank
  • Initial experiments with polyrepresentation at
    the unstructured end of the polyrepresentation
    continuum

31
Lessons learnt
  • Too complex design (Keep It Simple, Stupid)
  • Find focus early on
  • Even experiments without the involvement of users
    took a long time
  • Do not get too disappointed by not-to-high
    performing results
  • Re-use and re-development is a good thing in
    publication

32
Reference
  • References and citations in automatic indexing
    and retrieval systems experiments with the
    boomerang effect / Birger Larsen. Copenhagen
    Department of Information Studies, Royal School
    of Library and Information Science, 2004. xiii,
    297 p. ISBN 87-7415-275-0
  • Available (along with a few defence photos)
    at http//www.db.dk/blar/dissertation

33
Part 3 Pennant diagrams for IR
  • Pennant diagrams is an idea of Howard White
  • A combination of bibliometrics, information
    retrieval and relevance theory
  • For a thorough theoretical and methodical
    introduction please consult White, H. (2007a,
    2007b)
  • Uses Sperber and Wilsons relevance theory, where
    the ratio of Cognitive Effects / Processing
    effort defines the relevance of communication
  • Operationalises this ratio using the tf idf
    term weighting scheme from IR for terms
    co-occurring with a user-supplied seed document
  • ? Pennant diagrams, bibliometric distribution for
    IR result visualisation

34
New synthesis proposed by White
  • Log(tf) of terms co-occurring with a seed term
  • Measures the predicted cognitive effects within
    the context of that seed term (system side)
  • Log(idf) for the same distribution
  • Measures the predicted processing effort of the
    terms co-occurring in that context (system side)
  • tf idf Cognitive Effects/Processing Effort
  • When bibliometric distributions predicts degrees
    of cognitive effect and processing effort
  • User-oriented and instrumental

35
Pennant diagrams
Ease of processing how easy it is to see a
connection between a given term and the seed term
Values on the cognitive effects scale are
detrmined by the judgements of citers, authors,
indexers etc.
36
Pennant diagram An example (1/2)
idf items in file
DIALOG RANK Results (Detailed Display) ----------
----------------------------- RANK S2/1-651
Field CA File(s) 7 (Rank fields found in 651
records -- 9545 unique terms) RANK No. Items in
File Items Ranked Items Ranked Term --------
------------- ------------ ------------- ----
1 651 651 100.0
INGWERSEN P 2 899 238
36.6 BELKIN NJ 3 1048
209 32.1 SARACEVIC T 4
789 153 23.5
BATES MJ 5 520 142
21.8 SPINK A 6 571
141 21.7 KUHLTHAU CC 7
967 137 21.0 ELLIS
D 8 901 134 20.6
DERVIN B 9 833 113
17.4 BORGMAN CL 10 2269
110 16.9 WILSON TD
tf items ranked
37
Pennant diagram An example (2/2)
Dialog subfile (SF) is used for N
tf idf tfidf INGWERSEN
P 3.813 3.839 14.642 BJORNEBORN
L 2.716 4.761 12.931 THELWALL M 2.924 4.367 12.77
2 ALMIND TC 2.732 4.670 12.762 VAKKARI
P 2.929 4.289 12.566 BELKIN NJ 3.376 3.699 12.491
BARILAN J 2.785 4.457 12.415 SPINK
A 3.152 3.937 12.411 BORLUND P 2.623 4.718 12.37
8 KUHLTHAU CC 3.149 3.896 12.271
tfidf 1log(tf)log(N/df)
y
x
38
Sectors of the pennant
A subordinate
B coordinate
C superordinate
39
Sectors of the pennant
A subordinate
B coordinate
C superordinate
40
Pennant diagram Cited author
41
Predictions about sectors when seed is a co-cited
author
Cocitee is seeds
Junior / younger / less famous / subspecialities
Peer / same age / equally famous / discipline
Senior / older / more famous / other disciplines
42
Extending pennant diagrams Plotting a Bradford
distribution
s ice(w)core/ti,abRANK JN
Bradford zones are implicitly present on the x
axis The core journals produce their greatest
effects in the context of a subject term and
hence are relevant to it
43
Summary points, pennant diagrams
  • Most interestingly, when bibliometric data are
    subjected to tfidf, plotted as pennants, and
    interpreted according to relevance theory, the
    results evoke major variables in information
    science
  • Topicality (intercoherrence and intercohesion
    among texts)
  • Other types of relevance stratified by sectors
  • Cognitive effects in relation to peoples
    questions
  • Levels of expertise as a precondition for
    cognitive effects
  • Processing effort (principle of least effort)
  • Specificity of terms as it affects processing
    effort
  • Relevance as the effects/effort ratio
  • Authority of texts and their authors

44
Summary points, Informetrics and IR
  • Citation indexing was invented for retrieval
  • Has some strong alternative IR characteristics
  • Will find other documents than searching by words
  • Offer many possibilities for retrieval and
    visualisation of results, interlinking and
    categorisation

45
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com