Organizing current awareness in a large volunteer-based digital library - PowerPoint PPT Presentation

About This Presentation
Title:

Organizing current awareness in a large volunteer-based digital library

Description:

Organizing current awareness in a large volunteer-based digital library Thomas Krichel 2006-02-27 – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 74
Provided by: open167
Learn more at: https://openlib.org
Category:

less

Transcript and Presenter's Notes

Title: Organizing current awareness in a large volunteer-based digital library


1
Organizing current awareness in a large
volunteer-based digital library
  • Thomas Krichel
  • 2006-02-27

2
outline
  • Background to work that we did
  • RePEc (Research Papers in Economics)
  • NEP New Economics Papers
  • The research
  • Theory
  • Method
  • Results
  • Other work done for NEP.

3
This talk has three parts
  • Some background
  • Two papers
  • chablis paper, with Nisa Bakkalbasi (Yale)
  • http//openlib.org/home/krichel/papers/chablis.pdf
  • shibuya paper
  • http//openlib.org/home/krichel/shibuya.pdf

4
RePEc
  • Digital library for academic Economics. It
    collects descriptions of
  • economics documents (working papers, articles
    etc)
  • collections of those documents
  • economists
  • collections of economists

5
RePEc principle
  • Many archives
  • Archives offer metadata about digital objects or
    authors and institutions data.
  • One database
  • Many services
  • Users can access the data through many
    interfaces.
  • Providers of archives offer their data to all
    interfaces at the same time. This provides for an
    optimal distribution.

6
it's the incentives, stupid
  • RePEc applies the ideas of open source to the
    construction of bibliographic dataset. It
    provides an open library.
  • The entire system is constructed in such a way as
    to be sustainable without monetary exchange
    between participants.

7
some history
  • Thomas Krichel in the early 1990s dreamed about a
    current awareness service for working paper. It
    would later have electronic papers.
  • In 1993 he made the first economics working paper
    available online.
  • In 1997 he wrote the key protocols that govern
    RePEc.

8
RePEc is based on 550 archives
  • WoPEc
  • EconWPA
  • DEGREE
  • S-WoPEc
  • NBER
  • CEPR
  • Elsevier
  • US Fed in Print
  • IMF
  • OECD
  • MIT
  • University of Surrey
  • CO PAH
  • Blackwell

9
to form a 362k item dataset
  • 171,000 working papers
  • 187,000 journal articles
  • 1,300 software components
  • 2,100 book and chapter listings
  • 9,000 author contact publication listings
  • 9,300 institutional contact listings
  • more records than
    arXiv.org

10
RePEc is used in many services
  • EconPapers
  • NEP New Economics Papers
  • Inomics
  • RePEc author service
  • Z39.50 service by the DEGREE partners
  • IDEAS
  • RuPEc
  • EDIRC
  • LogEc
  • CitEc

11
NEP New Economics Papers
  • This is a set of current awareness reports on new
    additions to the working paper stock only.
    Journal articles would be too old.
  • Founded by Thomas Krichel in 1998.
  • Supported by the Economics department at WUStL.
  • Initial software was written by Jose Manuel
    Barrueco Cruz.
  • First general editor was John S. Irons.

12
why NEP
  • Public aim Current awareness if well done, can
    be an important service in its own right. It is
    sheltered from the competition of general search
    engines.
  • Private aim It is useful to have some, even
    though limited classification information.
  • for performance measures
  • for general research purposes

13
modus operandi stage 1
  • The general editor uses a computer program who
    gathers all the new additions to the working
    paper stock. This is usually done weekly.
  • S/he filters out new descriptions of old papers
  • date field
  • handle heuristics
  • The result is an issue of the nep-all report.

14
modus operandi stage 2
  • Editors consider the papers in the nep-all report
    to filter out papers that belong to the subject.
    This forms as issue of a subject report nep-???.
  • nep-all and the subject reports are circulated
    via email.
  • A special arrangement makes the data of NEP
    available to other RePEc services.

15
some numbers
  • The are now 60 NEP lists.
  • Over 39k subscriptions.
  • Over to 16k subscribers.
  • Over 50k papers announced.
  • Over 100k announcements.
  • Homepage at http//nep.repec.org
  • All this is a fantastic
    success!!

16
problem with the private aim
  • We would have to have all the papers to be
    classified not only the working papers.
  • We would need to have 100 coverage of NEP.
  • This means every paper in nep-all appears in at
    least one subject report.

17
coverage ratio
  • We call the coverage ratio the number of papers
    in nep-all that have been announced in at least
    one subject report.
  • We can define this ratio
  • for each nep-all issue
  • for a subset of nep-all issues
  • for NEP as a whole

18
coverage ratio theory evidence
  • Over time more and more NEP reports have been
    added. As this happens, we expect the coverage
    ratio to increase.
  • However, the evidence, from research by Barrueco
    Cruz, Krichel and Trinidad is
  • The coverage ratio of different nep-all issues
    varies a great deal.
  • Overall, it remains at around 70.
  • We need some theory as to why. This is where the
    chablis paper comes in.

19
two theories
  • Target-size theory
  • Quality theory
  • descriptive quality
  • substantive quality

20
theory 1 target size theory
  • When editors compose a report issue, they have a
    size of the issue in mind.
  • If the nep-all issue is large, editors will take
    a narrow interpretation of the report subject.
  • If the nep-all ratio is small, editors will take
    a wide interpretation of the report subject.

21
target size theory static coverage
  • There are two things going on
  • The opening new subject reports improves the
    coverage ratio.
  • The expansion of RePEc implies that the size of
    nep-all, though varying in the short-run, grows
    in the long run. Target size theory implies that
    the coverage ratio deteriorates.
  • The static coverage ratio that we observe is the
    result of both effects canceling out.

22
theory 2 quality theory
  • George W. Bush version of quality theory
  • Some papers are rubbish. They will not get
    announced.
  • The amount of rubbish in RePEc remains constant.
  • This implies constant coverage.
  • Reality is slightly more subtle.

23
two versions of quality theory
  • Descriptive quality theory papers that are badly
    described
  • misleading titles
  • no abstract
  • languages other than English
  • Substantive quality theory papers that are well
    described, but not good
  • from unknown authors
  • issued by institutions with unenviable research
    reputation

24
practical importance
  • We do care whether one or the other theory is
    true.
  • Target size theory implies that NEP should open
    more reports to achieve perfect coverage.
  • Quality theory suggests that opening more report
    will have little to no impact on coverage.
  • Since operating more reports is costly, there
    should be an optimal number of reports.

25
overall model
  • We need an overall model that explains subject
    editors behavior.
  • We can feed this model with variables that
    represent theoretical determinants of behavior.
  • We can then assess the strength of various
    factors empirically.

26
method
  • The dependent variable is announced. It is one if
    the paper has been announced, 0 otherwise.
  • Since we are explaining a binary variable, we can
    use binary logistic regression analysis (BLRA).
    This is a fairly flexible technique, useful when
    the probability distributions governing the
    independent variables are not well known.
  • That's why BLRA is popular in the life sciences.

27
independent variables size
  • size is the size of the nep-all issue in which
    the paper appeared.
  • This is the critical indicator of target size
    theory. We expect it to have a negative impact on
    announced.

28
independent variables position
  • position is the position of the paper in the
    nep-all issue.
  • The presence of this variable can be justified by
    the combined assumption of target size and editor
    myopia.
  • If editors are myopic, they will be more liberal
    at the start of nep-all then at the end of
    nep-all.

29
independent variables title
  • title is the length of a title of the paper,
    measured by the number of characters.
  • This variable is motivated by descriptive quality
    theory. A longer title will say more about the
    paper than a short title. This makes is less
    likely that a paper is being overlooked.

30
independent variables abstract
  • abstract is the presence/absence of an abstract
    to the paper.
  • This is also motivated by descriptive quality
    theory.
  • Note that we do not use the length of the
    abstract because that would be a highly skewed
    variable.

31
independent variables language
  • language is an indicator if the language of the
    metadata is in English or not.
  • This variable is motivated by descriptive quality
    theory and the idea that English is the most
    commonly understood language.
  • While there are a lot of multilingual editors,
    customizing this variable would have been rather
    hard.

32
independent variables series
  • series is the size of the series where a paper
    appears in.
  • This variable is motivated by substantive quality
    theory.
  • The larger a series is the higher, usually, is
    its reputation. We can roughly qualify by size
    and quality
  • multi-institution series (NBER, CEPR)
  • large departments
  • small departments

33
independent variables author
  • author is the prolificacy of the authors of the
    paper.
  • It is justified by substantive quality theory.
  • This is the most difficult variable to measure.
    We use the number of papers written by the
    registered author with the highest number.
  • Since about 50 of the papers have no registered
    author, a lot of them are excluded. But there
    should be no bias by the exclusion.

34
create categorical variables
  • size_1 179, 326)
  • size_2 326, 835
  • title_1 55, 77)
  • title_2 77, 1945
  • position_1 0.357, 0.704)
  • position _2 0.704, 1.000
  • series_1 98, 231)
  • series_2 231, 3654

35
results
  • P(announced1 x) (exp(g(x))/(1exp(g(x))
  • g(x) 0.2401- 0.2774size_1 - 0.4657 size_2
    0.1512title_1 0.2469title_2 0.3874abstract
    0.0001author 0.7667language
    -0.1159series_1 0.1958series_2
  • position is not significant. author just makes
    the cut.

36
odds ratio
  • size_1 1.32 1.22, 1.44
  • size_2 0.83 0.76, 0.90
  • title_1 1.16 1.07, 1.26
  • title_2 1.28 1.18, 1.39
  • abstract 1.47 1.34, 1.62
  • language 2.15 1.85, 2.51
  • series_1 1.11 1.02, 1.20
  • series_2 1.37 1.26, 1.49
  • author 1.05 1.01, 1.09

37
scandal!
  • Substantive quality theory can not be rejected.
    That means that the editors are selecting for
    quality as well as for the subject.
  • The editors have rejected our findings. Almost
    all protest that there is no quality filtering.
  • This is where the chablis paper ends.

38
consequences
  • There has been no program to expand list.
  • There has to be a concentrated effort to help
    editors to find subject specific papers.
  • More effort needs to be made for editors to
    really find the subject-specific papers. This can
    be done by
  • the use of a more efficient interface
  • the use of automated resource discovery methods.

39
ernad
  • editing reports on new academic documents. It is
    purpose-built software system for current
    awareness reports.
  • It has been designed by Thomas Krichel,
    http//openlib.org/home/krichel/work/altai.html.
    The design is complicated, but the system quite
    easy to use.
  • The system was written by Roman D. Shapiro.

40
statistical learning
  • The idea is that a computer may be able to make
    decision on the current nep-all reports based on
    the observation of earlier editorial decisions.
  • ernad now works using support vector machines
    (SVM), with titles, abstracts, author name,
    classification values and series as features.

41
SVM performance
  • If we use average search length, we can do
    performance evaluations.
  • It turns out that reports have very different
    forecastability. Some are almost perfect, others
    are weak.
  • Again, this raises a few eyebrows!

42
what is the value of an editor?
  • If the forecast is perfect, we don't need the
    editor.
  • If the forecast is very weak the editor may be a
    prankster.

43
pre-sorting reconceived
  • We should not think of pre-sorting via SVM as
    something to replace the editor.
  • We should not think about it encouraging editors
    to be lazy.
  • Instead, we should think it as an invitation to
    examine some papers more closely than others.

44
headline vs. bottomline data
  • The editors really have a three stage process of
    decision.
  • They read title, author names.
  • They read the abstract.
  • They read the full text
  • A lot of papers fail at the first hurdle.
  • SVM can read the abstract and prioritize papers
    for abstract reading.
  • Editors are happy with the pre-sorting system.

45
performance evaluation
  • This is really where the shibuya paper starts.
  • How should the success or failure of a sorting
    algorithm be quantified?
  • Classic information retrieval suggests precision
    and recall.

46
precision and recall
  • precision is the number of retrieved and relevant
    documents divided by the number of retrieved
    documents.
  • recall is the number of retrieved and relevant
    documents divided by the number of relevant
    documents.
  • Both numbers are used together but recall is
    often difficult to measure.

47
precision and recall problem
  • Precision and recall really apply to "large" IR
    problems, where the set of documents is too large
    to be examined "by hand". Users only see the set
    of retrieved papers.
  • Here we have a "small" information retrieval
    problem.

48
PR interpretation 1
  • We can argue that when we sort nep-all recall is
    always constant 100
  • Precision is the number of relevant papers in the
    issue, divided by the size of nep-all. This does
    not depend on the sorting process.

49
PR interpretation 2
  • We can look at the precision achieved at the last
    retrieved paper. This is a measure that is
    equivalent to one measure I will present later,
    that essentially looks at how low the last paper
    has fallen.
  • But recall is still useless.

50
PR interpretation 3
  • We could the vector coming out of the sorting
    process to a set. We can then compare
  • set of predicted useful documents
  • set of actual used documents
  • But this would mean deliberately throwing away
    information.
  • And under this criteria different orders, which
    should widely differ for editors, can get the
    same evaluation.

51
we need some different theory!
  • We will look at some simple theory of editor
    behavior.
  • This theory is a bit like an economic theory in
    the sense that it has been made under
    ridiculously simplifying assumptions.
  • The hope is that the theory sheds light into
    basic features of the problem that remain
    operational under more realistic assumptions.

52
key assumption 1 binary decision
  • An editor faces a list of documents. Each
    document describes a working paper that has been
    added to RePEc recently. The editor examines the
    document.
  • An editor may spend a varying amount of effort
    examining a document. This would be a very
    complex decision to model. We assume it away.
  • Thus we assume a document is examined or not.

53
key assumption 2 no learning
  • The decision whether a document is relevant or
    not is assumed to only depend on the contents of
    that document.
  • It is assumed not to depend on the contents on
    any other document.
  • This assumption assumes away learning.

54
introducing cost-based reasoning
  • Editors face an optimal stopping problem.
  • There are two types of costs that editors are
    facing.
  • the cost of examining a new paper c_1. We can
    safely assume that c_1 is constant.
  • the cost associated with loosing papers c_2. It
    will depend on the number of papers lost. It
    c_2gt0, it will be unknown.

55
c_1 and c_2
  • c_1 and c_2 seem to dictate editor behavior
  • If c_1 gtgt c_2 the editor will not examine any
    documents.
  • If c_1 ltlt c_2 the editor will examine all
    documents.
  • Let us assume that the editor is conscientious.
    That is, c_1 and c_2 are such that, while there
    is a chance that there are some more relevant
    documents left, the editor will continue to
    examine the list.

56
the traffic light
  • We still have a complicated problem. Only a
    totally unrealistic assumption can safe us.
  • Basically, let us assume that there is no
    uncertainty about c_2. This is the traffic light
    assumption
  • A traffic light shows green as long as there are
    more relevant documents to be discovered.
  • The traffic light shows red

57
conscientious editor traffic light
  • Under the traffic light scenario the
    conscientious editor will examine papers until
    the light shows red.
  • Therefore
  • c_20
  • examination cost is c_2 i where i is the
    position of the last relevant paper is x.

58
what have we learned?
  • When presented with a series of outcomes, the
    editor will prefer the one where the last
    position of a relevant document is lower.
  • This defines a weak ordering over all outcomes.

59
relaxing the traffic light
  • Assume that there could be some uncertainty about
    the traffic light at the end of the examination
    process.
  • Assume that it is so small that the behavior of
    the editor would be unchanged.
  • Contrast
  • ranking A 10100
  • ranking B 01100
  • Then A should be preferred over B.

60
the natural order
  • Repeating the previous argument, we can find a
    full ordering over all outcomes that a rational
    and conscientious editor will have.
  • I am sure the optimality of that order could be
    confirmed for more general scenarios.
  • But that is a matter of conviction.

61
notation
  • We consider a nep-all report has n papers.
  • r of the papers are relevant.
  • x is an outcome vector.
  • x_i0 if the paper at position i is not relevant.
  • x_i1 if the paper at position i is relevant.

62
natural order when n5, r2
  • 1 1 0 0 0 0 0 1 1 0
  • 1 0 1 0 0 1 0 0 0 1
  • 0 1 1 0 0 0 1 0 0 1
  • 1 0 0 1 0 0 0 1 0 1
  • 0 1 0 1 0 0 0 0 1 1
  • (read column first)

63
measuring success
  • Let f(x) be a measure of the goodness of an
    outcome. It appears natural to require
  • A f(x) gt f(x') if x is better than x'
  • B f(1,,1,00) 1
  • C E f(x) 0, where E is the expected value
  • operator about the entire set of
    outcome.
  • D respect for the natural order
  • C calls for a closed form of the expected
    value.

64
Brookes Swets measures
  • Brookes and Swets measure on z, the internal
    ranking variable. The measure the true
    discriminating value of z.
  • It is difficult to build a measure by
    transformation that satisfies B and C.
  • It will not satisfy D.

65
the average search length
  • This is the average position of a relevant
    document, divided by n.
  • This can be transformed to satisfy B and C.
  • The problem remains that it does not satisfy D.
  • Using a simple change such as taking the
    logarithm of the position does not help.

66
Cooper's expected search length
  • This (roughly) is the number of non-relevant
    documents found until a target number of relevant
    documents has been found.
  • This can be transformed to satisfy B, C,
  • It can weakly impose D. But all outcomes where
    the same document is at the last position are
    considered the equivalent.
  • This is a problem.

67
natural order implementation I
  • One way is to use powers. Construct a penalty
  • yx_nyx_n-1 yx_1 where ygt1.
  • It is possible to find the expected value of this
    expression and construct a measure that satisfies
    B, C, and D.
  • Exact values depend on y.

68
natural order implementation II
  • Another way is to count the items in the natural
    orders, starting at zero say.
  • Finding the expected value is trivial, in this
    case.
  • But we need an algorithm that quickly finds the
    position of an outcome in the order. Such an
    algorithm is described in the paper.

69
test
  • We extract author names, titles, abstracts,
    series id, and classification codes.
  • We do a straight feature count, then normalize
    for the Euclidian norm.
  • We set aside 300 observations for testing, the
    rest for learning.
  • We use SVM_light. We conduct 100 tests per
    report.

70
results
  • Coopers measure does worse than the linear
    measures such as the average search length.
  • The direct imposition measures show very high
    values many times. This is the case when they
    have been able to lift the last observation, say,
    into the first half.

71
conclusion
  • Since
  • Cooper's measure and the direct imposition
    measure essentially measure the same order,
  • Cooper's measure gives relatively low values,
  • direct imposition measures give high values
  • I conclude that a linear combination of Cooper's
    measure and direct imposition measure II seems
    the way forward to measure performance.

72
to do list
  • Answer the question Why did I ever get into this
    rather convoluted topic ?
  • But now we have a criterion, we can seen if we
    can improve by other methods
  • bigrams and RePEc keyword values
  • different SVM settings
  • different algorithms

73
http//openlib.org/home/krichel/
  • Thank you for your attention!
Write a Comment
User Comments (0)
About PowerShow.com