Title: Quality assessment of a current awareness system
1Quality assessment of a current awareness system
- Thomas Krichel
- LIU H??
- 20071023
2acknowledgments
- Thanks to organizers.
- I am grateful for comment by
- Bernado Bátiz-Lazo
- Joanna P. Davies
- Marco Novarese
- Christian Zimmermann
- I thank everybody involved in RePEc and NEP, as
well as JISC.
3current awareness
- Current Awareness aka Selective Dissemination of
Information is a simple idea a user is informed
about new documents in her area of interest. - Current awareness generate a double
classification in - subject ...
- time ...
- matter.
4why bother?
- It is niche activity that has been neglected by
the search engines. - I have registered with Google and Amazon. They
give me tips but these are generally poor. - We can not trust computers to do it.
- Neither on subject matter
- Nor on time
5types of current awareness
- personal (amazon) vs collective (Google news)?
- machine generated vs human generated?
- Actually I claim the only human-generated current
awareness service for academic documents is NEP.
6computer-based keywords
- In computer generated current awareness one can
filter for keywords. - In academic digital libraries, since the papers
describe research results, they contain all
ideas that have not been previously seen. - Therefore getting the keywords right is
impossible.
7computer-based categories
- It is possible to classify documents based on
categories say football vs tennis. - It works fine when the vocabulary used in
different categories is quite different. - For some academic areas the differences are just
too subtle.
8computers and time problem
- In a digital library the date of a document can
mean anything. - The metadata may be dated in some implicit form.
- Recently arrived records can be calculated.
- But record handles may be unstable.
- Recently arrived records do not automatically
mean new documents.
9we need humans
- Catalogers are expensive.
- We need volunteers to do the work.
- Junior researchers have good incentives. They
- need to be aware of latest literature
- are absent from the informal circulation channels
of top level academics - need to get their name around among researchers
in the field..
10introducing NEP
- There is only one large freely-available
human-based current awareness service. - It is NEP New Economics Papers at
http//nep.repec.org - Remainder of this talk is about NEP.
11NEP New Economics Papers
- NEP is a current awareness system for the working
paper in the RePEc digital library about
economics. - Published articles are excluded because they are
way too old.
12NEP service model
- There is a basic model behind this service we
could call the "NEP Service Model". - two stages...
- flat report space...
13general two stage setup
- First stage A general editor compiles a list of
all new papers. This forms an issues of the
allport. - Second stage A group of subject editors filter
the new allport issue into subject reports. Each
editor does this independently for the subject
reports she looks after.
14a flat space
- There is a series of reports. Each reports has a
number of issues over time. - There is an allport a report that contains all
papers that are new in the period covered by the
issue.
15first stage in NEP
- General editor compiles a list of recent
additions to the RePEc working papers data. - Computer generated
- Journal articles are excluded
- Examined by the General Editor (GE, a person)?
- This list forms an issue of nep-all.
16first stage in NEP
- nep-all contains all new papers
- Circulated to
- nep-all subscribers
- Editors of subject-reports
17second stage
- Each editor creates, independently, a subject
report for her subject. She does this by removing
from nep-all. - A subject report issues (sri) is the result of
this process. - There have been over 47,000 sris issued through
the lifetime of NEP to date.
18history
- There are basically two phases in NEP the
pre-ernad 1998 to 2004 and the post ernad phase. - I will deal with pre-ernad history here.
- Some research on NEP has been conducted in the
pre-ernad phase. - This has informed the work that went into ernad.
19early history
- System was conceived by Thomas Krichel
- Name NEP by Sune Karlsson
- Implemented by José Manuel Barrueco Cruz.
- Started to run in May 1998.
20starting setup
- First the system was all email based.
- The nep-all was composed as an email.
- It was sent to editors as an email.
- Editors used whatever tool they used to compose
the email.
21web interface
- John S. Irons issued the first web interface for
report composition on 2000-02-01. - This would just compose the report.
- Editors would still cut and paste the results of
the form into email clients.
22historic mail support
- First mail support was given by mailbase.ac.uk.
- When this was closed in 2000-11, NEP moved to
jiscmail.ac.uk. - Since the mailing list service was only supposed
to be for UK academic community, it was deemed
not sustainable. - Thomas Krichel started hosting lists on
2002-11-16. It is a nightmare.
23Aeroflot document
- The Aeroflot document was a thinking piece that
Thomas Krichel wrote as early as 2001.
http//openlib.org/home/krichel/work/aeroflot.html
- This paper already sets out ideas for what would
be ernad. - At that time the Siberian RePEc team promised
help with building such a system.
24discover disaster
- In 2002-2003 Jeremiah Cochise Trinidad
Christensen and Thomas Krichel were the first
people to try to get a systematic picture of how
NEP works. - They discover that this is exceedingly difficult.
25mail log parsing
- Logs were not moved from to Maibase to JISCMail.
- Mailbase removed the logs in 2002-11. Thomas
Krichel got them just before they were destroyed. - The mail logs were the only source for historic
NEP information.
26parsing targets
- handles severely compromised by cut-n-paste
operations, editor locales, etc. - date of issue editors were free to set dates,
nep-all dates may not be preserved - time of issue an email is almost impossible to
time.
27state of pre-ernad data
- After a regular expressions orgy, we can get some
approximate idea about the handles that were
used. - Thus the thematic component is roughly intact.
- We have a problem with a bug in the discovery
program that made many papers appear several
times in nep-all. This makes it difficult to
associate subject and allport issues.
28state of pre-ernad data
- Timing of emails is extremely difficult, even
with full headers. - The logs of the Mailbase system only have times
for when the email client said it sent the mail.
This is the local editor's PC time, can be years
out of whack! - We still have some data for research...
29research conducted on NEP
- Most of the research conducted on ernad has been
done in the pre-ernad phase. - The difficulties of some of this work has
informed the construction of ernad.
30Chu and Krichel (2003)?
- Heting Chu Thomas Krichel (2003) NEP Current
awareness service of the RePEc digital library.
http//www.dlib.org/dlib/december03/chu/12chu.html
vaguely talks about NEP. Notes that there is a
problem of timeliness in the subject report
issue, despite the very shaky data.
31Barrueco Cruz et al. (2003)?
- Jose Manuel Barrueco Cruz, Thomas Krichel and
Jeremiah Cochise Trinidad-Chrisitensen
Organizing Current Awareness in a Large Digital
Library http//openlib.org/home/krichel/papers/es
poo.pdf have two themes - overlap between reports...
- coverage ratio...
- as well as history and suggestions.
32overlap
- Barrueco Cruz et al (2003) argue that overlap
occurs not when two papers are appearing in the
two reports, but when the two reports are read by
the same readers. - They have data on pairwise overlap between
reports, based on crude membership data.
33overlap puzzle
- Here is a puzzle to think about
- If a person will be interested in two subject
areas because they are close, she will subscribe
to both reports. - But since they are thematically close, she will
sometimes receive the same papers twice. - With mail technology and asyn-chronous issue
generation, this appears difficult to solve.
34coverage ratio
- We call the coverage ratio the number of papers
in nep-all that have been announced in at least
one subject report. - We can define this ratio
- for each nep-all issue
- for a subset of nep-all issues
- for NEP as a whole
35coverage ratio theory evidence
- Over time more and more NEP reports have been
added. As this happens, we expect the coverage
ratio to increase. - However, the evidence, from research by Barrueco
Cruz, Krichel and Trinidad is - The coverage ratio of different nep-all issues
varies a great deal. - Overall, it remains at around 70.
- We need some theory as to why.
36Krichel Bakkalbasi (2005)?
- Thomas Krichel and Nisa Bakkalbasi Developing a
predicitve model of editor selectivity in a
current awareness service of a large digital
library. http//openlib.org/home/krichel/papers/b
oston.pdf
37coverage ratio theories
- Krichel Bakkalbasi (2005) build two theories of
the observations of Barrueco Cruz at al. (2003)? - They are
- Target-size theory
- Quality theory
- descriptive quality
- substantive quality
38theory 1 target size theory
- When editors compose a report issue, they have a
size of the issue in mind. - If the nep-all issue is large, editors will take
a narrow interpretation of the report subject. - If the nep-all ratio is small, editors will take
a wide interpretation of the report subject.
39target size theory static coverage
- There are two things going on
- The opening new subject reports improves the
coverage ratio. - The expansion of RePEc implies that the size of
nep-all, though varying in the short-run, grows
in the long run. Target size theory implies that
the coverage ratio deteriorates. - The static coverage ratio is the result of both
effects canceling out.
40theory 2 quality theory
- George W. Bush version of quality theory
- Some papers are rubbish. They will not get
announced. - The amount of rubbish in RePEc remains constant.
- This implies constant coverage.
- Reality is slightly more subtle.
412 versions of quality theory
- Descriptive quality theory papers that are badly
described - misleading titles
- no abstract
- languages other than English
- Substantive quality theory papers that are well
described, but not good - from unknown authors
- issued by institutions with unenviable research
reputation
42practical importance
- We do care whether one or the other theory is
true. - Target size theory implies that NEP should open
more reports to achieve perfect coverage. - Quality theory suggests that opening more report
will have little to no impact on coverage. - Since operating more reports is costly, there
should be an optimal number of reports.
43results
- Krichel Bakkalbasi (2005) build a binary
logistic regression analysis model. - They find positive evidence for both target size
and quality theory. - The NEP editors don't like the results. They
insist that they only filter by topic.
44Bátiz-Lazo Krichel (2005)?
- Bernardo Bátiz-Lazo On-line distribution of
working papers through NEP A Brief Business
History http//openlib.org/home/krichel/papers/ka
ssel.pdf has an early history of NEP that covers
organizational details I don't talk about here.
45ernad
- stands for editing reports on new academic
documents. - Software system designed by Thomas Krichel at
http//openlib.org/home/kric hel/work/altai.html. - Software written in Perl by Roman D. Shapiro.
Cost 2000. - Started to work after 2004-12.
46cut editor freedom I
- Editors no longer send mail to lists.
- Only one email address sends mail.
- But the mail appears like coming from the editor
- From Marcus Desjardin lternad_at_nep.repec.orggt
- Reply-To Marcus Desjardin ltdesjardin_at_econ.louvain
.begt
47cut editor freedom II
- Editors can no longer edit report issue emails,
e.g. add announcements of conferences. - They are generated from XML files into
standardized text and HTML files bound together
by MIME multipart/alternative. - They can not change dates of issue.
48help editors
- Provide a simple-to-use interface for the
composition of reports - provide an easy to scroll input
- allow for easy sorting of report
- do a better job at pretty-printing
- Get ready for the introduction of pre-sorting
- Actually presorting was only introduced in
2005-08.
49statistical learning
- The idea is that a computer may be able to make
decision on the current nep-all reports based on
the observation of earlier editorial decisions.
This is known as pre-sorting. - Thomas Krichel Information retrieval performance
measures for a current awareness report
composition aid http//openlib.org/home/krichel/s
endai.pdf deals with the evaluation of presorting.
50presorting
- When an allport issue is created, it is
presorted. - In the allport rif each paper has a number in
document order. That number is still reported in
the presorted rif. - The method is support vector machines svm, using
svm_light.
51pre-sorting reconceived
- We should not think of pre-sorting via SVM as
something to replace the editor. - We should not think about it encouraging editors
to be lazy. - Instead, we should think it as an invitation to
examine some papers more closely than others.
52headline vs. bottomline data
- The editors really have a three stage process of
decision. - They read title, author names.
- They read the abstract.
- They read the full text.
- A lot of papers fail at the first hurdle.
- SVM can read the abstract and prioritize papers
for abstract reading. - Editors are happy with the presorting system.
53what is the value of an editor?
- It turns out that reports have very different
forecastability. Some are almost perfect, others
are weak. - If the forecast is very weak the editor may be
- a genius
- a prankster.
54svm training set
- The positive examples are taking from the report
up to a certain time limit, called the experience
length. - The negative results are taken from nep-all, from
the date of the last issued subject report until
the experience length. - The experience length is fixed in ernad.conf. For
NEP it is 13 months.
55features selection
- We use individual words out of the contents from
titles, author names, abstracts, classification
data and the id of the RePEc series. - We normalize the Euclidean sum of the feature
weights. - We run svm_light with the default settings.
56presorting timeline
- When a nep-all issues has been created, a
customized version of its rif is created in the
source/us directory. - This issue is then presorted. The presorted
version is stored in the source/ps directory. - Presorting therefore only takes account of the
information available at nep-all creation time.
57underlying technologies
- Written in Perl, using LibXSLT.
- Uses mod_perl under Apache 2.
- Runs on Debian GNU/Linux, could run on similar
systems. - Ernad needs to used some sort of mailing system
but is not geared to a specific system. It
basically just sends mail.
58underlying information
- AMF is a format for description of academic
documents and academics. - http//amf.openlib.org/doc/amf.html
- It is based on XML Schema, itself based on XML.
- Report data and issue data is encoded in AMF.
59ernad.conf
- Ernad uses one single configuration file
ernad.conf. - It has a simple attribute value structure.
60affordances and domains
- There are basically four things that an
ernad-based current awareness system provides
for. - For each of these affordances, we have a
separate domain. - This allows for distinct affordances to be run by
distinct domains.
61the composition domain
- This is the domain used by the report issue
composition interface. - This is the virtual domain that the ernad apache
is running under. - The ernad process creates files so the apache
server is best run as the user ernad. - Recall ernad requires mod_perl, which in turn is
incompatible with suexec.
62the service domain
- This is where potential reader look at
information about the ernad service - what reports are available
- who edits them
- This domain is fixed through the reports
configuration file report.amf.xml.
63the list domain
- This is the domain where the mailing lists are
under - domain of web interface
- domain of the mailing lists
- Each report report has a list report_at_listdomain,
where listdomain is the list domain. - This domain is fixed through the reports
configuration file report.amf.xml.
64delivery domain
- The links to the full text use the encode the
identifier of the paper and the identifier of the
report. - This allows to see what report readers requested.
- It is imperative that these links are not further
disseminated. There should be no archives of nep
lists. - It is fixed in ernad.conf
65reports.amf.xml
- The first part has the definition of NEP itself.
(next slide) - The second part as the definition of a report
(slide after)? - The allport is the first listed
- reports.amf.xml fixes
- report handles
- editor information, (incl. here editor ids)
- list domain
- service domain
66ltamf xmlns"http//amf.openlib.org"
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xsischemaLocation"http//amf.openlib.org
http//amf.openlib.org/2001/amf.xsd"
xmlnsernad"http//ernad.openlib.org"gt ltcollectio
n id"RePEcnep"gt lttitlegtNEP New Economics
Paperslt/titlegt ltaccesspointgt ftp//all.repec.org/
RePEc/wop/conf lt/accessointgtlthomepagegt http//nep.
repec.org lt/homepagegtlthaspartgt ltcollection
id"RePEcnepnepall"gt
67- ltcollection id"RePEcnepnepcba"gtlttitlegt
- Central Bankinglt/titlegtlthomepagegt
- http//lists.repec.org/mailman/listinfo/nep-cba
lt/homepagegtlternadpasswordgt... - lt/ernadpasswordgtlthaseditorgtltpersongt
- ltnamegtAlexander Mihailovlt/namegtltname
xmllang"en"gtAlexander Mihailovlt/namegt - lthomepagegthttp//econpapers.repec.org/RAS/pmi59.ht
mlt/homepagegt - ltemailgt...lt/emailgtltispartofgt ltorganizationgt...lt/or
ganizationgt - lt/ispartofgtlt/persongtlt/haseditorgt
- lt/collectiongt
68operation
- A rif always as a name yyyy-mm-dd_tist,
- where yyyy is the year, mm month, dd day of the
nep-all issue. - tist is a UNIX time stamp, i.e. the number of
seconds that have passed since the first of
January 1970 - rifs are never deleted. When an operation is
made, a new version of the rif with a new tist is
written.
69creating a sri 0
- After login in to ernad, an editor sees a set of
allport issues to work on. - This is the report selection stage.
- If there is no allport issue needs working on a
sorry message is displayed.
70creating a sri 1
- When a subject report issue is created, it is
copied from the source/ps or source/us directory,
depending on whether the editor chooses with
presorted to work with the presorted or the
unsorted version of the report.
71creating a sri 1.5
- If there is no paper worth to be included in the
report, the editor can move back from the paper
selection to the issue selection stage. There she
can delete the issue. - The rif of the issue is not deleted. Instead an
empty issue is created in the final sent
directory.
72creating a sri 2
- Once papers have been selected, a rif is created
in the into the director selected. This rif
only contains the selected papers. - If there are changes in the selections, new rifs
are created, as soon as the editor moves to the
next screen.
73creating a sri 3
- Once papers have been ordered, a rif is created
in the into the director ordered. - If there are changes in the order, new rifs are
created, as soon as the editor moves to the next
screen.
74creating a sri 4
- Once the sri has been previewed the editor can
click the send button. The rif is stored in the
sent directory. - Ernad created a mail file containing a HTML and
text version of the sri and places it in the
mail directory. - This file can be sent again if there is an email
problem.
75a lot of data
- ernad_at_khufu date
- Thu Oct 18 051837 EDT 2007
- ernad_at_khufu du -s ernad/var/reports/
- 24147956 ernad/var/reports/
- ernad_at_khufu find ernad/var/reports/ -type f
grep -c - 76043
76maintenance
- Thomas Krichel has written a technical guide,
mainly for the director and the general editor. - It is at http//nep.repec.org/technical.
- It illustrates well that the technical
maintenance is still quite heavy. - A lot of maintenance scripts still have to be
written by Thomas.
77assessment
- How well does NEP work?
- Some criteria are already discussed the
literature - delay
- coverage ratio
- overlap
78coverage to lossage
- There is a web site
- http//nep.repec.org/lossage
- that does show the papers that have not been
sent, as well as the coverage ratio for each
issue. - It appears that coverage has improved.
79overlap
- There is no script to compute current overlap
data. - There is quite good historical subscriber for
most of the post-ernad period. - Thus it is possible to calculate overlap of
reports for various nep-all issues.
80to improve coverage
- It would be interesting to redo the work of
Krichel and Bakkalbasi (2005)? - For English papers, we can try to presort nep-all
issues for a virtual nep-no report. This could
help to identify thematic gaps. - We open language-specific reports?
81delay
- The site http//nep.repec.org/delay shows average
delays of editors. - Half of the editors appear to do a good job. A
good job is when the average delay is below a
week.
82editor activity
- There is a web site containing activity data of
editors. - http//nep.repec.org/editor_activity.
- There appear some minor problems but overall it
appears ok. - Date is only available from 2005-06, because of a
misunderstanding between Roman D. Shapiro and
Thomas Krichel.
83downloads
- This is the ultimate measure of success.
- Downloads from a report can be measured because
of a GCI parameter identifying the report. - Parsing the logs and matching the handles with
handles in reports is difficult.
84downloads from issues
- A lot of downloads are made by editors when they
compose the report issue. - A lot of others are made by robots.
- As for the rest, data is at http//nep.repec.org/d
ownloads - At this time this is preliminary.
85the Kiev framework
- This is a framework I want to discuss today to
assess NEP. - Objective maximize downloads of papers through
NEP per paper announced. - Means
- targeted report
- large and targeted audience
- both can be influenced by the editor
86unit of assessment
- The unit of assessment is the report issue. This
is not an assessment of - the report
- the editor
- The independent variable is related to dependent
variables through simple linear regression.
87dependent variable
- It is the number of downloads made by users from
a report. - We try to get to the true user data.
- We only look at data after pre-sorting was
introduced, say 2006 and 2007. - In the following, I am looking at the independent
variables (i.v.)?
88i.v. of normalization
- i.v. 1 issue_size
- This is the number of papers in the report.
- i.v. 2 membership_size
- This is the number of members that the get the
report just before the issue is mailed.
89i.v. of membership
- One indication of membership quality it that is
is dispersed. - i.v. 3 concentration_1
- A measure of concentration of subscribers' top
level domain. - i.v. 4 concentration_2
- A measure of concentration of subscribers' top
and second-level domain.
90i.v. of timeliness
- i.v. 5 all_time
- the time between nep-all and the current subject
issue - i.v. 6 neighbour_time
- the minimum of
- the time between current issue and next issue
- the time between the current issue and and the
previous issue
91i.v. of composition
- i.v. 7 composition_duration
- the total time of composition
- i.v. 8 ordering_step
- the total number of times the report was ordered
- i.v. 9 trailer
- the position of the last paper selected
- i.v. 10 all_size
- the size of the corresponding nep-all
92i.v. of season
- We know that the activity of RePEc is seasonal.
- i.v. 11 to 22 m1 to m12
- dummies to indicate the month
93anything missing?
94Leonardo Fernandes Souto
- is a Brazilian PhD student working on current
awareness services. - He has (correctly) identified NEP as the model
that all should follow -)? - His questionnaire is at http//nep.repe
c.org/research/NEP_questionnaire_2007-10-06.doc.
95future extensions of ernad 1
- Use editor identities to build a customized
experience length. - Use a collection of multi-word RePEc keywords to
aid pre-sorting.
96future extensions of ernad 2
- Deal better with duplicate papers under different
handles. - use lists before ordering as a basis for
inclusion into pre-sorting. This will allow
editors to delete duplicates without confusing
the SVM - potentially detect duplicates at allport
composition time.
97future extensions of NEP
- Use RSS as an alternative dissemination method.
98tough problems
- Filtering for new papers is deficient as the date
on papers is not mandatory. Presorting for age
seems impossible. - The fight with spam is a problem for anyone who
sends out a lot of mail.
99finally stop the press...
- 2007-10-17 Christian Zimmermann wrote At
http//ideas.repec.org/i/ e.html, I have
attempted to classify registered authors by
field. For this I used their papers as
disseminated by NEP, and if at least one fourth
are in a report, authors are considered to be a
specialist of that field.
100http//openlib.org/home/krichel
- Thank you for your attention!