Title: International Accesses to a Digital Library of ETDs
1ETD 2005
International Accesses to a Digital Library of
ETDs
2ETD 2005
Ana Pavani Departamento de Engenharia
Elétrica PontifÃcia Universidade Católica do Rio
de Janeiro apavani_at_lambda.ele.puc-rio.br http//w
ww.maxwell.lambda.ele.puc-rio.br/
3- Presentation outline
- Profile of the digital library
- Generation of data
- Combination and anaysis of data interesting
results - Next steps
4- Profile of the digital library
- Beginning of the collection 2nd semester of
1995 - Items to start the collection courseware
(texts, exercises, technical manuals, tests, etc.)
5- The digital library is part of a system that
- Is a LMS (Learning Management System)
- Has administrative functions that allow data
exchange with the universitys administrative
system - Is linked (2 directions) to CNPqs Lattes
Platform (curricula database with more than 595 K
CV) - Allows the control of series collections
- Is multilingual and has interfaces in 3 languages
6- Evolution of the collection
- Administrative documents
- Preprints, published papers online articles
- Interactive courseware
- ETDs (2000)
- Online journals (2003)
- Senior projects (2003)
- Online bulletins distributed through mailing
lists, archived and published automatically
(2004) - Books (Oct. 2005)
7- Numbers of titles in the collection
- Courseware (many types) 2,700
- Administrative documents 33
- Technical documents 94
- ETDs 1873 (PUC-Rio) 31 (UNICAP)
- Preprints, published papers online articles
280 - Senior projects 305
- Online journals 3 ( 1 in Oct. 2005 1 in Dec.
2005) - Online bulletins 2
- Books 1 (to be published in Oct. 2005)
- Total number of digital objects (DOs) 16,400
8- Technological characteristics
- Machine IBM RS/6000
- Operating system IBM AIX
- Web server Apache
- DBMS IBM DB2
- Apache log contains info on accesses to ALL
digital contents on the system, besides all
transaction that users perform (clicking buttons,
reading posts, reading help pages, etc.) data
on transactions with contents must be extracted
from the server log to generate the numbers to be
analyzed
9- Generation of data
- Data have 2 different natures production and
accesses - Production data come from functions of the system
that are not related to the Apache server but
only to the DB
example
10() PUC-Rio started requiring ETDs in Aug.
2002 () UNICAP does not require ETDs.
11- Access data are obtained from both the Apache
Server log and the DB - Logs are mined (according to the following
definitions) and the results are stored on the DB - Mined data are combined with production data
(metadata) already in the database (types of
contents, authors, programs, areas of knowledge,
dates, countries, etc.) to yield results
12- Definitions for mining the log
- When access statistics came into discussion, it
was necessary to define how data should be mined
from the log and how it should be combined
afterwards - The definitions follow (M) mining definitions
and (C) combining definitions
13- (M) Visits and complete visits
- An ETD can have one or many digital objects. The
number of visits is the sum of all accesses to
all digital objects in a given month. A complete
visit is a set of visits to all digital objects
from a country in a given month.
14- (M) Country x IP address
- The decision to use the country and not the IP
address to establish a visit was based on the
fact that the visits to an ETD can be made at
different times (and reconnecting may assign a
new IP address) and from different locations
(with fixed IP addresses).
15(M) Counting visits from the same IP address
Visits from the same IP are counted
individually due to the fact that networks with
many machines can be identified by the IP address
of a firewall.
16(M) Counting visits to restricted digital objects
Some ETDs are totally or partially restricted
approximately 30 have some type of permanent
or temporary restriction. Metadata, abstracts
included, are publicly available for all of them.
It was decided that attempts followed by denials
of access would be counted as accesses.
!! This is informed in the help pages of the
system it is suggested that authors should
consider allowing their contents to become public
if many attempts occur.
17(C) Lines to mine Since the interest was on
access to digital objects, the decision was to
get the lines with extensions .dcr, .doc, .htm,
.pdf, etc. All possible extensions on the
database are considered, as long as the
corresponding item is cataloged on the digital
library, so that an eventual static html system
page is not counted.
18- Observations
- Statistics were planned on a monthly basis. The
model treats data as sequences of points with
discrete-time intervals of a month. Past months
data are unchanged and current month is updated
according to the Update definition. - IPs are resolved using a plug-in called GeoIP
Free that is available with AWStats.
19(C) Information to get from a log line The
month and the year are extracted along with
identification of the digital object and the
country of the IP address that accessed the
digital object.
20(C) Update of the DB The lines are read every
hour at the full hours (0000, 0100, etc.)
incremental lines are mined. Accesses are summed
for each month-year-DO-country, so the table is
not very big in the first 6 months of 2005 the
average number of lines per month was 10,000.
21(C) When to start computing The log of the
Apache Server started being saved on Jun 01,
2004. So, either this date was used or a later
one, for example Jan 01, 2005. The decision was
to use all available monthly logs.
When the process started, some days of offline
processing were required. Afterwards update
became automatic according to the Update
definition.
22- Observations
- Maybe these were not the best definitions we
are willing to discuss alternatives!! - The (original) logs are stored and saved offline
in case some change in the minig strategy is
decided (we have not sunk the ships!!).
23- Definitions for computing statistics
- By author
- Visited ETDs by year, month and country
- Visited ETDs by country, month and year
- 25 most visited ETDs (on the system PUC-Rio
UNICAP) - 20 most visited ETDs by institution
24- 10 most visited ETDs by graduate program
- Visited ETDs by institution, program, year and
month
25Initial Results
26Access to ETDs is increasing (Sep 28, 2005)
ETDs may/sep ? 13 accesses may/sep ?
54.6
27Number of total visits is increasing (Sep 28,
2005)
ETDs may/sep ? 13 accesses may/sep ?
54.6
28Accumulated average total visits is increasing
(Sep 28, 2005)
ETDs may/sep ? 13 accesses may/sep ?
54.6
29Brazil accounts for 55 of the accesses since Jun
01, 2004 (Sep 28, 2005)
But Brazil pt speaking es speaking
75 Brazil US pt speaking es speaking 87
30On Jun 15, 2007 the numbers of ETDs in Iberian
languages on the NDLTD DB were
Brazilian ETDs were 83 of all ETDs in Iberian
languages (total number 13,369)
31Percentage of visits from Brazil is decreasing
(Sep 28, 2005)
32Accumulated percentage averages of visits from
Brazil (Sep 28, 2005)
33Total accesses top 10 countries (Sep 28, 2005)
identified countries 122 unindentified
countries satellite access host
34- Some interesting results
- Some ETDs are permanent best sellers
- They are on specific subjects (examples a
specific phylosopher and history of modern
architecture in Brazil) - They are linked from sites on the subjects
(examples the first from the US Brazil and the
second from Germany) - They are accessed from different countries
- Some topics are permanent best sellers
(example energy)
35- Some ETDs are temporary best sellers this
seems to happen when they are displayed at the
last published ETDs functions (system and
graduate program) - Some graduate programs are permanent best
sellers - They research topics that are very specific of
the country (examples education and history of
culture) - They are indexed in other sites and/or digital
libraries (examples Universia in Spain for
social sciences and humanities) - They are accessed from different countries
36The 25 most visited ETDs have a large number of
visits
No average is lower than 100 visits per month
37- Next steps
- Find out how readers got to ETDs (BDTD, NDLTD,
SCIRUS, etc.) an online survey is planned - Interview faculty to check if some ETDs are
recommended reading in courses - Gather more data and analyze in a more
scientific manner (must find a student!!)
38- Develop additional functions comparing accesses
with production - Extend to other digital contents (at the moment
only ETDs and online journals have access
statistics)
39 Thank you! Muito obrigada!