Group Project Presentation: Database of Online Publications - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Group Project Presentation: Database of Online Publications

Description:

The DBLP, or Database of on-line Library Publications, is a catalogue of ... structure was fully implemented - still impossible to eradicate redundancy ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 39

Provided by: eionmu

Category:

more less

Transcript and Presenter's Notes

Title: Group Project Presentation: Database of Online Publications

1
Group Project PresentationDatabase of Online
Publications
2
Introduction

The DBLP, or Database of on-line Library
Publications, is a catalogue of computing-related
publications from the last 50 years
It is a resource for finding source material and
references used by academics world-wide

3
Demonstration

There will now be a demonstration of both the
existing site and our new site

4
Aims

The aim of this project is twofold
Firstly, to convert data stored in an XML format
to a relational database schema
Secondly, to provide a toolset for accessing the
data that is superior to that which exists for
the XML records
The data is to be accessed using a web interface,
as is currently the case

5
The Current DBLP Site

As we do not have any access to the current DBLP
site over and above normal web access and the
ability to download the XML, we have no way of
knowing exactly how the current site operates
As far as we were able to determine, the current
site works by using forms to search against a
variety of indices of the XML files- search
results are then passed back

6
The DBLP Files

These files contain the raw XML data. The
back-end of the current site runs off these
files, in combination with various indexing
strategies.
The DBLP records are available for download
They consist of over 235,000 small XML files,
each containing a single publication
Publications can be theses, books, journal
articles, conferences

7
What is XML?

XML is a universal format for structured
documents and data
It uses plain text files and is designed to be
easily parsed, for simple programs
XML is being developed by the World Wide Web
Consortium

8
Example DBLP File

ltbook key"books/oreilly/OReillyMG99"gt
ltauthorgtTim O'Reillylt/authorgt
ltauthorgtTroy Mottlt/authorgt
ltauthorgtWalter Glennlt/authorgt
lttitlegtWindows 98 in a Nutshelllt/titlegt
ltpublishergtO'Reillylt/publishergt
ltyeargt1999lt/yeargt
ltisbngt1-56592-486-Xlt/isbngt
lt/bookgt

9
How do we read the Data?

There were two main choices of technology to use
Java and JAXP (Java API for XML Processing)
Perl and one of the Perl-XML parsers
We chose to use Perl, as we were more familiar
with the language

10
Reading the Data

The DTD (Document Type Declaration) was used to
cater for all the possible data fields in the
files
A script was created to read all the data into
Perl data structures, and to recursively process
all the files from DBLP

11
Now the Data has been Read

Once the data had been read, it was simply a
matter of adding the appropriate routines to
insert the data into the database, once the
database design had been completed

12
Database Schema Design

Data was divided into separate tables
Publications publication data (e.g. titles
etc.)
Persons people (authors, editors)
Joins links between people and publications
Links external links for publications

13
Database Schema Reasoning

To avoid redundancy, store people separately from
publications
To establish the required many-many relationship
between people and publications use an
intermediary table
To store links more efficiently, use a separate
table

14
XML vs. PostgreSQL

XML is a flat-file format, so there is no way to
avoid data redundancy
Extra tools are needed to index and search
records
XML provides no built-in type-checking
Because there are no relationships, it is easy to
set up publications with multiple authors

15
XML vs. PostgreSQL (cont.)

PostgreSQL
avoids data redundancy.
has search and indexing built-in
has type-checking of records
makes it harder to set up publications with
multiple authors

16
Site Implementation

The new design of the site works with three main
components
A dynamically generated HTML front end for
displaying content and results
A Perl DBI (Database Interface) for data
retrieval and searching
A PostgreSQL back end database

17
Site Implementation (cont.)

The front end makes requests to the Perl DBI
using forms.
The Perl DBI processes form input, converts it to
SQL
The SQL runs on the PostgreSQL back end.
Results are then returned to the front end via
the DBI and displayed in an appropriate format.

18
The User Interface

Current DBLP site
cluttered
uses multiple pages for similar functions, which
could be on a single page
Our Aim
to provide better functionality
To make the site user-friendly

19
Our Design

We decided that our design should utilise colour
and layout
Our first design used frames, but these were
removed in the third revision
We decided to have a single Basic Search page,
and a single Advanced Search page

20
Simplifying the Site

Incorporated Frequently Asked Questions into a
single page rather than spreading over many
Incorporated searches into a single section
We are using only standard HTML and CSS, to allow
any browser to be used to view the site

21
Search Strategy

Searches need to be
Correct
Efficient
Fast

22
Basic Search

Search for author
J Smith

23
Basic Search (cont.)

Step 1
SELECT personid, name
FROM persons
WHERE name ILIKE JSmith

24
Basic Search (cont.)

Results
J. A. Smith
Harry J. Smith
J MacGregor-Smith
John Miles Smith

25
Basic Search (cont.)

Step 2 Using the personid
SELECT p.pubid, title
FROM publications p, joins j
WHERE p.pubid j.pubid
AND j.personid 20

26
Basic Search (cont.)

Step 3 Using the pubid
SELECT title, publisher, year
FROM publications
WHERE pubid 1002

27
Basic Search (cont.)

Alternatives
SELECT pubid, title
FROM publications
WHERE title ILIKE SQL

28
Basic Search (cont.)

SELECT per.name, p.title, p.pubid
FROM publications p, joins j, persons per
WHERE p.pubid j.pubid
AND j.personid per.personid
AND per.name LIKE JSmith
AND p.title LIKESQL

29
Advanced Search

Works in a similar way to basic search
Can now enter many authors
Can now enter more criteria to search for

30
Advanced Search (cont.)

Search for
Authors McBrien, Poulovassilis
Title Databases
Year 1999

31
Advanced Search (cont.)

Find all publications for each author.
SELECT p.pubid from publications p, joins
j, persons s
WHERE p.pubid j.pubid
AND j.personid s.personid
AND s.name ILIKE McBrien

32
Advanced Search (cont.)

and.
SELECT p.pubid from publications p, joins
j, persons s
WHERE p.pubid j.pubid
AND j.personid s.personid
AND s.name ILIKE Poulovassilis
then find the intersection of the two sets of
pubids

33
Advanced Search (cont.)

then check the extra constraints
SELECT title
FROM publications p, joins j, persons s
WHERE p.pubid j.pubid
AND j.personid s.personid
AND (p.pubid 1 OR p.pubid OR.)
AND p.title ILIKE Databases
AND p.year 1999

34
Testing

Two main types of testing
User testing
Performance testing

35
User Testing

Visual Impact
Colour scheme
Tidy, less cluttered
Navigation
Simple
Links on every page

36
User Testing (cont.)

Browsing
More complete than existing site
Downloading
Speed
Different formats

37
Performance Testing

No pre-existing performance benchmarks
Site performance is time-dependent
All performance tests were comparative
New site is faster for more complex searches,
slightly slower for very simple ones
Overall performance does not suffer as a result
of extra functionality

38
Conclusions

The main aims of the project were both met
the relational database structure was fully
implemented - still impossible to eradicate
redundancy
extra search tools and output formats were
successfully implemented with no downsides
resulting from the new implementation

Write a Comment

User Comments (0)