Group Project Presentation: Database of Online Publications - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Group Project Presentation: Database of Online Publications

Description:

The DBLP, or Database of on-line Library Publications, is a catalogue of ... structure was fully implemented - still impossible to eradicate redundancy ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 39
Provided by: eionmu
Category:

less

Transcript and Presenter's Notes

Title: Group Project Presentation: Database of Online Publications


1
Group Project PresentationDatabase of Online
Publications
2
Introduction
  • The DBLP, or Database of on-line Library
    Publications, is a catalogue of computing-related
    publications from the last 50 years
  • It is a resource for finding source material and
    references used by academics world-wide

3
Demonstration
  • There will now be a demonstration of both the
    existing site and our new site

4
Aims
  • The aim of this project is twofold
  • Firstly, to convert data stored in an XML format
    to a relational database schema
  • Secondly, to provide a toolset for accessing the
    data that is superior to that which exists for
    the XML records
  • The data is to be accessed using a web interface,
    as is currently the case

5
The Current DBLP Site
  • As we do not have any access to the current DBLP
    site over and above normal web access and the
    ability to download the XML, we have no way of
    knowing exactly how the current site operates
  • As far as we were able to determine, the current
    site works by using forms to search against a
    variety of indices of the XML files- search
    results are then passed back

6
The DBLP Files
  • These files contain the raw XML data. The
    back-end of the current site runs off these
    files, in combination with various indexing
    strategies.
  • The DBLP records are available for download
  • They consist of over 235,000 small XML files,
    each containing a single publication
  • Publications can be theses, books, journal
    articles, conferences

7
What is XML?
  • XML is a universal format for structured
    documents and data
  • It uses plain text files and is designed to be
    easily parsed, for simple programs
  • XML is being developed by the World Wide Web
    Consortium

8
Example DBLP File
  • ltbook key"books/oreilly/OReillyMG99"gt
  • ltauthorgtTim O'Reillylt/authorgt
  • ltauthorgtTroy Mottlt/authorgt
  • ltauthorgtWalter Glennlt/authorgt
  • lttitlegtWindows 98 in a Nutshelllt/titlegt
  • ltpublishergtO'Reillylt/publishergt
  • ltyeargt1999lt/yeargt
  • ltisbngt1-56592-486-Xlt/isbngt
  • lt/bookgt

9
How do we read the Data?
  • There were two main choices of technology to use
  • Java and JAXP (Java API for XML Processing)
  • Perl and one of the Perl-XML parsers
  • We chose to use Perl, as we were more familiar
    with the language

10
Reading the Data
  • The DTD (Document Type Declaration) was used to
    cater for all the possible data fields in the
    files
  • A script was created to read all the data into
    Perl data structures, and to recursively process
    all the files from DBLP

11
Now the Data has been Read
  • Once the data had been read, it was simply a
    matter of adding the appropriate routines to
    insert the data into the database, once the
    database design had been completed

12
Database Schema Design
  • Data was divided into separate tables
  • Publications publication data (e.g. titles
    etc.)
  • Persons people (authors, editors)
  • Joins links between people and publications
  • Links external links for publications

13
Database Schema Reasoning
  • To avoid redundancy, store people separately from
    publications
  • To establish the required many-many relationship
    between people and publications use an
    intermediary table
  • To store links more efficiently, use a separate
    table

14
XML vs. PostgreSQL
  • XML is a flat-file format, so there is no way to
    avoid data redundancy
  • Extra tools are needed to index and search
    records
  • XML provides no built-in type-checking
  • Because there are no relationships, it is easy to
    set up publications with multiple authors

15
XML vs. PostgreSQL (cont.)
  • PostgreSQL
  • avoids data redundancy.
  • has search and indexing built-in
  • has type-checking of records
  • makes it harder to set up publications with
    multiple authors

16
Site Implementation
  • The new design of the site works with three main
    components
  • A dynamically generated HTML front end for
    displaying content and results
  • A Perl DBI (Database Interface) for data
    retrieval and searching
  • A PostgreSQL back end database

17
Site Implementation (cont.)
  • The front end makes requests to the Perl DBI
    using forms.
  • The Perl DBI processes form input, converts it to
    SQL
  • The SQL runs on the PostgreSQL back end.
  • Results are then returned to the front end via
    the DBI and displayed in an appropriate format.

18
The User Interface
  • Current DBLP site
  • cluttered
  • uses multiple pages for similar functions, which
    could be on a single page
  • Our Aim
  • to provide better functionality
  • To make the site user-friendly

19
Our Design
  • We decided that our design should utilise colour
    and layout
  • Our first design used frames, but these were
    removed in the third revision
  • We decided to have a single Basic Search page,
    and a single Advanced Search page

20
Simplifying the Site
  • Incorporated Frequently Asked Questions into a
    single page rather than spreading over many
  • Incorporated searches into a single section
  • We are using only standard HTML and CSS, to allow
    any browser to be used to view the site

21
Search Strategy
  • Searches need to be
  • Correct
  • Efficient
  • Fast

22
Basic Search
  • Search for author
  • J Smith

23
Basic Search (cont.)
  • Step 1
  • SELECT personid, name
  • FROM persons
  • WHERE name ILIKE JSmith

24
Basic Search (cont.)
  • Results
  • J. A. Smith
  • Harry J. Smith
  • J MacGregor-Smith
  • John Miles Smith

25
Basic Search (cont.)
  • Step 2 Using the personid
  • SELECT p.pubid, title
  • FROM publications p, joins j
  • WHERE p.pubid j.pubid
  • AND j.personid 20

26
Basic Search (cont.)
  • Step 3 Using the pubid
  • SELECT title, publisher, year
  • FROM publications
  • WHERE pubid 1002

27
Basic Search (cont.)
  • Alternatives
  • SELECT pubid, title
  • FROM publications
  • WHERE title ILIKE SQL

28
Basic Search (cont.)
  • SELECT per.name, p.title, p.pubid
  • FROM publications p, joins j, persons per
  • WHERE p.pubid j.pubid
  • AND j.personid per.personid
  • AND per.name LIKE JSmith
  • AND p.title LIKESQL

29
Advanced Search
  • Works in a similar way to basic search
  • Can now enter many authors
  • Can now enter more criteria to search for

30
Advanced Search (cont.)
  • Search for
  • Authors McBrien, Poulovassilis
  • Title Databases
  • Year 1999

31
Advanced Search (cont.)
  • Find all publications for each author.
  • SELECT p.pubid from publications p, joins
    j, persons s
  • WHERE p.pubid j.pubid
  • AND j.personid s.personid
  • AND s.name ILIKE McBrien

32
Advanced Search (cont.)
  • and.
  • SELECT p.pubid from publications p, joins
    j, persons s
  • WHERE p.pubid j.pubid
  • AND j.personid s.personid
  • AND s.name ILIKE Poulovassilis
  • then find the intersection of the two sets of
    pubids

33
Advanced Search (cont.)
  • then check the extra constraints
  • SELECT title
  • FROM publications p, joins j, persons s
  • WHERE p.pubid j.pubid
  • AND j.personid s.personid
  • AND (p.pubid 1 OR p.pubid OR.)
  • AND p.title ILIKE Databases
  • AND p.year 1999

34
Testing
  • Two main types of testing
  • User testing
  • Performance testing

35
User Testing
  • Visual Impact
  • Colour scheme
  • Tidy, less cluttered
  • Navigation
  • Simple
  • Links on every page

36
User Testing (cont.)
  • Browsing
  • More complete than existing site
  • Downloading
  • Speed
  • Different formats

37
Performance Testing
  • No pre-existing performance benchmarks
  • Site performance is time-dependent
  • All performance tests were comparative
  • New site is faster for more complex searches,
    slightly slower for very simple ones
  • Overall performance does not suffer as a result
    of extra functionality

38
Conclusions
  • The main aims of the project were both met
  • the relational database structure was fully
    implemented - still impossible to eradicate
    redundancy
  • extra search tools and output formats were
    successfully implemented with no downsides
    resulting from the new implementation
Write a Comment
User Comments (0)
About PowerShow.com