CS179G Team06 - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

CS179G Team06

Description:

description two of our famous Belgian Waffles with plenty of real maple syrup /description ... Belgian Wafflestwo of our famous Belgian Waffles with plenty ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 19
Provided by: yunc
Category:
Tags: cs179g | famous | team06

less

Transcript and Presenter's Notes

Title: CS179G Team06


1
CS179G Team06
  • Carina Yun, Andrew Luby, Marc Sherry

2
Introduction
  • Create a database from an XML file
  • Parse the XML file using Xerces-C
  • Store the parsed data into a database
  • Add new versions by parsing a change log file
  • Recall previous versions and user-requested
    information about certain data (version
    reconstruction and structural joins)

3
An example Food.xml
  • lt?xml version"1.0" encoding"ISO8859-1" ?gt
  • ltbreakfast-menugtThis is breakfast menu
  • ltfoodgt
  • ltnamegtBelgian Waffleslt/namegt
  • ltpricegt5.95lt/pricegt
  • ltdescriptiongttwo of our famous Belgian
    Waffles with plenty of real maple
    syruplt/descriptiongt
  • ltcaloriesgt650lt/caloriesgt
  • lt/foodgt
  • ltfoodgt
  • ltnamegtHomestyle Breakfastlt/namegt
  • ltpricegt6.95lt/pricegt
  • ltdescriptiongttwo eggs, bacon or sausage,
    toast, and our ever-popular hash
    brownslt/descriptiongt
  • ltcaloriesgt950lt/caloriesgt
  • lt/foodgt
  • lt/breakfast-menugt

4
Tree Numbering
  • As we parsed the XML file we put it in the tree
    and added a numbering scheme

This is breakfast menu
(100,2200)
(200,1100)
(1200,2100)
950
650
Belgian Waffles
Homestyle Breakfast
(900,1000)
(1900,2000)
(300,400)
(1300,1400)
two of our famous Belgian Waffles with plenty of
real maple syrup
6.95
two eggs, bacon or sausage, toast, and our
ever-popular hash browns
5.95
(500,600)
(1500,1600)
(700,800)
(1700,1800)
5
Numbering Scheme
Take the range between two nodes and take 20
percent i.e. range 700-600 100, 20 of 100 is
20 So new node num1range620 and num2 is the
left over range 20 num1 range (700-620)20
62016636
This is breakfast menu
(100,2200)
(200,1100)
(1200,2100)
950
650
Belgian Waffles
Homestyle Breakfast
(900,1000)
(1900,2000)
(300,400)
(1300,1400)
two of our famous Belgian Waffles with plenty of
real maple syrup
6.95
two eggs, bacon or sausage, toast, and our
ever-popular hash browns
5.95
(500,600)
(1500,1600)
(700,800)
(1700,1800)
New node
(620,636)
6
STL Maps Lists
This is breakfast menu
two eggs, bacon or sausage, toast, and our
ever-popular hash browns
two of our famous Belgian Waffles with plenty of
real maple syrup
950
650
Belgian Waffles
Homestyle Breakfast
5.95
6.95
7
ListElem
  • ListElem is the container for our XML file
  • It contains
  • num1 the left number in the tree
  • num2 the right number in the tree
  • start the start version of usefulness
  • end the end version of usefulness
  • value the characters between the start and end
    tag in XML if it is 6 characters or less, or two
    pointers one to page and one to offset

8
Database Structure
  • Our ListElem has 16 bytes of ints, like for
    example startversion, endversion, num1, num2, and
    size. Each ListElem also has a string inside it,
    which takes up a varying amount of space (that
    is, Belgian Waffles takes up less space than
    Our delicious waffles covered in heavy maple
    syrup )
  • num1 300 //left number - 4
    bytes
  • num2 400 //right number - 4 bytes
  • start 0 //start version - 2 bytes
  • end -1 //end version - 2 bytes
  • value Belgian Waffles //XML data - 15 bytes
  • size (16 bytes for ints15 bytes for Belgian
    Waffles) 31 //4 bytes

9
Database Structure
  • RECORDSIZE is a constant we set at the
    beginning that tells us how much data is in a
    single record in the DB. We use 22 bytes, but we
    can change it to whatever we like.
  • From the last slide, we have 31 bytes in our
    ListElem. This is too big to fit into one record,
    so we split it into two types of pages. There is
    one set of pages to hold the ints, and one set to
    hold the data.
  • page 450 page 501

300 0 501 400 -1
0
Belgian Waffles
10
Database Structure
  • The database has two types of pages record
    pages and data pages

300 0 501 400 -1
0
Belgian Wafflestwo of our famous Belgian Waffles
with plenty of real maple syrup
500 0 5.95 600 -1

700 0 501 800 -1
15
900 0 650 1000 -1

11
Version Clustering
  • Version clustering takes into consideration
    the usefulness threshold of a given page. When a
    page is no longer useful the still-useful
    elements in the page are copied to a new page.
  • For example, if our usefulness threshold is
    set to 50 then when the page is less than 50
    useful it is copied to a new page

Version 5
(V0-V2)
(V5-V8)
This is the new page It is 100 useful
(V1-V4)
(V5-V8)
(V1-V2)
This page is not useful because it is 25 useful
lt 50
12
Changelog Parsing
  • V1
  • delete breakfast-menu0/food0/price0
  • lt?xml version"1.0" encoding"ISO8859-1" ?gt
  • ltbreakfast-menugtThis is breakfast menu
  • ltfoodgt
  • ltnamegtBelgian Waffleslt/namegt
  • ltpricegt5.95lt/pricegt --this line is deleted
  • ltdescriptiongttwo of our famous Belgian
    Waffles with plenty of real maple
    syruplt/descriptiongt
  • ltcaloriesgt650lt/caloriesgt
  • lt/foodgt
  • ltfoodgt
  • ltnamegtHomestyle Breakfastlt/namegt
  • ltpricegt6.95lt/pricegt
  • ltdescriptiongttwo eggs, bacon or sausage,
    toast, and our ever-popular hash
    brownslt/descriptiongt
  • ltcaloriesgt950lt/caloriesgt
  • lt/foodgt
  • lt/breakfast-menugt

13
Sorting
  • Used by both version reconstruction and
    structural joins
  • Sort Merging - merging pages together
  • We take 1024 pages per element, sort them one
    page at a time and write to a buffer in memory.
    Once buffer is full, we clear the buffer and
    continue.

Each number represents the left number in the
ListElem

All pages are type food
500
400
700
800
300
100
600
200
1st time 2nd time 3rd time
.. Nth time

100
300
500
700
200
600
800
400
Write out one page at a time
To database
14
Version Reconstruction
Food
Calories
Price
Sorted pages in database
1000
1100
1200
700
800
900
502
504
506
400
500
600
100
200
300
501
505
503
Array of pages in memory
400
500
600
100
200
300
Looking at the left number, find the lowest
numbered element, and print out the tag, value,
and closing tag. This is the XML output.
15
Structural Joins
  • Structural joins allow the user to request
    all the data of a given type under a type. For
    example if the user wants the names of all the
    foods in the breakfast menu or if the user wants
    all the prices of all the foods on the menu.

ltfoodgt ltnamegtBelgian Waffleslt/namegt
ltpricegt5.95lt/pricegt ltdescriptiongttwo of our
famous Belgian Waffles with plenty of real maple
syruplt/descriptiongt ltcaloriesgt650lt/caloriesgt
lt/foodgt ltfoodgt ltnamegtHomestyle
Breakfastlt/namegt ltpricegt6.95lt/pricegt
ltdescriptiongttwo eggs, bacon or sausage, toast,
and our ever-popular hash brownslt/descriptiongt
ltcaloriesgt950lt/caloriesgt lt/foodgt
Structural Joins allow the retrieval of a
given type of data Belgian Waffles Homestyle
Breakfast 5.95 6.95
16
Structural Joins
  • Sort elements
  • Scan elements of the given parent type
  • For each parent found, scan the elements of the
    given descendent type
  • If the descendent is contained within the range
    of the parent, then it is a descendent

17
Performance
  • Parse 6 seconds
  • Create DB 4 seconds
  • Version reconstruction 24 seconds

18
Conclusion
  • We learned how to work with Berkley DB and
    Xerces-C to parse an XML file into a database.
  • How to create a versioning system by reading in a
    change log and reconstructing the versions.
Write a Comment
User Comments (0)
About PowerShow.com