Title: CS179G Team06
1CS179G Team06
- Carina Yun, Andrew Luby, Marc Sherry
2Introduction
- Create a database from an XML file
- Parse the XML file using Xerces-C
- Store the parsed data into a database
- Add new versions by parsing a change log file
- Recall previous versions and user-requested
information about certain data (version
reconstruction and structural joins)
3An example Food.xml
- lt?xml version"1.0" encoding"ISO8859-1" ?gt
- ltbreakfast-menugtThis is breakfast menu
- ltfoodgt
- ltnamegtBelgian Waffleslt/namegt
- ltpricegt5.95lt/pricegt
- ltdescriptiongttwo of our famous Belgian
Waffles with plenty of real maple
syruplt/descriptiongt - ltcaloriesgt650lt/caloriesgt
- lt/foodgt
- ltfoodgt
- ltnamegtHomestyle Breakfastlt/namegt
- ltpricegt6.95lt/pricegt
- ltdescriptiongttwo eggs, bacon or sausage,
toast, and our ever-popular hash
brownslt/descriptiongt - ltcaloriesgt950lt/caloriesgt
- lt/foodgt
- lt/breakfast-menugt
4Tree Numbering
- As we parsed the XML file we put it in the tree
and added a numbering scheme
This is breakfast menu
(100,2200)
(200,1100)
(1200,2100)
950
650
Belgian Waffles
Homestyle Breakfast
(900,1000)
(1900,2000)
(300,400)
(1300,1400)
two of our famous Belgian Waffles with plenty of
real maple syrup
6.95
two eggs, bacon or sausage, toast, and our
ever-popular hash browns
5.95
(500,600)
(1500,1600)
(700,800)
(1700,1800)
5Numbering Scheme
Take the range between two nodes and take 20
percent i.e. range 700-600 100, 20 of 100 is
20 So new node num1range620 and num2 is the
left over range 20 num1 range (700-620)20
62016636
This is breakfast menu
(100,2200)
(200,1100)
(1200,2100)
950
650
Belgian Waffles
Homestyle Breakfast
(900,1000)
(1900,2000)
(300,400)
(1300,1400)
two of our famous Belgian Waffles with plenty of
real maple syrup
6.95
two eggs, bacon or sausage, toast, and our
ever-popular hash browns
5.95
(500,600)
(1500,1600)
(700,800)
(1700,1800)
New node
(620,636)
6STL Maps Lists
This is breakfast menu
two eggs, bacon or sausage, toast, and our
ever-popular hash browns
two of our famous Belgian Waffles with plenty of
real maple syrup
950
650
Belgian Waffles
Homestyle Breakfast
5.95
6.95
7ListElem
- ListElem is the container for our XML file
- It contains
- num1 the left number in the tree
- num2 the right number in the tree
- start the start version of usefulness
- end the end version of usefulness
- value the characters between the start and end
tag in XML if it is 6 characters or less, or two
pointers one to page and one to offset -
8Database Structure
- Our ListElem has 16 bytes of ints, like for
example startversion, endversion, num1, num2, and
size. Each ListElem also has a string inside it,
which takes up a varying amount of space (that
is, Belgian Waffles takes up less space than
Our delicious waffles covered in heavy maple
syrup ) - num1 300 //left number - 4
bytes - num2 400 //right number - 4 bytes
- start 0 //start version - 2 bytes
- end -1 //end version - 2 bytes
- value Belgian Waffles //XML data - 15 bytes
- size (16 bytes for ints15 bytes for Belgian
Waffles) 31 //4 bytes
9Database Structure
- RECORDSIZE is a constant we set at the
beginning that tells us how much data is in a
single record in the DB. We use 22 bytes, but we
can change it to whatever we like. - From the last slide, we have 31 bytes in our
ListElem. This is too big to fit into one record,
so we split it into two types of pages. There is
one set of pages to hold the ints, and one set to
hold the data. - page 450 page 501
300 0 501 400 -1
0
Belgian Waffles
10Database Structure
- The database has two types of pages record
pages and data pages
300 0 501 400 -1
0
Belgian Wafflestwo of our famous Belgian Waffles
with plenty of real maple syrup
500 0 5.95 600 -1
700 0 501 800 -1
15
900 0 650 1000 -1
11Version Clustering
- Version clustering takes into consideration
the usefulness threshold of a given page. When a
page is no longer useful the still-useful
elements in the page are copied to a new page. -
- For example, if our usefulness threshold is
set to 50 then when the page is less than 50
useful it is copied to a new page
Version 5
(V0-V2)
(V5-V8)
This is the new page It is 100 useful
(V1-V4)
(V5-V8)
(V1-V2)
This page is not useful because it is 25 useful
lt 50
12Changelog Parsing
- V1
- delete breakfast-menu0/food0/price0
-
- lt?xml version"1.0" encoding"ISO8859-1" ?gt
- ltbreakfast-menugtThis is breakfast menu
- ltfoodgt
- ltnamegtBelgian Waffleslt/namegt
- ltpricegt5.95lt/pricegt --this line is deleted
- ltdescriptiongttwo of our famous Belgian
Waffles with plenty of real maple
syruplt/descriptiongt - ltcaloriesgt650lt/caloriesgt
- lt/foodgt
- ltfoodgt
- ltnamegtHomestyle Breakfastlt/namegt
- ltpricegt6.95lt/pricegt
- ltdescriptiongttwo eggs, bacon or sausage,
toast, and our ever-popular hash
brownslt/descriptiongt - ltcaloriesgt950lt/caloriesgt
- lt/foodgt
- lt/breakfast-menugt
13Sorting
- Used by both version reconstruction and
structural joins - Sort Merging - merging pages together
- We take 1024 pages per element, sort them one
page at a time and write to a buffer in memory.
Once buffer is full, we clear the buffer and
continue.
Each number represents the left number in the
ListElem
All pages are type food
500
400
700
800
300
100
600
200
1st time 2nd time 3rd time
.. Nth time
100
300
500
700
200
600
800
400
Write out one page at a time
To database
14Version Reconstruction
Food
Calories
Price
Sorted pages in database
1000
1100
1200
700
800
900
502
504
506
400
500
600
100
200
300
501
505
503
Array of pages in memory
400
500
600
100
200
300
Looking at the left number, find the lowest
numbered element, and print out the tag, value,
and closing tag. This is the XML output.
15Structural Joins
- Structural joins allow the user to request
all the data of a given type under a type. For
example if the user wants the names of all the
foods in the breakfast menu or if the user wants
all the prices of all the foods on the menu.
ltfoodgt ltnamegtBelgian Waffleslt/namegt
ltpricegt5.95lt/pricegt ltdescriptiongttwo of our
famous Belgian Waffles with plenty of real maple
syruplt/descriptiongt ltcaloriesgt650lt/caloriesgt
lt/foodgt ltfoodgt ltnamegtHomestyle
Breakfastlt/namegt ltpricegt6.95lt/pricegt
ltdescriptiongttwo eggs, bacon or sausage, toast,
and our ever-popular hash brownslt/descriptiongt
ltcaloriesgt950lt/caloriesgt lt/foodgt
Structural Joins allow the retrieval of a
given type of data Belgian Waffles Homestyle
Breakfast 5.95 6.95
16Structural Joins
- Sort elements
- Scan elements of the given parent type
- For each parent found, scan the elements of the
given descendent type - If the descendent is contained within the range
of the parent, then it is a descendent
17Performance
- Parse 6 seconds
- Create DB 4 seconds
- Version reconstruction 24 seconds
18Conclusion
- We learned how to work with Berkley DB and
Xerces-C to parse an XML file into a database. - How to create a versioning system by reading in a
change log and reconstructing the versions.