Finding What We Want: From Hierarchical XML to Directories - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Finding What We Want: From Hierarchical XML to Directories

Description:

bspears-oops.mp3. los-del-rios-macarena.mp3. los-del-rios-macarena. bspears-oops. Directory. Other Services with Similar Directory Peer Architectures ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 32

Provided by: zack4

Category:

more less

Transcript and Presenter's Notes

Title: Finding What We Want: From Hierarchical XML to Directories

1
Finding What We Want From Hierarchical XML to
Directories

Zachary G. Ives
University of Pennsylvania
CIS 455 / 555 Internet and Web Systems
February 10, 2009

2
Today

Reminder HW1 Milestone 2 due
XQuery wrap-up
Content-based addressing
Directories DNS
Flooding Gnutella

3
Recall from Last Time XQuery and Joins

for i in doc (dblp.xml)/dblp/inproceedings,
r in i/crossref/text(), c in doc
(dblp.xml)/dblp/conf, n in c/_at_name
where c r
return i, c

4
Some Uses for Join in XML

Translation between values
SSN ? PennID
Joining or combining information
Amazon invoice info UPS tracking info
Restructuring information
..?
Here, we separate authors from books, then join
them back in upside-down fashion

5
Changing Nesting of XML Content

Re-nesting XML trees is a common operation
Simply nest the query blocks and correlate them
similar to join
for u in doc(dblp.xml)/dblp/university, n
u/name/text(),
k u/_at_key
where u/country USA
return
n
for mt in u/../mastersthesis,
inst in mt/school/text()
where mt/year/text() 1992 and
_______________
return mt/title

6
Example XML Data
Root
dblp
?xml
mastersthesis
inproceedings
university
mdate
school
key
country
key
author
title
year
mdate
name
2002
key
USA
1992
author
title
crossref
year
ee
ms/Brown92
2002..
PRPL
wisc
On
1997
wisc
Kurt Brown
conf/sigm../
sigmod-97
www
Wisconsin
Paul R.
7
Collections Aggregation in XQuery

XQuery is a functional language, with Nodes and
Node Sets as types
Given a collection, we can compute an average,
count, etc. of its members
for paper in doc(dblp.xml)/dblp/inproceedings
let pauth paper/author
return paper/title
fncount(pauth)

a collection
8
Sorting in XQuery

We can order the sequence of result tuples
output by the return clause
for x in doc(dblp.xml)/proceedings
order by x/title/text()
return x

9
Querying Defining Tags

Can get a nodes name by querying node-name()
for x in document(dblp.xml)/dblp/
return node-name(x)
Can construct elements and attributes using
computed names
for x in document(dblp.xml)/dblp/,
year in x/year,
title in x/title/text(),
element node-name(x)
attribute year- year title

10
XQuery Summary

Very flexible and powerful language for XML
Focus is on database-style operations like joins
Performs tasks that cant be done with XPath or
XSLT and that are tedious to program in Java
Integrating information from multiple sources
Joins, based on correspondences of values
Computing count, average, etc.
Today, XQuery is available
In RDBMSs (SQL Server, Oracle, DB2) and XML DBMS
systems (MarkLogic)
As the basis of research prototypes for XQuery
full text
As the basis of XQueryP a Web Services/AJAX
programming language based on XQuery but with
programming language features
http//2006.xmlconference.org/programme/presentati
ons/38.html
Well discuss data integration and middleware
later in the course

11
Hierarchical Naming Schemes

Thus far, weve seen XPath as a hierarchical
naming scheme
Content-based naming describe the structure
and values of a tree structure
Assumption XML tree resides in (or is being
sent to) one place
But hierarchy is often used for naming and
location
Well now look at some naming and location
schemes, including hierarchical ones

12
How Do We Find Things on the Internet?

Generally, using one of three means
Addresses or locations specify where something
is, assuming that we understand how to navigate
Just like a physical address, we may still need a
map!
In the Internet, addresses are typically IP
addresses the routers know the map
Names are mapped into addresses via lookup
services
Best-known example on the Internet DNS name
Cell phone numbers, email addresses, etc. are
becoming names
Content-based addressing/naming
The actual data value is somehow used to find its
location
The basis of publish-subscribe systems and
peer-to-peer architectures

13
The Simplest Way of Going fromNames or Content ?
Locations

Directory-based lookup protocols are very common
Examples
Napster 1.0 peer-to-peer storage with central
directory
Inverted index used to look up keywords in
information retrieval
DNS distributed hierarchical directory
LDAP hierarchical Directory Information Tree

14
Napster 1.0, ca 2002

Hybrid of peer-to-peer storage with central
directory showing whats currently available
What are the trade-offs implicit in this model?
Why did it fail?

Peer1
los-del-rios-macarena.mp3
Directory
Napster.com
los-del-rios-macarena bspears-oops
Peer2
bspears-oops.mp3
Peer3
los-del-rios-macarena.mp3
15
Other Services with Similar Directory Peer
Architectures

Windows Live Sync
Google Desktop Search with multiple machines
BitTorrent trackers are quite similar (well
discuss BitTorrent more later)

16
Inverted Indices for Content Search

A forward index documents to words
The inverted index words to word-occurrences
The basis of most information retrieval engines,
Google, etc.
Can handle positional predicates
But how can we reconstruct previews?

17
Naming People and Devices LDAP

Lightweight Directory Access Protocol
Hierarchical naming system that can be
partitioned and replicated

18
LDAPs Schema

LDAP information has an XML-like schema
A unique name in LDAP is called a Distinguished
Name, dn and consists of a sequence of
attributes representing a hierarchy, from
most-specific to least-specific (as in DNS
names)
o organization dc domain component
ou organizational unit
uid user ID
cn common name
c country st state l locality
Can also have objectClass the type of entity

19
LDAP Hierarchy
Brad Marshall LDAP Tutorial, quark.humbug.au/publi
cations/ldap_tut.html
20
Querying LDAP

LDAP queries are mostly attribute-value
predicates
uidzives oupenn c usa
((cnSusan Davidson)(cnZachary Ives)(cnVal
Tannen))
objectclassposixAccount
(!cnVal Tannen)
How does this differ from XPath?
How might we process these queries?

21
The Backbone of Internet NamingDomain Name
Service

A simple, hierarchical name system with a
distributed database each domain controls its
own names

com
Top LevelDomains
edu

columbia
upenn
berkeley
amazon

www
www
cis
sas

www
www
www
22
Top-Level Domains (TLDs)

Mostly controlled by Network Solutions, Inc.
today
.com commercial
.edu educational institution
.gov US government
.mil US military
.net networks and ISPs (now also a number of
other things)
.org other organizations
244, 2-letter country suffixes, e.g., .us, .uk,
.cz, .tv,
and a bunch of new suffixes that are not very
common, e.g., .biz, .name, .pro,

23
Finding the Root

13 root servers store entries for all top level
domains (TLDs)
DNS servers have a hard-coded mapping to root
servers so they can get started

24
Excerpt from DNS Root Server Entries

This file is made available by InterNIC
registration services under anonymous FTP as
file /domain/named.root
formerly NS.INTERNIC.NET
. 3600000 IN NS A.ROOT-SERVERS.NET.
A.ROOT-SERVERS.NET. 3600000 A 98.41.0.4
formerly NS1.ISI.EDU
. 3600000 NS B.ROOT-SERVERS.NET.
B.ROOT-SERVERS.NET. 3600000 A 128.9.0.107
formerly C.PSI.NET
. 3600000 NS C.ROOT-SERVERS.NET.
C.ROOT-SERVERS.NET. 3600000 A 192.33.4.12

(13 servers in total, A through M)
25
Supposing We Were to Build DNS

How would we start? How is a lookup performed?
(Hint what do you need to specify when you add
a client to a network that doesnt do DHCP?)

26
Issues in DNS

We know that everyone wants to be my-domain.com
How does this mesh with the assumptions inherent
in our hierarchical naming system?
What happens if things move frequently?
What happens if we want to provide different
behavior to different requestors (e.g., Akamai)?

27
Directories Summarized

An efficient way of finding data, assuming
Data doesnt change too often, hence it can be
replicated and distributed
Hierarchy is relatively wide and flat
Caching is present, helping with repeated queries
Directories generally rely on names at their core
Sometimes we want to search based on other means,
e.g., predicates or filters over content

28
Pushing the Search to the NetworkFlooding
Requests Gnutella

Node A wants a data item it asks B and C
If B and C dont have it, they ask their
neighbors, etc.
What are the implications of this model?

G
D
H
C
B
A
E
I
F
29
Bringing the Data to the Router
Publish-Subscribe

Generally, too much data to store centrally but
perhaps we only need a central coordinator!
Interested parties register a profile with the
system (often in a central server)
In, for instance, XPath!
Data gets aggregated at some sort of router or by
a crawler, and then gets disseminated to
individuals
Based on match between content and the profile
Data changes often, but queries dont!

30
An Example XML-Based Information Dissemination

Basic model (XFilter, YFilter, Xyleme)
Users are interested in data relating to a
particular topic, and know the schema
/politics/usa//body
A crawler-aggregator reads XML files from the web
(or gets them from data sources) and feeds them
to interested parties

31
Engine for XFilter Altinel Franklin 00

Write a Comment

User Comments (0)