Title: Finding What We Want: From Hierarchical XML to Directories
1Finding What We Want From Hierarchical XML to
Directories
- Zachary G. Ives
- University of Pennsylvania
- CIS 455 / 555 Internet and Web Systems
- February 10, 2009
2Today
- Reminder HW1 Milestone 2 due
- XQuery wrap-up
- Content-based addressing
- Directories DNS
- Flooding Gnutella
3Recall from Last Time XQuery and Joins
- for i in doc (dblp.xml)/dblp/inproceedings,
r in i/crossref/text(), c in doc
(dblp.xml)/dblp/conf, n in c/_at_name - where c r
- return i, c
4Some Uses for Join in XML
- Translation between values
- SSN ? PennID
- Joining or combining information
- Amazon invoice info UPS tracking info
- Restructuring information
-
..?
- Here, we separate authors from books, then join
them back in upside-down fashion
5Changing Nesting of XML Content
- Re-nesting XML trees is a common operation
- Simply nest the query blocks and correlate them
similar to join - for u in doc(dblp.xml)/dblp/university, n
u/name/text(), - k u/_at_key
- where u/country USA
- return
- n
- for mt in u/../mastersthesis,
inst in mt/school/text() - where mt/year/text() 1992 and
_______________ - return mt/title
-
6Example XML Data
Root
dblp
?xml
mastersthesis
inproceedings
university
mdate
school
key
country
key
author
title
year
mdate
name
2002
key
USA
1992
author
title
crossref
year
ee
ms/Brown92
2002..
PRPL
wisc
On
1997
wisc
Kurt Brown
conf/sigm../
sigmod-97
www
Wisconsin
Paul R.
7Collections Aggregation in XQuery
- XQuery is a functional language, with Nodes and
Node Sets as types - Given a collection, we can compute an average,
count, etc. of its members -
- for paper in doc(dblp.xml)/dblp/inproceedings
- let pauth paper/author
- return paper/title
- fncount(pauth)
-
-
a collection
8Sorting in XQuery
- We can order the sequence of result tuples
output by the return clause - for x in doc(dblp.xml)/proceedings
- order by x/title/text()
- return x
9Querying Defining Tags
- Can get a nodes name by querying node-name()
- for x in document(dblp.xml)/dblp/
- return node-name(x)
- Can construct elements and attributes using
computed names - for x in document(dblp.xml)/dblp/,
- year in x/year,
- title in x/title/text(),
- element node-name(x)
- attribute year- year title
-
10XQuery Summary
- Very flexible and powerful language for XML
- Focus is on database-style operations like joins
- Performs tasks that cant be done with XPath or
XSLT and that are tedious to program in Java - Integrating information from multiple sources
- Joins, based on correspondences of values
- Computing count, average, etc.
- Today, XQuery is available
- In RDBMSs (SQL Server, Oracle, DB2) and XML DBMS
systems (MarkLogic) - As the basis of research prototypes for XQuery
full text - As the basis of XQueryP a Web Services/AJAX
programming language based on XQuery but with
programming language features - http//2006.xmlconference.org/programme/presentati
ons/38.html - Well discuss data integration and middleware
later in the course
11Hierarchical Naming Schemes
- Thus far, weve seen XPath as a hierarchical
naming scheme - Content-based naming describe the structure
and values of a tree structure - Assumption XML tree resides in (or is being
sent to) one place - But hierarchy is often used for naming and
location - Well now look at some naming and location
schemes, including hierarchical ones
12How Do We Find Things on the Internet?
- Generally, using one of three means
- Addresses or locations specify where something
is, assuming that we understand how to navigate - Just like a physical address, we may still need a
map! - In the Internet, addresses are typically IP
addresses the routers know the map - Names are mapped into addresses via lookup
services - Best-known example on the Internet DNS name
- Cell phone numbers, email addresses, etc. are
becoming names - Content-based addressing/naming
- The actual data value is somehow used to find its
location - The basis of publish-subscribe systems and
peer-to-peer architectures
13The Simplest Way of Going fromNames or Content ?
Locations
- Directory-based lookup protocols are very common
- Examples
- Napster 1.0 peer-to-peer storage with central
directory - Inverted index used to look up keywords in
information retrieval - DNS distributed hierarchical directory
- LDAP hierarchical Directory Information Tree
14Napster 1.0, ca 2002
- Hybrid of peer-to-peer storage with central
directory showing whats currently available - What are the trade-offs implicit in this model?
Why did it fail?
Peer1
los-del-rios-macarena.mp3
Directory
Napster.com
los-del-rios-macarena bspears-oops
Peer2
bspears-oops.mp3
Peer3
los-del-rios-macarena.mp3
15Other Services with Similar Directory Peer
Architectures
- Windows Live Sync
- Google Desktop Search with multiple machines
- BitTorrent trackers are quite similar (well
discuss BitTorrent more later)
16Inverted Indices for Content Search
- A forward index documents to words
- The inverted index words to word-occurrences
- The basis of most information retrieval engines,
Google, etc. - Can handle positional predicates
- But how can we reconstruct previews?
17Naming People and Devices LDAP
- Lightweight Directory Access Protocol
- Hierarchical naming system that can be
partitioned and replicated
18LDAPs Schema
- LDAP information has an XML-like schema
- A unique name in LDAP is called a Distinguished
Name, dn and consists of a sequence of
attributes representing a hierarchy, from
most-specific to least-specific (as in DNS
names) - o organization dc domain component
- ou organizational unit
- uid user ID
- cn common name
- c country st state l locality
- Can also have objectClass the type of entity
19LDAP Hierarchy
Brad Marshall LDAP Tutorial, quark.humbug.au/publi
cations/ldap_tut.html
20Querying LDAP
- LDAP queries are mostly attribute-value
predicates - uidzives oupenn c usa
- ((cnSusan Davidson)(cnZachary Ives)(cnVal
Tannen)) - objectclassposixAccount
- (!cnVal Tannen)
- How does this differ from XPath?
- How might we process these queries?
21The Backbone of Internet NamingDomain Name
Service
- A simple, hierarchical name system with a
distributed database each domain controls its
own names
com
Top LevelDomains
edu
columbia
upenn
berkeley
amazon
www
www
cis
sas
www
www
www
22Top-Level Domains (TLDs)
- Mostly controlled by Network Solutions, Inc.
today - .com commercial
- .edu educational institution
- .gov US government
- .mil US military
- .net networks and ISPs (now also a number of
other things) - .org other organizations
- 244, 2-letter country suffixes, e.g., .us, .uk,
.cz, .tv, - and a bunch of new suffixes that are not very
common, e.g., .biz, .name, .pro,
23Finding the Root
- 13 root servers store entries for all top level
domains (TLDs) - DNS servers have a hard-coded mapping to root
servers so they can get started
24Excerpt from DNS Root Server Entries
- This file is made available by InterNIC
registration services under anonymous FTP as - file /domain/named.root
-
- formerly NS.INTERNIC.NET
-
- . 3600000 IN NS A.ROOT-SERVERS.NET.
- A.ROOT-SERVERS.NET. 3600000 A 98.41.0.4
-
- formerly NS1.ISI.EDU
-
- . 3600000 NS B.ROOT-SERVERS.NET.
- B.ROOT-SERVERS.NET. 3600000 A 128.9.0.107
-
- formerly C.PSI.NET
-
- . 3600000 NS C.ROOT-SERVERS.NET.
- C.ROOT-SERVERS.NET. 3600000 A 192.33.4.12
(13 servers in total, A through M)
25Supposing We Were to Build DNS
- How would we start? How is a lookup performed?
- (Hint what do you need to specify when you add
a client to a network that doesnt do DHCP?)
26Issues in DNS
- We know that everyone wants to be my-domain.com
- How does this mesh with the assumptions inherent
in our hierarchical naming system? - What happens if things move frequently?
- What happens if we want to provide different
behavior to different requestors (e.g., Akamai)?
27Directories Summarized
- An efficient way of finding data, assuming
- Data doesnt change too often, hence it can be
replicated and distributed - Hierarchy is relatively wide and flat
- Caching is present, helping with repeated queries
- Directories generally rely on names at their core
- Sometimes we want to search based on other means,
e.g., predicates or filters over content
28Pushing the Search to the NetworkFlooding
Requests Gnutella
- Node A wants a data item it asks B and C
- If B and C dont have it, they ask their
neighbors, etc. - What are the implications of this model?
G
D
H
C
B
A
E
I
F
29Bringing the Data to the Router
Publish-Subscribe
- Generally, too much data to store centrally but
perhaps we only need a central coordinator! - Interested parties register a profile with the
system (often in a central server) - In, for instance, XPath!
- Data gets aggregated at some sort of router or by
a crawler, and then gets disseminated to
individuals - Based on match between content and the profile
- Data changes often, but queries dont!
30An Example XML-Based Information Dissemination
- Basic model (XFilter, YFilter, Xyleme)
- Users are interested in data relating to a
particular topic, and know the schema - /politics/usa//body
- A crawler-aggregator reads XML files from the web
(or gets them from data sources) and feeds them
to interested parties
31Engine for XFilter Altinel Franklin 00