Finding What We Want: From Hierarchical XML to Directories - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Finding What We Want: From Hierarchical XML to Directories

Description:

bspears-oops.mp3. los-del-rios-macarena.mp3. los-del-rios-macarena. bspears-oops. Directory. Other Services with Similar Directory Peer Architectures ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 32
Provided by: zack4
Category:

less

Transcript and Presenter's Notes

Title: Finding What We Want: From Hierarchical XML to Directories


1
Finding What We Want From Hierarchical XML to
Directories
  • Zachary G. Ives
  • University of Pennsylvania
  • CIS 455 / 555 Internet and Web Systems
  • February 10, 2009

2
Today
  • Reminder HW1 Milestone 2 due
  • XQuery wrap-up
  • Content-based addressing
  • Directories DNS
  • Flooding Gnutella

3
Recall from Last Time XQuery and Joins
  • for i in doc (dblp.xml)/dblp/inproceedings,
    r in i/crossref/text(), c in doc
    (dblp.xml)/dblp/conf, n in c/_at_name
  • where c r
  • return i, c

4
Some Uses for Join in XML
  • Translation between values
  • SSN ? PennID
  • Joining or combining information
  • Amazon invoice info UPS tracking info
  • Restructuring information

  • ..?
  • Here, we separate authors from books, then join
    them back in upside-down fashion

5
Changing Nesting of XML Content
  • Re-nesting XML trees is a common operation
  • Simply nest the query blocks and correlate them
    similar to join
  • for u in doc(dblp.xml)/dblp/university, n
    u/name/text(),
  • k u/_at_key
  • where u/country USA
  • return
  • n
  • for mt in u/../mastersthesis,
    inst in mt/school/text()
  • where mt/year/text() 1992 and
    _______________
  • return mt/title

6
Example XML Data
Root
dblp
?xml
mastersthesis
inproceedings
university
mdate
school
key
country
key
author
title
year
mdate
name
2002
key
USA
1992
author
title
crossref
year
ee
ms/Brown92
2002..
PRPL
wisc
On
1997
wisc
Kurt Brown
conf/sigm../
sigmod-97
www
Wisconsin
Paul R.
7
Collections Aggregation in XQuery
  • XQuery is a functional language, with Nodes and
    Node Sets as types
  • Given a collection, we can compute an average,
    count, etc. of its members
  • for paper in doc(dblp.xml)/dblp/inproceedings
  • let pauth paper/author
  • return paper/title
  • fncount(pauth)

a collection
8
Sorting in XQuery
  • We can order the sequence of result tuples
    output by the return clause
  • for x in doc(dblp.xml)/proceedings
  • order by x/title/text()
  • return x

9
Querying Defining Tags
  • Can get a nodes name by querying node-name()
  • for x in document(dblp.xml)/dblp/
  • return node-name(x)
  • Can construct elements and attributes using
    computed names
  • for x in document(dblp.xml)/dblp/,
  • year in x/year,
  • title in x/title/text(),
  • element node-name(x)
  • attribute year- year title

10
XQuery Summary
  • Very flexible and powerful language for XML
  • Focus is on database-style operations like joins
  • Performs tasks that cant be done with XPath or
    XSLT and that are tedious to program in Java
  • Integrating information from multiple sources
  • Joins, based on correspondences of values
  • Computing count, average, etc.
  • Today, XQuery is available
  • In RDBMSs (SQL Server, Oracle, DB2) and XML DBMS
    systems (MarkLogic)
  • As the basis of research prototypes for XQuery
    full text
  • As the basis of XQueryP a Web Services/AJAX
    programming language based on XQuery but with
    programming language features
  • http//2006.xmlconference.org/programme/presentati
    ons/38.html
  • Well discuss data integration and middleware
    later in the course

11
Hierarchical Naming Schemes
  • Thus far, weve seen XPath as a hierarchical
    naming scheme
  • Content-based naming describe the structure
    and values of a tree structure
  • Assumption XML tree resides in (or is being
    sent to) one place
  • But hierarchy is often used for naming and
    location
  • Well now look at some naming and location
    schemes, including hierarchical ones

12
How Do We Find Things on the Internet?
  • Generally, using one of three means
  • Addresses or locations specify where something
    is, assuming that we understand how to navigate
  • Just like a physical address, we may still need a
    map!
  • In the Internet, addresses are typically IP
    addresses the routers know the map
  • Names are mapped into addresses via lookup
    services
  • Best-known example on the Internet DNS name
  • Cell phone numbers, email addresses, etc. are
    becoming names
  • Content-based addressing/naming
  • The actual data value is somehow used to find its
    location
  • The basis of publish-subscribe systems and
    peer-to-peer architectures

13
The Simplest Way of Going fromNames or Content ?
Locations
  • Directory-based lookup protocols are very common
  • Examples
  • Napster 1.0 peer-to-peer storage with central
    directory
  • Inverted index used to look up keywords in
    information retrieval
  • DNS distributed hierarchical directory
  • LDAP hierarchical Directory Information Tree

14
Napster 1.0, ca 2002
  • Hybrid of peer-to-peer storage with central
    directory showing whats currently available
  • What are the trade-offs implicit in this model?
    Why did it fail?

Peer1
los-del-rios-macarena.mp3
Directory
Napster.com
los-del-rios-macarena bspears-oops
Peer2
bspears-oops.mp3
Peer3
los-del-rios-macarena.mp3
15
Other Services with Similar Directory Peer
Architectures
  • Windows Live Sync
  • Google Desktop Search with multiple machines
  • BitTorrent trackers are quite similar (well
    discuss BitTorrent more later)

16
Inverted Indices for Content Search
  • A forward index documents to words
  • The inverted index words to word-occurrences
  • The basis of most information retrieval engines,
    Google, etc.
  • Can handle positional predicates
  • But how can we reconstruct previews?

17
Naming People and Devices LDAP
  • Lightweight Directory Access Protocol
  • Hierarchical naming system that can be
    partitioned and replicated

18
LDAPs Schema
  • LDAP information has an XML-like schema
  • A unique name in LDAP is called a Distinguished
    Name, dn and consists of a sequence of
    attributes representing a hierarchy, from
    most-specific to least-specific (as in DNS
    names)
  • o organization dc domain component
  • ou organizational unit
  • uid user ID
  • cn common name
  • c country st state l locality
  • Can also have objectClass the type of entity

19
LDAP Hierarchy
Brad Marshall LDAP Tutorial, quark.humbug.au/publi
cations/ldap_tut.html
20
Querying LDAP
  • LDAP queries are mostly attribute-value
    predicates
  • uidzives oupenn c usa
  • ((cnSusan Davidson)(cnZachary Ives)(cnVal
    Tannen))
  • objectclassposixAccount
  • (!cnVal Tannen)
  • How does this differ from XPath?
  • How might we process these queries?

21
The Backbone of Internet NamingDomain Name
Service
  • A simple, hierarchical name system with a
    distributed database each domain controls its
    own names

com
Top LevelDomains
edu


columbia
upenn
berkeley
amazon



www
www
cis
sas



www
www
www
22
Top-Level Domains (TLDs)
  • Mostly controlled by Network Solutions, Inc.
    today
  • .com commercial
  • .edu educational institution
  • .gov US government
  • .mil US military
  • .net networks and ISPs (now also a number of
    other things)
  • .org other organizations
  • 244, 2-letter country suffixes, e.g., .us, .uk,
    .cz, .tv,
  • and a bunch of new suffixes that are not very
    common, e.g., .biz, .name, .pro,

23
Finding the Root
  • 13 root servers store entries for all top level
    domains (TLDs)
  • DNS servers have a hard-coded mapping to root
    servers so they can get started

24
Excerpt from DNS Root Server Entries
  • This file is made available by InterNIC
    registration services under anonymous FTP as
  • file /domain/named.root
  • formerly NS.INTERNIC.NET
  • . 3600000 IN NS A.ROOT-SERVERS.NET.
  • A.ROOT-SERVERS.NET. 3600000 A 98.41.0.4
  • formerly NS1.ISI.EDU
  • . 3600000 NS B.ROOT-SERVERS.NET.
  • B.ROOT-SERVERS.NET. 3600000 A 128.9.0.107
  • formerly C.PSI.NET
  • . 3600000 NS C.ROOT-SERVERS.NET.
  • C.ROOT-SERVERS.NET. 3600000 A 192.33.4.12

(13 servers in total, A through M)
25
Supposing We Were to Build DNS
  • How would we start? How is a lookup performed?
  • (Hint what do you need to specify when you add
    a client to a network that doesnt do DHCP?)

26
Issues in DNS
  • We know that everyone wants to be my-domain.com
  • How does this mesh with the assumptions inherent
    in our hierarchical naming system?
  • What happens if things move frequently?
  • What happens if we want to provide different
    behavior to different requestors (e.g., Akamai)?

27
Directories Summarized
  • An efficient way of finding data, assuming
  • Data doesnt change too often, hence it can be
    replicated and distributed
  • Hierarchy is relatively wide and flat
  • Caching is present, helping with repeated queries
  • Directories generally rely on names at their core
  • Sometimes we want to search based on other means,
    e.g., predicates or filters over content

28
Pushing the Search to the NetworkFlooding
Requests Gnutella
  • Node A wants a data item it asks B and C
  • If B and C dont have it, they ask their
    neighbors, etc.
  • What are the implications of this model?

G
D
H
C
B
A
E
I
F
29
Bringing the Data to the Router
Publish-Subscribe
  • Generally, too much data to store centrally but
    perhaps we only need a central coordinator!
  • Interested parties register a profile with the
    system (often in a central server)
  • In, for instance, XPath!
  • Data gets aggregated at some sort of router or by
    a crawler, and then gets disseminated to
    individuals
  • Based on match between content and the profile
  • Data changes often, but queries dont!

30
An Example XML-Based Information Dissemination
  • Basic model (XFilter, YFilter, Xyleme)
  • Users are interested in data relating to a
    particular topic, and know the schema
  • /politics/usa//body
  • A crawler-aggregator reads XML files from the web
    (or gets them from data sources) and feeds them
    to interested parties

31
Engine for XFilter Altinel Franklin 00
Write a Comment
User Comments (0)
About PowerShow.com