Title: DNS and ContentBased Addressing
1DNS and Content-Based Addressing
- Zachary G. Ives
- University of Pennsylvania
- CIS 455 / 555 Internet and Web Systems
- February 12, 2009
2Reminders and Recap
- Homework 1 Milestone 2 due 2/17
- We have been discussing schemes for finding data
- XPaths path queries over hierarchical XML
- Content-based addressing keyword search
- Directories Napster and LDAP
- Today we see more schemes that build upon similar
ideas - DNS hierarchical administration, heavy caching
at all levels - Gnutella
- Can make requests based on filter conditions
- But flooding is expensive
- XFilter find the data through a centralized
crawler and XPath
3The Backbone of Internet NamingDomain Name
Service
- A simple, hierarchical name system with a
distributed database each domain controls its
own names
com
Top LevelDomains
edu
columbia
upenn
berkeley
amazon
www
www
cis
sas
www
www
www
4Top-Level Domains (TLDs)
- Mostly controlled by Network Solutions, Inc.
today - .com commercial
- .edu educational institution
- .gov US government
- .mil US military
- .net networks and ISPs (now also a number of
other things) - .org other organizations
- 244, 2-letter country suffixes, e.g., .us, .uk,
.cz, .tv, - and a bunch of new suffixes that are not very
common, e.g., .biz, .name, .pro,
5Finding the Root
- 13 root servers store entries for all top level
domains (TLDs) - DNS servers have a hard-coded mapping to root
servers so they can get started - These can be updated by UDP messages (single
packet)
6Excerpt from DNS Root Server Entries
- This file is made available by InterNIC
registration services under anonymous FTP as - file /domain/named.root
-
- formerly NS.INTERNIC.NET
-
- . 3600000 IN NS A.ROOT-SERVERS.NET.
- A.ROOT-SERVERS.NET. 3600000 A 98.41.0.4
-
- formerly NS1.ISI.EDU
-
- . 3600000 NS B.ROOT-SERVERS.NET.
- B.ROOT-SERVERS.NET. 3600000 A 128.9.0.107
-
- formerly C.PSI.NET
-
- . 3600000 NS C.ROOT-SERVERS.NET.
- C.ROOT-SERVERS.NET. 3600000 A 192.33.4.12
(13 servers in total, A through M)
7Supposing We Were to Build DNS
- How would we start? How is a lookup performed?
- (Hint what do you need to specify when you add
a client to a network that doesnt do DHCP?)
8Issues in DNS
- We know that everyone wants to be my-domain.com
- How does this mesh with the assumptions inherent
in our hierarchical naming system? - What happens if things move frequently?
- What happens if we want to provide different
behavior to different requestors (e.g., Akamai)?
9Directories Summarized
- An efficient way of finding data, assuming
- Data doesnt change too often, hence it can be
replicated and distributed - Hierarchy is relatively wide and flat
- Caching is present, helping with repeated queries
- Directories generally rely on names at their core
- Sometimes we want to search based on other means,
e.g., predicates or filters over content
10Pushing the Search to the NetworkFlooding
Requests Gnutella
- Node A wants a data item it asks B and C
- If B and C dont have it, they ask their
neighbors, etc. - What are the implications of this model?
G
D
H
C
B
A
E
I
F
11Bringing the Data to the Router
Publish-Subscribe
- Generally, too much data to store centrally but
perhaps we only need a central coordinator! - Interested parties register a profile with the
system (often in a central server) - In, for instance, XPath!
- Data gets aggregated at some sort of router or by
a crawler, and then gets disseminated to
individuals - Based on match between content and the profile
- Data changes often, but queries dont!
12Bringing the Data to the Router
Publish-Subscribe
- Generally, too much data to store centrally but
perhaps we only need a central coordinator! - Interested parties register a profile with the
system (often in a central server) - In, for instance, XPath!
- Data gets aggregated at some sort of router or by
a crawler, and then gets disseminated to
individuals - Based on match between content and the profile
- Data changes often, but queries dont!
13An Example XML-Based Information Dissemination
- Basic model (XFilter, YFilter, Xyleme)
- Users are interested in data relating to a
particular topic, and know the schema - /politics/usa//body
- A crawler-aggregator reads XML files from the web
(or gets them from data sources) and feeds them
to interested parties
14Engine for XFilter Altinel Franklin 00
15How Does It Work?
- Each XPath segment is basically a subset of
regular expressions over element tags - Convert into finite state automata
- Parse data as it comes in use SAX API
- Match against finite state machines
- Most of these systems use modified FSMs because
they want to match many patterns at the same time
16Path Nodes and FSMs
- XPath parser decomposes XPath expressions into a
set of path nodes - These nodes act as the states of corresponding
FSM - A node in the Candidate List denotes the current
state - The rest of the states are in corresponding Wait
Lists - Simple FSM for /politics_at_topicpresident/usa//
body
politics
usa
body
Q1_1 Q1_2 Q1_3
17Decomposing Into Path Nodes
Q1/politics_at_topicpresident/usa//body
- Query ID
- Position in state machine
- Relative Position (RP) in tree
- 0 for root node if its not preceded by //
- -1 for any node preceded by //
- Else 1 (no of nodes from predecessor node)
- Level
- If current node has fixed distance from root,
then 1 distance - Else if RP 1, then 1, else 0
- Finaly, NextPathNodeSet points to next node
Q1
Q1
Q1
1
2
3
0
1
-1
1
2
-1
Q1-1
Q1-2
Q1-3
Q2//usa//body/p
Q2
Q2
Q2
1
2
3
-1
2
1
-1
0
0
Q2-1
Q2-2
Q2-3
18Query Index
CL
- Query index entry for each XML tag
- Two lists Candidate List (CL) and Wait List (WL)
divided across the nodes - Live queries states are in CL pending
queries states are in WL - Events that cause state transition are generated
by the XML parser
Q1-1
X
politics
WL
X
Q2-1
X
usa
Q1-2
X
X
body
Q1-3
Q2-2
X
X
p
Q2-3
X
19Encountering an Element
- Look up the element name in the Query Index and
all nodes in the associated CL - Validate that we actually have a match
Query ID Position Rel. Position Level
startElement politics
Q1
1
Entry in Query Index
0
1
CL
Q1-1
NextPathNodeSet
X
politics
Q1-1
WL
X
20Validating a Match
- We first check that the current XML depth matches
the level in the user query - If level in CL node is less than 1, then ignore
height - else level in CL node must height
- This ensures were matching at the right point in
the tree! - Finally, we validate any predicates against
attributes (e.g., _at_topicpresident)
21Processing Further Elements
- Queries that dont meet validation are removed
from the Candidate Lists - For other queries, we advance to the next state
- We copy the next node of the query from the WL to
the CL, and update the RP and level - When we reach a final state (e.g., Q1-3), we can
output the document to the subscriber - When we encounter an end element, we must remove
that element from the CL
22Publish-Subscribe Model Summarized
- Currently not commonly used
- Partly because XML isnt that widespread
- This may change with the adoption of an XML
format called RSS (Rich Site Summary or Really
Simple Syndication) - Many news sites, web logs, mailing lists, etc.
use RSS to publish daily articles - Seems like a perfect fit for publish-subscribe
models!
23Finding a Happy Medium
- Weve seen two approaches
- Do all the work at the data stores flood the
network with requests - Do all the work via a central crawler record
profiles and disseminate matches - An alternative, two-step process
- Build a content index over whats out there
- Typically limited in what kinds of queries can be
supported - Most common instance an index of document
keywords