Title: Naming
1Naming
- Introduction to Distributed SystemsCS
457/557Fall 2008Kenneth Chiu
2Entities, Names, IDs, and Addresses
- A name is a sequence of bits that can be used to
refer to an entity. - Entities The thing that we want to refer to.
- What properties are desired in name?
- Location independence
- Easy to remember
- Suppose I have the name of an entity. Can I now
operate on it immediately? Suppose I have your
name and want to access you? - To operate on an entity, necessary to access it,
via an access point. Names of access points are
addresses. - Can an entity have more than one access point?
- Can an entities access point change over time?
- Are names unique? Permanent?
- Addresses unique? Permanent?
3- IDs are also a kind of name. What properties do
they have? - Refers to at most one entity
- Each entity has at most one ID
- ID is permanent
- Examples?
- What is your name? What is your address? What is
your ID? - SSN, phone number, passport number, street
address, e-mail address. - Can we substitute one for the other?
- Use your phone number as your name?
- Use your SSN as your name?
- Use your name as your SSN? Phone number?
4Central Question
- How to resolve names to addresses?
5Naming vs. Locating
- DNS works well for static situations, where
addresses dont change very often. - What if we assume that it does?
- Suppose we want to change ftp.cs.binghamton.edu
to ftp.cs.albany.edu. How? - Change the IP address for ftp.cs.binghamton.edu.
- Make a symbolic link from ftp.cs.binghamton.edu
to ftp.albany.cs.edu. In other words, put an
entry in DNS that says ftp.cs.binghamton.edu has
been renamed to ftp.cs.albany.cs.edu. - Compare and contrast?
- If the first is over long distances, latency may
be slow. Also, could be a bottleneck, since it is
centralized. - Adding indirection via a symbolic link can
create very long chains. Chain management is an
issue.
6- A better solution is to divide into separate
naming and location service. - How many levels of naming are used in typical
Internet?
Two-level mapping using identities and separate
naming and location service.
Direct, single level mapping between names and
addresses.
7Name Spaces
- The name space is the way that names in a
particular system are organized. This also
defines the set of all possible names. - Examples?
- Phone numbers
- Credit card numbers
- DNS
- Human names in the US
- Files in UNIX, Windows
- URLs
8Flat Naming
9Flat Naming
- Given an unstructured name (ID), how we locate
the access point? - Broadcasting
- Forwarding
- Home-based
- Distributed hash tables
- Hierarchical location service
10Broadcast/Multicast Location
- How does Ethernet addressing work?
- MAC address
- How does your switch/hub know the IP to MAC
address mapping? - ARP, sends request, gets answer
- Disadvantage?
- Could waste bandwidth if network is large.
- Interrupts hosts to check if they are the one
being sought. - What if multicast instead of broadcast?
- Can also be used to find best replica.
11Forwarding
- Question How does the post office deal with
mobility? - When entity moves, leave a reference.
- Disadvantages?
- Chain too long if lots of movement.
- All intermediate locations have to maintain the
forwarding. - Vulnerable to failure.
- Performance is bad.
12Forwarding Pointers in SSP
- Object originally in P3, then moved to P4. P1
passes object reference to P2. - Do we always need to go through the whole chain?
13Shortcut
- Redirecting a forwarding pointer, by storing a
shortcut in a client stub.
How to deal with broken chains?
14Home-Based Approaches
- Let a home keep track of where the entity is.
Mobile IP - All hosts use fixed IP (essentially functions as
an ID) as a home location (or home address). Home
location registered with a naming service. - A home agent is monitoring this address.
- When entity moves, it registers a foreign address
as the care-of address. - Clients send to home location first. When home
agent receives a packet, it tunnels it to current
care-of address, and also responds back to client
with current location.
15(No Transcript)
16- Disadvantages?
- Home address has to be supported as long as
entity is alive. - Home address is fixed, what happens when the move
is permanent. - What if entity is local, but home is far away?
- Try a two-tiered scheme, first see if the entity
is local, then try the home. - Solution for moves?
- Use a naming service to find the home.
17Distributed Hash Tables
- Chord
- Organize all nodes in a ring.
- Each node is assigned a random m-bit ID.
- Each entity is assigned a unique m-bit key.
- Entity with key k managed by node with smallest
id gt k (called the successor). - Simple solution?
18Node 1 receives request to find key 19. What does
it do?
p 1
pred(p)
succ(p1)
Possible key values
Actual nodes
19- Finger tables
- Each node p maintains a finger table FTp with at
most m entries FTpi
points to the first node succeeding p by at least
2i1. - To look up a key k, node p forwards the request
to node with index j satisfying - If p lt k lt FTp1, then the request is also
forwarded to FTp1.
20Possible key values
Actual nodes
21- Resolving key 26 from node 1 and key 12 from node
28 in a Chord system.
22Lookup
- Primary task is how to lookup the node
responsible for storing the value of a key, given
a key. - Typically, the key might be the hash of a file,
for example. - The process is that each node uses its finger
table to decide which node to send it to.
23succ(p2i-1)
i
Resolve k12 from node 28
Possiblekey values
Resolve k26 from node 1
Actual nodes
24Joining
25succ(p2i-1)
i
2. New node informs succ(p1) that itself is new
predecessor.
Possiblekey values
Insert new node
1. New node asks any node to lookup succ(p1)
24
Actual nodes
26succ(p2i-1)
2. New node informs succ(p1) that itself is new
predecessor.
i
Possiblekey values
24
1. New node asks any node to lookup succ(p1)
Insert
3. New node builds its own finger table by doing
successive lookups.
27- Maintaining connectivity
- Nodes periodically contact succ(p1) and ask it
to return pred(succ(p1)). - If same, all is fine.
- If different, what happened?
- If different, some node q must have joined. p
will set FT1 to q, then contact q to ask for
its predecessor. - Nodes periodically contact pred(p) to see if
still alive. - If dead, pred(p) is set to null.
- If a node discovers that pred(succ(p1)) is null,
then it informs succ(p1) that its predecessor
is very likely to be p. - Maintaining finger tables.
- Nodes periodically lookup succ(p2i-1).
- As long as its not too far out of whack, will
still succeed efficiently.
28- Locality Chord ignores the network topology.
- Topology-based assignment When assigning an ID
to a node, make sure that nodes close in ID space
are also close in the network. - How do network failures impact this?
- Proximity routing Maintain more than one
alternative for each entry in the table.
Currently, each entry is the first node in the
range p2i-1, p2i-1. It can point to multiple
ones, scattered in network space. - Proximity neighbor selection If there is a
choice as to neighbors, select the closest one.
29Hierarchical Approaches
- Build a large-scale search tree by dividing
network in hierarchical domains. - Each domain has a node.
- Generalization of
- First try local registry, then try the home.
- Can generalize to multiple tiers.
30- Hierarchical organization of a location service
into domains, each having an associated directory
node. - Is this a hierarchical namespace?
- Note that namespace is still flat!
31- Each entity in domain D has a location record in
dir(D). - The location record in the leaf domain contains
the current address. - The location record in a non-leaf domain contains
a reference to the directory node of the correct
next lower domain.
R
E N1
N1
E N2
N2
E Address of E
Entity (E)
Client
32- An example of storing information of an entity
having two addresses in different leaf domains.
33- Looking up a location in a hierarchically
organized location service.
34- Each entity in domain D has a location record in
dir(D). - The location record in the leaf domain contains
the current address. - The location record in a non-leaf domain contains
a reference to the directory node of the correct
next lower domain.
R
E N1
N1
E N2
N2
E Address of E
Entity (E)
Client
35- Inserting a replica The request is forwarded to
the first node that knows about entity E.
36- A chain of forwarding pointers to the leaf node
is created.
37- Delete operations are analogous.
- Directory node dir(D) in leaf domain is requested
to remove entry for E. - If dir(D) has no more entries for E (no more
replicas), it then requests its parent to also
remove the entry pointing to dir(D).
38Why Is This Any Better?
- Lets play Devils Advocate and compare this to
straightforward solutions. - Exploits locality
- Search expands in a ring.
- As an entity moves, it usually is a local
operation.
39Pointer Caching
- How effective is caching addresses?
- Depends on degree of mobility.
- If very mobile, what can we do to help?
- If D is the smallest domain that E moves around
in, can cache dir(D). - Called pointer caching.
40- Caching a reference to a directory node of the
lowest-level domain in which an entity will
reside most of the time.
41- If a replica is inserted locally, caches should
be updated to point to the local replica.
42Scalability
- Where is the bottleneck here?
- Root has to store everything.
- How to address?
- Federate/distribute the root (multiple roots).
- Each root is responsible for a subset.
- Cluster solution?
- Also distribute geographically.
43Scalability Issues
- Locating subnodes correctly is a challenge.
44Structured Names
45Naming Graph
- Path, local name, absolute name
- Should it be a tree, DAG, allow cycles?
46Name Spaces (2)
- The general organization of the UNIX file system
implementation on a logical disk of contiguous
disk blocks.
47Name Resolution
- Looking up a name (finding the value) is called
name resolution. - Closure mechanism (or where to start)
- How to find the root node, for example.
- Examples, file systems, ZIP code, DNS
48Aliases
- Aliases
- Can be hard.
- Can be soft, like a forwarding address.
49- Naming graph for symbolic link.
50Merging Namespaces
- How can we merge namespaces? Are there any
issues? - Mounting
- Can be used to merge namespaces.
- In the hierarchical case, what is needed is a
special directory node that jumps to the other
namespace.
51Linking and Mounting
- Consider a collection of heirarchical namespaces
distributed across different machines. - Each namespace implemented by different server.
- Information required to mount a foreign name
space in a distributed system - The name of an access protocol.
- The name of the server.
- The name of the mounting point in the foreign
name space.
52- Mounting remote name spaces through a specific
process protocol. - How do you access steens mbox from Machine A?
Network
53Name Space Implementation
- Name spaces always map names to something.
- DNS maps what to what?
- Can be divided into three layers
- Global layer Doesnt change very often.
- Administrational layer Single organization, like
a department, or division. - Managerial layer Change regularly, such as local
area network.
54Global layer
Administrational layer
Managerial layer
55Name Server Characteristics
- A comparison between name servers for
implementing nodes from a large-scale name space
partitioned into a global layer, as an
administrational layer, and a managerial layer.
56Name Resolution
- A name resolver looks up names.
- How about a simple hash table?
- Bottleneck if just one.
- Replicate?
- How do you update a record? Every single replica?
- The idea is that you want to distribute the load,
but do it in the right way. - Assume that we use a hierarchical name space.
57Iterative Name Resolution
- Consider the name rootltnl, vu, cs, ftp, pub,
globe, index.txtgt.
58Recursive Name Resolution
- Which loads root server more?
59Recursion and Caching
- Recursive name resolution of ltnl, vu, cs, ftpgt.
Name servers cache intermediate results for
subsequent lookups. - Is iterative or recursive better for caching?
60Name Resolution Communication Costs
- The comparison between recursive and iterative
name resolution with respect to communication
costs.
61Name Resolution Communication Costs
- A comparison of iterative vs. recursive
R1
Name servernl node
Recursive name resolution
R2
I1
Name servervu node
Client
I2
R3
Iterative name resolution
Name servercs node
I3
Long distance communication
62Example DNS
- Name space is a tree of nodes.
- Label can be 63 characters.
- Max name is 255 characters.
- Path names represented two ways
- rootltnl, vu, cs, flitsgt
- flits.cs.vu.nl.
- Subtree is a domain. Path name is a domain name.
- A node contains a collection of resource records.
- A zone is the part of the tree that a nameserver
is responsible for. - A domain is made up of one or more zones.
63Resource Records
64DNS Implementation
- Each zone is managed by a name server.
65Node Contents
- An excerpt from the DNS database for the zone
cs.vu.nl.
66(No Transcript)
67DNS Subdomains
- Part of the description for the vu.nl domain
which contains the cs.vu.nl domain.
68Decentralized DNS
- Basic idea Take the DNS name, hash it, and use
DHT to find the key. - Disadvantage?
- Pastry
- Prefixes of keys are used to route to nodes.
- Each digit taken from base b.
- Suppose you have base 4. A node with ID 3210 is
responsible for all keys with prefix 321. it
keeps the following table.
69- Suppose it receives a lookup request for 3123?
1000?
70Replication
- The main problem with this is that there is going
to be a lot of hops. - Replicate to higher levels. For example, key 3211
is replicated to all nodes havin prefix 321. - What happens if you replicate everything?
- Suppose you want to guarantee that on average, it
takes C hops? Which keys should be replicated?
71Distribution
- How are queries distributed? Are some more common
than others? What does it look like? - Zipf distribution says that the frequency of the
n-th ranked item is proportional to 1/n?, with ?
being close to 1.
72Selective Replication
- Assume Zipf distribution of queries, then the
formula above shows fraction of most popular keys
that should be replicated at level i. d is based
on ? base b. N is total number of nodes. C is
desired hop count.
73Example
- Example Assume that you want an average of one
hop, with base b4, ?0.9, N10,000, and
1,000,000 records. - 61 most popular should be replicated at level 0.
- 284 next most popular should be replicated at
level 1. - 1323 next most popular should be replicated at
level 2. - 6177 next most popular should be replicated at
level 3. - 28826 next most popular should be replicated at
level 4. - 134505 next most popular should not be replicated.
74Attribute-Based Naming
75Directory Services
- Sometimes you want to search for things based on
some kind of description of them. - Usually known as directory services.
- Coming up with attributes can be hard.
- Resource Description Framework (RDF) is
specifically designed for this.
76Example LDAP
- DNS resolves a name to a node in the namespace
graph. - LDAP is a directory service, which allows more
general queries. - Consists of a set of records.
- Each record is a list of attribute-value pairs,
with possible multiple values.
77LDAP Directory Entry
- A simple example of a LDAP directory entry using
LDAP naming conventions. - /CNL/OVrije Universiteit/OUMath. Comp. Sc.
78- Collection of all entries is a directory
information base (DIB). - Each naming attribute is a relative distinguished
name (RDN). - The RDNs, in sequence, can be used to form a
directory information tree (DIT).
79Hierarchical Implementations LDAP (2)
- Part of a directory information tree.
80Children Nodes
- Two directory entries having Host_Name as RDN.
81Using DHTs for attribute-value searches
- So far, we have assumed that the search is
centralized.
82Lookups
- To do a lookup, represent as a path, then hash
the path.
83Range Queries
- Divide key into two parts, name and value.
- Hash the name. Assume that a group of servers is
responsible for that. - Each server in the group is responsible for a
range. - For a resource description described by two
attribute-values, it must be stored via both of
them. - Example, movie made after 1980 with rating of
four-five stars.
84Semantic Overlay Networks
- Maintaining a semantic overlay through gossiping.
85Garbage Collection
86Garbage Collection
- How do you handle a server object that is unused?
- Can it be deleted by the server?
- More context.
87Unreferenced Objects
- An example of a graph representing objects
containing references to each other.
88Example
- class Class1 Class2 cls2 Class2 global
new Class2foo() Class1 obj new
Class1 obj-gtcls2 new Class2 global
new Class2
89Reference Counting
- How to avoid the double-counting?
90Passing a Reference
- Copying a reference to another process and
incrementing the counter too late - A solution.
91Weighted Reference Counting
- Can we avoid sending increment messages?
2
1
1
2
X
Increment
Decrement
92Weighted References
- Think of each reference as a token. If you give
each reference multiple tokens, then it can hand
those out without contacting the skeleton.
1
2
1
Step 2 Copy of reference is made. Weight is
divided by two.
2
2
Decrement 1
X
2
Step 1 Reference created with weight 2.
1
Step 3 A reference is deleted. A decrement by 1
message is sent.
93Weighted References
- Works correctly in the simple case.
2
2
Step 1 Reference created with weight 2.
Decrement 2
2
X
Step 2 A reference is deleted. A decrement by 2
message is sent.
94Total and Partial Weights
- To work in real situation, the skeleton keeps
track of the initial total weight that is
available. - Each proxy/stub then keeps track of how much
weight it is carrying (the partial weight). - When a proxy is duplicated, the partial weight is
halved. - When a proxy is deleted, a decrement message is
sent.
95Weighted Reference Counting
- The initial assignment of weights in weighted
reference counting - Weight assignment when creating a new reference.
96Passing a Weighted Ref Count
- Weight assignment when copying a reference.
97Indirection
- Creating an indirection when the partial weight
of a reference has reached 1.
98Generation Reference Counting
- Simple reference counting requires message for
incrementing and a message for decrementing. - Can we somehow combine those into one message?
Maybe delay the increment somehow? - Generation reference counting gets rid of one of
the messages (the increment message). - Basic idea is to try to defer the increment until
we actually decrement.
99Delayed Incrementing
1
C0
1 A proxy and a skeleton.
X
Increment 2, Decrement 1
1
C2
C0
1
C0
C0
C0
3 First proxy is deleted. It sends a message
saying it created two other proxies, so increment
ref count by two before decrementing by one.
2 Proxy created two more. Ref count at skeleton
is not updated, but first proxy keeps track of
the fact that it created two other proxies (C2).
100A Problem
1
C0
C2
1 A proxy and a skeleton.
1
C0
C2
Decrement 1
1
X
C0
C0
3 Third proxy is deleted. It has not created any
proxies, so it just sends a message to decrement
by one. This causes the object to be improperly
deleted.
2 Proxy created two more. Ref count at skeleton
is not updated, but first proxy keeps track of
the fact that it created two other proxies (C2).
101Generations
- Proxies have generations, as in humans.
G0C2
Generation 0
G1C2
G1C1
Generation 1
G2C0
G2C0
G2C0
Generation 2
102Generational Ref Counting
- Creating and copying a remote reference in
generation reference counting.
103Deleting a Proxy
- Skeleton maintains a table Gi, which denotes
the references for generation i. - When a proxy is deleted, a message is sent with
the generation number k and the number of copies
c. - Skeleton decrements Gk and increments Gk1 by
c. - Only when table is all 0 is the skeleton deleted.
104Deleting a Proxy
G01G1-1
G0C0
G0C2
G01
1 A proxy and a skeleton.
G1C0
G0C2
Decrement G1 by 1
G01
X
G1C0
G1C0
3 Third proxy is deleted. It has not created any
proxies, so it just sends a message to decrement
by one. Generation number is different from first
one.
2 Proxy created two more. Ref count at skeleton
is not updated, but first proxy keeps track of
the fact that it created two other proxies (C2).
105Refresh Distributed Garbage Collection
- The problem. How do you discover when no one
needs a server object? - One solution Develop distributed versions of
garbage collection algorithms.
106Reference Listing
- Distributed reference counting is tricky because
of failures. - If you send an increment, how do you know it
arrived? - What if the ack is lost?
- If you can design it so that it doesnt matter
how many times you send a message, then it is
simpler. - This is called idempotency. An idempotent
operation can be done many times without negative
effect. - Are these idempotent?
- Withdrawing 50 from your bank account.
- Cancelling a credit card account.
- Registering for a course.
107Idempotent Reference Counting
- How can we make reference counting idempotent?
- What turns non-idempotent registration into
idempotent registration? - Keep track of which proxies have been created in
the skeleton.
108Reference Listing
P1
Reference ListP1P2
P2
- Keep a list of proxies in the skeleton.
- Failures can be handled with heartbeats, etc.
- Main issue is scaling, if millions of proxies.
109Tracing in Groups
- Garbage collect first within a group of
distributed processes. - An object is not collected if
- It has a reference from outside the group
- It has a reference from the root set
- This is conservative.
- It is possible that a reference from outside the
group is not reachable.
110The Model
- Proxies (stubs), skeletons, and objects.
- Only one proxy per object per process.
- A root set of references. The root set is not
proxies.
Proxy
Skeleton
Object
111Basic Steps
- Find all skeletons in a group that are reachable
from outside or from root set. - Mark them hard, others are soft.
- Within a process, proxies that are reachable from
hard are marked hard, reachable from soft are
marked soft. Some are marked none. - Repeat the above until stable (no change).
112- Skeletons
- Hard
- Reachable from proxy outside of group
- Reachable from root object inside of group
- Soft
- Reachable only from proxies inside of group
- Proxies
- Hard
- Reachable from root set
- Soft
- Reachable from skeleton marked soft
- None
113Initial Marking of Skeletons
- All proxies in group report to skeleton.
- If ref count greater than number of proxies, then
must be external refs.
114Local Propagation
- Propagate hard/soft marks locally.
115Iterate Till Convergence
- Final marking.
- Anything not hard can be collected.