Introduction to Distributed Systems

About This Presentation

Title:

Introduction to Distributed Systems

Description:

Distribute : To divide among several or many, systematically ... 129.65.242.4 hornet.csc.calpoly.edu hornet. 129.65.241.8 hornet-srv.csc.calpoly.edu hornet-srv ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 47

Provided by: infm3

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Distributed Systems

1
Introduction to Distributed Systems

What is distributed Computing?
Distribute To divide among several or many,
systematically or merely at random.
Distributed system Collection of independent
computers that appear to the users of the system
as a single computer.
Distributed programming techniques allow software
to take advantage of resources located on the
Internet, on corporate and organization
intranets, and on networks.
Distributed programming usually involves network
programming in one form or another. That is, a
program on one computer on a network needs some
hardware or software resource that belongs to
another computer either on the same network or on
some remote network.

2
Introduction to Distributed Systems

A Distributed System

3
Introduction to Distributed Systems

Examples of Distributed Systems
Network of workstations (NOW) a group of
networked personal workstations connected to one
or more server machines.
The Internet
An intranet a network of computers and
workstations within an organization, segregated
from the Internet via a protective device (a
firewall).
Actual example of a large-scale distributed
system eBay
Actual example of a small-scale distributed
system smart home
Computers in a distributed system
Workstations computers used by end-users to
perform computing
Server machines computers which provide
resources and services
Personal Assistance Devices handheld computers
connected to the system via a wireless
communication link.

4
Introduction to Distributed Systems

The network really is the computer.
Tim OReilly, in an address at 6/2000 Java One
By now, it's a truism that the Internet runs on
open source. Bind, the Berkeley Internet Name
Daemon, is the single most mission critical
program on the Internet, followed closely by
Sendmail and Apache, open source servers for two
of the Internet's most widely used application
protocols, SMTP and HTTP.
Early killer apps
- usenet distributed bulletin board
- email
- talk
Recent killer apps
- the web
- collaborative computing

5
Introduction to Distributed Systems
Centralized vs. Distributed Computing
6
Introduction to Distributed Systems

Monolithic mainframe applications vs. distributed
applications
The monolithic mainframe application
architecture
Separate, single-function applications, such as
order-entry or billing
Applications cannot share data or other resources
Developers must create multiple instances of the
same functionality (service).
Proprietary (user) interfaces
The distributed application architecture
Integrated applications
Applications can share resources
A single instance of functionality (service) can
be reused.
Common user interfaces

7
Introduction to Distributed Systems

Evolution of pardigms
Client-server Socket API, remote method
invocation
Distributed objects
Object broker CORBA
Network service Jini
Object space JavaSpaces
Mobile agents
Message oriented middleware (MOM) Java Message
Service
Collaborative applications

8
Introduction to Distributed Systems

Cooperative distributed computing projects
Cooperative distributed computing projects (also
called distributed computing in some literature)
these are projects that parcel out large-scale
computing to workstations, often making use of
surplus CPU cycles.
Example seti_at_home project to scan data
retrieved by a radio telescope to search for
radio signals from another world.
Why distributed computing?
Economics distributed systems allow the pooling
of resources, including CPU cycles, data storage,
input/output devices, and services.
Reliability a distributed system allow
replication of resources and/or services, thus
reducing service outage due to failures.
The Internet has become a universal platform for
distributed computing

9
Introduction to Distributed Systems

The Weaknesses and Strengths of Distributed
Computing
In any form of computing, there is always a
tradeoff in advantages and disadvantages
Some of the reasons for the popularity of
distributed computing
The affordability of computers and availability
of network access
Resource sharing
Scalability
Fault Tolerance
The disadvantages of distributed computing
Multiple Points of Failures the failure of one
or more participating computers, or one or more
network links, can spell trouble.
Security Concerns In a distributed system, there
are more opportunities for unauthorized attack.

10
Introduction to Distributed Systems

The Architecture of Distributed Applications

11
Introduction to Distributed Systems

Network standards and protocols
On public networks such as the Internet, it is
necessary for a common set of rules to be
specified for the exchange of data.
Such rules, called protocols, specify such
matters as the formatting and semantics of data,
flow control, error correction.
Software can share data over the network using
network software which supports a common set of
protocols.
Protocols
In the context of communications, a protocol is a
set of rules that must be observed by the
participants.
In communications involving computers, protocols
must be formally defined and precisely
implemented. For each protocol, there must be
rules that specify the followings
How is the data exchanged encoded?
How are events (sending , receiving) synchronized
so that the participants can send and receive in
a coordinated order?
The specification of a protocol does not dictate
how the rules are to be implemented.

12
Introduction to Distributed Systems

The network architecture
Network hardware transfers electronic signals,
which represent a bit stream, between two
devices.
Modern day network applications require an
application programming interface (API) which
masks the underlying complexities of data
transmission.
A layered network architecture allows the
functionalities needed to mask the complexities
to be provided incrementally, layer by layer.
Actual implementation of the functionalities may
not be clearly divided by layer.

13
Introduction to Distributed Systems

The OSI seven-layer network architecture

14
Introduction to Distributed Systems

Network Architecture
The division of the layers is conceptual the
implementation of the functionalities need not be
clearly divided as such in the hardware and
software that implements the architecture.
The conceptual division serves at least two
useful purposes
Systematic specification of protocols it allows
protocols to be specified systematically
Conceptual Data Flow it allows programs to be
written in terms of logical data flow.

15
Introduction to Distributed Systems

The TCP/IP Protocol Suite
The Transmission Control Protocol/Internet
Protocol suite is a set of network protocols
which supports a four-layer network architecture.
It is currently the protocol suite employed on
the Internet.

16
Introduction to Distributed Systems

The TCP/IP Protocol Suite -2
The Internet layer implements the Internet
Protocol, which provides the functionalities for
allowing data to be transmitted between any two
hosts on the Internet.
The Transport layer delivers the transmitted data
to a specific process running on an Internet
host.
The Application layer supports the programming
interface used for building a program.

17
Introduction to Distributed Systems

Network Resources
Network resources are resources available to the
participants of a distributed computing
community.
Network resources include hardware such as
computers and equipment, and software such as
processes, email mailboxes, files, web documents.
An important class of network resources is
network services such as the World Wide Web and
file transfer (FTP), which are provided by
specific processes running on computers.
One of the key challenges in distributed
computing is the unique identification of
resources available on the network, such as
e-mail mailboxes, and web documents.
Addressing an Internet Host
Addressing a process running on a host
Email Addresses
Addressing web contents URL

18
Introduction to Distributed Systems
The Internet Topology
19
Introduction to Distributed Systems

The Internet Topology
The internet consists of an hierarchy of
networks, interconnected via a network backbone.
Each network has a unique network address.
Computers, or hosts, are connected to a network.
Each host has a unique ID within its network.
Each process running on a host is associated with
zero or more ports. A port is a logical entity
for data transmission.

20
Introduction to Distributed Systems

The Internet addressing scheme
In IP version 4, each address is 32 bit long.
The address space accommodates 232 (4.3 billion)
addresses in total.
Addresses are divided into 5 classes (A through
E)

21
Introduction to Distributed Systems

The Internet addressing scheme - 2

22
Introduction to Distributed Systems

Example
Suppose the dotted-decimal notation for a
particular Internet address is129.65.24.50. The
32-bit binary expansion of the notation is as
follows
Since the leading bit sequence is 10, the
address is a Class B address. Within the class,
the network portion is identified by the
remaining bits in the first two bytes, that is,
00000101000001, and the host portion is the
values in the last two bytes, or
0001100000110010. For convenience, the binary
prefix for class identification is often included
as part of the network portion of the address, so
that we would say that this particular address is
at network 129.65 and then at host address 24.50
on that network.

23
Introduction to Distributed Systems

Another example
Given the address 224.0.0.1, one can expand it as
follows
The binary prefix of 1110 signifies that this is
class D, or multicast, address. Data packets
sent to this address should therefore be
delivered to the multicast group
0000000000000000000000000001.

24
Introduction to Distributed Systems

The Internet Address Scheme 3
For human readability, Internet addresses are
written in a dotted decimal notation
nnn.nnn.nnn.nnn, where each nnn group is a
decimal value in the range of 0 through 255
Internet host table (found in /etc/hosts file)
127.0.0.1 localhost
129.65.242.5 falcon.csc.calpoly.edu falcon
loghost
129.65.241.9 falcon-srv.csc.calpoly.edu
falcon-srv
129.65.242.4 hornet.csc.calpoly.edu hornet
129.65.241.8 hornet-srv.csc.calpoly.edu
hornet-srv
129.65.54.9 onion.csc.calpoly.edu onion
129.65.241.3 hercules.csc.calpoly.edu
hercules

25
Introduction to Distributed Systems

IP version 6 Addressing Scheme
There are three types of addresses
Unicast An identifier for a single interface.
Anycast An identifier for a set of interfaces
(typically belonging to different nodes).
Multicast An identifier for a set of interfaces
(typically belonging to different nodes). A
packet sent to a multicast address is delivered
to all interfaces identified by that address.
The Domain Name System (DNS)
For user friendliness, each Internet address is
mapped to a symbolic name, using the DNS, in the
format of
ltcomputer-namegt.ltsubdomain hierarchygt.ltorganizatio
ngt.ltsector namegt.ltcountry codegt
e.g., www.csc.calpoly.edu.us

26
Introduction to Distributed Systems
27
Introduction to Distributed Systems

The Domain Name System
For network applications, a domain name must be
mapped to its corresponding Internet address.
Processes known as domain name system servers
provide the mapping service, based on a
distributed database of the mapping scheme.
The mapping service is offered by thousands of
DNS servers on the Internet, each responsible for
a portion of the name space, called a zone. The
servers that have access to the DNS information
(zone file) for a zone is said to have authority
for that zone.
Top Level Domain Names
.com For commercial entities, anyone in the
world, can register.
.net Originally designated for organizations
directly involved in Internet operations. It is
increasingly being used by businesses when the
desired name under "com" is already registered by
another organization. Today anyone can register a
name in the Net domain.
.org For miscellaneous organizations, including
non-profits.
.edu For four-year accredited institutions of
higher learning.
.gov For US Federal Government entities
.mil For US military
Country Codes For individual countries based on
the International Standards Organization. For
example, ca for Canada, and jp for Japan.

28
Introduction to Distributed Systems

Domain Name Hierarchy

29
Introduction to Distributed Systems

Name lookup and resolution
If a domain name is used to address a host, its
corresponding IP address must be obtained for the
lower-layer network software.
The mapping, or name resolution, must be
maintained in some registry.
For runtime name resolution, a network service is
needed a protocol must be defined for the naming
scheme and for the service.
Example
The DNS service supports the DNS
the Java RMI registry supports RMI object lookup
JNDI is a network service lookup protocol.

30
Introduction to Distributed Systems

Addressing a process running on a host logical
ports

31
Introduction to Distributed Systems

Well Known Ports
Each Internet host has 216 (65,535) logical
ports. Each port is identified by a number
between 1 and 65535, and can be allocated to a
particular process.
Port numbers between 1 and 1023 are reserved for
processes which provide well-known services such
as finger, FTP, HTTP, and email.

32
Introduction to Distributed Systems

Choosing a port to run your program
For our programming exercises when a port is
needed, choose a random number above the well
known ports 1,024- 65,535.
If you are providing a network service for the
community, then arrange to have a port assigned
to and reserved for your service.
The Uniform Resource Identifier (URI)
Resources to be shared on a network need to be
uniquely identifiable.
On the Internet, a URI is a character string
which allows a resource to be located.
There are two types of URIs
URL (Uniform Resource Locator) points to a
specific resource at a specific location
URN (Uniform Resource Name) points to a specific
resource at a nonspecific location.

33
Introduction to Distributed Systems

A URL has the format of
protocol//host addressport/directory
path/file namesection

34
Introduction to Distributed Systems

More on URLs
The path in a URL is relative to the document
root of the server. On the CSL systems, a users
document root is /www.
A URL may appear in a document in a relative
form
lt a hrefanother.htmlgt
and the actual URL referred to will be
another.html preceded by the protocol, hostname,
directory path of the document .

35
Introduction to Distributed Systems

Design Issues in Distributed Systems
Transparency is the most important issue in
truly distributed systems is to make a group of
machines appear as if it is an old timesharing
system.
Different types of transparency
Location Transparency Users can not tell where
the resources are located (hardware, software
resources, CPU, printers, files, databases, etc.)
Migration Transparency Resources must be free
to move from one machine to another without
changing their names. E.g. Moving the mount
points of remote file systems. /usr/dist on the
sun cluster.
Replication Transparency Users can't tell how
many copies exist. System may make multiple
copies for reliability (a disk failure), improved
performance (heavily used files). As long as the
users don't observe anomalous behavior
(coherency) it should not matter.

36
Introduction to Distributed Systems

Concurrency Transparency Multiple users can
share resources automatically.
Multiple readers OK
Multiple writers Provide automatic mechanisms
to sequentialize this to maintain correctness.
Parallelism Transparency Activities may happen
in parallel without the users knowing about it.
Hard to achieve. Advanced users may want to
exploit the presence of multiple processors.
Because the state-of-the-art is not close to
achieving this automatically. The end is not in
sight!!!!
Sometimes users don't want total transparency.
Use a special printer
Use a special hardware accelerator attached to a
particular machine.

37
Introduction to Distributed Systems

Reliability
One machine goes down -gt another one performs the
computation.
User never sees the difference, except perhaps in
the performance level.
E.g. 5 file servers that have duplicate data.
Probability of one failing 0.05.
Probability of all of them failing simultaneously
is 0.54 0.000006 practically negligible.
(Logical OR of the individuals)
In practice distributed systems depend on
several pieces all working simultaneously for the
system to work. (Logical AND of the components)
Distributed system is one on which I cannot get
any wok done because some machine I have never
heard of has crashed. (Lamport)

38
Introduction to Distributed Systems

Reliability has several facets
Availability
Fraction of the time the system is available for
use.
Use as few components that need to work as
simultaneously as possible. (reduce the logical
AND)
Allow redundancy (increase logical OR). Replicate
key pieces of hardware and software.
However one has to worry about the issues of
consistency as the degree of redundancy
increases. Tradeoff.
Security
Also a key issue in reliability
Easier to authenticate in centralized systems
Use password and OK after that.
Distributed systems Messages between machines.
How do you authenticate? Anybody can put any kind
of message on the network.

39
Introduction to Distributed Systems

Fault tolerance
How easily / transparently does the system get
out of a failure of some kind? E.g. A machine
goes down? What happens to the process that was
running? Can it be restarted in some other place
exactly at the point the original process left
off.
Important in business/banking systems.
Performance
An important aspect of distributed systems
Many different metrics can be used
response time/turnaround time
system utilization
network capacity utilization
Performance measurements depend a great deal on
the types of situation. E.g. large number of
compute bound jobs with little/no I/O Vs. large
database applications.

40
Introduction to Distributed Systems

Granularity of computation
Fine grained e.g. simple operations that can be
done with a few instructions. Lots of
interaction, coordination, I/O, etc. Distributing
them would be too much overhead.
Coarse-grained long computation times. little
I/O, coordination, interaction, Better suited for
distribution.
Scalability
Designed for 100s of CPUs. How will it work for
100, 000 CPUs?
E.g. French PTT system
What principles to use?
Avoid centralized components, e.g. single mail
servers, single file servers.
Avoid centralized tables, databases, etc., e.g.
telephone directory
Avoid centralized algorithms, .e.g an algorithm
that first collects information about the whole
system before computing an optimal route to send
a message.

41
Introduction to Distributed Systems

Characteristics of decentralized algorithms
Lack of complete information about the whole
system
Make decisions based on local information
If one machine is down, the algorithm should
still work.
No assumption about a global clock
General Discussion
Distributed To divide the computation among
several what ?
Processors/nodes
Processor CPU (include cache, etc.)
Nodes single/multiprocessor, memory, I/O,
possibly network interface
Communication The work has to be divided and
distributed so, communication is central.
Bus (processor) or Network (node)
The parameters CPU, Memory, I/O, Network,
System Software

42
Introduction to Distributed Systems

Many types of Systems
The differences are difficult to clearly state.
Some believe that it is a continuum.
Centralized Single system
Decentralized multiple systems, but no
coordination
Distributed multiple systems with coordination
Homogeneous All systems are same/similar
Heterogeneous Dissimilar nodes in the system
Server A system providing some services, e.g.
file systems usually more powerful and complex
hardware
Client A system with minimal resources, depends
on servers to get tasks accomplished.

43
Introduction to Distributed Systems

Networked System
High degree of autonomy of machines
Loosely-coupled hardware and loosely coupled
software
Machines run their own OS
May have their own local disk
Operations have explicit names of the machines
rlogin Eagle
May have a file server but the view from
different machines is different.
True Distributed System
Tightly-coupled software on loosely coupled
hardware
Create an illusion of a Single System Image ,
Virtual Uniprocessor
Uniform view of file system, uniform protection
mechanisms
Uniform communication schemes

44
Introduction to Distributed Systems

Multiprocessor Systems
Tightly coupled software on tightly coupled
hardware
Typically single run queue
Shared (logically) memory
File system is like the centralized system
Cluster Systems
Parallel or distributed system
Consists of a collection of interconnected whole
computers
Utilized as a single computing resource
Peer relationship between the nodes in a cluster
Nodes of a cluster do not maintain their internal
anonymity

45
Introduction to Distributed Systems

Networked Distributed Multiprocessor Cluster
Number of Nodes 1000s 1000s 1000s 10s
Performance Metric Response Time Response
Time TurnaroundTime TurnAroundTime
Virtual Processor View No Yes Yes Yes
Node Individualization Yes Yes No No
Operating Systems Heterogenous
Homogeneous Homogeneous Homogeneous
Copies of OS N N 1 N
Communication Shared Files Messages Shared
Memory Messages
Network Protocol Required Required Not
Required Not Required
Run Queue No No Yes No
Inter-node Security None Required None Required

46
Introduction to Distributed Systems

Summary
We discussed the following topics
What is meant by distributed computing
Rationale for distributed Systems
Centralised versus distributed Systems
Basic concepts in data communication in
distributed systems
Network architectures the OSI model and the
Internet model
Naming schemes for network resources
The three-layered architecture of distributed
applications presentation layer, application or
business logic, the service layer
Design issues in distributed systems