Title: Web Servers
1Web Servers
2Outline
- Survey many different types of software and
hardware web servers. - Describe how to write a simple diagnostic web
server in Perl. - Explain how web servers process HTTP
transactions, step by step.
3Different types of web servers
- General-purpose software web server
- Web server appliances
- Embedded web servers
4Jobs of web servers
- Implement HTTP and the related TCP connection
handling. - Manage the server-slide resource and provide
administrative features to configure, control,
and enhance the web service.
5Jobs of Operating System
- Manages the hardware details of the underlying
computer system - Provide TCP/IP network support
- Provide filesystems to hold web resources
- Provide process management to control computing
activities.
6General-purpose software web server
- General-purpose software web servers run on
standard, network-enabled computer system. - Open source software (such as Apache or W3Cs
Jigsaw). - Commercial software (such as Microsofts and
iPlanets web servers). - Web server software is available for just about
every computer and operating systems.
7General-Purpose Software Web Servers
In September 2003, the Netcaft survey
(http//www.netcraft.com/survey/)
8Web server appliances
- Web server appliances are prepackaged
software/hardware solutions. The vendor
preinstalls a software server onto a
vendor-chosen computer platform and preconfigures
the software. - Sun/cobalt Raq web appliance(http//www.cobalt.co
m) - Toshiba Magnia SG10 (http//www.toshiba.com)
- IBM Whistle web server application
(http//www.whistle.com) - Appliance solutions remove the need to install
and configuration software and often greatly
simplify administration. However, the web server
often is less flexible, feature-rich, and the
serer hardware is not easily upgradable.
9Embedded web servers
- Embedded servers re tiny web servers intended to
be embedded into consumer producers (e.g.,
printers or home appliances). - Allow users to administer their consumer devices
using a convenient web browser interface. - IPic match-head sized web server
- (http//www-ccs.cs.umass.edu/shri/iPic.html)
- NetMedia SitePlayer SP1 Ethernet web server
- (http//www.siteplayer.com)
10A Minimal Perl Web server
- Type-o-serve a minimal Perl web server used for
HTTP debugging - http//www.http-guide.com/tools/type-o-serve.pl
11A Minimal Perl Web Server
HTTP request message
Type-o-serve dialog
GET /blah.txt HTTP/1.1 Accept / Accept-language
en-us Accept-encoding gzip, deflate User-agent
Mozilla/4.0 Host www.csie.ncnu.edu.tw8080 Conne
ction Keep-alive
./type-o-serve.pl 8080 ltltResquest From
'www.csie.ncnu.edu.tw'gtgt GET /blah.txt
HTTP/1.1 Accept / Accept-language
en-us Accept-encoding gzip, deflate User-agent
Mozilla/4.0 Host www.csie.ncnu.edu.tw8080 Connec
tion Keep-alive ltltType Response followed by
'.gtgt HTTP/1.0 200 OK Connection
close Content-type text-plain Hi there!
HTTP response message
HTTP/1.0 200 OK Connection close Content-type
text/plain Hi there!
12What do web servers do?
- Set up connection
- Receive request
- Process request
- Access resource
- Construct response
- Send response
- Log transaction
13What Real Web Servers Do
User space
HTTP server software process
(3)Process request
(5)Create response
(2)Receive request
(4)Access resource
(7) Log transaction
TCP/IP network stack
(1)Set up connection
Network interface
Object Storage
(6)Send response
Operating system
14Step 1 accepting client connections
- Handling new connections
- Exacting client IP from a new TCP connection
- Client hostname identification
- Using reverse DNS
- Determining the client user through ident
- Some web servers support the IETF ident protocol
15Handling new connection
- When a client requests a TCP connection to the
web server, the web server establishes the
connection and determines which client is on the
other side of the connection, extracting the IP
address from the TCP connection. (e.g., using
getpeername call in UNIX socket) - The server is free to reject and immediately
close connections, because the client IP is
unauthorized or is known malicious client. - Once a new connection is established and
accepted, the server adds the new connection to
its list of existing connections and prepares to
watch for data on the connection.
16Client host identification
- Most web servers can be configured to convert
client IP addresses into client hostnames, using
reverse DNS. - The hostname information is used for detailed
access control and logging. - Note that hostname lookups can take a long time,
slowing down web transactions. Many
high-performance web servers either disable
hostname resolution or enable it only for
particular content. - Ex Configuring Apache to lookup hostnames for
HTML and CGI resources - HostnameLookups off
- ltFiles \. (html htm cgi)gt
- HostanmeLookups on
- lt/Filesgt
17Determining the client user through ident
- The ident protocol let servers find out what
username initiated an HTTP connection. - The username information is particularly useful
for logging the 2nd field of the popular Common
Log Format contains the ident username of each
HTTP request. (RFC931, the updated ident
specification is documented by RFC 1413). - If a client supports the ident protocol, the
client listens on TCP port 113 for ident
requests.
18Determining the Client User Through ident
(a) Mary establishes new HTTP connection
Port 80
Port 4236
HTTP connection
(c)Server sends request
4236, 80
(b)Server establishes ident connection
Mary
Port 80
Web server
Port 113
4236, 80USERIDUNIXMARY
(d)Client returns ident response
19Ident protocol (cont.)
- Ident can work inside organizations, but it does
not work well across public Internet for the
following reasons. - Many client PC dont run the identd
identification protocol daemon software. - The ident protocol significantly delays HTTP
transactions. - Many firewalls wont permit incoming ident
traffic. - The ident protocol is insecure and easy to
fabricate. - The ident protocol doesnt support virtual IP
address well. - There are privacy concerns about exporting client
usernames. - Enable ident lookup in Apache
- IdentityCheck on
- Common Log Format log files typically contain
typhens (-) in the 2nd filed if no ident
information is available.
20Step 2 Receiving request messages
- As the data arrives on connections, the server
read out the data and start parsing the request
message. - Parse the request line looking for the request
method, the specified URI, and the version
number. - Read the message headers, each ending in CRLF.
- Detect the end-of-headers blank line, ending in
CRLF. - Reads the request body, if any (length specified
by Content-Length header) - Internet Representations of Messages
- Some web servers also store the request message
in internal data structures that make the message
easy to manipulate.
21Receiving Request Messages
Request message being read from network
GET /specials/hychen.gif HTTP/1.0CRLF Accept
image/gifCRLF Host www.j
Internet
LF CR LF CR moc.erawdrah-seo
server
client
22Internal Representations of Message
GET /specials/saw-blade.gif HTTP/1.0CRLF Accept
image/gifCRLF Host www.joes-hardware.comCRLF CRLF
Parse
method 1 version 1.0 uri ? header
count 2 headers ? body -
specials/saw-blade.gif
www.joes-hardware.com
Image/gif
NameHost
Value ?
NameAccept
Value ?
23Different web server architectures
- Single-threaded web servers
- Multi-process and multi-threaded web servers
- Multiplexed I/O web servers
- Non-blocking network accessing
- Multiplexed multi-threaded web servers
24Connection Input/Output Processing Architectures
25Step 3 Processing requests
- Once the web server has received a request, it
can process the request using method, resource,
headers, and optional body. - Some method (e.g., POST) require entity body data
in the request message. A few methods (e.g., GET)
forbid entity body data in the request message.
26Step 4 Mapping and Accessing resources
- Docroot
- Virtually hosted docroots
- User home directory docroots
- Directory Listings
- Dynamic content resource maping
- Server-Side Include (SSI)
- Access Control
27Docroots
- Web servers support different kinds of resource
mapping, but the simplest form of mapping uses
the request URI to name a file in the web
servers filesystem. - Typically, a special folder in the web server
filesystem is reserved for web content. The
folder is called the document root, or docroot. - The web server takes the URI from the request
message and appends it to the document root. - The docroot setting in apache servers
- DocumentRoot /usr/local/httpd/files
- Servers must be careful not to let relative URLs
back up out of a document root and expose other
parts of the filesystem. - E.g., http//www.csie.ncnu.edu.tw/../
28Docroots
/usr/local/httpd/files
Internet
Request message
GET /specials/hychen.gif HTTP/1.0 Host
www.csie.ncnu.edu.tw
Object Storage
client
Web server
Request URI /specials/hychen.gif
Server resource /usr/local/httpd/files/specials/h
ychen.gif
29Virtually hosted docroots
- Virtually hosted web servers host multiple web
site on the same web server, giving each site its
own distinct document root on the server. - A virtual hosted web server identifies the
correct document root to use from the IP or
hostname in the Host header.
30Apaches virtual host configuration
- ltVirtualHost www.joes-hardware.comgt
- ServerName www.joes-hardware.com
- DocumentRoot /docs/joe
- TransferLog /log/joe.access_log
- ErrorLog /logs/joe.error_log
- lt/VirtualHostgt
- ltVirtualHost www.marys-hardware.comgt
- ServerName www.marys-hardware.com
- DocumentRoot /docs/mary
- TransferLog /log/mary.access_log
- ErrorLog /logs/mary.error_log
- lt/VirtualHostgt
31Virtually hosted docroots
Internet
Request message A
GET /index.html HTTP/1.0 Host www.joes-hardware.c
om
GET /index.html HTTP/1.0 Host www.marys-antiques.
com
client
Request message B
www.joes-hardware.com www.marys-antiques.com
32User home directory docroots
Request message A
GET /bob/index.html HTTP/1.0
/home/bob/public_html
Internet
/home/betty/public_html
GET /betty/index.html HTTP/1.0
client
Request message B
www.joes-hardware.com www.marys-antiques.com
33User home directory docroots
- Another common use of docroots gives people
private web site on a web server. - A typical convention maps URIs whose paths begin
with a slash and tilde (/) followed by a
username to a private document root for that
user. - The private docroot is often the folder called
public_html inside that users home directory,
but it can be configured differently (e.g., in
the NCNU web server, we use WWW as the users
private document root.) - In apaches configuration,
- UserDir public_html
34Directory listings
- A web serer can receive request for directory
URLs, where the path resolves to a directory, not
a file. - Most web servers can be configured to take a few
different actions when a client requests a
directory URL - Return an error.
- Return a special, default, index file instead
of the directory. - Scan the directory, and return an HTML page
containing the contents. - Most web servers look for a file named index.html
or index.htm inside a directory to represent that
directory. - In apache configuration
- DirectoryIndex index.html index.htm home.html
home.html index.cgi - Disable the automatic generation of directory
index files with the apache directive - Option -Indexes
35Dynamic content resource mapping
- Web server also can map URIs to dynamic resources
that is, to programs that generate content on
demand. - In fact, a whole class of web servers called
application servers connect web servers t
sophisticated backend applications. - The web server need to be able to tell when a
resource is a dynamic resource, where the dynamic
content generator program is located, and how to
runt he program. - In apaches configuration
- ScriptAlias /cgi-bin/ /usr/lcoal/etc/httpd/cgi-pro
grams/ - AddHandler cgi-script .cgi
- CGI is an early, simple, and popular interface
for executing server-side applications. Modern
application servers have more powerful and
server-side dynamic content support, including
Active Server Pages, java servlets, and PHP.
36Dynamic Content Resource Mapping
Internet
client
server
37Server-Side Includes (SSI)
- Many web servers also provide support for
server-side includes. - If a resource is flagged as containing
server-side includes, the server processes the
resource contents before sending them to the
client. - The content are scanned for certain special
patterns, which can be variable name or embedded
scripts. The special patterns are replaced with
the values of variables or the output of
executable scripts. - This is an easy way to create dynamic content.
38Access controls
- Web servers also can assign access controls to
particular resource. - When a request arrives for an access-controlled
resource, the web server can control access based
on the IP address of the client, or it can issues
a password challenge to get access to the
resource. - We will see more details in the later lecture
(HTTP authentication).
39Step 5 Building Responses
- Once the web server has identified the resource,
it performs the action described in the request
method and returns the response message, which
contains status code, response header, and a
response body. - Response Entities
- MIME Typing
- Redirection
40Response entities
- If the transaction generated a response body, the
content is sent back with the response message,
which usually contains - a Content-Type header, i.e. MIME typing
- a Content-Length header, describing body size
- The actual message body content
41MIME typing
- The web server is responsible for determining the
MIME type of the response body. - There are many ways to configure servers to
associate MIME types with resources - mime.types extension-based type association
- Magic typing content-based association, scanning
a known patterns - Explicit typing force particular files or
directory contents to have a MIME types,
regardless of the file extension or contents. - Type negotiation server is configured to store a
resource in multiple document formats. In a
client-server negotiation process the server can
determine the best format to use.
42MIME Typing
hychen.gif file
HTTP request message contains the command and the
URI
GET /specials/hychen.gif HTTP/1.1 Host
www.csie.ncnu.edu.tw
www.csie.ncnu.edu.tw
client
43Redirection
- Web servers sometimes return redirection
responses (indicated by a 3XX return code)
instead of success messages. The Location
response header contains a URI for the new or
preferred location of the content. Redirections
are useful for - Permanently moved resources
- Temporarily moved resources
- URL augmentation
- Load balancing
- Server affinity
- Canonicalizing directory names
44Step 6 Sending Responses
- The servers may have many connections to many
clients, some idle, some sending data to the
server, and some carrying response data back to
the clients. - The servers needs to keep track of connection
state and handle persistent connections with
special care. - For non-persistent connections, the server is
expected to close its side of connection when the
entire message is sent. - For persistent connections, the connection may
stay open, in which case the server needs to be
extra cautious to compute the Content-Length
header correctly, or the client will have no way
of knowing when a response ends.
45Step 7 Logging
- Finally, when a transaction is complete, the web
server notes an entry into a log file, describing
the transaction performed. - Most web servers provide several configurable
forms of logging. (Later lectures for details)
46Reference Web server
- http//www.apache.org
- The apache web site
- http//www.w3c.org/Jigsaw
- Jigsaw- W3Cs Server
- http//www.ietf.org/rfc/rfc1413.txt
- RFC 1413, Identification Protocol, By M. St.
Johns.