Title: URLs
1URLs Uniform Resource Locators
- Since web pages may contain pointers to other
pages, we will see how those pointers are
implemented - When the web was first created, it was apparent
that having one page point to another required
mechanisms for naming and locating pages. In
particular there were 3 questions that had to be
answered before a selected page could be
displayed - What is the page called?
- Where is the page located?
- How can the page be accessed?
-
2URLs
- The solution chosen identifies pages in a way
that solves all 3 problems at once. - Each page is assigned a URL (Uniform Resource
Locator) that effectively serves as the pages
worldwide name. -
3URLs
- URLs have 3 parts
- The protocol (also called a scheme)
- The DNS name of the machine on which the page is
located, and - A local name uniquely indicating the specific
page (usually just a file name on the machine
where it resides) - For example, the URL for the authors department
is http//www.cs.vu.nl/welcome.html This URL
consists of 3 parts the protocol (http), the DNS
name of the host (www.cs.vu.nl) and the file name
(welcome.html) with certain punctuation
separating the pieces
4URLs
- Many sites have certain shortcuts for file names
built in. For example, user/ might be mapped
onto users WWW directory, with the convention
that a reference to the directory itself implies
a certain file, say, index.html - Thus the authors home page can be reached at
http//www.cs.vu.nl/ast/ even though the actual
file name is different. - At many sites a null file name defaults to the
organizations home page. -
5URLs mechanism
- To make a piece of text clickable the page writer
must provide 2 items of information - The clickable text to be displayed, and
- The URL of the page to go to if the text is
selected - When the text is selected, the browser looks up
the host name using DNS. Now armed with the
hosts IP address, the browser then establishes a
TCP connection to the host. Over that connection
it sends the file name using the specified
protocol. Next, back comes the page. -
6URLs - protocols
- The URL scheme is open ended, in the sense that
it is straight forward to have protocols other
than HTTP. In fact, URLs for various other
protocols have been defined, and many browsers
understand them - The next table illustrates slightly simplified
forms of the more common ones
7ULRs - Protocols
8HTTP HyperText Transfer Protocol
- The standard Web transfer protocol is HTTP
(HyperText Transfer Protocol) - The HTTP protocol consists of two fairly distinct
items - the set of requests from browsers to servers, and
- the set of responses going back the other way
9HTTP
- HTTP is an ASCII protocol (each interaction
consists of an ASCII request, followed by one
MIME-like response) - MIME (Multipurpose Internet Mail Extensions) in
the early days of the ARPNET email messages
consisted exclusively of text messages written in
English and expressed in ASCII. Nowadays on the
Internet this approach is no longer adequate, as
the following need to be addressed - Messages in languages with accents (French,
German) - Messages in nonLatin alphabets (e.g. Hebrew,
Russian) - Messages in languages withough alphabets (e.g.
Chinese, Japanese) - Messages not containing text at all (e.g. audio,
video)
10MIME
- The basic idea of MIME is to define encoding
rules for non-ASCII messages. MIME defines 5
message headers -
Header Meaning
MIME-Version Identifies the MIME version
Content-Description Human readable string telling what is the message
Content-ID Unique identifier
Content-Transfer-Encoding How the body is wrapped for the transmission
Content-Type Nature of the message
11MIME Content Type
Header Subtype Meaning
Text Plain Richtext Unformatted text Text including simple formatting
Image Gif Jpeg Still picture in GIF format Still picture in JPEG format
Audio Basic Audible sound
Video Mpeg Movie in MPEG format
Application Octet-stream Postscript An uninterpreted byte sequence A printable document in PostScript
Message Rfc822 Partial External-body A MIME RFC 822 message Message has been split for transmission Message must be fetched over the net
Multipart Mixed Alternative Parallel Digest Independent parts Same message in different formats Parts must be viewed simultaneously Each part is a complete RFC 822 message
12HTTP - request
- Although HTTP was designed for use in the Web, it
has been intentionally made more general than
necessary with an eye to future object oriented
applications. For this reason the first word of a
request line is simply the name of the method
(command) to be executed on the Web page (or
general object) - The built in methods are as follows
Method Description
GET Request to read a Web page
HEAD Request to read a Web pages header
PUT Request to store a Web page
POST Append to a named resource (web page)
DELETE Remove the Web page
LINK Connects two existing resources
UNLINK Breaks an existing connection between resources
13HTTP request / response
- A request is just a GET line, naming the page
desired and the HTTP protocol version - GET /hypertext/WWW/TheProject.html HTTP/1.1
- The response is just the raw page, headers, and
MIME information - For example, because HTTP is an ASCII protocol,
it is easy for aperson at a terminal (opposed to
a browser) to direcly talk to Web servers. All
that is a needed is a TCP connection to port 80
on the server. The simplest way to get such
connection is the Telnet program
14HTTP - example
- Client Telnet www.w3.org 80
- Trying 18.23.0.23
- Connected to www.w3.org
- Client GET /hypertext/WWW/TheProject.html
HTTP/1.1 - Server HTTP/1.1 200 Document follows
- Server MIME-Version 1.0
- Server Server CERN/3.0
- Server Content-Type text/html
- Server Content-Length 8247
- Server ltHEADgtltTITLEgtThe World Wide Web
Consortium (W3C) lt/TITLEgt lt/HEADgt - Server ltBODYgt
-
15HTTP Example
- Or could use a command line browser, (such as
WFetch) to review the same information
16(No Transcript)
17HTML HyperText Markup Language
- HTML is a markup language, a language for
describing how documents are to be formatted. The
term markup comes from the old days when
copyeditors acutally marked up documents to tell
the printer (in those days a human being) which
fonts to use, and so on. - Markup languages thus contain explicit commands
for formatting. For example, in HTML, ltBgt means
start boldface mode, and lt/Bgt means leave
boldface mode.
18HTML
- The advantage of a markup language over one with
no explicit markup is that writing a browser for
it is straightforward the browser simply has to
understand the markup commands. - By embedding the markup commands within each HTML
file and standardizing them, it becomes possible
for any Web browser to read and reformat any Web
page.
19HTML
- HTTP and HTML are constantly evolving. When
Mosaic was the only browser, the language it
interpreted, HTML 1.0, was de facto standard. - When new browsers came along, there was a need
for a formal Internet standard, so the HTML 2.0
standard was produced. Next, HTML 3.0 was created
as a research effort to add many new features to
HTML 2.0, including tables, toolbars,
mathematical formulas, advanced style sheets (for
defining page layout and the meaning of symbols),
etc.
20HTML brief introduction
- A proper Web page consists of a head and body
enclosed by ltHTMLgt and lt/HTMLgt tags (formatting
commands), although most browsers do not complain
if these tags are missing. - The head is bracketed by ltHEADgt lt/HEADgt tags, and
the body is bracketed by ltBODYgt lt/BODYgt tags - The commands inside the tags are called
directives. Most HTML tags have this format, that
is, ltSOMETHINGgt to mark the beginning of
something and lt/SOMETHINGgt to mark its end.
21HTML brief introduction
- Numerous other examples of HTML are easily
available. Most browsers have a menu item VIEW
SOURCE or something similar. Selecting this item
for an HTML page, displays the current HTML
source, instead of formatted output
22DNS Domain Name System
- Programs rarely refer to hosts, mailboxes, and
other resources by their binary network
addresses. Instead, they use ASCII strings, such
as tana_at_art.ucsb.edu - Nevertheless, the network itself only understands
binary addresses, so some mechanism is required
to convert the ASCII strings to network
addresses.
23DNS
- Way back in the ARPANET, there was simply a file,
hosts.txt, that listed all the hosts and their IP
addresses. Every night, all the hosts would fetch
it from the site and at which it was maintained.
For a network of a few hundred large timeshareing
machines, this approach worked reasonably well. - However, when thousands of workstations were
connected to the net, everyone realized that this
approach could not continue to work forever.
24DNS
- For one thing, the size of the file would become
too large. However, even more important, host
name conflicts would occur constantly unless
names were centrally managed, something
unthinkable in a huge international network. - To solve these problems, DNS (the Domain Name
System) was invented.
25DNS
- The essence of DNS is the invention of a
hierarchical, domain-based naming scheme and a
distributed database system for implementing this
naming scheme. - It is primarily used for mapping host names and
email destinations to IP addresses.
26DNS how it is used
- To map a name onto an IP address, an application
program calls a library procedure called the
resolver, passing it the name as a parameter. The
resolver sends a UDP packet to a local DNS
server, which then looks up the name and returns
the IP address to the resolver, which then
returns it to the caller. - Armed with the IP address, the program can then
establish a TCP connection with the destination,
or send it UDP packets.