Title: URLs and Resources
1URLs and Resources
2Outline
- Navigating the Internets Resources
- URL syntax
- and what the various URLs mean and do
- URL Shortcuts that many web clients support
- relative URLs
- and expanded URLs
- URL encoding and character rules
- Common URL schemes
- The future of URLs, including URNs
3Navigating a resource by URL, which tells a web
client
- URL scheme how to access the resource
- Server location where the resource is hosted
- Resource path what particular local resource
on the server is being requested
http//english.csie.ncnu.edu.tw/demo/index.html
Web page
Scheme (how)
Host (where)
Path (what)
4URLs
- URLs can direct you to resources available
through other than HTTP. - Email account mailtohychen_at_csie.ncnu.edu.tw
- A file resides on a FTP serverftp//ftp.ncnu.edu
.tw/a_file.txt - A video streamed by a video serverrtsp//www.cnn
.com/headline.rm - Most URLs have the same scheme//server
location/path structure
5Navigating a resource by URL, which tells a web
client
- URL scheme how to access the resource
- Server location where the resource is hosted
- Resource path what particular local resource
on the server is being requested
http//english.csie.ncnu.edu.tw/demo/index.html
Web page
Scheme (how)
Host (where)
Path (what)
6URL Syntax
- ltschemegt//ltusergtltpasswordgt_at_lthostgtltportgt/ltpathgt
ltparamsgt?ltquerygtltfraggt
7Scheme what protocol to use
- The scheme is really the main identifier of how
to access a given resource - Must start with an alphabetic character,
- And it is separated from the rest of the URL by
the first - Scheme names are case-insensitive.
8Hosts and Ports
- The host component (IP or Domain Name) identifies
that host machine on the Internet that has access
to the resource. - The port component identifies the network port on
which the server is listing. - Different services uses different default ports
for a machine. - HTTP 80
- FTP 21
- Telnet 23
- SMTP 25
9Usernames and Passwords
- Many servers require a username and password
before you can access data through them. Here are
a few examples - ftp//ftp.prep.ai.mit.edu/pub/gnu
- ftp//anonymous_at_ftp.perp.ai.mit.edu/pub/gnu
- ftp//anonymousmy_passwd_at_ftp.prep.ai.mit.edu/pub/
gnu - http//joejoespasswd_at_www.joes-hardware.com/sales_
info.txt - The default username and password
- anonymous for username
- Internet Explorer sends IEUser for password,
while Netscape send mozilla.
10Paths
- The path component of the URL specifies where on
the server machine the resource lives. - The path often resembles a hierarchical
filesystem path. For example, - http//www.csie.ncnu.edu.tw/course/1998.htmlThe
path in the URL is /course/1998.html, which
resembles a filesystem path on a UNIX filesystem.
- The path component for HTTP URLs can be divided
into path segments separated by / . Each path
segment can have its own params component
(described later).
11Parameters
- For many schemes, a simple host and path to the
object just arent enough. - Aside from what port the server is listening to
and even whether or not you have access to the
resource with a username and password, many
protocols require more information to work. - For example,
- ftp//ftp.ncnu.edu.tw/image.giftypea
- ftp//ftp.ncnu.edu.tw/program.exetypei
12Query strings
- Some resources, such as database, can be queried
according to input strings. For example, - http//www.xxx.tw/a.cgi?id123nameabc
- There is no requirement for the format of the
query component, except that some characters are
illegal. By convention, many gateways except the
query to be formatted as a series of namevalue
pairs, separated by characters.
13Query Strings
http//english.csie.ncnu.edu.tw/course/NWSMLViewer
.php?lectureidrctlee-20030909125212
lectureidrctlee-20030909125212
Internet
Server
viewer gateway
14Fragments
- Some finer resource fragments, such as sessions
in a large HTML document , can friendly be
accessed. For example, - http//engquiz.csie.ncnu.edu.tw/e-book/html/B001.h
tmlpage10 - Because HTTP servers generally deal only with
entire objects, not with fragments of objects,
clients dont pass fragments along to servers.
Namely, the whole object is retreived, but only
the partial content is displayed. - Note that in Range Request feature of HTTP/1.1,
agents may request byte ranges of objects. (later
lectures)
15Fragments
(Fragment is NOT sent to the server) (b)Browser
makes request to http//www.csie.ncnu.edu.tw/hych
en/web_tech/
(a)User selects link to http//www.csie.ncnu.edu.
tw/hychen/web_tech/Resource
Internet
www.csie.ncnu.edu.tw
Client
(c)Server returns entire HTML page
Browser scrolls down to star at named Resource
fragment
(d)Browser displays HTML page starting with named
Resourcefragment
16URL shortcuts
- Web clients understand and use a few URL
shortcuts. - Many browsers also support automatic expansion of
URLs, where the user can type in a key
(memorable) part of a URL, and the browser fills
in the rest. - Relative URLs
- Base URLs
- Resolving relative references
- Expanded URLs
17Relative URLs
- URLs comes in two flavors absolute and relative.
- So far, we have looked only at absolute URLs, all
the information you need to access a resource. - On the other hand, relative URL is incomplete. To
get all the information need to access a
resource, a relative URL must be interpreted on
the basis of another URL, called its base.
18HTML snippet with relative URL
- ltHTMLgt
- ltHEADgt ltTITLEgt Joes Tools lt/TITLEgt lt/HEADgt
- ltBODYgt
- ltH1gt Tools page lt/H1gt
- ltH2gt Hammers lt/H2gt
- ltPgt Joes HARDWARE online has the largest
selection of ltA href ./hammers.htmlgt hammers
lt/Agt on earth. - lt/BODYgt
- lt/HTMLgt
19Using a base URL
Relative URL ./hammers.html
Base URL http//www.joes-hardware.com/tools.html
http//www.joes-hardware.com/hammers.html New
absolute URL
20Base URLs
- The first step in the conversion process is to
find a base URL, which can come from a few
places. - Explicitly provided in the resource
- Use ltBASEgt tag to define the base URL
- Base URL of the encapsulating resource
- Does not explicitly specify a base URL.
- Use the URL of the resource in which the document
is imbedded as a base, as the example in the
preceding slide. - No base URL
- In some instances, there is no base URL. This
often means that you have an absolute URL
however, sometimes you just have an incomplete or
broken URL.
21Resolving relative references
22Expanded URLs
- Some browser try to expand URLs automatically,
either after you submit the URL or while youre
typing. This provides users with a shortcut
they dont have to type in the complete URL. - Hostname expansion
- Ex yahho ? www.yahoo.com
- History expansion
- Ex http//www.ncnu ? http//www.ncnu.edu.tw
23Shady characters in URLs
- URLs were designed to be portable, to uniformly
name all the resources on the Internet. This
means that the URLs will be transmitted through
various protocol. - However, because different protocols (schemes)
use different mechanisms for transmitting, it is
important for the URLs to be transmitted safely,
namely without losing information, through any
protocols over network. - Some protocols, such as the Simple Mail Transfer
Protocol (SMTP) for email, use a 7-bit encoding
for message this can strip off certain
characters if the source is encoded in 8 bits or
more. - To get around of this, URLs are permitted to
contain only characters from a relatively small,
universally safe alphabet. - In addition to the transportable issue, URLs
should be readable. Hence, some invisible,
nonprinting characters also are prohibited in
URLs, even though these character may pass
through mailers. - To complete matter further, URLs also need to be
complete. One day people wound want URLs to
contain binary data or characters outside of the
universally safe of alphabets. So, an escape
mechanism was added.
24The URL Character Set
- US-ASCII is very portable, due to its long
legacy. It uses 7 bits to represent most keys
available on an English typewriter and a few
non-printing control character for text
formatting and hardware signal. But it doesnt
support the inflected characters common in
European languages or non-Romanic language read. - Want to contain arbitrary binary data.
- Use escape sequences allow the encoding of
arbitrary values using restricted subset of the
US-ASCII character set, yielding portability and
completeness.
25Encoding mechanism
- Simply represents the unsafe character by an
escape notation, consisting of a percent sign
() followed by two hexadecimal digits. - For example
- ? 0x7E, http//www.ncnu.edu.tw/7Ehychen
- Space-gt 0x20, http//www.abc.com/web20tools.html
- ? 0x25, http//www.abc.com/10025satisfaction.ht
ml
26Character Restrictions
- escape token
- / path delimiter
- . Path component
- .. Path component
- fragment delimiter
- ? Query-string delimiter
- params delimiter
- , Reserved
- _at_ Reserved, special meaning in some scheme
- \ Restricted, unsafe handling by various
transport agent, such as gateway - ltgt Unsafe should be encoded because they
often have meaning outside the scope of URL - 0x00-0x1F, 0x7F Restricted, fall within
nonprintable range - gt0x7F Restricted, do not fall within 7-bit range
of US-ASCII
27Common scheme format
- http, https
- mailto
- ftp
- rtsp, rtspu
- file
- News
- telnet
28The Future URN?
Get http//purl.oclc.org/jhardware/
STEP1Ask the resource resolver what the Joes
Hardware URL is. Receive from the resolver the
current location of the resource
Internet
Client
Purl.oclc.org
Actualhttp//www.joes-hardware.com/
STEP2 Get the actual URL for the resource
Get http//www.joes-hardware.com
Internet
Client
www.joes-hardware.com
29URIUniversal Resource Identifier
- URIs defined in RFC 1630. (1994)
- URI is a superset of URL and URN.
- Full URI proto//hostname/path
- http//www.csie.ncnu.edu.tw80/hychen/
- Partial URI /path
- /hychen/
Identifies the Server
No server mentioned
30URLs information
- http//www.w3.org/Addressing/
- The W3C page about naming and addressing URIs and
URLs. - http//www.ietf.org/rfc/rfc1738.txt
- RFC 1738, Uniform Resource Locators (URL), by
T. Berners-Lee, L. Masinter, and M. McCahill. - http//www.ietf.org/rfc/rfc2396.txt
- RFC 2396, Uniform Resource Identifiers (URI)
Generic Syntax, by T. Berners-Lee, R. Fielding,
and L. Masinter. - http//www.ietf.org/rfc/rfc2141.txt
- RFC 2141, URN Syntax, by R. Moats.
- http//purl.oclc.org
- The persistent uniform resource locator web site.
- http//www.ietf.org/rfc/rfc1808.txt
- RFC 1808, Relative Uniform Resource Locators,
by R. Fielding.