Title: Basic WWW Technologies
1Basic WWW Technologies
- Thanks to P. Smyth, Hayes, Mark Sapossnek, B.
Arms, C. Manning, P. Raghavan, H. Schutze
2What we have covered
- What is IR
- Evaluation
- Tokenization and properties of text
- Web crawling
- Query models
- Vector methods
- Measures of similarity
- Indexing
- Inverted files
- This presentation
- Basics of internet and web
- Web graphs
- Spam and SEO
3Web and Internet
- Focus
- Infrastructure
- Languages
- Graphs
- Spam
- SEO
4What Is the World Wide Web?
- The world wide web (web) is a network of
information resources. The web relies on three
mechanisms to make these resources readily
available to the widest possible audience - 1. A uniform naming scheme for locating resources
on the web (e.g., URIs). - 2. Protocols, for access to named resources over
the web (e.g., HTTP). - 3. Hypertext, for easy navigation among resources
(e.g., HTML).
5Internet vs. Web
- Internet
- Internet is a more general term
- Includes physical aspect of underlying networks
and mechanisms such as email, FTP, HTTP - Web
- HTTP
- Associated with information stored on the
Internet - Refers to a broader class of networks, i.e. Web
of English Literature - Both Internet and web are networks
6Networks vs Graphs Examples?
Old internet network
7Essential Components of WWW
- Resources
- Conceptual mappings to concrete or abstract
entities, which do not change in the short term - ex IST411 website (web pages and other kinds of
files) - Resource identifiers (hyperlinks)
- Strings of characters represent generalized
addresses that may contain instructions for
accessing the identified resource - http//clgiles.ist.psu.edu/IST441 is used to
identify our course homepage - Transfer protocols
- Conventions that regulate the communication
between a browser (web user agent) and a server
8Internet Technologies The World Wide Web
- A way to access and share information
- Technical papers, marketing materials, recipes,
... - A huge network of computers the Internet
- Graphical, not just textual
- Information is linked to other information
- Application development platform
- Shop from home
- Provide self-help applications for customers and
partners - ...
9Internet TechnologiesWWW Architecture
- Client/Server, Request/Response architecture
- You request a Web page
- e.g. http//www.msn.com/default.asp
- HTTP request
- The Web server responds with data in the form of
a Web page - HTTP response
- Web page is expressed as HTML
- Pages are identified as a Uniform Resource
Locator (URL) - Protocol http
- Web server www.msn.com
- Web page default.asp
- Can also provide parameters ?nameLeon
10Internet TechnologiesWeb Standards
- Internet Engineering Task Force (IETF)
- http//www.ietf.org/
- Founded 1986
- Request For Comments (RFC) at http//www.ietf.org/
rfc.html - World Wide Web Consortium (W3C)
- http//www.w3.org
- Founded 1994 by Tim Berners-Lee
- Publishes technical reports and recommendations
11Internet TechnologiesWeb Design Principles
- Interoperability Web languages and protocols
must be compatible with one another independent
of hardware and software. - Evolution The Web must be able to accommodate
future technologies. Encourages simplicity,
modularity and extensibility. - Decentralization Facilitates scalability and
robustness.
12Languages of the WWW
- Markup languages
- A markup language combines text and extra
information about the text. The extra
information, for example about the text's
structure or presentation, is expressed using
markup, which is intermingled with the primary
text. The best-known markup language is in modern
use is HTML (Hypertext Markup Language), one of
the foundations of the World Wide Web.
Historically, markup was (and is) used in the
publishing industry in the communication of
printed work between authors, editors, and
printers.
13What is a markup language?
- Textual (i.e. person readable) language where
significant elements are indicated by markers - ltTITLEgtXMLlt/TITLEgt
- Examples are RTF, HTML, XML, TEX etc.
- Easy to process and can be manipulated by a
variety of application programs
14HTML Background
- HTML was originally developed by Tim Berners-Lee
while at CERN, and popularized by the Mosaic
browser developed at NCSA. - The Web depends on Web page authors and vendors
sharing the same conventions for HTML. This has
motivated joint work on specifications for HTML. - HTML standards are organized by W3C
http//www.w3.org/MarkUp/
15HTML Functionalities
- HTML gives WWW authors the means to
- Publish online documents with headings, text,
tables, lists, photos, etc - Include spread-sheets, video clips, sound clips,
and other applications directly in their
documents - Link information via hypertext links, at the
click of a button - Design forms for conducting transactions with
remote services, for use in searching for
information, making reservations, ordering
products, etc - Very robust ignores many errors!
16HTML Versions
- HTML 4.01 is a revision of the HTML 4.0
Recommendation first released on 18th December
1997. - Last version released.
- XHTML is the new html
- HTML 4.01 Specification
- http//www.w3.org/TR/1999/REC-html401-1
9991224/html40.txt - HTML 4.0 was first released as a W3C
Recommendation on 18 December 1997 - HTML 3.2 was W3C's first Recommendation for HTML
which represented the consensus on HTML features
for 1996 - HTML 2.0 (RFC 1866) was developed by the IETF's
HTML Working Group, which set the standard for
core HTML features based upon current practice in
1994.
17Sample Webpage
18Sample Webpage HTML Structure
- ltHTMLgt
- ltHEADgt
- ltTITLEgtThe title of the webpagelt/TITLEgt
lt/HEADgt - ltBODYgt ltPgtBody of the webpage
- lt/BODYgt
- lt/HTMLgt
19HTML Structure
- An HTML document is divided into a head section
(here, between ltHEADgt and lt/HEADgt) and a body
(here, between ltBODYgt and lt/BODYgt) - The title of the document appears in the head
(along with other information about the document) - The content of the document appears in the body.
The body in this example contains just one
paragraph, marked up with ltPgt
20HTML Hyperlink
- lta href"relations/alumni"gtalumnilt/agt
- A link is a connection from one Web resource to
another - It has two ends, called anchors, and a direction
- Destination anchor - relations/alumni
- Anchor text - alumni
- Starts at the "source" anchor and points to the
"destination" anchor, which may be any Web
resource (e.g., an image, a video clip, a sound
bite, a program, an HTML document)
21What is XML?
- XML eXtensible Markup Language
- designed to improve the functionality of the Web
by providing more flexible and adaptable
information and identification - extensible because not a fixed format like HTML
- a language for describing other languages (a
meta-language) - design your own customized markup language
22Why use XML?
- XML is written in SGML the Standardized General
Markup Language, an international standard (ISO
8879) - XML very simple dialect of SGML
- goal enable generic SGML to be served, received
and processed on the Web in ways not possible
with HTML
23Why use XML?
- XML is not just for Web pages
- use to store any kind of structured document
- to enclose/encapsulate information in order to
pass it between different computing systems that
are otherwise unable to communicate
24Key feature of XML
- An application is free to use XML tagged data in
many different ways, e.g. - produce an image
- generate a formatted text listing
- display the XML documents markup in pretty
colors - restructure the data into a format for storing in
a database, transmission over a network, input to
another program.
25XML is important because...
- Removes 2 constraints that held back Web
development - dependence on a single, inflexible document type
(HTML) much abused - reduced the complexity of full SGML many options
but hard to program
26- XML allows the flexible development of
user-defined document types. - provides a robust, non-proprietary, persistent,
and verifiable file format for the storage and
transmission of text and data both on and off the
Web
27XML Software?
- hundreds (probably thousands) of programs are
XML ready already today. - xml.coverpages.org covers news of new additions
to XML
28Is XML a Computer Language?
- XML is not C or C or like any other programming
language - By itself, it cannot specify calculations,
actions, decisions to be carried out in any order - XML is a markup specification language
29XML - a Markup Language
- with XML, you can design ways of describing
information (text or data), usually for storage,
transmission or processing by a program - XML conveys no information about what should be
done with the data or text it merely describes
it. - By itself, XML does anything it is a data
description format
30How do I run or execute an XML file?
- You cant and you dont !
- XML is not a programming language
- XML is a markup specification language
- XML files are just data (waiting for a program to
do something with them) - XML files can be viewed with an XML editor or
XML-compatible browser
31Things to Remember
- XML does not replace HTML it provides an
alternative which allows you to define your own
set of markup elements to a published standard - lt?xml version"1.0" standalone"yes"?gt
- ltconversationgt
- ltgreetinggtHello, world!lt/greetinggt
- ltresponsegtStop the planet, I want to get
off!lt/responsegt - lt/conversationgt
32Things to Remember
- All parts of an XML document are case sEnSiTiVe
- Element type names are case sensitive, so ltBODYgt
lt/b odygt is out. - Attribute names are case sensitive
- ltPIC width7cm/gt and
- ltPIC WIDTH6cm/gt
- describe different attributes, not just
different values for the attribute PIC width.
33What is XQuery?
- XQuery is the language for querying XML data
- The best way to explain XQuery is to say that
- XQuery is to XML what SQL is to database
- tables.
- XQuery uses XPath expressions to extract XML
data. - XPath is a language for finding information in an
XML document. - XPath is used to navigate through elements and
attributes in an XML document. - XQuery is defined by the W3C.
- XQuery is supported by all the major database
engines (IBM, Oracle, Microsoft, etc.) - XQuery 1.0 is not yet a W3C Recommendation
(XQuery is a Working Draft). Hopefully it will be
a recommendation in the near future.
34Resource Identifiers
- URI Uniform Resource Identifiers
- URL Uniform Resource Locators
- URN Uniform Resource Names
- Legacy, not used
- Ex urn//isbn4322347
35Introduction to URIs
- Every resource available on the Web has an
address that may be encoded by a URI - URIs typically consist of three pieces
- The naming scheme of the mechanism used to access
the resource. (HTTP, FTP) - The name of the machine hosting the resource
- The name of the resource itself, given as a path
36URI Example
- http//www.w3.org/TR
- There is a document available via the HTTP
protocol - Residing on the machines hosting www.w3.org
- Accessible via the path "/TR"
37Protocols
- Describe how messages are encoded and exchanged
- For the internet
- Different Layering Architectures
- ISO OSI 7-Layer Architecture
- TCP/IP 4-Layer Architecture
38Hypertext Transfer Protocol (HTTP)
- A connection-oriented protocol (TCP) used to
carry WWW traffic between a browser and a server - One of the transport layer protocol supported by
Internet - HTTP communication is established via a TCP
connection and server port 80
39GET Method in HTTP
40Domain Name System
- DNS (domain name service) mapping from domain
names to IP address - IPv4
- IPv4 was initially deployed January 1st. 1983 and
is still the most commonly used version. - 32 bit address, a string of 4 decimal numbers
separated by dot, range from 0.0.0.0 to
255.255.255.255. - IPv6
- Revision of IPv4 with 128 bit address
41 IP Addresses
All devices connected to the Internet have a
32-bit IP (IPv4) address associated with it. 232
total addresses? Think of the IP address as a
logical address (possibly temporary), while the
48-bit address on every NIC is the physical, or
permanent address. Computers, networks and
routers use the 32-bit binary address, but a more
readable form is the dotted decimal notation.
42IP Addresses For example, the 32-bit binary
address 10000000 10011100 00001110 00000111 (4
octets) translates to 128.156.14.7 (called
dotted decimal notation) Range of octets is 0-255
28 There are basically four types of IP
addresses Classes A, B, C and D. A particular
class address has a unique network address size
and a unique host address size.
43DNS Lookup
- http//www.bankes.com/nslookup.htm
- nslookup - Name Server Lookup A linux/windows
utility used to query Internet domain name
servers. An nslookup is usually used to find the
IP address corresponding to a hostname. - whois - An Internet program which allows users to
query a database of people and other Internet
entities, such as domains, networks, and hosts,
kept at the NIC. The information for people shows
a person's company name, address, phone number
and email address
44Top Level Domains (TLD)
- Top level domain names, .com, .edu, .gov and ISO
3166 country codes - There are three types of top-level domains
- Generic domains were created for use by the
Internet public - Country code domains were created to be used by
individual country - The .arpa domain Address and Routing Parameter
Area domain is designated to be used exclusively
for Internet-infrastructure purposes
45Registrars
- Domain names ending with .aero, .biz, .com,
.coop, .info, .museum, .name, .net, .org, or .pro
can be registered through many different
companies (known as "registrars") that compete
with one another - InterNIC at http//www.internic.net
- Registrars Directory http//www.internic.net/regi
st.html
46Server Log Files
- Server Transfer Log transactions between a
browser and server are logged - IP address, the time of the request
- Method of the request (GET, HEAD, POST)
- Status code, a response from the server
- Size in byte of the transaction
- Referrer Log where the request originated
- Agent Log browser software making the request
(spider) - Error Log request resulted in errors (404)
47Server Log Analysis
- Most and least visited web pages
- Entry and exit pages
- Referrals from other sites or search engines
- What are the searched keywords
- How many clicks/page views a page received
- Error reports, like broken links
48Server Log Analysis
49Search Engines
- According to Pew Internet Project Report (2002),
search engines are the most popular way to locate
information online - About 33 million U.S. Internet users query on
search engines on a typical day. - More than 80 have used search engines
- Search Engines are measured by coverage and
recency
50Search Engine Coverage of the WWW
- Overlap analysis used for estimating the size of
the indexable web - W size of set of webpages available to search
engines - Wa, Wb number of pages crawled by two
independent engines a and b - P(Wa), P(Wb) probabilities that a page was
crawled by search engine a or b - P(Wa) Wa / W
- P(Wb) Wb / W
- P(Wa ? Wb ) Wa ? Wb / W
51Overlap Analysis - Capture/recapture
- Bayes rule P(A ? B) P(AB)P(B) P(BA)P(A)
- P(Wa ? Wb) P(Wa Wb) P(Wb)
-
- If a and b are independent
- P(Wa ? Wb) P(Wa)P(Wb)
- Wa ? Wb / W Wa / W Wb / W
-
- W Wb Wa / Wa ? Wb
- Need the search engines to tell you what they
have and what overlaps with each other.
WEB
52Overlap Analysis
- Researchers (Lawrence and Giles) found
- Web had at least 320 million pages in 1997
- 60 of web was covered by six major engines
- Maximum coverage of a single engine was 1/3 of
the web - What is the overlap today? What is the size of
the web? Can it be measured?
53Dynamic HTML
- Refers to Web content that changes each time it
is viewed. For example, the same URL could result
in a different page depending on any number of
parameters, such as - Geographic location of the reader
- Time of day
- Previous pages viewed by the reader
- Profile of the reader
- There are many technologies for producing dynamic
HTML, including CGI scripts, Server-Side Includes
(SSI), cookies, Java, JavaScript, and ActiveX.
54How to Improve the Coverage?
- Meta-search engine dispatch the user query to
several engines at same time, collect and merge
the results into one list to the user. - Any suggestions?
55Graph Structure in the Web
http//www9.org/w9cdrom/160/160.html