Basic WWW Technologies - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Basic WWW Technologies

Description:

Include spread-sheets, video clips, sound clips, and other applications directly ... The name of the machine hosting the resource. The name of the resource ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 34
Provided by: cjen
Category:

less

Transcript and Presenter's Notes

Title: Basic WWW Technologies


1
Basic WWW Technologies
  • Thanks to P. Smyth, Hayes, Mark Sapossnek, B.
    Arms, C. Manning, P. Raghavan, H. Schutze

2
What we have covered
  • What is IR
  • Evaluation
  • Tokenization and properties of text
  • Web crawling
  • Query models
  • Vector methods
  • Measures of similarity
  • Indexing
  • Inverted files
  • This presentation
  • Basics of internet and web
  • Web graphs
  • Spam and SEO

3
Web and Internet
  • Focus
  • Infrastructure
  • Languages
  • Graphs
  • Spam
  • SEO

4
What Is the World Wide Web?
  • The world wide web (web) is a network of
    information resources. The web relies on three
    mechanisms to make these resources readily
    available to the widest possible audience
  • 1. A uniform naming scheme for locating resources
    on the web (e.g., URIs).
  • 2. Protocols, for access to named resources over
    the web (e.g., HTTP).
  • 3. Hypertext, for easy navigation among resources
    (e.g., HTML).

5
Internet vs. Web
  • Internet
  • Internet is a more general term
  • Includes physical aspect of underlying networks
    and mechanisms such as email, FTP, HTTP
  • Web
  • HTTP
  • Associated with information stored on the
    Internet
  • Refers to a broader class of networks, i.e. Web
    of English Literature
  • Both Internet and web are networks

6
Networks vs Graphs Examples?
Old internet network
7
Essential Components of WWW
  • Resources
  • Conceptual mappings to concrete or abstract
    entities, which do not change in the short term
  • ex IST411 website (web pages and other kinds of
    files)
  • Resource identifiers (hyperlinks)
  • Strings of characters represent generalized
    addresses that may contain instructions for
    accessing the identified resource
  • http//clgiles.ist.psu.edu/IST441 is used to
    identify our course homepage
  • Transfer protocols
  • Conventions that regulate the communication
    between a browser (web user agent) and a server

8
Internet Technologies The World Wide Web
  • A way to access and share information
  • Technical papers, marketing materials, recipes,
    ...
  • A huge network of computers the Internet
  • Graphical, not just textual
  • Information is linked to other information
  • Application development platform
  • Shop from home
  • Provide self-help applications for customers and
    partners
  • ...

9
Internet TechnologiesWWW Architecture
  • Client/Server, Request/Response architecture
  • You request a Web page
  • e.g. http//www.msn.com/default.asp
  • HTTP request
  • The Web server responds with data in the form of
    a Web page
  • HTTP response
  • Web page is expressed as HTML
  • Pages are identified as a Uniform Resource
    Locator (URL)
  • Protocol http
  • Web server www.msn.com
  • Web page default.asp
  • Can also provide parameters ?nameLeon

10
Internet TechnologiesWeb Standards
  • Internet Engineering Task Force (IETF)
  • http//www.ietf.org/
  • Founded 1986
  • Request For Comments (RFC) at http//www.ietf.org/
    rfc.html
  • World Wide Web Consortium (W3C)
  • http//www.w3.org
  • Founded 1994 by Tim Berners-Lee
  • Publishes technical reports and recommendations

11
Internet TechnologiesWeb Design Principles
  • Interoperability Web languages and protocols
    must be compatible with one another independent
    of hardware and software.
  • Evolution The Web must be able to accommodate
    future technologies. Encourages simplicity,
    modularity and extensibility.
  • Decentralization Facilitates scalability and
    robustness.

12
Languages of the WWW
  • Markup languages
  • A markup language combines text and extra
    information about the text. The extra
    information, for example about the text's
    structure or presentation, is expressed using
    markup, which is intermingled with the primary
    text. The best-known markup language is in modern
    use is HTML (Hypertext Markup Language), one of
    the foundations of the World Wide Web.
    Historically, markup was (and is) used in the
    publishing industry in the communication of
    printed work between authors, editors, and
    printers.

13
What is a markup language?
  • Textual (i.e. person readable) language where
    significant elements are indicated by markers
  • ltTITLEgtXMLlt/TITLEgt
  • Examples are RTF, HTML, XML, TEX etc.
  • Easy to process and can be manipulated by a
    variety of application programs

14
HTML Background
  • HTML was originally developed by Tim Berners-Lee
    while at CERN, and popularized by the Mosaic
    browser developed at NCSA.
  • The Web depends on Web page authors and vendors
    sharing the same conventions for HTML. This has
    motivated joint work on specifications for HTML.
  • HTML standards are organized by W3C
    http//www.w3.org/MarkUp/

15
HTML Functionalities
  • HTML gives WWW authors the means to
  • Publish online documents with headings, text,
    tables, lists, photos, etc
  • Include spread-sheets, video clips, sound clips,
    and other applications directly in their
    documents
  • Link information via hypertext links, at the
    click of a button
  • Design forms for conducting transactions with
    remote services, for use in searching for
    information, making reservations, ordering
    products, etc
  • Very robust ignores many errors!

16
HTML Versions
  • HTML 4.01 is a revision of the HTML 4.0
    Recommendation first released on 18th December
    1997.
  • Last version released.
  • XHTML is the new html
  • HTML 4.01 Specification
  • http//www.w3.org/TR/1999/REC-html401-1
    9991224/html40.txt
  • HTML 4.0 was first released as a W3C
    Recommendation on 18 December 1997
  • HTML 3.2 was W3C's first Recommendation for HTML
    which represented the consensus on HTML features
    for 1996
  • HTML 2.0 (RFC 1866) was developed by the IETF's
    HTML Working Group, which set the standard for
    core HTML features based upon current practice in
    1994.

17
Sample Webpage
18
Sample Webpage HTML Structure
  • ltHTMLgt
  • ltHEADgt
  • ltTITLEgtThe title of the webpagelt/TITLEgt
    lt/HEADgt
  • ltBODYgt ltPgtBody of the webpage
  • lt/BODYgt
  • lt/HTMLgt

19
HTML Structure
  • An HTML document is divided into a head section
    (here, between ltHEADgt and lt/HEADgt) and a body
    (here, between ltBODYgt and lt/BODYgt)
  • The title of the document appears in the head
    (along with other information about the document)
  • The content of the document appears in the body.
    The body in this example contains just one
    paragraph, marked up with ltPgt

20
HTML Hyperlink
  • lta href"relations/alumni"gtalumnilt/agt
  • A link is a connection from one Web resource to
    another
  • It has two ends, called anchors, and a direction
  • Destination anchor - relations/alumni
  • Anchor text - alumni
  • Starts at the "source" anchor and points to the
    "destination" anchor, which may be any Web
    resource (e.g., an image, a video clip, a sound
    bite, a program, an HTML document)

21
What is XML?
  • XML eXtensible Markup Language
  • designed to improve the functionality of the Web
    by providing more flexible and adaptable
    information and identification
  • extensible because not a fixed format like HTML
  • a language for describing other languages (a
    meta-language)
  • design your own customized markup language

22
Why use XML?
  • XML is written in SGML the Standardized General
    Markup Language, an international standard (ISO
    8879)
  • XML very simple dialect of SGML
  • goal enable generic SGML to be served, received
    and processed on the Web in ways not possible
    with HTML

23
Why use XML?
  • XML is not just for Web pages
  • use to store any kind of structured document
  • to enclose/encapsulate information in order to
    pass it between different computing systems that
    are otherwise unable to communicate

24
Key feature of XML
  • An application is free to use XML tagged data in
    many different ways, e.g.
  • produce an image
  • generate a formatted text listing
  • display the XML documents markup in pretty
    colors
  • restructure the data into a format for storing in
    a database, transmission over a network, input to
    another program.

25
XML is important because...
  • Removes 2 constraints that held back Web
    development
  • dependence on a single, inflexible document type
    (HTML) much abused
  • reduced the complexity of full SGML many options
    but hard to program

26
  • XML allows the flexible development of
    user-defined document types.
  • provides a robust, non-proprietary, persistent,
    and verifiable file format for the storage and
    transmission of text and data both on and off the
    Web

27
XML Software?
  • hundreds (probably thousands) of programs are
    XML ready already today.
  • xml.coverpages.org covers news of new additions
    to XML

28
Is XML a Computer Language?
  • XML is not C or C or like any other programming
    language
  • By itself, it cannot specify calculations,
    actions, decisions to be carried out in any order
  • XML is a markup specification language

29
XML - a Markup Language
  • with XML, you can design ways of describing
    information (text or data), usually for storage,
    transmission or processing by a program
  • XML conveys no information about what should be
    done with the data or text it merely describes
    it.
  • By itself, XML does anything it is a data
    description format

30
How do I run or execute an XML file?
  • You cant and you dont !
  • XML is not a programming language
  • XML is a markup specification language
  • XML files are just data (waiting for a program to
    do something with them)
  • XML files can be viewed with an XML editor or
    XML-compatible browser

31
Things to Remember
  • XML does not replace HTML it provides an
    alternative which allows you to define your own
    set of markup elements to a published standard
  • lt?xml version"1.0" standalone"yes"?gt
  • ltconversationgt
  • ltgreetinggtHello, world!lt/greetinggt
  • ltresponsegtStop the planet, I want to get
    off!lt/responsegt
  • lt/conversationgt

32
Things to Remember
  • All parts of an XML document are case sEnSiTiVe
  • Element type names are case sensitive, so ltBODYgt
    lt/b odygt is out.
  • Attribute names are case sensitive
  • ltPIC width7cm/gt and
  • ltPIC WIDTH6cm/gt
  • describe different attributes, not just
    different values for the attribute PIC width.

33
What is XQuery?
  • XQuery is the language for querying XML data
  • The best way to explain XQuery is to say that
  • XQuery is to XML what SQL is to database
  • tables.
  • XQuery uses XPath expressions to extract XML
    data.
  • XPath is a language for finding information in an
    XML document.
  • XPath is used to navigate through elements and
    attributes in an XML document.
  • XQuery is defined by the W3C.
  • XQuery is supported by all the major database
    engines (IBM, Oracle, Microsoft, etc.)
  • XQuery 1.0 is not yet a W3C Recommendation
    (XQuery is a Working Draft). Hopefully it will be
    a recommendation in the near future.

34
Resource Identifiers
  • URI Uniform Resource Identifiers
  • URL Uniform Resource Locators
  • URN Uniform Resource Names
  • Legacy, not used
  • Ex urn//isbn4322347

35
Introduction to URIs
  • Every resource available on the Web has an
    address that may be encoded by a URI
  • URIs typically consist of three pieces
  • The naming scheme of the mechanism used to access
    the resource. (HTTP, FTP)
  • The name of the machine hosting the resource
  • The name of the resource itself, given as a path

36
URI Example
  • http//www.w3.org/TR
  • There is a document available via the HTTP
    protocol
  • Residing on the machines hosting www.w3.org
  • Accessible via the path "/TR"

37
Protocols
  • Describe how messages are encoded and exchanged
  • For the internet
  • Different Layering Architectures
  • ISO OSI 7-Layer Architecture
  • TCP/IP 4-Layer Architecture

38
Hypertext Transfer Protocol (HTTP)
  • A connection-oriented protocol (TCP) used to
    carry WWW traffic between a browser and a server
  • One of the transport layer protocol supported by
    Internet
  • HTTP communication is established via a TCP
    connection and server port 80

39
GET Method in HTTP
40
Domain Name System
  • DNS (domain name service) mapping from domain
    names to IP address
  • IPv4
  • IPv4 was initially deployed January 1st. 1983 and
    is still the most commonly used version.
  • 32 bit address, a string of 4 decimal numbers
    separated by dot, range from 0.0.0.0 to
    255.255.255.255.
  • IPv6
  • Revision of IPv4 with 128 bit address

41
IP Addresses

All devices connected to the Internet have a
32-bit IP (IPv4) address associated with it. 232
total addresses? Think of the IP address as a
logical address (possibly temporary), while the
48-bit address on every NIC is the physical, or
permanent address. Computers, networks and
routers use the 32-bit binary address, but a more
readable form is the dotted decimal notation.
42
IP Addresses For example, the 32-bit binary
address 10000000 10011100 00001110 00000111 (4
octets) translates to 128.156.14.7 (called
dotted decimal notation) Range of octets is 0-255
28 There are basically four types of IP
addresses Classes A, B, C and D. A particular
class address has a unique network address size
and a unique host address size.

43
DNS Lookup
  • http//www.bankes.com/nslookup.htm
  • nslookup - Name Server Lookup A linux/windows
    utility used to query Internet domain name
    servers. An nslookup is usually used to find the
    IP address corresponding to a hostname.
  • whois - An Internet program which allows users to
    query a database of people and other Internet
    entities, such as domains, networks, and hosts,
    kept at the NIC. The information for people shows
    a person's company name, address, phone number
    and email address

44
Top Level Domains (TLD)
  • Top level domain names, .com, .edu, .gov and ISO
    3166 country codes
  • There are three types of top-level domains
  • Generic domains were created for use by the
    Internet public
  • Country code domains were created to be used by
    individual country
  • The .arpa domain Address and Routing Parameter
    Area domain is designated to be used exclusively
    for Internet-infrastructure purposes

45
Registrars
  • Domain names ending with .aero, .biz, .com,
    .coop, .info, .museum, .name, .net, .org, or .pro
    can be registered through many different
    companies (known as "registrars") that compete
    with one another
  • InterNIC at http//www.internic.net
  • Registrars Directory http//www.internic.net/regi
    st.html

46
Server Log Files
  • Server Transfer Log transactions between a
    browser and server are logged
  • IP address, the time of the request
  • Method of the request (GET, HEAD, POST)
  • Status code, a response from the server
  • Size in byte of the transaction
  • Referrer Log where the request originated
  • Agent Log browser software making the request
    (spider)
  • Error Log request resulted in errors (404)

47
Server Log Analysis
  • Most and least visited web pages
  • Entry and exit pages
  • Referrals from other sites or search engines
  • What are the searched keywords
  • How many clicks/page views a page received
  • Error reports, like broken links

48
Server Log Analysis
49
Search Engines
  • According to Pew Internet Project Report (2002),
    search engines are the most popular way to locate
    information online
  • About 33 million U.S. Internet users query on
    search engines on a typical day.
  • More than 80 have used search engines
  • Search Engines are measured by coverage and
    recency

50
Search Engine Coverage of the WWW
  • Overlap analysis used for estimating the size of
    the indexable web
  • W size of set of webpages available to search
    engines
  • Wa, Wb number of pages crawled by two
    independent engines a and b
  • P(Wa), P(Wb) probabilities that a page was
    crawled by search engine a or b
  • P(Wa) Wa / W
  • P(Wb) Wb / W
  • P(Wa ? Wb ) Wa ? Wb / W

51
Overlap Analysis - Capture/recapture
  • Bayes rule P(A ? B) P(AB)P(B) P(BA)P(A)
  • P(Wa ? Wb) P(Wa Wb) P(Wb)
  • If a and b are independent
  • P(Wa ? Wb) P(Wa)P(Wb)
  • Wa ? Wb / W Wa / W Wb / W
  • W Wb Wa / Wa ? Wb
  • Need the search engines to tell you what they
    have and what overlaps with each other.

WEB
52
Overlap Analysis
  • Researchers (Lawrence and Giles) found
  • Web had at least 320 million pages in 1997
  • 60 of web was covered by six major engines
  • Maximum coverage of a single engine was 1/3 of
    the web
  • What is the overlap today? What is the size of
    the web? Can it be measured?

53
Dynamic HTML
  • Refers to Web content that changes each time it
    is viewed. For example, the same URL could result
    in a different page depending on any number of
    parameters, such as
  • Geographic location of the reader
  • Time of day
  • Previous pages viewed by the reader
  • Profile of the reader
  • There are many technologies for producing dynamic
    HTML, including CGI scripts, Server-Side Includes
    (SSI), cookies, Java, JavaScript, and ActiveX.

54
How to Improve the Coverage?
  • Meta-search engine dispatch the user query to
    several engines at same time, collect and merge
    the results into one list to the user.
  • Any suggestions?

55
Graph Structure in the Web
http//www9.org/w9cdrom/160/160.html
Write a Comment
User Comments (0)
About PowerShow.com