Web Information Systems - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Web Information Systems

Description:

An introduction to other important WIS topics ... Experienced hacker can spoof' his ID address. 32. Data Transmission Problems. Causes ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 46
Provided by: xiaofa
Category:

less

Transcript and Presenter's Notes

Title: Web Information Systems


1
4
SCHOOL OF INFORMATION TECHNOLOGY AND ELECTRICAL
ENGINEERING THE UNIVERSITY OF QUEENSLAND
Module 4 Web Search and Web Information Systems
Security
2
in this module
  • Outline
  • An introduction to other important WIS topics
  • Web search engines how search engines rank web
    pages?
  • WIS security how to make your system secure?
  • WIS performance and scalability how to make your
    system efficient?
  • Objectives
  • We have already covered topics on how to develop
    a WIS, how to use XML and other technologies to
    enhance interoperability. Now we are looking at
    other important aspects of building a good WIS

3
Web Search
4
Finding Information on the Web
  • Browsing
  • From a starting point, navigate through
    hyperlinks to find desired documents
  • Yahoos category hierarchy facilitates browsing
  • Searching
  • Submit a query to a search engine to find desired
    documents
  • Many well-known search engines on the Web
    AltaVista, Google

5
Browsing Versus Searching
  • Category hierarchy is built mostly manually while
    search engine databases can be created
    automatically
  • Search engines can index much more documents than
    a category hierarchy
  • Browsing is good for finding some desired
    documents and searching is better for finding a
    lot of desired documents
  • Browsing is more accurate (less junk will be
    encountered) than searching

6
Search Engine
  • A search engine is essentially a text retrieval
    system for web pages plus a Web interface

7
Text Retrieval
  • Document representation
  • Remove stopwords (of, the, ...) and stemming
    (stemming ? stem)
  • d (d1 , ..., di , ..., dn), di tf idf (or a
    weighted formula)
  • tf term frequency (the number of times this term
    appear in the doc)
  • idf inverse document frequency (the total number
    of docs over the number of docs containing the
    term)
  • Query
  • q (q1 , ..., qi , ..., qn), qi weight of ith
    term in q
  • Similarity

this approach prefers long docs (many
improvements available)
8
Precision and Recall
  • Retrieval effectiveness
  • Relevant documents documents useful to the user
    of query
  • Recall of relevant documents retrieved
  • Precision of retrieved documents that are
    relevant

precision
recall
there are 100 relevant docs total, and 80 are
found in the 200 docs returned, then recall
80/100 80, and precision 80/200 40
9
Robot
  • A robot (also known as spider or crawler) is a
    program for fetching web pages from the Web
  • Main idea
  • Place some initial URLs into a URL queue
  • Repeat the steps below until the queue is empty
  • Take the next URL from the queue and fetch the
    web page using HTTP
  • Extract new URLs from the downloaded web page and
    add them to the queue
  • A centralised index is built for the docs
    retrieved

10
Improving Retrieval Effectiveness
  • AltaVista and Yahoo associate different
    importance to term occurrences
  • Google uses terms in anchor tags to index a
    referenced page
  • Shortcomings
  • Important information for web docs, such as tags
    and links, are not explored

11
PageRank The Heart of Google
  • Web G(V, E)
  • V is the set of web pages (vertices)
  • E is the set of hyperlinks (directed edges)
  • Outgoing edges (forward links) a citation from
    the page
  • Incoming links (backlinks) a citation to the
    page
  • Global web page importance measured by backlinks
  • The importance is domain/topic independent
  • Not suitable for query-dependent importance
  • What are important web browser pages?
  • Which pages are important game pages?

Page, Brin, Motwani and Winograd, "The PageRank
Citation Ranking Bringing Order to the Web",
Stanford Digital Library Working Paper, 1999
12
Page Rank Definition
  • Let u be a web page
  • Fu be the set of pages u points to
  • Bu be the set of pages that point to u
  • The rank (importance) of a page u can be defined
    recursively as
  • R(u) ? ( R(v) / Fv )
  • v ?Bu
  • Initially all page ranks are 1/N, where N V
  • Issues
  • Will this computation terminate?
  • What about a sub-graph that has no outgoing links?

13
Page Ranks and Search Engine
  • The ranking score of a web page can be a weighted
    sum of its regular similarity with a query and
    its importance
  • ranking_score(q, p)
  • w?sim(q, p) (1-w) ? R(p), if sim(q, p)
    gt 0
  • 0, otherwise
  • where 0 lt w lt 1
  • Both sim(q, p) and R(d) need to be normalised to
    between 0, 1

14
Authority and Hub
  • A page is a good authoritative page with respect
    to a given query if it is referenced (i.e.,
    pointed to) by many (good hub) pages that are
    related to the query
  • A page is a good hub page with respect to a given
    query if it points to many good authoritative
    pages with respect to the query
  • Good authoritative pages (authorities) and good
    hub pages (hubs) reinforce each other.

Kleinberg, Authoritative sources in a
hyperlinked environment, ACM-SIAM Discrete
Algorithms 1998
15
The Root Set and The Base Set
  • Submit q to a regular similarity-based search
    engine
  • Let S be the set of top n pages returned by the
    search engine
  • S is called the root set and n is often in the
    low hundreds
  • Expand S into a large set T (base set)
  • Add pages that are pointed to by any page in S.
  • Add pages that point to any page in S. If a page
    has too many parent pages, only the first k
    parent pages will be used for some k

16
A Sub-graph Induced by T
17
Computing the Scores
  • Compute the authority score and hub score of each
    web page in T based on the subgraph SG(V, E)
  • Given a page p, let
  • a(p) be the authority score of p
  • h(p) be the hub score of p
  • (p, q) be a directed edge in E from p to q
  • Two basic operations
  • Operation I Update each a(p) as the sum of all
    the hub scores of web pages that point to p
  • Operation O Update each h(p) as the sum of all
    the authority scores of web pages pointed to by p

18
Operations I and O
  • Operation I for each page p
  • a(p) ? h(q)
  • q (q, p)?E
  • Operation O for each page p
  • h(p) ? a(q)
  • q (p, q)?E

19
Normalisation
  • After each iteration of applying Operations I and
    O, normalise all authority and hub scores
  • Repeat until the scores for each page converge
  • The convergence is guaranteed

20
All Links Equal?
  • Some links may be more meaningful/important than
    other links
  • Web site creators may trick the system to make
    their pages more authoritative by adding dummy
    pages pointing to their cover pages (spamming)
  • Domain name the first level of the URL of a page
  • Transverse links are more important than
    intrinsic links
  • Transverse link links between pages with
    different domain names
  • Intrinsic link links between pages with the same
    domain name
  • Two ways to incorporate this
  • Use only transverse links and discard intrinsic
    links
  • Give lower weights to intrinsic links
  • Other ways for link weighting possible (such as
    using tag information)

other interesting problems related to web
communities
21
Web Information System Security
22
Security
  • For a WIS, and in particular for an e-Commerce
    system, security means
  • Protecting your customer
  • Protecting yourself
  • Getting paid
  • Lack of security means
  • Reduced availability
  • Loss of confidential information
  • Loss of data integrity

23
Improving Security
  • Security can only be achieved by a combination of
  • Technical solutions
  • Legislations
  • Policies
  • You need to
  • Understand the security challenges
  • Use Encryption techniques
  • Use DBMS-provided security mechanisms
  • Work with system admin to improve security
  • Secure 3-teir WIS
  • No right solution that will be right for any
    particular business
  • Security is a joint responsibility of DBA, system
    admin, application analysts, Web master

24
Challenges in Securing WIS
  • Before the Internet revolution, databases were
    not easily accessible to hackers
  • Physical security
  • Operating system security
  • Accessible by a limited number of internal
    employees only
  • Security is a big concern because of access
    anywhere
  • Must balance high availability, ease of access
    and security

25
Three-Tier Architecture
?
?
?
?
Internet
?
?
?
?
Clients
Web Server
Database Server
Where we might have security problems?
26
Client Problems
  • Causes
  • Bugs in the browsers
  • Active components in HTML docs
  • Applet, ActiveX controls, external helpers,
    plug-ins, JavaScript and VBScript code
  • Cookies
  • Risks
  • Crash the browser and damage the users system
  • Breach the users privacy
  • Reveal users identify and activity history
  • Browsing other documents
  • Shut down services, deny legitimate users,
    create other inconvenience
  • Attack other computers
  • Use local devices (e.g., Internet connection)

27
Client-Side Active Components
  • Java Applets
  • Built-in security with the sandbox model and
    restricted client-side operations
  • No file manipulation or network connection
  • However, denial-of-services is possible
  • JavaScript/VBScript
  • Interpreted and executed by the browser only,
    with very limited interaction with client system
  • Most problem mainly related to privacy
  • ActiveX Controls
  • No restrictions what a control can do, but each
    ActiveX control can be digitally signed
  • ActiveX security do you trust the certifying
    authority?
  • You can turn off all these in your browser

28
Server Problems
  • Causes
  • Access anywhere
  • Bugs in software and misconfiguration
  • Executions of arbitrary code on the server on
    clients request
  • Risks
  • Steal confidential data
  • Execute commands on the server host machine
  • Modify the system
  • Gain information about the server-side to break
    into the system
  • Launch denial-of-services attacks
  • Services unavailable to legitimate users

29
Firewalls
  • The first line of defence
  • A Web server should never be connected to any
    in-house networks
  • Attacks from Internet are inevitable
  • A firewall is a system to prevent unauthorised
    access to or from a private network (Intranet)
  • All messages entering or leaving the intranet
    pass through the firewall, which examines each
    message and blocks those not meeting criteria
    (configurable)

30
Password Protection
  • Application password protection
  • User types in username/password in a form (and
    use POST)
  • Server-side authentication (reading from a
    database)
  • Username/password is validated for every HTML/JSP
    page
  • Web server supported password protection
  • Directory-based access control
  • .htaccess in the directory (where to find
    password file)
  • htpasswd in a secured place holding user names
    and passwords
  • One of the oldest way of protection, but for WIS
  • Web servers allow unlimited number of trials!
  • Password can be intercepted during transmission
  • Password goes with every HTTP request

31
Restricted Access by IP Address
  • Most Web servers support such restrictions
  • EG, allow from uq.edu.au in configuration file
  • You can specify host names or domain names
  • Not very safe
  • You trust a machine (anyone uses that machine)
  • Experienced hacker can spoof his ID address

32
Data Transmission Problems
  • Causes
  • Internet is not designed for security
  • HTTP uses clear text in transmission
  • Interception of data via network eavesdropping
  • Risks
  • Steal confidential data
  • Impersonate a client or a server

33
Data Transmission Security
  • Privacy only the sender and the receiver have
    access to the data
  • Integrity data cannot be changed during
    transmission
  • Authenticity the data is from the sender
  • Non-fabrication the receiver is genuine
  • Non-repudiation the sender cannot deny sending
    the data

34
Data Encryption
  • Cryptography, a branch of mathematics for
    encryption algorithms
  • Four components of an encryption system

35
Private Key Algorithms
  • Both the sender and receiver have a key they need
    to keep private
  • Also called secret key or symmetric key algorithm
  • Popular algorithm
  • DES (Data Encryption Standard)
  • 56-bit keys, US Govt standard (1977) and ANSI
    standard (1981)
  • Problem
  • How do you distribute your key?
  • This is an even bigger problem for Web
    applications

36
Public Key Algorithms
  • A result of mathematical breakthrough in 1970
  • Also called asymmetric key algorithm
  • Two keys no need for key distribution
  • Public key used for encryption
  • Private key used to decryption

Recipient's public key
Encryption Algorithm
Cipher text
Clear text
Decryption Algorithm
Recipient's private key
37
Public Key Algorithms Continued
  • Most common public key algorithm
  • RSA algorithm (used in PGP)

38
More on Public Key Encryption
  • How the system works
  • Everyone generates a pair of keys, and distribute
    the public key to everyone but keeps the private
    keep to itself.
  • If A wants to send a message to B, A uses Bs
    public key to encrypt the message, and sends the
    encrypted message to B.
  • Only the person who has the private key can
    decrypt the message, so only B can read the clear
    text.
  • The algorithms are public domain knowledge
  • The strength comes from the length of the key
    (128 bit key encryption is computationally
    impossible to break)

39
Digital Signatures
  • Electronic counterpart of handwrite signatures
  • Legally binding now
  • Purpose
  • The receiver knows the signed messaged is sent by
    the sender
  • The sender cannot deny she has sent the sign
    message
  • Based on public key encryption

40
How Digital Signature Works
  • A message digest (or, a fingerprint) is generated
    such that
  • It reveals nothing about the message and
  • Its not computationally possible to find another
    message that will generate the same digest
  • The digital signature is the message digest
    encrypted using the senders private key
  • The digital signature and the clear text message
    will be sent out
  • The receiver will use the senders public key to
    decrypt the digital signature (and get a clear
    text message digest)
  • The receiver uses the same method (a public
    domain one) to generate a message digest form the
    clear text message
  • This digest must be the same as the decrypted
    message digest

where do I get the senders public key?
41
Digital Certificates
  • An application of digital signature
  • There are a number of certificate authority
  • VeriSign, Microsoft,
  • Authoritative, independent, trustworthy third
    parties
  • CAs public keys are well known
  • Many stored in browsers
  • If A wants to send a piece of data (including
    code), A needs to get it certified by a CA
  • A sends its certified data with the digital
    certificate
  • Once B receives such as message, B can verify
    that it is certified by the CA

42
Secure Socket Layer
  • SSL was proposed by Netscape
  • A low level encryption scheme to provide
    connection security
  • HTTP lives on top of it TCP lives underneath it
  • Purposes
  • Allow sensitive information to be shared only
    between browser and server
  • Ensure that data exchanged is reliable (not
    changed during transmission)
  • https this means SSL is used
  • What SSL can provide
  • Privacy for the connection, using symmetric key
    encryption
  • Integrity, using message authentication code
    (MAC)
  • Authentication of the parties, using certificates

43
More on SSL
  • SSL has two logical layers
  • The handshake layer
  • To enable the client and the server to negotiate
    with each other about the encryption methods they
    support, and to establish in a secure way the
    security parameters (key length, compression
    algorithms etc)
  • Public key methods are used to exchange secret,
    one-time, symmetric keys
  • The record layer
  • Computing a message authentication code, and
    using a shared secret key and other established
    parameters to encrypting it

44
Summary
  • We have discussed
  • Web search (PageRank and Authority-and-Hub)
  • WIS security (client, server and transport level)
  • Next week
  • An introduction to Microsoft .Net technology

45
References
  • Page, Brin, Motwani and Winograd, "The PageRank
    Citation Ranking Bringing Order to the Web",
    Stanford Digital Library Working Paper, 1999
  • Kleinberg, Authoritative sources in a
    hyperlinked environment, ACM-SIAM Discrete
    Algorithms 1998
  • www.w3.org/Security
Write a Comment
User Comments (0)
About PowerShow.com