Title: Web Information Systems
14
SCHOOL OF INFORMATION TECHNOLOGY AND ELECTRICAL
ENGINEERING THE UNIVERSITY OF QUEENSLAND
Module 4 Web Search and Web Information Systems
Security
2 in this module
- Outline
- An introduction to other important WIS topics
- Web search engines how search engines rank web
pages? - WIS security how to make your system secure?
- WIS performance and scalability how to make your
system efficient? - Objectives
- We have already covered topics on how to develop
a WIS, how to use XML and other technologies to
enhance interoperability. Now we are looking at
other important aspects of building a good WIS
3Web Search
4Finding Information on the Web
- Browsing
- From a starting point, navigate through
hyperlinks to find desired documents - Yahoos category hierarchy facilitates browsing
- Searching
- Submit a query to a search engine to find desired
documents - Many well-known search engines on the Web
AltaVista, Google
5Browsing Versus Searching
- Category hierarchy is built mostly manually while
search engine databases can be created
automatically - Search engines can index much more documents than
a category hierarchy - Browsing is good for finding some desired
documents and searching is better for finding a
lot of desired documents - Browsing is more accurate (less junk will be
encountered) than searching
6Search Engine
- A search engine is essentially a text retrieval
system for web pages plus a Web interface
7Text Retrieval
- Document representation
- Remove stopwords (of, the, ...) and stemming
(stemming ? stem) - d (d1 , ..., di , ..., dn), di tf idf (or a
weighted formula) - tf term frequency (the number of times this term
appear in the doc) - idf inverse document frequency (the total number
of docs over the number of docs containing the
term) - Query
- q (q1 , ..., qi , ..., qn), qi weight of ith
term in q - Similarity
this approach prefers long docs (many
improvements available)
8Precision and Recall
- Retrieval effectiveness
- Relevant documents documents useful to the user
of query - Recall of relevant documents retrieved
- Precision of retrieved documents that are
relevant
precision
recall
there are 100 relevant docs total, and 80 are
found in the 200 docs returned, then recall
80/100 80, and precision 80/200 40
9Robot
- A robot (also known as spider or crawler) is a
program for fetching web pages from the Web - Main idea
- Place some initial URLs into a URL queue
- Repeat the steps below until the queue is empty
- Take the next URL from the queue and fetch the
web page using HTTP - Extract new URLs from the downloaded web page and
add them to the queue - A centralised index is built for the docs
retrieved
10Improving Retrieval Effectiveness
- AltaVista and Yahoo associate different
importance to term occurrences - Google uses terms in anchor tags to index a
referenced page - Shortcomings
- Important information for web docs, such as tags
and links, are not explored
11PageRank The Heart of Google
- Web G(V, E)
- V is the set of web pages (vertices)
- E is the set of hyperlinks (directed edges)
- Outgoing edges (forward links) a citation from
the page - Incoming links (backlinks) a citation to the
page - Global web page importance measured by backlinks
- The importance is domain/topic independent
- Not suitable for query-dependent importance
- What are important web browser pages?
- Which pages are important game pages?
Page, Brin, Motwani and Winograd, "The PageRank
Citation Ranking Bringing Order to the Web",
Stanford Digital Library Working Paper, 1999
12Page Rank Definition
- Let u be a web page
- Fu be the set of pages u points to
- Bu be the set of pages that point to u
- The rank (importance) of a page u can be defined
recursively as - R(u) ? ( R(v) / Fv )
- v ?Bu
- Initially all page ranks are 1/N, where N V
- Issues
- Will this computation terminate?
- What about a sub-graph that has no outgoing links?
13Page Ranks and Search Engine
- The ranking score of a web page can be a weighted
sum of its regular similarity with a query and
its importance - ranking_score(q, p)
- w?sim(q, p) (1-w) ? R(p), if sim(q, p)
gt 0 - 0, otherwise
- where 0 lt w lt 1
- Both sim(q, p) and R(d) need to be normalised to
between 0, 1
14Authority and Hub
- A page is a good authoritative page with respect
to a given query if it is referenced (i.e.,
pointed to) by many (good hub) pages that are
related to the query - A page is a good hub page with respect to a given
query if it points to many good authoritative
pages with respect to the query - Good authoritative pages (authorities) and good
hub pages (hubs) reinforce each other.
Kleinberg, Authoritative sources in a
hyperlinked environment, ACM-SIAM Discrete
Algorithms 1998
15The Root Set and The Base Set
- Submit q to a regular similarity-based search
engine - Let S be the set of top n pages returned by the
search engine - S is called the root set and n is often in the
low hundreds - Expand S into a large set T (base set)
- Add pages that are pointed to by any page in S.
- Add pages that point to any page in S. If a page
has too many parent pages, only the first k
parent pages will be used for some k
16A Sub-graph Induced by T
17Computing the Scores
- Compute the authority score and hub score of each
web page in T based on the subgraph SG(V, E) - Given a page p, let
- a(p) be the authority score of p
- h(p) be the hub score of p
- (p, q) be a directed edge in E from p to q
- Two basic operations
- Operation I Update each a(p) as the sum of all
the hub scores of web pages that point to p - Operation O Update each h(p) as the sum of all
the authority scores of web pages pointed to by p
18Operations I and O
- Operation I for each page p
-
- a(p) ? h(q)
- q (q, p)?E
- Operation O for each page p
-
- h(p) ? a(q)
- q (p, q)?E
19Normalisation
- After each iteration of applying Operations I and
O, normalise all authority and hub scores - Repeat until the scores for each page converge
- The convergence is guaranteed
20All Links Equal?
- Some links may be more meaningful/important than
other links - Web site creators may trick the system to make
their pages more authoritative by adding dummy
pages pointing to their cover pages (spamming) - Domain name the first level of the URL of a page
- Transverse links are more important than
intrinsic links - Transverse link links between pages with
different domain names - Intrinsic link links between pages with the same
domain name - Two ways to incorporate this
- Use only transverse links and discard intrinsic
links - Give lower weights to intrinsic links
- Other ways for link weighting possible (such as
using tag information)
other interesting problems related to web
communities
21Web Information System Security
22Security
- For a WIS, and in particular for an e-Commerce
system, security means - Protecting your customer
- Protecting yourself
- Getting paid
- Lack of security means
- Reduced availability
- Loss of confidential information
- Loss of data integrity
23Improving Security
- Security can only be achieved by a combination of
- Technical solutions
- Legislations
- Policies
- You need to
- Understand the security challenges
- Use Encryption techniques
- Use DBMS-provided security mechanisms
- Work with system admin to improve security
- Secure 3-teir WIS
- No right solution that will be right for any
particular business - Security is a joint responsibility of DBA, system
admin, application analysts, Web master
24Challenges in Securing WIS
- Before the Internet revolution, databases were
not easily accessible to hackers - Physical security
- Operating system security
- Accessible by a limited number of internal
employees only - Security is a big concern because of access
anywhere - Must balance high availability, ease of access
and security
25Three-Tier Architecture
?
?
?
?
Internet
?
?
?
?
Clients
Web Server
Database Server
Where we might have security problems?
26Client Problems
- Causes
- Bugs in the browsers
- Active components in HTML docs
- Applet, ActiveX controls, external helpers,
plug-ins, JavaScript and VBScript code - Cookies
- Risks
- Crash the browser and damage the users system
- Breach the users privacy
- Reveal users identify and activity history
- Browsing other documents
- Shut down services, deny legitimate users,
create other inconvenience - Attack other computers
- Use local devices (e.g., Internet connection)
27Client-Side Active Components
- Java Applets
- Built-in security with the sandbox model and
restricted client-side operations - No file manipulation or network connection
- However, denial-of-services is possible
- JavaScript/VBScript
- Interpreted and executed by the browser only,
with very limited interaction with client system - Most problem mainly related to privacy
- ActiveX Controls
- No restrictions what a control can do, but each
ActiveX control can be digitally signed - ActiveX security do you trust the certifying
authority? - You can turn off all these in your browser
28Server Problems
- Causes
- Access anywhere
- Bugs in software and misconfiguration
- Executions of arbitrary code on the server on
clients request - Risks
- Steal confidential data
- Execute commands on the server host machine
- Modify the system
- Gain information about the server-side to break
into the system - Launch denial-of-services attacks
- Services unavailable to legitimate users
29Firewalls
- The first line of defence
- A Web server should never be connected to any
in-house networks - Attacks from Internet are inevitable
- A firewall is a system to prevent unauthorised
access to or from a private network (Intranet) - All messages entering or leaving the intranet
pass through the firewall, which examines each
message and blocks those not meeting criteria
(configurable)
30Password Protection
- Application password protection
- User types in username/password in a form (and
use POST) - Server-side authentication (reading from a
database) - Username/password is validated for every HTML/JSP
page - Web server supported password protection
- Directory-based access control
- .htaccess in the directory (where to find
password file) - htpasswd in a secured place holding user names
and passwords - One of the oldest way of protection, but for WIS
- Web servers allow unlimited number of trials!
- Password can be intercepted during transmission
- Password goes with every HTTP request
31Restricted Access by IP Address
- Most Web servers support such restrictions
- EG, allow from uq.edu.au in configuration file
- You can specify host names or domain names
- Not very safe
- You trust a machine (anyone uses that machine)
- Experienced hacker can spoof his ID address
32Data Transmission Problems
- Causes
- Internet is not designed for security
- HTTP uses clear text in transmission
- Interception of data via network eavesdropping
- Risks
- Steal confidential data
- Impersonate a client or a server
33Data Transmission Security
- Privacy only the sender and the receiver have
access to the data - Integrity data cannot be changed during
transmission - Authenticity the data is from the sender
- Non-fabrication the receiver is genuine
- Non-repudiation the sender cannot deny sending
the data
34Data Encryption
- Cryptography, a branch of mathematics for
encryption algorithms - Four components of an encryption system
35Private Key Algorithms
- Both the sender and receiver have a key they need
to keep private - Also called secret key or symmetric key algorithm
- Popular algorithm
- DES (Data Encryption Standard)
- 56-bit keys, US Govt standard (1977) and ANSI
standard (1981) - Problem
- How do you distribute your key?
- This is an even bigger problem for Web
applications
36Public Key Algorithms
- A result of mathematical breakthrough in 1970
- Also called asymmetric key algorithm
- Two keys no need for key distribution
- Public key used for encryption
- Private key used to decryption
Recipient's public key
Encryption Algorithm
Cipher text
Clear text
Decryption Algorithm
Recipient's private key
37Public Key Algorithms Continued
- Most common public key algorithm
- RSA algorithm (used in PGP)
38More on Public Key Encryption
- How the system works
- Everyone generates a pair of keys, and distribute
the public key to everyone but keeps the private
keep to itself. - If A wants to send a message to B, A uses Bs
public key to encrypt the message, and sends the
encrypted message to B. - Only the person who has the private key can
decrypt the message, so only B can read the clear
text. - The algorithms are public domain knowledge
- The strength comes from the length of the key
(128 bit key encryption is computationally
impossible to break)
39Digital Signatures
- Electronic counterpart of handwrite signatures
- Legally binding now
- Purpose
- The receiver knows the signed messaged is sent by
the sender - The sender cannot deny she has sent the sign
message - Based on public key encryption
40How Digital Signature Works
- A message digest (or, a fingerprint) is generated
such that - It reveals nothing about the message and
- Its not computationally possible to find another
message that will generate the same digest - The digital signature is the message digest
encrypted using the senders private key - The digital signature and the clear text message
will be sent out - The receiver will use the senders public key to
decrypt the digital signature (and get a clear
text message digest) - The receiver uses the same method (a public
domain one) to generate a message digest form the
clear text message - This digest must be the same as the decrypted
message digest
where do I get the senders public key?
41Digital Certificates
- An application of digital signature
- There are a number of certificate authority
- VeriSign, Microsoft,
- Authoritative, independent, trustworthy third
parties - CAs public keys are well known
- Many stored in browsers
- If A wants to send a piece of data (including
code), A needs to get it certified by a CA - A sends its certified data with the digital
certificate - Once B receives such as message, B can verify
that it is certified by the CA
42Secure Socket Layer
- SSL was proposed by Netscape
- A low level encryption scheme to provide
connection security - HTTP lives on top of it TCP lives underneath it
- Purposes
- Allow sensitive information to be shared only
between browser and server - Ensure that data exchanged is reliable (not
changed during transmission) - https this means SSL is used
- What SSL can provide
- Privacy for the connection, using symmetric key
encryption - Integrity, using message authentication code
(MAC) - Authentication of the parties, using certificates
43More on SSL
- SSL has two logical layers
- The handshake layer
- To enable the client and the server to negotiate
with each other about the encryption methods they
support, and to establish in a secure way the
security parameters (key length, compression
algorithms etc) - Public key methods are used to exchange secret,
one-time, symmetric keys - The record layer
- Computing a message authentication code, and
using a shared secret key and other established
parameters to encrypting it
44Summary
- We have discussed
- Web search (PageRank and Authority-and-Hub)
- WIS security (client, server and transport level)
- Next week
- An introduction to Microsoft .Net technology
45References
- Page, Brin, Motwani and Winograd, "The PageRank
Citation Ranking Bringing Order to the Web",
Stanford Digital Library Working Paper, 1999 - Kleinberg, Authoritative sources in a
hyperlinked environment, ACM-SIAM Discrete
Algorithms 1998 - www.w3.org/Security