Title: Data Dissemination on the Web
1Data Dissemination on the Web
Krithi Ramamritham IIT Bombay krithi_at_cse.iitb.ern
et.in
2Web Content
- Web sites have traditionally served static
content - But, dynamic content generation has come into
vogue - generated on the fly by running dynamic scripts,
e.g., Active Server Pages (ASP), Java Server
Pages (JSP), Servlets - allows generation of different content for the
same request
3Dynamic Web Pages
Web Page
A News content site
4Generic Architecture
wired hosts
sensors
Network
Network
mobile hosts
servers
Data sources
End-hosts
5Coherency of Dynamic Data
- Strong coherency
- The client and source always in sync with each
other - Strong coherency is expensive!
- Relax strong coherency ? - coherency
- Time domain ?t - coherency
- The client is never out of sync with the source
by more than ?t time units - eg Traffic data not stale by more than a minute
- Value domain ?v - coherency
- The difference in the data values at the client
and the source bounded by ?v at all times - eg Only interested in temperature changes larger
than 1 degree
6Generic Architecture
wired host
sensors
Network
Network
servers
Proxies /caches
mobile host
Data sources
End-hosts
7The Push Approach
- Proxy registers the data item of interest and the
coherency requirement with the server - Server pushes interesting changes
- Achieves Strong Consistency
- Keeps network overhead minimum
- -- Poor Scalability (has to maintain state and
has to keep connections open) - -- Low Resiliency
8The Pull Approach
- Proxy Pulls after
- Time to Live (TTL)
- Time To next Refresh (TTR / TNR)
- Can be implemented using the HTTP protocol
- Stateless and hence is generally scalable with
respect to state space and computation - Weak cache consistency
- Heavy polling for stringent coherence requirement
or highly dynamic data - Network overheads higher than for Push
9Typical End-to-end Web Site Architecture
Application Server Cluster
Web Server Cluster
Data
. . . .
10WS vs. AS
- Web servers
- Do well defined and quantifiable local work
- e.g., processing HTTP headers, serving static
content - Application servers
- Run multi-layer programs
- e.g., scripts involving calls to backends
11Application Layer Details
Servlets
12The Problem Page Generation Delays
- Causes of page generation delays include
- (in addition to pure processing overhead)
- Remote database accesses Heavy I/O loads,
Network delays - XML-HTML transformations Extensive processing
delays - Personalization logic e.g., Broadvision,
Vignette, etc. - Interaction bottlenecks e.g., database
connection pools - gt serious performance and scalability
problems - for web sites
- due to increased load
- on server-side infrastructure
13Reducing delays
- Approaches fall into 3 broad categories
- Database caching
- Page level caching
- Fragment level caching
14Alternative CDNs
Content Distribution Networks
15Push Based Core Infrastructure
- Resilient and efficient
- content distribution network (CDN)
- for dynamic data.
- Existing CDNs
- Akamai, Dynamai
16Database Caching
- Two broad types
- Query result caching
- Middle tier database caching
- caching database tables in main memory
17Query result caching
- Many application server products offer this
feature - Luo et. al., 2000 proposed query result caching
at Web proxy caches - -- mitigates only local database access latency
- -- only a subset of query results may be reused
in page generation - -- page fragments may not all be from databases
18Middle tier database caching
- Caching database tables in main memory
- Oracle 9i Cache
- Main-memory databases, e.g., TimesTen
- -- mitigates only database access latency
- -- caching at table granularity results in poor
cache utilization - -- main-memory databases are difficult to
integrate and maintain and can be expensive
19Page Level Caching
- Dynamically generated HTML pages are cached
Iyengar Challenger, 1997 Zhu Yang, 2000 - Several commercially available products follow
this approach, e.g., SpiderCache, Xcache, Dynamai - Can completely offload work from web/app
server - Low reusability for highly personalized web pages
- URL may not uniquely identify a page
- -- increasing the risk of delivering
incorrect pages - Often introduces excessive invalidations
- -- e.g., even if a single element on the
page changes
20Reducing page generation delays
- Approaches fall into 3 broad categories
- Database caching
- Page level caching
- Fragment level caching
21How Dynamic Scripting Works
Page generation script
Write to Out
Write to Out
. . .
22Code Blocks Perform Work
Page generation script
Write to Out
Write to Out
. . .
. . .
23Code Blocks lt-gt Components
Page generation script
Web Page
Ad Component
Write to Out
Headline Component
Headline Component
Navigation Component
Headline Component
Headline Component
Write to Out
. . .
Personalized Component
(Example News content site)
Certain components can be cached
24DCA Our Solution
Page generation script
Code block
Request
Dynamic Content Accelerator
Code Block Output
Application logic
Code block
Work bypassed
Database calls
HTML formatting
. . .
25DCA in a Typical End-to-end Web Site Architecture
- A single instance of the DCA serves a rack of
application servers - Application servers communicate with DCA through
a lightweight API
Application Server Cluster
Web Server Cluster
Data
Dynamic Content Accelerator
26Cache Management
- A critical aspect of any caching solution
- DCA supports novel cache management strategies
- Prediction-based cache replacement
- Observation-based cache invalidation
27Cache Replacement
- Prediction-based replacement
- fragments having lowest probability of access
replaced - Least-Likely-to-be-Used (LLU)
- Access probabilities based on
- Current user navigational patterns over site
graph - (in the form of clickstreams)
- Historical user navigational patterns over site
graph - (in the form of association rules)
(News, Sports, Hockey) ? Schedules 20
(News, Sports, Hockey) ? Players 15
LLU
(News, Sports, Hockey) ? Teams 10
(News, Sports, Hockey) ? Scores 55
28Cache Invalidation
- DCA supports common cache invalidation
techniques - Time-based Each cache element assigned a TTL
- Event-based Updates to the database send an
invalidation message to the cache - On demand Manual invalidation of selected
elements - DCA supports additional invalidation techniques.
29Cache Invalidation
- Other invalidation techniques supported
- Observation-based
- User-initiated updates are observed in scripts
each such update sends an invalidation message to
the cache - Most appropriate for auction sites, online
trading sites - Invalidation does not require communication with
the databases - Keyword-based
- Elements can be associated with keywords e.g.,
a retailer may wish to invalidate all
seasonal items - Regular expression-based
- Elements can be invalidated based on regular
expression matching
30Other Fragment Level Caching
app servers (e.g., BEAs WebLogic, IBMs
WebSphere) cache fragments produced by JSP
scripts
Application Server Cluster
- can offload presentation layer tasks
- runs in the application server process space
- gt competes for server resources
- application server cluster
- gt multiple cache instances,
- duplication of content,
- additional synchronization overhead
31Other Fragment Level Caching.
- Weave system VLDB 2000 caches XML fragments, as
well as query results and HTML pages - Requires use of declarative web site
specification language
32Performance Study
- Metric
- Average page generation time
- time required to construct HTML page
33Performance Study
- Test Site
- Fictitious online retail site, allows browsing of
product catalog - Pages generated using JSP scripts
- Site content stored in Oracle database
- Database schema based on Dublin Core Metadata
Open Standard - Contains 200,000 products and 44,000 categories
- Each page consists of 3 components, each
involving a database call
34Performance Study
- Test Setup
- Content Database Server
- Oracle 8.1.6
- Web/Application Server
- WebLogic 6.0 running on cluster of 2 machines
- Server machines
- have 1 GB RAM, dual P III-933 Mhz processors
- run Windows 2K Advanced Server
35Testing Methodology
- DCA compared to 2 middle tier caching solutions
- Main Memory Database TimesTen used to cache the
content database (entire database is cached, runs
on database server machine) - Application Server Cache WebLogic Server JSP
caching (WLS Cache) - For both WLS and DCA, 2 (of 3) page components
are cached - Usually, DCA runs on a separate machine (512 MB
RAM, P III-600Mhz processor, running Windows 2K
Advanced Server)
36Testing Methodology...
- Baseline Parameters
- Cache Size, i.e., percentage of fragments that
fit into cache 75 - Cache replacement policy LLU for DCA
- User load is varied by sending requests from
client machines running Radviews WebLoad - Simulated users navigate site according to Zipf
80-20 distribution (i.e., 80 of users follow 20
of navigation links)
37Page Gen. Times vs. Number of Users
TimesTen vs. DCA -- 3x to 9x
improvement TimesTen only mitigates local
database access latency -- still requires
query processing, formatting operations
38Page Generation Times...
WLS vs. DCA -- 2x to 5x improvement WLS runs
in application server process space, competes
for server resources WLS utilizes
multiple caches, causing redundant caching DCA
runs as single, standalone logical cache
39Sensitivity to Cache Size
As expected, performance improves as cache size
increases Since cached elements are typically
quite small (e.g., a few hundred bytes), larger
cache sizes are feasible in practice
40Conclusion
- Increased use of dynamic page generation
technologies - gt increases load on application servers
- gt serious performance and scalability
problems - for e-business sites
-
- DCA (Dynamic Content Acceleration)
- gt significantly reduces the load on the
server side infrastructure, allows e-business
sites to scale - gt significantly outperforms existing middle
tier caching solutions
41IIT Bombays aAQUA Community Forum
Farmers get information and get their questions
answered -- In the local context -- In their
local language
Capitalizes on existing human and infrastructural
resources Agri-extension center KVK,
Baramati NGO Vigyan Ashram, Pabal Corporate
infrastructure -- ITC e-chaupal Government
MCIT
www.aAQUA.org
42Access over low bandwidthResource Optimization
Resource constraints Low/unpredictable bandwidth
gt disconnected operation/access Exploit cachi
ng prefetching (through prediction of future
needs) Profiling by user type, location Data
characteristics Static data text, images land
records, photos can be cached/hoarded Dynamic
data weather/price information cached info need
to be refreshed carefully Continuous media
VoIP, video data QoS considerations