VMWORLD EUROPE - PowerPoint PPT Presentation

About This Presentation

Title:

VMWORLD EUROPE

Description:

Each delegate to return their completed event evaluation form to the materials ... Packet tracing with tools like TCPdump, Ethereal, Wireshark, etc. ... – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 51

Provided by: pritimishr

Category:

more less

Transcript and Presenter's Notes

Title: VMWORLD EUROPE

1
AP02 NFS iSCSI Performance Characterization
and Best Practices in ESX 3.5

Priti Mishra
MTS, VMware
Bing Tsai
Sr. RD Manager, VMware

2
Housekeeping

Please turn off your mobile phones, blackberries
and laptops
Your feedback is valued please fill in the
session evaluation form (specific to that
session) hand it to the room monitor / the
materials pickup area at registration
Each delegate to return their completed event
evaluation form to the materials pickup area will
be eligible for a free evaluation copy of
VMwares ESX 3i
Please leave the room between sessions, even if
your next session is in the same room as you will
need to be rescanned

3
Topics

General Performance Data and Comparison
Improvements in ESX 3.5 over ESX 3.0.x
Performance Best Practices
Troubleshooting Techniques
Basic methodology
Tools
Case studies

4
Key performance improvements since ESX3.0.x (1 of
3)

NFS
Accurate CPU accounting further improves load
balancing among multiple VMs
Optimized buffer and heap sizes
Improvements in TSO support
TSO (TCP segmentation offload) improves large
writes
H/W iSCSI (with QLogic 405x HBA)
Improvements in PAE (large memory) support
Results in better multi-VM performance in large
systems
Minimized NUMA performance overhead
This overhead exists in physical systems as well
Improved CPU cost per I/O

5
Key performance improvements since ESX3.0.x (2 of
3)

S/W iSCSI (S/W-based initiator in ESX)
Improvements in CPU costs per I/O
Accurate CPU accounting further improves load
balance among multiple VMs
Increased maximum transfer size
Minimizes iSCSI protocol processing cost
Reduces network overhead for large I/Os
Ability to handle more concurrent I/Os
Improved multi-VM performance

6
Key performance improvements since ESX3.0.x (3 of
3)

S/W iSCSI (continued)
Improvements in PAE (large memory) support
CPU efficiency much improved for systems with
gt4GB memory
Minimizing NUMA performance overhead

7
Performance Experiment Setup (1 of 3)

Workload Iometer
Standard set based on
Request size
1k, 4k, 8k, 16k, 32k, 64k, 72k, 128k, 256k, 512k
Access mode
50 read/ write
Access pattern
100 sequential
1 worker, 16 Outstanding I/Os
Cached runs
100MB data disks to minimize array/server disk
activities
All I/Os served from server/array cache
Gives upper bound on performance

8
Performance Experiment Setup (2 of 3)

VM information
Windows 2003 Enterprise Edition
1 VCPU 256 MB memory
No file system used in VM (Iometer sees disk as
physical drive)
No caching done in VM
Virtual disks located on RDM device configured in
physical mode
Note VMFS-formatted volumes are used in some
tests where noted

9
Performance Experiment Setup (3 of 3)

ESX Server
4-socket, 8 x 2.4GHz cores
32GB DRAM
2 x Gigabit NICs
One for vmkernel networking used for NFS and
software iSCSI protocols
One for general VM connectivity
Networking Configuration
Dedicated VLANs for data traffic isolated from
general networking

10
How to read performance comparison charts

Throughput
Higher is better
Positive is better ? higher throughput
Latency
Lower is better
Negative is better ? lower response time
CPU cost
Lower is better
Negative is better ? reduced CPU cost
How does this metric matter?

11
CPU Costs

Why is CPU cost data useful?
Determines how much I/O traffic the system CPUs
can handle
How many I/O-intensive VMs can be consolidated in
a host
How to compute CPU cost
Measure total physical CPU usage in ESX
esxtop counter Physical Cpu(_Total)
Normalize to per I/O or per MBps
Example MHz/MBps
(Physical CPU usage percentage out 100) ) X (
of physical CPUs) X (CPU MHz rating) /
(throughput in MBps)

12
Performance Data

First set Relative to baselines in ESX 3.0.x
Second set Comparison of storage options using
Fibre Channel data as the baseline
Last VMFS vs. RDM physical

13
Software iSCSI Throughput Comparison to 3.0.x
higher is better
14
Software iSCSI Latency Comparison to 3.0.x
lower is better
15
Software iSCSI CPU Cost Comparison to 3.0.x
lower is better
16
Software iSCSI Performance Summary

Lower CPU costs
Can lead to higher throughput for small IO sizes
when CPU is pegged
CPU costs per IO also greatly improved for larger
block sizes
Latency is lower
Especially for smaller data sizes
Read operations benefit most
Throughput levels
Dependent on workload
Mixed read-write patterns show most gain
Read I/Os show gains for small data sizes

17
Hardware iSCSI Throughput Comparison to 3.0.x
higher is better
18
Hardware iSCSI Latency Comparison to 3.0.x
lower is better
19
Hardware iSCSI CPU Cost Comparison to 3.0.x
lower is better
20
Hardware iSCSI Performance Summary

Lower CPU costs
Results in higher throughput levels for small IO
sizes
CPU costs per IO are especially improved for
larger data sizes
Latency is better
Smaller data sizes show the most gain
Mixed read-write and read I/Os benefit more
Throughput levels
Dependent on workload
Mixed read-write patterns show most gain for all
block sizes
Pure read and write I/Os show gains for small
block sizes

21
NFS Performance Summary

Performance also significantly improved in ESX
3.5
Data now shown here for interest of time

22
Protocol Comparison

Which storage option to choose?
IP Storage vs. Fibre Channel
How to read the charts?
All data is presented as ratio to the
corresponding 2Gb FC (Fibre Channel) data
If the ratio is 1, the FC and IP protocol data is
identical if lt 1, FC data value is larger

23
Comparison with FC Throughput
if lt 1, FC data value is larger
24
Comparison with FC Latency
lower is better
25
VMFS vs. RDM

Which one has better performance?
Data shown as ratio to RDM physical

26
VMFS vs. RDM-physical Throughput
higher is better
27
VMFS vs. RDM-physical Latency
lower is better
28
VMFS vs. RDM-physical CPU Cost
lower is better
29
Topics

General Performance Data and Comparison
Improvements in ESX 3.5 over ESX 3.0.x
Performance Best Practices
Troubleshooting Techniques
Basic methodology
Tools
Case studies

30
Pre-Deployment Best Practices Overview

Understand the performance capability of your
Storage server/array
Networking hardware and configurations
ESX host platform
Know your workloads
Establish performance baselines

31
Pre-Deployment Best Practices (1 of 4)

Storage server/array a complex system by itself
Total spindle count
Number of spindles allocated for use
RAID level and stripe size
Storage processor specifications
Read/write cache sizes and caching policy
settings
Read-Ahead, Write-Behind, etc.
Useful sources of information
Vendor documentation manuals, best practice
guides, white papers, etc.
Third-party benchmarking reports
NFS-specific tuning information SPEC-SFS
disclosures in http//www.spec.org

32
Pre-Deployment Best Practices (2 of 4)

Networking
Routing topology and path configurations of
links in between, etc.
Switch type, speed and capacity
NIC brand/model, speed and features
H/W iSCSI HBAs
ESX host
CPU revision, speed and core count
Architecture basics
SMP or NUMA?
Disabling NUMA is not recommended
Bus speed, I/O subsystems, etc.
Memory configuration and size
Note NUMA nodes may not have equal amount of
memory

33
Pre-Deployment Best Practices (3 of 4)

Workload characteristics
What are the smallest, largest and most common
I/O sizes?
What is the read? write?
Is access pattern sequential? random? mixed?
Response time more important or aggregate
throughput?
Response time variance an issue or not?
Important know the peak resource usage, not just
the average

34
Pre-Deployment Best Practices (4 of 4)

Establish performance baselines by running
standardized benchmarks
Whats the upperbound IOps for small I/Os?
Whats the upperbound MBps?
Whats the average/worst case response time?
Whats the CPU cost of doing I/O?

35
Additional Considerations (1 of 3)

NFS parameters
of NFS mount points
Multiple VMs using multiple mount points may give
higher aggregate throughput with slightly higher
CPU cost
Export option on NFS server affects performance
iSCSI protocol parameters
Header digest processing slight impact on
performance
Data digest processing turning off may result in
Improved CPU utilization
Slightly lower latencies
Minor throughput improvement
Actual outcome highly dependent on workload

36
Additional Considerations (2 of 3)

NUMA specific
If only one VM is doing heavy I/O, may be
beneficial to pin the VM and its memory to node 0
If CPU usage is not a concern no pinning
necessary
On each VM reboot, ESX Server will place it on
the next adjacent NUMA node
Minor performance implications for certain
workloads
To avoid this movement, VM should be affinitized
using VI client
SMP VMs
For I/O workloads within an SMP VM that migrate
frequently between VCPUs
Pin the guest thread/process to a specific VCPU
Some versions of Linux has KHz timer rate and may
incur high overhead

37
Additional Considerations (3 of 3)

CPU headroom
Software initiated iSCSI and NFS protocols can
consume significant amount of CPU in certain I/O
patterns
Small I/O workloads require large amount of CPU
ensure that CPU saturation does not restrict I/O
rate
Networking
Avoid link over-subscription
Ensure all networking parameters or even the
basic gigabit connection is consistent across the
full network path
Intelligent use of VLAN or zoning to minimize
traffic interference

38
General Troubleshooting Tips (1 of 3)

Identify
Components in the whole I/O path
Possible issues at each layer in the path
Check all hardware software configuration
parameters, in particular
Disk configurations and cache management policies
on storage server/array
Network settings and routing topology
Design experiments to isolate problems, such as
Cached runs
Use a small file or logical device, or a physical
host configured with RAM-disks Minimizing
physical disk effects
Indicate upper-bound throughput and I/O rate
achievable

39
General Troubleshooting Tips (2 of 3)

Run tests with single outstanding I/O
Easier for analysis on packet traces
Throughput entirely dependent on I/O response
times
Micro benchmarking each layer in the I/O path
Compare to non-virtualized, native performance
results
Collect data
Guest OS data But dont trust the CPU
Esxtop data
Storage server/array data Cache hit ratio,
storage processor busy, etc.
Packet tracing with tools like TCPdump, Ethereal,
Wireshark, etc.

40
General Troubleshooting Tips (3 of 3)

Analyze performance data
Do any stats, e.g., throughput or latency, change
drastically over time?
Check esxtop data for anomalies, e.g., CPU spikes
or excessive queueing
Server/array stats
Compare array stats with ESX stats
Is cache hit ratio reasonable? Storage processor
overloaded?
Network trace analysis
Inspect packet traces to see if
NFS and iSCSI requests are processed timely
IO sizes issued by the guest match the transfer
sizes over the wire
Block addresses aligned to appropriate boundaries?

41
Isolating Performance Problems Case Study1 (1
of 3)

Symptoms
Throughput can reach Gigabit wire speed doing
128KB sequential reads from a 20GB LUN on an
iSCSI array with 2GB cache
Throughput degrades for larger data sizes beyond
128KB
From esxtop data
CPU utilization also lower for l/O sizes larger
than 128KB
CPU cost per I/O is in expected range for all I/O
sizes

42
Isolating Performance Problems Case Study1 (2
of 3)

From esxtop or benchmark output
I/O response times in the 10 to 20ms range for
the problematic IOs
Indicates constant physical disk activities
required to serve the reads
From network packet traces
No retransmissions or packet loss observed
indicating no networking issue
Packet time stamps indicating array takes 10ms to
20ms to respond to a read request, no delay in
the ESX host
From cached run results
No throughput degradation above 128KB!
Problem exists only for file sizes exceeding
cache capacity
Array appears to have cache-management issues
with large sequential reads

43
Isolating Performance Problems Case Study1 (3
of 3)

From native tests to same array
Same problem observed
From the administration GUI of the array
Read-ahead policies set to highly aggressive
Is the policy appropriate for the workload?
Solution
Understand performance characteristics of the
array
Experiment with different read-ahead policies
Try turning off read-ahead entirely to get the
baseline behavior

44
Isolating Performance Problems Case Study2 (1
of 4)

Symptoms
1KB random write throughput much lower (lt 10)
than sequential writes to a 4GB vmdk file located
on an NFS server
Even after extensive warm-up period
But very little difference in performance between
random and sequential reads
From NFS server spec
3GB read/write cache
Most data should be in cache after warming up

45
Isolating Performance Problems Case Study2 (2
of 4)

From esxtop and application/benchmark data
CPU utilization lower but CPU cost per I/O
mostly same regardless of randomness
Not likely a client side (i.e., ESX host) issue
Random write latency in the 20ms range
Sequential write lt 1ms
From NFS server stats
cache hit much lower for random writes, even
after warm-up

46
Isolating Performance Problems Case Study2 (3
of 4)

From cached runs to a 100MB vmdk
Random write latency almost matches sequential
write
Again, suggests that issue is not in ESX host
From native tests
Random and sequential write performance is almost
same
From network packet traces
Server responds to random writes in 10 to 20ms,
sequential writes in lt1ms
Offset in NFS WRITE requests is not aligned to
power-of-2 boundary
Packet traces from native runs show correct
alignment

47
Isolating Performance Problems Case Study2 (4
of 4)

Question
Why are sequential writes not affected?
NFS Server file system idiosyncrasies
Manages cache memory at 4KB granularity
Old blocks are not updated in place writes go to
new blocks
Each lt 4KB write incurs a read from the old block
Aggressive read-ahead masks the read latency
associated with sequential writes
Solution
Use disk alignment tool in the guest OS to align
disk partition
Alternatively, use unformatted partition inside
guest OS

48
Summary and Takeaways

IP-based storage performance in ESX is being
constantly improved Key enhancements in ESX 3.5
Overall storage subsystem
Networking
Resource scheduling and management
Optimized NUMA, multi-core, and large memory
support
IP-based network storage technologies are
maturing
Price/performance can be excellent
Deployment and troubleshooting could be
challenging
Knowledge is key server/array, networking, host,
etc.
Stay tuned for further updates from VMware