Applies to:Sun ZFS Storage 7120 - Version: Not Applicable
Sun Storage 7210 Unified Storage System - Version: Not Applicable and later [Release: N/A and later]
Sun Storage 7110 Unified Storage System - Version: Not Applicable and later [Release: N/A and later]
Sun Storage 7410 Unified Storage System - Version: Not Applicable and later [Release: N/A and later]
Sun Storage 7310 Unified Storage System - Version: Not Applicable and later [Release: N/A and later]
Information in this document applies to any platform.
To provide information to assist the user to configure the system for NFS performance
- The choice of the ZFS pool RAID level
- Number of disks configured in the ZFS pool(s)
- Provision of Write-optimized SSDs (logzillas)
- Provision of Read-optimized SSDs (readzillas)
- Matching the ZFS pool blocksize to the client workload I/O size
- Size of (DRAM) Memory
- Number/speed of CPUs
In general, the biggest causes of NFS performance problems on the Series 7000 appliance when configuring/sizing the system are:
- The 'wrong' choice of RAID level
- No Log SSDs ... or too few
- Not enough disks configured in each pool
The choice of the ZFS pool RAID level
This is the most important decision when configuring the system for performance.
Choosing a RAID level:
- Double Parity RAID is the default 'Data Profile' type on the BUI storage configuration screen - > this is NOT a good choice for Random workloads.
- If tuning for performance, always choose 'Mirrored'.
- For random and/or latency-sensitive workloads:
- use a mirrored pool and configure sufficient Read SSDS and Log SSDs in it.
- budget for 'disk IOPS + 30%' for cache warmup.
- RAIDZ2/RAID3 provide great useable capacity for archives and filestores, but don't use these RAID levels for random workloads.
- For random and/or latency sensitive workloads use mirrors (R1) and assume 100 IOPs/vdev.
- RAIDZ2/Z3 provide great usable capacity but poor performance for random workloads unless working set can be held in ARC/L2ARC.
- RAIDZ1 with Narrow Stripes is a reasonable compromise between Mirrors and RAIDZ2/Z3 but is vulnerable to disk failure.
- >2 Storage Pools per appliance supported in 2010.Q1 so you will be able to have a number of RAID levels per node.
Number of disks configured in the ZFS pool(s)
For good performance we need as many drives and Log SSDs as possible per pool ... so discourage lots of small pools. If customer wants to create separate pools for different projects/departments, create separate projects in the pool.
Multiple pools may be required if applications/usage dictates different RAID levels on the underlying storage.
- The more disks configured in a pool, the more IOPS are available -> keep the number of storage pools to a minimum (FEWER, LARGER pools better).
Provision of Write-optimized SSDs (logzillas)
Write-optimized (log) SSDs accelerate Synchronous Writes. Synchronous writes do not complete until the data is stored in non-volatile storage.
Log SSD vs DRAM:
- Sync Writes are buffered in the DRAM and the Log SSD.
- Non-sync writes are just buffered in DRAM.
- Written data is buffered in memory for up to 5 seconds.
Applications & Protocols that use Synchronous Writes:
- Email systems
- Writes over NFS are mostly always asynchronous. The IO writes are synchronous for file attribute changes and on file closure. NFS access can be forced to be synchronous if the file is opened with the O_DSYNC option.
- Writes over CIFS are NOT synchronous writes on the appliance unless the application requests them.
- Writes to Fibre Channel and ISCSI LUNs are ALWAYS synchronous on the appliance.
'Mirror' or 'Stripe' Log SSDs configuration ?
- If you have one Log SSD and it fails, there is potential data loss if you have a second failure before ZFS flushes the data to disk.
- Mirroring Log SSDs can maintain synchronous write performance after the failure of one Log SSD device.
- Striping Log SSD devices aggregates the performance of the SSDs together, but the 'stripe' will fail if one SSD device fails.
Log SSD recommendations:
- Need one log SSD per 100 MB/s of synchronous writes / 3300 4K IOPS/s.
- Log SSDs are required when applications or protocols performing synchronous writes are going to be used - omit them and performance issues may result.
- Log SSDs are recommended when NFS or iSCSI is being used.
- Always configure a minimum of TWO log SSDs in 7210/7310/7410.
- When sizing, take into consideration whether the log SSDs are mirrored or striped.
- It is expensive and physically disruptive to 'retrofit' Log SSDs into a production system.
Provision of Read-optimized SSDs (readzillas)
The aim of read-optimized SSDs is to enhance read performance by accelerating ZFS caching.
Read SSD recommendations:
- No read SSDs are supported in a 7110/7210.
- Read SSDs can be added (non-disruptively) to a 7310/7410.
- Each read SSD can service ~3100 x 8 Kbyte IOPS/sec.
- While the read SSD is being written to (populated from the ARC), it is not available for reading -> need two read SSDs to make sure at least one is available for reads.
- Configure a minimum of two read SSDs in 7310/7410.
Matching the ZFS pool blocksize to the client workload I/O size
Notes on blocksize and file re-writing:
- Desktop applications completely re-write files when saving them.
- Some applications update individual blocks in files repeatedly, eg. databases.
- In that case - if the I/O size is smaller than the filesystem blocksize - read-modify-writes can occur (regardless of RAID level) eg. need to write 64Kb to a file with a (filesystem) 128Kb blocksize, must read 128Kb block before updating it and re-writing it.
- When configuring the ZFS pool/shares blocksize, attempt to match it to the actual client workload I/O size.
Size of (DRAM) Memory
Memory is used as a 'cache' for data blocks (ARC).
- Attempt to size the memory configuration so that the application 'working set' fits into the appliance memory (only applicable where the application has a 'working set' eg. databases, VDI).
- 64 GB should be viewed as a minimum memory configuration for production use.
Number/speed of CPUs
- Minimum of two CPUs recommended.
- Software features such as compression and replication make heavy use of CPU cores.
- Need to add CPUs to get the maximum memory slots.
Do we have a hardware performance 'bottleneck' ?
In some instances, the 'limiting factor' in the system performance may be some component of the actual system configuration ie. the number/size/speed of:
- DRAM memory
- Network hardware
- Read SSDs (readzilla) configured on the server
- Write SSDs (logzilla) configured in the JBODs
Use the following tips for manually observing hardware bottlenecks in Analytics:
1. Buy more (or faster) CPUs when ...
=> multiple CPU cores are at 100% utilization for more than 15 minutes
Observe with: CPU: CPUs -> Broken down by percent utilization
CPU: Percent utilization -> Broken down by CPU identifier
- a) "More than 15 minutes" is a general guideline; your customer may want more/faster CPUs even if they peg their existing CPU cores for a shorter period if they have a frequent short-duration workload that is CPU-intensive.
- b) A single CPU core pegged at 100% utilization while the others are relatively idle is a likely indication of a single-threaded workload; encourage your customer to divide their workload among multiple clients or to investigate a multi-threaded implementation of their client application to better utilize the many CPU cores we offer in our controllers.
2. Buy/Use more network when ...
=> any network device is pushing 95% of its maximum throughput for more than 10 minutes
Observe with: NETWORK: Device bytes -> Broken down by device
- a) As with CPU, "more than 10 minutes" is a general guideline and may be adjusted if your customer is sensitive to shorter-duration workloads that peg available network bandwidth.
- b) A 1Gb device can push ~120MBytes/sec.
- c) A 10Gb device can push ~1.20GBytes/sec.
- d) Aggregating existing datalinks may be an option for expandingbandwidth without purchasing additional hardware (if there is one or more relatively idle datalink already available in the system).
3. Buy more DRAM when ...
=> ARC accesses for data and/or metadata hit 75-97% (as compared to misses)
=> ARC access hits for data/metadata are significantly greater than prefetch hits
=> ARC is being accessed at least 10,000 times per second for more than 10 minutes
Observe with: CACHE: ARC accesses -> Broken down by hit/miss
- The first condition (that we're hitting more than missing) shows that the ARC is actually providing us benefit by storing data or metadata the applications want; the second shows that the majority of the ARC accesses are for real applications -- not just the prefetch mechanism; and the third shows that we're actually hitting DRAM a bit -- not just an idle system.
4. Buy your first Readzilla(s), if you have none to start, when ...
=> there are at least 1500 L2ARC-eligible ARC access misses for data and/or metadata per second for about ~24 hrs
=> an active filesystem or LUN has a ZFS recordsize of 32k or smaller
Observe with: CACHE: ARC accesses -> Broken down by hit/miss ... then drill down on data/metadata misses by L2ARC eligibility
- Our current Readzillas can do about 3100 8k reads per second, so the threshold of 1500 indicates a reasonable return on investment at about 50% capacity of a single Readzilla -or 25% capacity for two, as a single Readzilla is not recommended.
5. Buy MORE Readzillas when ...
=> existing Readzillas are 90% utilized for more than 10 minutes
Observe with: DISK: Percent utilization -> Broken down by disk ... then drill down on a Readzilla as a raw statistic
- a) Don't draw any conclusions from the initial "percent utilization" statistic for all disks; be sure to drill down on a Readzilla as a raw statistic.
- b) You can identify which chassis and slots contain Readzillas from the Maintenance -> Hardware context, then identify those disks in analytics.
6. Buy your first Logzilla(s), if you have none to start, when ...
=> the sum of iSCSI writes, FC writes, and NFS synchronous operations is at least 1000 per second for at least 15 minutes
=> there are at least 100 NFS commits per second for 15 minutes
Observe with: PROTOCOL: <Protocol> -> Broken down by type of operation
- a) "At least 15 minutes" is a general guideline; the customer may want Logzillas even sooner if they have a performance-sensitive, short-duration synchronous write workload.
- b) Unless "Write cache enabled" is turned on (not the default), FC and iSCSI writes will all be synchronous, so they will benefit from Logzilla.
- c) Common synchronous NFS operations include 'commit', 'mkdir', 'create', and 'write' when the stable_how field is set to DATA_SYNC or FILE_SYNC (see the NFS RFCs for details).
7. Buy MORE Logzillas when ...
=> existing Logzillas are 90% utilized for more than 10 minutes
Observe with: Disk: Percent utilization -> Broken down by disk ... then drill down on a Logzilla as a raw statistic
- a) Don't draw any conclusions from the initial "percent utilization" statistic for all disks; be sure to drill down on a Logzilla as a raw statistic.
- b) You can identify which chassis and slots contain Logzillas from the Maintenance -> Hardware context, then identify those disks in analytics.
8. Buy more spinning disks when ...
=> at least 50% of existing drives are at least 70% utilized for over 30 minutes
Observe with: DISK: Disks -> Broken down by percent utilization
- a) As with the other bottlenecks, "over 30 minutes" is a general guideline; if the customer has short-duration workloads that are bottlenecked on disk utilization, they may be interested in improving that regardless.
- b) Disks can be over-utilized by making a poor choice for RAID profile and/or ZFS record size as well. It may be possible to reduce existing disk utilization by moving from raidz to mirrored profiles (especially for random read workloads exceeding help provided by ARC and L2ARC) and/or matching ZFS record sizes to client I/O sizes (especially for small-block I/O, since we default to 128k).