About sampling and predictive cache
MSDP uses a memory up to a size that is configured in MaxCacheSize to cache fingerprints for efficient deduplication lookup. A new fingerprint cache lookup data scheme that is introduced in NetBackup release 10.1 reduces the memory usage. It splits the current memory cache into two components, sampling cache (S-cache) and predictive cache (P-cache). S-cache caches a percentage of the fingerprints from each backup and is used to find similar data from the samples of previous backups for deduplication. P-cache caches the fingerprints that is most likely used in the immediate future for deduplication lookup.
At the start of a job, a small portion of the fingerprints from its last backup is loaded into P-cache as initial seeding. The fingerprint lookup is done with P-cache to find duplicates, and the lookup misses are searched from S-cache samples to find the possible matches of previous backup data. If found, part of the matched backup fingerprints is loaded into P-cache for future deduplication.
The S-cache and P-cache fingerprint lookup method is enabled for local and cloud storage volumes with MSDP cluster deployments including Flex Scale, AKS, and EKS deployment. This method is also enabled for cloud-only volumes for MSDP non-cluster platforms that are NetBackup appliance, Flex, and BYO. For the platforms with cloud-only volume support, local volume still uses the original cache lookup method. You can find S-cache and P-cache configuration parameters under Cache section of configuration file contentrouter.cfg.
The default values for non-cluster deployments:
Configuration | Default value |
|---|---|
MaxCacheSize | 50% |
MaxPredictiveCacheSize | 20% (10% in NetBackup Appliance) |
MaxSamplingCacheSize | 5% (10% in NetBackup Appliance) |
EnableLocalPredictiveSamplingCache in | false |
EnableLocalPredictiveSamplingCache in | false |
The default values for cluster deployments:
Configuration | Default value |
|---|---|
MaxCacheSize | 512MiB |
MaxPredictiveCacheSize | 40% |
MaxSamplingCacheSize | 20% |
EnableLocalPredictiveSamplingCache in | true |
EnableLocalPredictiveSamplingCache in | true |
For MSDP cluster deployments, the local volume and cloud volume share the same S-cache and P-cache size. For the non-cluster deployment, S-cache and P-cache are only for cloud volume, and MaxCacheSize is still used for local volume. In case the system is not used for cloud backup, MaxPredictiveCacheSize and MaxSamplingCacheSize can be set to a small value, for example, 1% or 128MiB. MaxCacheSize can be set to a large value, for example, 50% or 60%. Similarly, if the system is used for cloud backups only, MaxCacheSize can be set to 1% or 128MiB, and MaxPredictiveCacheSize and MaxSamplingCacheSize can be set to a larger value.
The S-cache size is determined by the back-end MSDP capacity or the number of fingerprints from the back-end data. With the assumption that average segment size of 32KB, the S-cache size is about 100MB per TB of back-end capacity. P-cache size is determined by the number of concurrent jobs and data locality or working set of the incoming data. With working set of 250MB per stream (about 5 million fingerprints). For example, 100 concurrent stream needs minimum memory of 25GB (100*250MB). The working set can be larger for certain applications with multiple streams and large data sets. As P-cache is used for fingerprint deduplication lookup and all fingerprints that are loaded into P-cache stay there until its allocated capacity is reached, the larger the P-cache size, the better the potential lookup hit rate, and the more memory usage. Under-sizing S-cache or P-cache leads to reduced deduplication rates and over-sizing increases the memory cost.