About the Cloud Catalyst cache
The administrator configures a local cache directory as part of configuring a Cloud Catalyst storage server. The primary function of the local cache directory (or Cloud Catalyst cache) is to allow the Cloud Catalyst to continue to deduplicate data. Deduplication of data occurs even if the ingest rate from targeted backup and duplication jobs temporarily exceeds the available upload throughput to the destination cloud storage.
For example, if backup and duplication jobs transfer 10 TB of data per hour to the Cloud Catalyst storage server, and the Cloud Catalyst deduplicates the data at a ratio of 10:1, the 1 TB of deduplicated data may exceed the upload capacity of .7 TB per hour of writes to cloud storage. The cache allows the jobs to continue to send and process the data, assuming that at some point the incoming data rate slows. The Cloud Catalyst cache only stores the deduplicated data. Jobs are not marked as complete until all data is uploaded to the cloud.
While a Cloud Catalyst cache of 4 TB is recommended, a larger cache has the following benefits:
For restores:
If the data exists in the Cloud Catalyst cache, it is restored from the cache instead of the cloud. The larger the cache, the more deduplicated objects can reside in the cache.
For data with poor deduplication rates:
A larger cache may be required since the poor deduplication ratios require that larger amounts of data be uploaded to the cloud.
For job windows that experience bursts of activity:
A larger cache can be helpful if frequent jobs are targeted to the Cloud Catalyst storage server within a narrow window of time.
While a larger cache can be beneficial, jobs are not marked as complete until all data is uploaded to the cloud. Data is uploaded from the cache to the cloud when an MSDP container file is full. This occurs soon after the backup or duplication job begins, but not immediately. Deduplication makes it possible for second and subsequent backup jobs to transfer substantially less data to the cloud, depending on the deduplication rate.
For example, 4 TB of cache is expected to manage 1 PB of data in the cloud without issue.
Note:
If you initiate a restore from Glacier or Glacier Deep Archive, NetBackup initiates a warming step. NetBackup does not proceed with the restore until all the data is available in S3 storage to be read.
The warming step is always done if using Amazon, even if the data is in the Cloud Catalyst cache. For storage classes other than Glacier and Glacier Deep Archive, the warming step is almost immediate with no meaningful delay. For Glacier and Glacier Deep Archive, the warming step may be immediate if files were previously warmed and are still in S3 Standard storage. However, it may take several minutes, hours, or days depending on settings being used.
The Cloud Catalyst manages the cache based on the configuration settings in the esfs.json file. Once the high watermark is reached, data is purged when the used space reaches the midpoint between HighWatermark and LowWatermark (high+low)/2 and continues until LowWatermark is reached. If the rate of incoming data exceeds the rate where the watermark can be maintained, the jobs begin to fail. Administrators should not manually delete or purge the managed data in the cache storage unless directed to do so by NetBackup Technical Support.