Fingerprint lookup for deduplication
The SHA-2 hashing algorithm is used to generate the fingerprints of the data segments from backup streams. A unique SHA-2 fingerprint represents a unique data segment and is compared to a set of fingerprints representing data segments already in a data store. A lookup match means the data segment is already stored in the system; a lookup miss means the system does not have it and the corresponding data segment needs to be stored.
The set of fingerprints in memory, also known as the fingerprint cache, contains two sets of fingerprints for a given backup job:
The global fingerprint cache, which is indexed for fast query, maintained at the deduplication server-side for the duration of the deduplication service running.
The job-based fingerprint cache, which is also indexed, created at the deduplication client side in the beginning of the job and released at the end of the job.
The fingerprints of the last image (which is the last full backup by default and can be the last full backup plus subsequent incrementals) is fetched from the MSDP server to the OST pdplugin in the beginning. Whether the deduplication happens on the OST pdplugin completely depends on whether the client-side cache is big enough to hold all the fingerprints from the last image. Any fingerprint lookup that is missed from the client-side cache triggers the lookup to go to the MSDP server-side, even though the fingerprint may not exist on the server-side.
This two-level fingerprint cache provides a high-performance lookup and reduces memory footprint requirement at the server side, such as a NetBackup appliance.