P
US9405643B2ActiveUtilityPatentIndex 68

Multi-level lookup architecture to facilitate failure recovery

Assignee: DROPBOX INCPriority: Nov 26, 2013Filed: Nov 26, 2013Granted: Aug 2, 2016
Est. expiryNov 26, 2033(~7.4 yrs left)· nominal 20-yr term from priority
Inventors:COWLING JAMESMODZELEWSKI KEVIN P
G06F 16/2255G06F 16/184G06F 11/2094G06F 11/1471G06F 11/0727G06F 11/2097G06F 17/30212G06F 17/3033
68
PatentIndex Score
4
Cited by
19
References
21
Claims

Abstract

The disclosed embodiments relate to a data storage system that facilitates efficiently recovering from storage device failures. Upon receiving a request to retrieve a data block from the data storage system, the system uses a hash that identifies the data block to look up a bucket and an associated cell containing the data block. Note that the bucket aggregates a large number of data blocks and is located in the associated cell that comprises a set of object storage devices (OSDs). Within the cell, the system uses the bucket to look up an OSD that contains the bucket in a local bucket database (BDB) for the cell. Within the OSD, the system uses the bucket and the hash to look up an offset and a length for the data block in a write-ahead log that stores data blocks for the bucket. Finally, the system returns the data block from the determined offset.

Claims

exact text as granted — not AI-modified
What is claimed is: 
     
       1. A computer-implemented method for operating a data storage system, comprising:
 receiving a request to retrieve a data block from the data storage system, wherein the request includes a hash that functions as a global identifier for the data block; 
 using the hash to look up a bucket and an associated cell that contains the data block, wherein the bucket aggregates a large number of data blocks, wherein the bucket is located in the associated cell that comprises a set of object storage devices (OSDs), and wherein the lookup is performed in a hash database (HDB) for the data storage system; 
 within the cell, using the bucket to look up an OSD that contains the bucket, wherein the lookup is performed in a local bucket database (BDB) for the cell; 
 within the OSD, using the bucket and the hash to determine an offset and a length for the data block in a write-ahead log that stores data blocks for the bucket; and 
 returning the data block from the determined offset in the write-ahead log; 
 wherein an OSD comprises a server that comprises one or more storage devices. 
 
     
     
       2. The computer-implemented method of  claim 1 , further comprising:
 receiving a request to write a data block to the data storage system; 
 using the data block to compute a hash that functions as a global identifier for the data block; 
 selecting a writeable bucket and an associated cell for the data block; 
 within the associated cell, using the selected bucket to look up an OSD for the data block, wherein the lookup is performed in a local BDB for the selected cell; 
 within the OSD, appending the data block to a write-ahead log that stores data blocks for the bucket; and 
 updating the HDB to include an entry that maps the hash to the selected bucket and associated cell. 
 
     
     
       3. The computer-implemented method of  claim 1 , wherein upon detecting a failure associated with a bucket in a cell, the method further comprises:
 marking the bucket as non-writable; 
 performing a fast block-copy of the bucket to a new OSD in the cell; and 
 updating the BDB for the cell to indicate that the bucket is associated with the new OSD. 
 
     
     
       4. The computer-implemented method of  claim 1 , wherein the HDB comprises a sharded database that stores hash-to-bucket mappings for multiple cells. 
     
     
       5. The computer-implemented method of  claim 1 , wherein the BDB is stored in random access memory to facilitate performing bucket-to-OSD lookups without having to access a storage device. 
     
     
       6. The computer-implemented method of  claim 1 , wherein determining the offset for the data block in a write-ahead log comprises performing an in-memory lookup in the OSD without accessing a storage device. 
     
     
       7. The computer-implemented method of  claim 1 ,
 wherein each OSD is associated with a generation number that is incremented after a failure-recovery operation completes; and 
 wherein the generation number is used to prevent an OSD associated with an old generation number, and therefore possibly containing invalid data, from servicing read operations. 
 
     
     
       8. The computer-implemented method of  claim 1 , wherein a storage device corresponds to a disk drive. 
     
     
       9. A non-transitory computer-readable storage medium storing instructions that when executed by a computer cause the computer to perform a method for operating a data storage system, the method comprising:
 receiving a request to retrieve a data block from the data storage system, wherein the request includes a hash that functions as a global identifier for the data block; 
 using the hash to look up a bucket and an associated cell that contains the data block, wherein the bucket aggregates a large number of data blocks, wherein the bucket is located in the associated cell that comprises a set of object storage devices (OSDs), and wherein the lookup is performed in a hash database (HDB) for the data storage system; 
 within the cell, using the bucket to look up an OSD that contains the bucket, wherein the lookup is performed in a local bucket database (BDB) for the cell; 
 within the OSD, using the bucket and the hash to determine an offset and a length for the data block in a write-ahead log that stores data blocks for the bucket; and 
 returning the data block from the determined offset in the write-ahead log; 
 wherein an OSD comprises a server that comprises one or more storage devices. 
 
     
     
       10. The non-transitory computer-readable storage medium of  claim 9 , wherein the method further comprises:
 receiving a request to write a data block to the data storage system; 
 using the data block to compute a hash that functions as a global identifier for the data block; 
 selecting a writeable bucket and an associated cell for the data block; 
 within the associated cell, using the selected bucket to look up an OSD for the data block, wherein the lookup is performed in a local BDB for the selected cell; 
 within the OSD, appending the data block to a write-ahead log that stores data blocks for the bucket; and 
 updating the HDB to include an entry that maps the hash to the selected bucket and associated cell. 
 
     
     
       11. The non-transitory computer-readable storage medium of  claim 9 , wherein upon detecting a failure associated with a bucket in a cell, the method further comprises:
 marking the bucket as non-writable; 
 performing a fast block-copy of the bucket to a new OSD in the cell; and 
 updating the BDB for the cell to indicate that the bucket is associated with the new OSD. 
 
     
     
       12. The non-transitory computer-readable storage medium of  claim 9 , wherein the HDB comprises a sharded database that stores hash-to-bucket mappings for multiple cells. 
     
     
       13. The non-transitory computer-readable storage medium of  claim 9 , wherein the BDB is stored in random access memory to facilitate performing bucket-to-OSD lookups without having to access a storage device. 
     
     
       14. The non-transitory computer-readable storage medium of  claim 9 , wherein determining the offset for the data block in a write-ahead log comprises performing an in-memory lookup in the OSD without accessing a storage device. 
     
     
       15. The non-transitory computer-readable storage medium of  claim 9 ,
 wherein each OSD is associated with a generation number that is incremented after a failure-recovery operation completes; and 
 wherein the generation number is used to prevent an OSD associated with an old generation number, and therefore possibly containing invalid data, from servicing read operations. 
 
     
     
       16. A data storage system, comprising:
 one or more zones, wherein each zone is associated with a data center; 
 wherein each zone is divided into a plurality of cells; 
 wherein each cell comprises a plurality of object storage devices (OSDs) for storing blocks of data, and a master for managing the OSDs and wherein the master comprises one or more processors and memory; and 
 wherein during operation, the data storage system is configured to,
 receive a request to retrieve a data block from the data storage system, wherein the request includes a hash that functions as a global identifier for the data block, 
 use the hash to look up a bucket and an associated cell that contains the data block, wherein the bucket aggregates a large number of data blocks and is located in the associated cell, and wherein the lookup is performed in a hash database (HDB) for the data storage system; 
 within the cell, use the bucket to look up an OSD that contains the bucket, wherein the lookup is performed in a local bucket database (BDB) for the cell; 
 within the OSD, use the bucket and the hash to determine an offset and a length for the data block in a write-ahead log that stores data blocks for the bucket; and 
 
 return the data block from the determined offset in the write-ahead log; 
 wherein an OSD comprises a server that comprises one or more storage devices. 
 
     
     
       17. The data storage system of  claim 16 , wherein during operation, the data storage system is additionally configured to:
 receive a request to write a data block to the data storage system; 
 use the data block to compute a hash that functions as a global identifier for the data block; 
 select a writeable bucket and an associated cell for the data block; 
 within the associated cell, use the selected bucket to look up an OSD for the data block, wherein the lookup is performed in a local BDB for the selected cell; 
 within the OSD, append the data block to a write-ahead log that stores data blocks for the bucket; and 
 update the HDB to include an entry that maps the hash to the selected bucket and associated cell. 
 
     
     
       18. The data storage system of  claim 16 , wherein upon detecting a failure associated with a bucket in a cell, the data storage system is configured to:
 mark the bucket as non-writable; 
 perform a fast block-copy of the bucket to a new OSD in the cell; and 
 update the BDB for the cell to indicate that the bucket is associated with the new OSD. 
 
     
     
       19. The data storage system of  claim 16 , wherein the HDB comprises a sharded database that stores hash-to-bucket mappings for multiple cells. 
     
     
       20. The data storage system of  claim 16 , wherein the BDB is stored in random access memory to facilitate performing bucket-to-OSD lookups without having to access a storage device. 
     
     
       21. The data storage system of  claim 16 , wherein while determining the offset for the data block in a write-ahead log, the OSD is configured to perform an in-memory lookup without accessing a storage device.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.