P
US11403258B2ActiveUtilityPatentIndex 62

Generating hexadecimal trees to compare file sets

Assignee: EMC IP HOLDING CO LLCPriority: Aug 4, 2020Filed: Aug 4, 2020Granted: Aug 2, 2022
Est. expiryAug 4, 2040(~14.1 yrs left)· nominal 20-yr term from priority
Inventors:SAAD YOSSEFGLICK ITAY
G06F 16/152G06F 16/137G06F 16/2246G06F 16/184
62
PatentIndex Score
0
Cited by
3
References
18
Claims

Abstract

First and second trees having leaves identified by hexadecimal values are generated. First files from a first file set are allocated across the first tree based on hashes of the first files. The hashes of the first files are translated into first leaf index values. Second files from a second file set are allocated across the second tree based on hashes of the second files. The hashes of the second files are translated into second leaf index values. The first and second leaf index values are compared to identify leaves that are the same between the first and second trees. A similarity index indicating a degree of similarity between the first and second sets of files is created based on the comparison.

Claims

exact text as granted — not AI-modified
What is claimed is: 
     
       1. A method of comparing first files in a first set of files with second files in a second set of files comprising:
 generating first and second trees comprising leaves identified by hexadecimal values; 
 allocating the first files across leaves of the first tree based on hashes of the first files and the hexadecimal values identifying the leaves of the first tree; 
 translating the hashes for the allocated first files into first leaf index values, each first leaf index value being associated with a respective leaf of the first tree and representing respective files from the first set of files that have been allocated to the respective leaf of the first tree; 
 allocating the second files across leaves of the second tree based on hashes of the second files and the hexadecimal values identifying the leaves of the second tree; 
 translating the hashes for the allocated second files into second leaf index values, each second leaf index value being associated with a respective leaf of the second tree and representing respective files from the second set of files that have been allocated to the respective leaf of the second tree; 
 comparing the first leaf index values associated with leaves of the first tree with the second leaf index values associated with corresponding leaves of the second tree to identify leaves that are the same between the first and second trees; and 
 creating, from the comparison, a similarity index indicating a degree of similarity between the first and second sets of files. 
 
     
     
       2. The method of  claim 1  wherein the allocating the first files further comprises:
 matching at least a part of the hashes for the first files in the first set of files to the hexadecimal values identifying the leaves of the first tree; and 
 wherein the allocating the second files further comprises: 
 matching at least a part of the hashes for the second files in the second set of files to the hexadecimal values identifying the leaves of the second tree. 
 
     
     
       3. The method of  claim 1  wherein the first leaf index values and the second leaf index values comprise a fixed-length. 
     
     
       4. The method of  claim 1  wherein the similarity index comprises a percentage value of leaves between the first and second trees having the same first and second leaf index values. 
     
     
       5. The method of  claim 1  wherein the first and second trees comprise at least an upper level having nodes and a lower level having the leaves branching from the upper level of nodes, the upper level of nodes being identified by a single-digit hexadecimal value, and the lower level of leaves being identified by a two-digit hexadecimal value. 
     
     
       6. The method of  claim 1  wherein the translating the hashes for the allocated first files into first leaf index values further comprises:
 applying an XOR function to hashes of files from the first set of files allocated to each respective leaf of the first tree; and 
 wherein the translating the hashes for the allocated second files further comprises: 
 applying the XOR function to hashes of files from the second set of files allocated to each respective leaf of the second tree. 
 
     
     
       7. A system for comparing first files in a first set of files with second files in a second set of files comprising: a processor; and memory configured to store one or more sequences of instructions which, when executed by the processor, cause the processor to carry out the steps of:
 generating first and second trees comprising leaves identified by hexadecimal values; 
 allocating the first files across leaves of the first tree based on hashes of the first files and the hexadecimal values identifying the leaves of the first tree; 
 translating the hashes for the allocated first files into first leaf index values, each first leaf index value being associated with a respective leaf of the first tree and representing respective files from the first set of files that have been allocated to the respective leaf of the first tree; 
 allocating the second files across leaves of the second tree based on hashes of the second files and the hexadecimal values identifying the leaves of the second tree; 
 translating the hashes for the allocated second files into second leaf index values, each second leaf index value being associated with a respective leaf of the second tree and representing respective files from the second set of files that have been allocated to the respective leaf of the second tree; 
 comparing the first leaf index values associated with leaves of the first tree with the second leaf index values associated with corresponding leaves of the second tree to identify leaves that are the same between the first and second trees; and 
 creating, from the comparison, a similarity index indicating a degree of similarity between the first and second sets of files. 
 
     
     
       8. The system of  claim 7  wherein the allocating the first files further comprises:
 matching at least a part of the hashes for the first files in the first set of files to the hexadecimal values identifying the leaves of the first tree; and 
 wherein the allocating the second files further comprises: 
 matching at least a part of the hashes for the second files in the second set of files to the hexadecimal values identifying the leaves of the second tree. 
 
     
     
       9. The system of  claim 7  wherein the first leaf index values and the second leaf index values comprise a fixed-length. 
     
     
       10. The system of  claim 7  wherein the similarity index comprises a percentage value of leaves between the first and second trees having the same first and second leaf index values. 
     
     
       11. The system of  claim 7  wherein the first and second trees comprise at least an upper level having nodes and a lower level having the leaves branching from the upper level of nodes, the upper level of nodes being identified by a single-digit hexadecimal value, and the lower level of leaves being identified by a two-digit hexadecimal value. 
     
     
       12. The system of  claim 7  wherein the translating the hashes for the allocated first files into first leaf index values further comprises:
 applying an XOR function to hashes of files from the first set of files allocated to each respective leaf of the first tree; and 
 wherein the translating the hashes for the allocated second files further comprises: 
 applying the XOR function to hashes of files from the second set of files allocated to each respective leaf of the second tree. 
 
     
     
       13. A computer program product, comprising a non-transitory computer-readable medium having a computer-readable program code embodied therein, the computer-readable program code adapted to be executed by one or more processors to implement a method of comparing first files in a first set of files with second files in a second set of files, the method comprising:
 generating first and second trees comprising leaves identified by hexadecimal values; 
 allocating the first files across leaves of the first tree based on hashes of the first files and the hexadecimal values identifying the leaves of the first tree; 
 translating the hashes for the allocated first files into first leaf index values, each first leaf index value being associated with a respective leaf of the first tree and representing respective files from the first set of files that have been allocated to the respective leaf of the first tree; 
 allocating the second files across leaves of the second tree based on hashes of the second files and the hexadecimal values identifying the leaves of the second tree; 
 translating the hashes for the allocated second files into second leaf index values, each second leaf index value being associated with a respective leaf of the second tree and representing respective files from the second set of files that have been allocated to the respective leaf of the second tree; 
 comparing the first leaf index values associated with leaves of the first tree with the second leaf index values associated with corresponding leaves of the second tree to identify leaves that are the same between the first and second trees; and 
 creating, from the comparison, a similarity index indicating a degree of similarity between the first and second sets of files. 
 
     
     
       14. The computer program product of  claim 13  wherein the allocating the first files further comprises:
 matching at least a part of the hashes for the first files in the first set of files to the hexadecimal values identifying the leaves of the first tree; and 
 wherein the allocating the second files further comprises: 
 matching at least a part of the hashes for the second files in the second set of files to the hexadecimal values identifying the leaves of the second tree. 
 
     
     
       15. The computer program product of  claim 13  wherein the first leaf index values and the second leaf index values comprise a fixed-length. 
     
     
       16. The computer program product of  claim 13  wherein the similarity index comprises a percentage value of leaves between the first and second trees having the same first and second leaf index values. 
     
     
       17. The computer program product of  claim 13  wherein the first and second trees comprise at least an upper level having nodes and a lower level having the leaves branching from the upper level of nodes, the upper level of nodes being identified by a single-digit hexadecimal value, and the lower level of leaves being identified by a two-digit hexadecimal value. 
     
     
       18. The computer program product of  claim 13  wherein the translating the hashes for the allocated first files into first leaf index values further comprises:
 applying an XOR function to hashes of files from the first set of files allocated to each respective leaf of the first tree; and 
 wherein the translating the hashes for the allocated second files further comprises: 
 applying the XOR function to hashes of files from the second set of files allocated to each respective leaf of the second tree.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.