Compression of small strings
Abstract
A method for compressing a set of small strings may include calculating n-gram frequencies for a plurality of n-grams over the set of small strings, selecting a subset of n-grams from the plurality of n-grams based on the calculated n-gram frequencies, defining a mapping table that maps each n-gram of the subset of n-grams to a unique code, and compressing the set of small strings by replacing n-grams within each small string in the set of small strings with corresponding unique codes from the mapping table. The method may use linear optimization to select a subset of n-grams that achieves a maximum space saving amount over the set of small strings for inclusion in the mapping table. The unique codes may be variable-length one or two byte codes. The set of small strings may be domain names.
Claims
exact text as granted — not AI-modifiedWhat is claimed is:
1. A computer-implemented method for compressing a set of small strings, the method comprising:
calculating, by a processor, n-gram frequencies for a plurality of n-grams over the set of small strings;
selecting, by the processor, a subset of n-grams from the plurality of n-grams based on the calculated n-gram frequencies;
defining, by the processor, a mapping table that maps each n-gram of the subset of n-grams to a unique code, wherein the unique codes are variable-length one or two byte codes; and
compressing, by the processor, the set of small strings by replacing n-grams within each small string in the set of small strings with corresponding unique codes from the mapping table.
2. The method according to claim 1 , the selecting further comprising:
calculating, by the processor, a space saving amount for each n-gram of the plurality of n-grams as a product of (1) the n-gram frequency and (2) a difference between a character length of the n-gram and a length of the unique codes; and
selecting a number of the n-grams with the highest calculated space saving amount as the subset of n-grams.
3. The method according to claim 2 , wherein overlapping n-grams are removed from the selected subset of n-grams.
4. The method according to claim 1 , the selecting further comprising:
calculating, by the processor, a space saving amount for each n-gram of the plurality of n-grams as a product of (1) the n-gram frequency and (2) a difference between a character length of the n-gram and a length of the unique codes; and
using linear optimization to determine and select the subset of n-grams from the plurality of n-grams that achieves a maximum space saving amount over the set of small strings;
wherein constraints for the linear optimization include as acting only one n-gram from a set of overlapping n-grams.
5. The method according to claim 1 , the compressing further comprising:
for each small string in the set of small strings, replacing n-grams within the small string with corresponding unique codes from the mapping table starting with the longest n-gram appearing in both the small string and the mapping table first.
6. The method according to claim 1 , wherein the set of small strings is a set of domain names.
7. A system for compressing a set of strings, the system comprising:
a processor; and
a memory connected to the processor, the memory storing instructions to direct the processor to perform operations comprising:
calculating n-gram frequencies for a plurality of n-grams over the set of small strings;
selecting a subset of n-grams from the plurality of n-grams based on the calculated n-gram frequencies;
defining a mapping table that maps each n-gram of the subset of n-grams to a unique code, wherein the unique codes are variable-length one or two byte codes; and
compressing the set of small strings by replacing n-grams within each small string in the set of small strings with corresponding unique codes from the mapping table.
8. The system according to claim 7 , the selecting further comprising:
calculating a space saving amount for each n-gram of the plurality of n-grams as a product of (1) the n-gram frequency and (2) a difference between a character length of the n-gram and a length of the unique codes; and
using linear optimization to determine and select the subset of n-grams from the plurality of n-grams that achieves a maximum space saving amount over the set of small strings;
wherein constraints for the linear optimization include selecting only one n-gram from a set of overlapping n-grams.
9. The system according to claim 7 , the selecting further comprising:
calculating a space saving amount for each n-gram of the plurality of n-grams as a product of (1) the n-gram frequency and (2) a difference between a character length of the n-gram and a length of the unique codes; and
selecting a number of the n-grams with the highest calculated space saving amount as the subset of n-grams.
10. The system according to claim 9 , wherein overlapping n-grams are removed from the selected subset of n-grams.
11. A non-transitory computer-readable storage medium storing instructions for compressing a set of small strings, the instructions causing one or more computer processors to perform operations according to a method, the method comprising:
calculating n-gram frequencies for a plurality of n-grams over the set of small strings;
selecting a subset of n-grams from the plurality of n-grams based on the calculated n-gram frequencies;
defining a mapping table that maps each n-gram of the subset of n-grams to a unique code, wherein the unique codes are variable-length one or two byte codes; and
compressing the set of small strings by replacing n-grams within each small string in the set of small strings with corresponding unique codes from the mapping table.
12. The non-transitory computer-readable storage medium according to claim 11 , the selecting further comprising:
calculating a space saving amount for each n-gram of the plurality of n-grams as a product of (1) the n-gram frequency and (2) a difference between a character length of the n-gram and a length of the unique codes; and
using linear optimization to determine and select the subset of n-grams from the plurality of n-grams that achieves a maximum space saving amount over the set of small strings;
wherein constraints for the linear optimization include selecting only one n-gram from a set of overlapping n-grams.
13. The non-transitory computer-readable storage medium according to claim 11 , the selecting further comprising:
calculating a space saving amount for each n-gram of the plurality of n-grams as a product of (1) the n-gram frequency and (2) a difference between a character length of the n-gram and a length of the unique codes; and
selecting a number of the n-grams with the highest calculated space saving amount as the subset of n-grams.
14. The non-transitory computer-readable storage medium according to claim 13 , wherein overlapping n-grams are removed from the selected subset of n-grams.
15. A computer-implemented method for compressing a set of small strings, the method comprising:
calculating, by a processor, n-gram frequencies for a plurality of n-grams over the set of small strings;
selecting, by the processor, a subset of n-grams from the plurality of n-grams based on the calculated n-gram frequencies;
defining, by the processor, a mapping table that maps each n-gram of the subset of n-grams to a unique code;
determining, by the processor, an optimum length for the unique codes, the determining including:
calculating a space saving amount over a subset of small strings from the set of small strings for each of at least two different unique code lengths, wherein the at least two different unique code lengths includes: (1) fixed single byte codes, (2) fixed 2-byte codes, and (3) variable-length one or two byte codes; and
selecting as the optimum length the unique code length with the maximum space saving amount over the subset of small strings, wherein the unique codes are the optimum length; and
compressing, by the processor, the set of small strings by replacing n-grams within each small string in the set of small strings with corresponding unique codes from the mapping table.
16. The method according to claim 15 , wherein the subset of small strings from the set of small strings includes the whole set of small strings.
17. A system for compressing a set of small strings, the system comprising:
a processor; and
a memory connected to the processor, the memory storing instructions to direct the processor to perform operations comprising:
calculating n-gram frequencies for a plurality of n-grams over the set of small strings;
selecting a subset of n-grams from the plurality of n-grams based on the calculated n-gram frequencies;
defining a mapping table that maps each n-gram of the subset of n-grams to a unique code;
determining an optimum length for the unique codes, the determining including:
calculating a space saving amount over a subset of small strings from the set of small strings for each of at least two different unique code lengths, wherein the at least two different unique code lengths includes: (1) fixed single byte codes, (2) fixed 2-byte codes, and (3) variable-length one or two byte codes; and
selecting as the optimum length the unique code length with the maximum space saving amount over the subset of small strings, wherein the unique codes are the optimum length; and
compressing the set of small strings by replacing n-grams within each small string in the set of small strings with corresponding unique codes from the mapping table.
18. The system according to claim 17 , wherein the subset of small strings from the set of small strings includes the whole set of small strings.
19. A non-transitory computer-readable storage medium storing instructions for compressing a set of small strings, the instructions causing one or more computer processors to perform operations according to a method, the method comprising:
calculating n-gram frequencies for a plurality of n-grams over the set of small strings;
selecting a subset of n-grams from the plurality of n-grams based on the calculated n-gram frequencies;
defining a mapping table that maps each n-gram of the subset of n-grams to a unique code;
determining an optimum length for the unique codes, the determining including:
calculating a space saving amount over a subset of small strings from the set of small strings for each of at least two different unique code lengths, wherein the at least two different unique code lengths includes: (1) fixed single byte codes, (2) fixed 2-byte codes, and (3) variable-length one or two byte codes; and
selecting as the optimum length the unique code length with the maximum space saving amount over the subset of small strings, wherein the unique codes are the optimum length; and
compressing the set of small strings by replacing n-grams within each small string in the set of small strings with corresponding unique codes from the mapping table.
20. The non-transitory computer-readable storage medium according to claim 19 , wherein the subset of small strings from the set of small strings includes the whole set of small strings.Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.