Matching strings in a large relational database
Abstract
A computer-implemented method searches a database for a particular string. One or more processors receive data as an input string, and then identify multiple k-grams in, unique characters in, and a length of the input string. The one or more processors perform binary locality sensitive hashing on the k-grams, the unique characters, and the length for the input string, and then sum the binary locality sensitive hashings to create a first addition vector, which is used to generate a first binary vector. The same process is performed on a particular string being requested to generate a second binary vector. The one or more processors then search the database for the particular string that was requested using the second binary vector in a large scale hamming distance query process that determines a hamming distance between the first binary code and the second binary code.
Claims
exact text as granted — not AI-modifiedWhat is claimed is:
1. A computer-implemented method comprising:
receiving, by one or more processors, data as an input string;
generating, by one or more processors, a first binary code using a binary locality sensitive hashing of k-grams in the input string, wherein features used to generate the first binary code comprise a similarity coefficient of strings of characters in the input string, unique characters of the input string, and a length of the input string, wherein the binary locality sensitive hashing on the k-grams in the input string is derived from a dice coefficient s, wherein:
s
=
2
X
⋂
Y
X
+
Y
wherein X is a first set of bi-grams in the input string, wherein Y is a second set of bi-grams in the input string, wherein |X∩Y| is a quantity of intersecting bi-grams in an intersection of X and Y, wherein |X| is a quantity of bi-grams in the first set of bi-grams in the input string, and wherein |Y| is a quantity of bi-grams in the second set of bi-grams in the input string;
storing, by one or more processors, the first binary code and the input string in a database;
in response to receiving a search request for a particular string, generating, by one or more processors, a second binary code using a binary locality sensitive hashing on the particular string;
searching, by one or more processors, the database using the second binary code in a large scale hamming distance query process; and
ranking and returning, by one or more processors, a set of similar strings, wherein strings with a minimum hamming distance between the first binary code and second binary code are a highest ranked recommendation.
2. The computer-implemented method of claim 1 , further comprising:
identifying, by one or more processors, multiple k-grams in the input string, wherein each k-gram is a string of one or more characters in the input string;
identifying, by one or more processors, the unique characters in the input string;
identifying, by one or more processors, the length of the input string;
performing, by one or more processors, binary locality sensitive hashing on the k-grams, the unique characters, and the length of the input string;
summing, by one or more processors, binary locality sensitive hashings on the k-grams, the unique characters, and the length of the input string to create a first addition vector;
generating, by one or more processors, a first binary vector from the first addition vector, wherein each element in the first binary vector represents a binary state of a value found in the first addition vector;
storing, by one or more processors, the first binary vector and the input string in a database;
receiving, by one or more processors, the search request for the particular string in the database;
identifying, by one or more processors, multiple k-grams in the particular string;
identifying, by one or more processors, unique characters in the particular string;
identifying, by one or more processors, a length of the particular string;
performing, by one or more processors, binary locality sensitive hashing on the k-grams, the unique characters, and the length for the particular string;
summing, by one or more processors, binary locality sensitive hashings on the k-grams, the unique characters, and the length of the particular string to create a second addition vector;
generating, by one or more processors, a second binary vector from the second addition vector, wherein each element in the second binary vector represents a binary state of a value found in the second addition vector;
searching, by one or more processors, the database for the particular string using the second binary vector in a large scale hamming distance query process that determines a hamming distance between the first binary code and the second binary code; and
returning, by one or more processors, the particular string based on the hamming distance between the first binary code and the second binary code.
3. The computer-implemented method of claim 2 , further comprising:
generating, by one or more processors, a unique binary vector from an addition vector for each string in the database, wherein each element in each unique binary vector represents a binary state of a value found in the addition vector for each string in the database;
establishing, by one or more processors, a hamming distance between each unique binary vector and the first binary vector;
ranking, by one or more processors, each unique binary vector according to its respective hamming distance from the first addition vector to create a ranked set of strings in the database; and
presenting, by one or more processors, the ranked set of strings in response to receiving the search request for the particular string in the database.
4. The computer-implemented method of claim 3 , further comprising:
receiving, by one or more processors, a user input of a variation of the input string; and
in response to receiving the user input of the variation of the input string, returning, by one or more processors, a highest ranked string from the ranked set of strings.
5. The computer-implemented method of claim 4 , wherein the variation of the input string is a misspelling of the input string.
6. The computer-implemented method of claim 4 , wherein the variation of the input string is an accepted alternative spelling of the input string.
7. A computer program product for searching a database for a particular string, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions readable and executable by a computer to perform a method of:
receiving data as an input string;
generating a first binary code using a binary locality sensitive hashing of k-grams in the input string, wherein features used to generate the first binary code comprise a similarity coefficient of strings of characters in the input string, unique characters of the input string, and a length of the input string, wherein the binary locality sensitive hashing on the k-grams in the input string is derived from a dice coefficient s, wherein:
s
=
2
X
⋂
Y
X
+
Y
wherein X is a first set of bi-grams in the input string, wherein Y is a second set of bi-grams in the input string, wherein |X∩Y| is a quantity of intersecting bi-grams in an intersection of X and Y, wherein |X| is a quantity of bi-grams in the first set of bi-grams in the input string, and wherein |Y| is a quantity of bi-grams in the second set of bi-grams in the input string;
storing the first binary code and the input string in a database;
in response to receiving a search request for a particular string, generating a second binary code using the binary locality sensitive hashing on the particular string;
searching the database using the second binary code in a large scale hamming distance query process; and
ranking and returning a set of similar strings, wherein strings with a minimum hamming distance between the first binary code and second binary code are a highest ranked recommendation.
8. The computer program product of claim 7 , wherein the method further comprises:
identifying multiple k-grams in the input string, wherein each k-gram is a string of one or more characters in the input string;
identifying the unique characters in the input string;
identifying the length of the input string;
performing binary locality sensitive hashing on the k-grams, the unique characters, and the length of the input string;
summing binary locality sensitive hashings on the k-grams, the unique characters, and the length of the input string to create a first addition vector;
generating a first binary vector from the first addition vector, wherein each element in the first binary vector represents a binary state of a value found in the first addition vector;
storing the first binary vector and the input string in a database;
receiving a search request for a particular string in the database;
identifying multiple k-grams in the particular string;
identifying unique characters in the particular string;
identifying a length of the particular string;
performing binary locality sensitive hashing on the k-grams, the unique characters, and the length for the particular string;
summing binary locality sensitive hashings on the k-grams, the unique characters, and the length of the particular string to create a second addition vector;
generating a second binary vector from the second addition vector, wherein each element in the second binary vector represents a binary state of a value found in the second addition vector;
searching the database for the particular string using the second binary vector in a large scale hamming distance query process that determines a hamming distance between the first binary code and the second binary code; and
returning the particular string based on the hamming distance between the first binary code and the second binary code.
9. The computer program product of claim 8 , wherein the method further comprises:
generating a unique binary vector from an addition vector for each string in the database, wherein each element in each unique binary vector represents a binary state of a value found in the addition vector for each string in the database;
establishing a hamming distance between each unique binary vector and the first binary vector;
ranking each unique binary vector according to its respective hamming distance from the first addition vector to create a ranked set of strings in the database; and
presenting the ranked set of strings in response to receiving the search request for the particular string in the database.
10. The computer program product of claim 9 , wherein the method further comprises:
receiving a user input of a variation of the input string; and
in response to receiving the user input of the variation of the input string, returning a highest ranked string from the ranked set of strings.
11. The computer program product of claim 10 , wherein the variation of the input string is a misspelling of the input string.
12. The computer program product of claim 10 , wherein the variation of the input string is an accepted alternative spelling of the input string.
13. The computer program product of claim 9 , wherein the program instructions are provided as a service in a cloud environment.
14. A computer system comprising:
one or more processors;
one or more computer readable memories, operably coupled to the one or more processors, wherein the one or more computer readable memories store program instructions for execution by at least one of the one or more processors, the stored program instructions comprising:
program instructions to receive data as an input string;
program instructions to generate a first binary code using a binary locality sensitive hashing of k-grams in the input string, wherein features used to generate the first binary code comprise a similarity coefficient of strings of characters in the input string, unique characters of the input string, and a length of the input string, wherein the binary locality sensitive hashing on the k-grams in the input string is derived from a dice coefficient s, wherein:
s
=
2
X
⋂
Y
X
+
Y
wherein X is a first set of bi-grams in the input string, wherein Y is a second set of bi-grams in the input string, wherein |X∩Y| is a quantity of intersecting bi-grams in an intersection of X and Y, wherein |X| is a quantity of bi-grams in the first set of bi-grams in the input string, and wherein |Y| is a quantity of bi-grams in the second set of bi-grams in the input string;
program instructions to store the first binary code and the input string in a database;
program instructions to, in response to receiving a search request for a particular string, generate a second binary code using the binary locality sensitive hashing on the particular string;
program instructions to search the database using the second binary code in a large scale hamming distance query process; and
program instructions to rank and return a set of similar strings, wherein strings with a minimum hamming distance between the first binary code and second binary code are a highest ranked recommendation.
15. The computer system of claim 14 , further comprising:
program instructions to identify multiple k-grams in the input string, wherein each k-gram is a string of one or more characters in the input string;
program instructions to identify the unique characters in the input string;
program instructions to identify the length of the input string;
program instructions to perform binary locality sensitive hashing on the k-grams, the unique characters, and the length for the input string;
program instructions to sum binary locality sensitive hashings on the k-grams, the unique characters, and the length of the input string to create a first addition vector;
program instructions to generate a first binary vector from the first addition vector, wherein each element in the first binary vector represents a binary state of a value found in the first addition vector;
program instructions to store the first binary vector and the input string in a database;
program instructions to receive a search request for a particular string in a database;
program instructions to identify multiple k-grams in the particular string;
program instructions to identify unique characters in the particular string;
program instructions to identify a length of the particular string;
program instructions to perform binary locality sensitive hashing on the k-grams, the unique characters, and the length for the particular string;
program instructions to sum binary locality sensitive hashings on the k-grams, the unique characters, and the length of the particular string to create a second addition vector;
program instructions to generate a second binary vector from the second addition vector, wherein each element in the second binary vector represents a binary state of a value found in the second addition vector;
program instructions to search the database for the particular string using the second binary vector in a large scale hamming distance query process that determines a hamming distance between the first binary code and the second binary code; and
program instructions to return the particular string based on the hamming distance between the first binary code and the second binary code.
16. The computer system of claim 15 , further comprising:
program instructions to generate a unique binary vector from an addition vector for each string in the database, wherein each element in each unique binary vector represents a binary state of a value found in the addition vector for each string in the database;
program instructions to establish a hamming distance between each unique binary vector and the first binary vector;
program instructions to rank each unique binary vector according to its respective hamming distance from the first addition vector to create a ranked set of strings in the database; and
program instructions to present the ranked set of strings in response to receiving the search request for the particular string in the database.
17. The computer system of claim 14 , wherein the stored program instructions are provided as a service in a cloud environment.Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.