P
US8538964B2ActiveUtilityPatentIndex 76

Using an ID domain to improve searching

Assignee: MAGDY WALIDPriority: Jul 25, 2008Filed: Dec 8, 2011Granted: Sep 17, 2013
Est. expiryJul 25, 2028(~2.1 yrs left)· nominal 20-yr term from priority
Inventors:MAGDY WALIDEL-SABAN MOTAZ AHMED
G06F 18/2137G06F 16/583
76
PatentIndex Score
6
Cited by
34
References
20
Claims

Abstract

Methods which use an ID domain to improve searching are described. An embodiment describes an index phase in which an image of a document is converted into the ID domain. This is achieved by dividing the text in the image into elements and mapping each element to an identifier. Similar elements are mapped to the same identifier. Each element in the text is then replaced by the appropriate identifier to create a version of the document in the ID domain. This version may be indexed and searched. Another embodiment describes a query phase in which a query is converted into the ID domain and then used to search an index of identifiers which has been created from collections of documents which have been converted into the ID domain. The conversion of the query may use mappings which were created during the index phase or alternatively may use pre-existing mappings.

Claims

exact text as granted — not AI-modified
The invention claimed is: 
     
       1. A computer-implemented method comprising:
 under control of a computing device having one or more processors with executable instructions, 
 segmenting text in an image of a document into elements, each element representing a character in the text; 
 grouping similar elements into clusters and assigning each cluster an identifier; 
 replacing each element in a cluster of similar elements with the identifier allocated to the cluster of similar elements; 
 ordering the identifiers within the document according to an order of the characters in the text; 
 creating an index of identifiers in the document; 
 receiving a text query and converting the text query into an image of the text query; 
 segmenting the image of the text query into elements and matching each element to at least one cluster using a cluster table, the cluster table comprising mappings between identifiers and element characteristics, at least a first element matching to at least two clusters; 
 replacing each element in the image of the text query with at least one identifier based on the matching to formulate a query defined in terms of identifiers, replacing each element in the image of the text query including replacing the first element with at least two identifiers based on the matching; and 
 searching the index of identifiers using the query defined in terms of identifiers. 
 
     
     
       2. The computer-implemented method according to  claim 1 , further comprising ordering the identifiers within the document according to an order of a language of the characters in the text. 
     
     
       3. The computer-implemented method according to  claim 2 , wherein the language comprises English and the identifiers are ordered from left to right and top to bottom. 
     
     
       4. The computer-implemented method according to  claim 2 , wherein the language comprises Arabic and the identifiers are ordered from right to left and top to bottom. 
     
     
       5. A computer-implemented method comprising:
 under control of a computing device having one or more processors with executable instructions,
 receiving a text query; 
 converting the text query into an image by drawing the text query using a font; 
 performing a comparison between elements in the image to a cluster table associated with the font, the cluster table defining mappings between image elements and identifiers associated with the image elements; and 
 creating a query defined in terms of identifiers associated with clusters of elements based on the comparison between elements of the image and the cluster table associated with the font, a first element in the image being matched to two clusters of elements, creating the query including replacing the first element with two identifiers associated with the two clusters of elements in the cluster table. 
 
 
     
     
       6. The computer-implemented method according to  claim 5 , further comprising:
 searching an index of identifiers created from at least one document image using the query defined in terms of identifiers based on the comparison between elements of the image and the cluster table associated with the font. 
 
     
     
       7. The computer-implemented method according to  claim 5 , further comprising:
 dividing text in an image of a document into a plurality of elements; 
 arranging the plurality of elements into clusters of similar elements; 
 allocating a unique one of the identifiers to each cluster; and 
 ordering the identifiers according to an order of a language of each page of the text. 
 
     
     
       8. The computer-implemented method according to  claim 7 , further comprising:
 replacing each element in the image of the document with one or more identifiers, one of the one or more identifiers comprising the identifier corresponding to the unique identifier of the cluster comprising the element; and 
 creating an index of identifiers in the image of the document. 
 
     
     
       9. The computer-implemented method according to  claim 5 , wherein the query is created by replacing each element in the image with at least one identifier based on the comparison. 
     
     
       10. The computer-implemented method according to  claim 9 , wherein the query is created by replacing each element in the image with N identifiers corresponding to the N most similar image elements in the cluster table. 
     
     
       11. The computer-implemented method according to  claim 10 , wherein each of said N identifiers has an associated weight and wherein the search of an index of identifiers uses the query defined in terms of identifiers and the weight associated with each of the identifiers. 
     
     
       12. The computer-implemented method according to  claim 5 , further comprising:
 converting the text query into another image by drawing the text query using another font different from the font; 
 creating another query defined in terms of identifiers based on a comparison between elements of the other image and a cluster table associated with the other font different from the font; and 
 searching another index of identifiers created from at least one document image using the other query defined in terms of identifiers based on the comparison between elements of the other image and the cluster table associated with the other font different from the font. 
 
     
     
       13. The computer-implemented method according to  claim 5 , wherein the query is created by replacing each element in the image with at least one identifier based on the comparison, and creating a query comprising a restricted sequence of identifiers. 
     
     
       14. One or more tangible storage media, the one or more tangible storage media being hardware, having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
 receiving a text query; 
 converting the text query into an image by drawing the text query using a font; 
 comparing elements in the image to a cluster table associated with the font, the cluster table defining mappings between image elements and identifiers associated with the image elements; and 
 creating a query defined in terms of identifiers associated with clusters of elements based on the comparison between elements of the image and the cluster table associated with the font, a first element in the image being matched to two clusters of elements, creating the query including replacing the first element with two identifiers associated with the two clusters of elements in the cluster table. 
 
     
     
       15. The one or more tangible storage media as claimed in  14  having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
 searching an index of identifiers created from at least one document image using the query defined in terms of identifiers based on the comparison between elements of the image and the cluster table associated with the font. 
 
     
     
       16. The one or more tangible storage media as claimed in  14  having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
 creating the query by replacing each element in the image with at least one identifier based on the comparison. 
 
     
     
       17. The one or more tangible storage media as claimed in  16  having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
 creating the query by replacing each element in the image with N identifiers corresponding to the N most similar image elements in the cluster table. 
 
     
     
       18. The one or more tangible storage media as claimed in  17  having device-executable instructions, wherein each of said N identifiers has an associated weight and wherein the search of an index of identifiers uses the query defined in terms of identifiers and the weight associated with each of the identifiers. 
     
     
       19. The one or more tangible storage media as claimed in  14  having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
 converting the text query into another image by drawing the text query using another font different from the font; 
 creating another query defined in terms of identifiers based on a comparison between elements of the other image and a cluster table associated with the other font different from the font; and 
 searching another index of identifiers created from at least one document image using the other query defined in terms of identifiers based on the comparison between elements of the other image and the cluster table associated with the other font different from the font. 
 
     
     
       20. The one or more tangible storage media as claimed in  14  having device-executable instructions which, when executed by one or more processors, cause the one or more processors to perform acts comprising:
 creating the query by replacing each element in the image with at least one identifier based on the comparison, and creating a query comprising a restricted sequence of identifiers.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.