P
US9104972B1ActiveUtilityPatentIndex 82

Classifying documents using multiple classifiers

Assignee: GOOGLE INCPriority: Mar 13, 2009Filed: Mar 24, 2014Granted: Aug 11, 2015
Est. expiryMar 13, 2029(~2.7 yrs left)· nominal 20-yr term from priority
Inventors:KOROLEV DMITRYMAENNEL HARTMUT
G06N 5/048G06N 99/005G06F 16/353G06N 20/00
82
PatentIndex Score
8
Cited by
26
References
20
Claims

Abstract

Methods, systems, and apparatus, including computer programs encoded on computer storage media, for classifying resources using scores from multiple classifiers. In general, one aspect of the subject matter described in this specification can be embodied in methods that include the actions of receiving identifying a collection of documents to classify; receiving a plurality of classifiers for scoring a document with respect to a specified property; for each document in the collection, applying each of the plurality of classifiers, each classifier generating a score associated with a likelihood that the document has the specified property, combining the scores from each classifier including applying a multiple classifier model that uses monotonic regression to combine the plurality of classifiers, and classifying the document as having the specified property based on the combined score.

Claims

exact text as granted — not AI-modified
What is claimed is: 
     
       1. A computer-implemented method comprising:
 computing multiple respective scores for each document in a collection of documents, one score from each of a plurality of D distinct classifiers, wherein each classifier computes a respective score representing a likelihood that the document has a property P, wherein each classifier has a respective lower threshold a j , wherein documents having a score less than a j  are unlikely to have the property P, and wherein each classifier has a respective upper threshold b j , wherein documents having a score greater than b j  are likely to have the property P; 
 determining, for each respective classifier, a plurality of intervals between a j  and b j  for the classifier; 
 determining, for each document in the collection of documents, a combination of intervals I 1   1  to I D   K  according to which interval of the plurality of intervals each respective score for the document belongs; 
 determining, for each combination of intervals I j   1  to I j   K , whether any documents in the collection of documents have a corresponding combination of intervals I 1   1  to I D   K ; 
 selecting no more than M documents for each combination of intervals I j   1  to I j   K  for which at least one document in the collection has the corresponding combination of intervals; and 
 training a multiple classifier model for the D distinct classifiers using each selected document. 
 
     
     
       2. The method of  claim 1 , wherein determining, for each respective classifier, a plurality of intervals between a j  and b j  for the classifier comprises determining K intervals for each distinct classifier. 
     
     
       3. The method of  claim 2 , wherein selecting no more than M documents for each combination of intervals comprises selecting no more than N total documents, wherein N is an upper bound on a number of documents selected. 
     
     
       4. The method of  claim 3 , further comprising:
 determining K such that K is the smallest integer that satisfies K D >=N. 
 
     
     
       5. The method of  claim 4 , further comprising:
 determining, for each classifier, a middle score m j , wherein determining a plurality of intervals between a j  and b j  for each classifier comprises determining K/2 intervals between a j  and m j  and K/2 intervals between m j  and b j . 
 
     
     
       6. The method of  claim 5 , wherein determining, for each classifier, the middle score m j  comprises determining the middle score m j  such that a ratio of documents having a score greater than m j  is a ratio of documents having the property P. 
     
     
       7. The method of  claim 1 , wherein substantially half of the selected documents have the property P. 
     
     
       8. The method of  claim 1 , further comprising:
 providing each selected document to one or more human raters; and 
 receiving, from the one or more human raters, an indication of whether or not each document has the property P. 
 
     
     
       9. The method of  claim 1 , wherein training the multiple classifier model comprises training a monotonic regression model. 
     
     
       10. A system comprising:
 one or more computers and one or more storage devices storing instructions that are operable, when executed by the one or more computers, to cause the one or more computers to perform operations comprising: 
 computing multiple respective scores for each document in a collection of documents, one score from each of a plurality of D distinct classifiers, wherein each classifier computes a respective score representing a likelihood that the document has a property P, wherein each classifier has a respective lower threshold a j , wherein documents having a score less than a j  are unlikely to have the property P, and wherein each classifier has a respective upper threshold b j , wherein documents having a score greater than b j  are likely to have the property P; 
 determining, for each respective classifier, a plurality of intervals between a j  and b j  for the classifier; 
 determining, for each document in the collection of documents, a combination of intervals I 1   1  to I D   K  according to which interval of the plurality of intervals each respective score for the document belongs; 
 determining, for each combination of intervals I j   1  to I j   K , whether any documents in the collection of documents have a corresponding combination of intervals I 1   1  to I D   K ; 
 selecting no more than M documents for each combination of intervals I j   1  to I j   K  for which at least one document in the collection has the corresponding combination of intervals; and 
 training a multiple classifier model for the D distinct classifiers using each selected document. 
 
     
     
       11. The system of  claim 10 , wherein determining, for each respective classifier, a plurality of intervals between a j  and b j  for the classifier comprises determining K intervals for each distinct classifier. 
     
     
       12. The system of  claim 11 , wherein selecting no more than M documents for each combination of intervals comprises selecting no more than N total documents, wherein N is an upper bound on a number of documents selected. 
     
     
       13. The system of  claim 12 , wherein the operations further comprise:
 determining K such that K is the smallest integer that satisfies K D >=N. 
 
     
     
       14. The system of  claim 13 , wherein the operations further comprise:
 determining, for each classifier, a middle score m j , wherein determining a plurality of intervals between a j  and b j  for each classifier comprises determining K/2 intervals between a j  and m j  and K/2 intervals between m j  and b j . 
 
     
     
       15. The system of  claim 14 , wherein determining, for each classifier, the middle score m j  comprises determining the middle score m j  such that a ratio of documents having a score greater than m j  is a ratio of documents having the property P. 
     
     
       16. The system of  claim 10 , wherein substantially half of the selected documents have the property P. 
     
     
       17. The system of  claim 10 , wherein the operations further comprise:
 providing each selected document to one or more human raters; and 
 receiving, from the one or more human raters, an indication of whether or not each document has the property P. 
 
     
     
       18. The system of  claim 10 , wherein training the multiple classifier model comprises training a monotonic regression model. 
     
     
       19. A computer program product, encoded on one or more non-transitory computer storage media, comprising instructions that when executed by one or more computers cause the one or more computers to perform operations comprising:
 computing multiple respective scores for each document in a collection of documents, one score from each of a plurality of D distinct classifiers, wherein each classifier computes a respective score representing a likelihood that the document has a property P, wherein each classifier has a respective lower threshold a j , wherein documents having a score less than a j  are unlikely to have the property P, and wherein each classifier has a respective upper threshold b j , wherein documents having a score greater than b j  are likely to have the property P; 
 determining, for each respective classifier, a plurality of intervals between a j  and b j  for the classifier; 
 determining, for each document in the collection of documents, a combination of intervals I 1   1  to I D   K  according to which interval of the plurality of intervals each respective score for the document belongs; 
 determining, for each combination of intervals to I j   1  to I j   K , whether any documents in the collection of documents have a corresponding combination of intervals I 1   1  to I D   K ; 
 selecting no more than M documents for each combination of intervals I j   1  to I j   K  for which at least one document in the collection has the corresponding combination of intervals; and 
 training a multiple classifier model for the D distinct classifiers using each selected document. 
 
     
     
       20. The computer program product of  claim 19 , wherein determining, for each respective classifier, a plurality of intervals between aj and bj for the classifier comprises determining K intervals for each distinct classifier.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.