P
US11526554B2ActiveUtilityPatentIndex 39

Preventing the distribution of forbidden network content using automatic variant detection

Assignee: GOOGLE LLCPriority: Dec 9, 2016Filed: Dec 9, 2016Granted: Dec 13, 2022
Est. expiryDec 9, 2036(~10.4 yrs left)· nominal 20-yr term from priority
Inventors:LIU YINTAOVAISH VAIBHAVXU RACHELCHEN ZHAOFU
G06F 16/3338G06F 16/355G06F 16/328G06F 16/374G06F 16/9535G06F 16/3349G06F 16/9035
39
PatentIndex Score
0
Cited by
37
References
20
Claims

Abstract

The subject matter of this specification generally relates to preventing the distribution of forbidden network content. In one aspect, a system includes a front-end server that receives content for distribution over a data communication network. The back-end server identifies, in the query log, a set of received queries for which a given forbidden term was used to identify a search result in response to the received query even though the given forbidden term was not included in queries included in the set of received queries. The back-end server classifies, as variants of the given forbidden term, a term from one or more queries in the set of received queries that caused a search engine to use the given forbidden term to identify one or more search results in response to the one or more queries and prevents distribution of content that includes a variant.

Claims

exact text as granted — not AI-modified
What is claimed is: 
     
       1. A system, comprising:
 one or more data storage devices that store (i) data identifying a set of forbidden terms and (ii) a query log that includes received queries; 
 one or more front-end servers that receive content for distribution over a data communication network; and 
 one or more back-end servers that communicate with the data storage device and the one or more front-end servers and performs operations including:
 identifying, in the query log, a set of received queries for which a given forbidden term was used to identify a search result that was provided in response to the received query even though the given forbidden term was not included in the received query; 
 generating a set of candidate variants of the given forbidden term based on the set of received queries identified in the query log, the generating comprising:
 for each query in the set of received queries;
 determining, based on data of the query log, that one or more terms in the query caused, when processed by a search engine in response to receiving the query, the search engine to use the forbidden term to identify one or more search results provided in response to the query, and 
 classifying each of the one or more terms as a candidate variant of the given forbidden term in response to determining that the one or more terms caused the search engine to use the given forbidden term to identify the one or more search results provided in response to the query; 
 
 
 determining a score for each candidate variant of the given forbidden term based on a frequency at which the candidate variant of the forbidden term occurs in the query log; 
 selecting, from a ranking of the candidate variants of the given forbidden term, a set of forbidden variants of the given forbidden term, wherein the ranking of the candidate terms is based on the score for each candidate variant of the forbidden term; and 
 preventing distribution of content that depicts a term included in the set of forbidden variants of the given forbidden term by the one or more front-end servers in response to the term being classified as a variant of the given forbidden term and included in the set of forbidden variants of the given forbidden term. 
 
 
     
     
       2. The system of  claim 1 , wherein identifying the set of received queries for which the given forbidden term was used to identify the search result in response to the received query even though the given forbidden term was not included in queries included in the set of received queries comprises identifying a given received query that was expanded by the search engine to include the given forbidden term. 
     
     
       3. The system of  claim 1 , wherein the one or more back-end servers perform operations comprising identifying, using a semantic network of terms, a term semantically linked to the given forbidden term as a candidate variant of the given forbidden term. 
     
     
       4. The system of  claim 1 , wherein selecting, from the ranking of the candidate variants of the given forbidden term, the set of forbidden variants of the given forbidden term comprises:
 generating the ranking of the candidate variants by ordering the candidate variants based on the score for each candidate variant; and 
 selecting, as the forbidden variants of the forbidden term, one or more of the candidate variants from the ranking of the candidate variants based on the score for each candidate variant. 
 
     
     
       5. The system of  claim 4 , wherein:
 the set of candidate variants includes a first candidate variant for which a spelling of the first candidate variant was corrected to the candidate forbidden term and a second candidate variant that was added to a received query that included the given forbidden term; 
 the score for the first candidate variant is further based on an edit distance between the first candidate variant and the given forbidden term; and 
 the score for the second candidate variant is further based on inverse document frequency score for the second candidate variant. 
 
     
     
       6. The system of  claim 1 , wherein identifying, in the query log, the set of received queries for which the given forbidden term was used to identify the search result that was provided in response to the received query even though the given forbidden term was not included in the received query comprises using a map procedure to identify, from the query log, candidate variants of each forbidden term. 
     
     
       7. The system of  claim 6 , wherein selecting, from the ranking of the candidate variants of the given forbidden term, the set of forbidden variants of the given forbidden term comprises using a reduce procedure for the given forbidden term to select, from the candidate variants for the given forbidden term, one or more forbidden variants of the forbidden term, wherein each reduce procedure is performed on a separate back-end server. 
     
     
       8. The system of  claim 1 , wherein selecting, from the ranking of the candidate variants of the given forbidden term, the set of forbidden variants of the given forbidden term comprises:
 for each candidate variant, identifying an additional score for the candidate variant that is based on a number of occurrences, in the query log, of the candidate variant in queries that were spell corrected or expanded to include the candidate variant; and 
 selecting, as the forbidden variants of the forbidden term, one or more candidate variants based further on the additional score for each candidate variant. 
 
     
     
       9. The system of  claim 1 , wherein:
 the score for each candidate variant is based on a data source from which the candidate variant was identified; 
 each data source has respective criteria for which the score for candidate variants identified from the data source is determined; and 
 the criteria for at least one data source is different from the criteria for one or more other data sources. 
 
     
     
       10. A method for preventing distribution of forbidden content, comprising:
 receiving, by one or more servers, content for distribution over a data communication network; 
 identifying, in a query log that includes received queries, a set of received queries for which a given forbidden term was used to identify a search result that was provided in response to the received query even though the given forbidden term was not included in the received query; 
 generating a set of candidate variants of the given forbidden term based on the set of received queries identified in the query log, the generating comprising:
 for each query in the set of received queries:
 determining, based on data of the query log, that one or more terms in the query caused, when processed by a search engine in response to receiving the query, the search engine to use the forbidden term to identify one or more search results provided in response to the query; and 
 classifying each of the one or more terms as a candidate variant of the given forbidden term in response to determining that the one or more terms that caused the search engine to use the given forbidden term to identify the one or more search results provided in response to the query, 
 
 
 determining a score for each candidate variant of the given forbidden term based on a frequency at which the candidate variant of the forbidden term occurs in the query log; 
 selecting, from a ranking of the candidate variants of the given forbidden term, a set of forbidden variants of the given forbidden term, wherein the ranking of the candidate terms is based on the score for each candidate variant of the forbidden term; and 
 preventing, by the one or more servers, distribution of content that depicts a term included in the set of forbidden variants of the given forbidden term in response to the term being classified as a variant of the given forbidden term and included in the set of forbidden variants of the given forbidden term. 
 
     
     
       11. The method of  claim 10 , wherein identifying the set of received queries for which the given forbidden term was used to identify the search result in response to the received query even though the given forbidden term was not included in queries included in the set of received queries comprises identifying a given received query that was expanded by the search engine to include the given forbidden term. 
     
     
       12. The method of  claim 10 , further comprising identifying, using a semantic network of terms, a term semantically linked to the given forbidden term as a candidate variant of the given forbidden term. 
     
     
       13. The method of  claim 10 , wherein selecting, from the ranking of the candidate variants of the given forbidden term, the set of forbidden variants of the given forbidden term comprises:
 generating the ranking of the candidate variants by ordering the candidate variants based on the score for each candidate variant; and 
 selecting, as the forbidden variants of the forbidden term, one or more of the candidate variants from the ranking of the candidate variants based on the score for each candidate variant. 
 
     
     
       14. The method of  claim 13 , wherein:
 the set of candidate variants includes a first candidate variant for which a spelling of the first candidate variant was corrected to the candidate forbidden term and a second candidate variant that was added to a received query that included the given forbidden term; 
 the score for the first candidate variant is further based on an edit distance between the first candidate variant and the given forbidden term; and 
 the score for the second candidate variant is further based on inverse document frequency score for the second candidate variant. 
 
     
     
       15. The method of  claim 10 , wherein identifying, in the query log, the set of received queries for which the given forbidden term was used to identify the search result that was provided in response to the received query even though the given forbidden term was not included in the received query comprises using a map procedure to identify, from the query log, candidate variants of each forbidden term. 
     
     
       16. The method of  claim 15 , wherein selecting, from the ranking of the candidate variants of the given forbidden term, the set of forbidden variants of the given forbidden term comprises using a reduce procedure for the given forbidden term to select, from the candidate variants for the given forbidden term, one or more forbidden variants of the forbidden term, wherein each reduce procedure is performed on a separate back-end server. 
     
     
       17. A non-transitory computer storage medium encoded with a computer program, the program comprising instructions that when executed by one or more data processing apparatus cause the data processing apparatus to perform operations comprising:
 receiving, by one or more servers, content for distribution over a data communication network; 
 identifying, in a query log that includes received queries, a set of received queries for which a given forbidden term was used to identify a search result that was provided in response to the received query even though the given forbidden term was not included in the received query; 
 generating a set of candidate variants of the given forbidden term based on the set of received queries identified in the query log, the identifying comprising:
 for each query in the set of received queries:
 determining, based on data of the query log, that one or more terms in the query caused, when processed by a search engine in response to receiving the query, the search engine to use the forbidden term to identify one or more search results provided in response to the query; and 
 classifying each of the one or more terms as a candidate variant of the given forbidden term in response to determining that the one or more terms that caused the search engine to use the given forbidden term to identify the one or more search results provided in response to the query, 
 
 
 determining a score for each candidate variant of the given forbidden term based on a frequency at which the candidate variant of the forbidden term occurs in the query log; 
 selecting, from a ranking of the candidate variants of the given forbidden term, a set of forbidden variants of the given forbidden term, wherein the ranking of the candidate terms is based on the score for each candidate variant of the forbidden term; and 
 preventing, by the one or more servers, distribution of content that depicts a term included in the set of forbidden variants of the given forbidden term in response to the term being classified as a variant of the given forbidden term and included in the set of forbidden variants of the given forbidden term. 
 
     
     
       18. The non-transitory computer storage medium of  claim 17 , wherein identifying the set of received queries for which the given forbidden term was used to identify the search result in response to the received query even though the given forbidden term was not included in queries included in the set of received queries comprises identifying a given received query that was expanded by the search engine to include the given forbidden term. 
     
     
       19. The non-transitory computer storage medium of  claim 17 , wherein selecting, from the ranking of the candidate variants of the given forbidden term, the set of forbidden variants of the given forbidden term comprises:
 generating the ranking of the candidate variants by ordering the candidate variants based on the score for each candidate variant; and 
 selecting, as the forbidden variants of the forbidden term, one or more of the candidate variants from the ranking of the candidate variants based on the score for each candidate variant. 
 
     
     
       20. The non-transitory computer storage medium of  claim 19 , wherein:
 the set of candidate variants includes a first candidate variant for which a spelling of the first candidate variant was corrected to the candidate forbidden term and a second candidate variant that was added to a received query that included the given forbidden term; 
 the score for the first candidate variant is further based on an edit distance between the first candidate variant and the given forbidden term; and 
 the score for the second candidate variant is further based on inverse document frequency score for the second candidate variant.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.