P
US7542970B2ExpiredUtilityPatentIndex 78

System and method for selecting a sub-domain for a specified domain of the web

Assignee: IBMPriority: May 11, 2006Filed: May 11, 2006Granted: Jun 2, 2009
Est. expiryMay 11, 2026(expired)· nominal 20-yr term from priority
Inventors:HOLMES SCOTT RICHARDMI HONGCHENGNEGI SUMITZHANG ZENGYAN
Y10S707/99932G06F 16/951G06F 16/958Y10S707/99934Y10S707/99933Y10S707/99935
78
PatentIndex Score
8
Cited by
12
References
9
Claims

Abstract

A selection system and method. The selection method comprises receiving, by a computing system, a taxonomy of data related to a specified domain of knowledge on the web. A taxonomy tree is constructed from the taxonomy. A sub tree related to a sub-domain from specified domain is selected from the taxonomy tree. A first list comprising user expected universal resource locators (URLs) related to the sub-domain is received. A second list comprising topic expressions defining each node of the taxonomy sub-tree is generated. A query based on the second list is generated. The query is applied on an index of URLs generated from a web crawling process to generate a third list. A recall value is calculated based on the first list and the third list.

Claims

exact text as granted — not AI-modified
1. A selection method, comprising:
 receiving, by a computing system, a taxonomy of data related to a specified domain of knowledge on the web; 
 storing, by said computing system, said taxonomy of data; 
 constructing, by a software application within said computing system, a taxonomy tree from said taxonomy; 
 receiving, by said computing system, a user selection for a taxonomy sub-tree from said taxonomy tree, said sub tree related to a sub-domain from said specified domain; 
 receiving, by said computing system from a user, a first list comprising user expected universal resource locators (URLs) related to said sub-domain, wherein said user selection is associated with a published list of URLs; 
 generating, by said software application, a second list comprising topic expressions defining each node of said taxonomy sub-tree; 
 receiving, by said software application, a first command for removing a first topic expression of said topic expressions from said second list; 
 removing, by said software application in response to said first command, said first topic expression from said second list; 
 receiving, by said software application, a second command for adding a second topic expression to said second list; 
 adding, by said software application in response to said second command, said second topic expression to said second list; 
 after said removing and said adding, generating by said software application, a query based on said second list by applying at least one Boolean operator on said topic expressions on said second list; 
 applying, by said software application, said query on an index of URLs, said index generated from a web crawling process; 
 generating, by said query, a third list comprising actual URLs located during said query; 
 determining, by said software application, a first group (A) of URLs that are listed on and common to said third list and said first list; 
 determining, by said software application, a second group (B) of URLs that are listed on only said first list; 
 calculating, by said software application, a recall value (R) based on a number of URLs in said first group (N A ) and a number of URLs in said second group (N B ), wherein R=N A /N B ; 
 randomly sampling, by said software application, said third list to generate a sampled list (D) of URLs from said third list; 
 sending, said sampled list (D) to said user of said computing system; 
 receiving, by said computing system, a user selected sub-list (C) of URLs from said sampled list (D), said user selected sublist based on a selection criteria; 
 calculating, by said software application, a precision value (P) based on a number of URLs on said user selected sub-list (N C ) and a number of URLs on said sampled list (N D ) wherein P=N C /N D ; 
 comparing, by said computing system, said recall value to a predetermined recall value; 
 determining, by said computing system based on first results of said comparing said recall value to said predetermined recall value, that said recall value comprises an acceptable recall value; 
 comparing, by said computing system, said precision value to a predetermined precision value; 
 determining, by said computing system based on second results of said comparing said precision value to said predetermined precision value, that said precision value comprises an acceptable precision value; 
 and saving, on said computing system in response to said first results and said second results, said sub-list (C). 
 
   
   
     2. The method of  claim 1 , wherein said taxonomy comprises XML code. 
   
   
     3. The method of  claim 1 , further comprising:
 receiving, by said computing system, a fourth list comprising specified topic expressions; and 
 adding, by said software application, said fourth list to said second list. 
 
   
   
     4. The method of  claim 1 , wherein said at least Boolean operator comprises a first Boolean operator encapsulated by a second Boolean operator, and wherein said first Boolean operator differs from said second Boolean operator. 
   
   
     5. A computing system comprising a processor coupled to a computer-readable memory unit, said memory unit comprising a software application and instructions that when executed by the processor implement a selection method, said method comprising:
 receiving, by said computing system, a taxonomy of data related to a specified domain of knowledge on the web; 
 storing, by said computing system, said taxonomy of data; 
 constructing, by a software application within said computing system, a taxonomy tree from said taxonomy; 
 receiving, by said computing system, a user selection for a taxonomy sub-tree from said taxonomy tree, said sub tree related to a sub-domain from said specified domain; 
 receiving, by said computing system from a user, a first list comprising user expected universal resource locators (URLs) related to said sub-domain, wherein said user selection is associated with a published list of URLs; 
 generating, by said software application, a second list comprising topic expressions defining each node of said taxonomy sub-tree; 
 receiving, by said software application, a first command for removing a first topic expression of said topic expressions from said second list; 
 removing, by said software application in response to said first command, said first topic expression from said second list; 
 receiving, by said software application, a second command for adding a second topic expression to said second list; 
 adding, by said software application in response to said second command, said second topic expression to said second list; 
 after said removing and said adding, generating by said software application, a query based on said second list by applying at least one Boolean operator on said topic expressions on said second list; 
 applying, by said software application, said query on an index of URLs, said index generated from a web crawling process; 
 generating, by said query, a third list comprising actual URLs located during said query; 
 determining, by said software application, a first group (A) of URLs that are listed on and common to said third list and said first list; 
 determining, by said software application, a second group (B) of URLs that are listed on only said first list; 
 calculating, by said software application, a recall value (R) based on a number of URLs in said first group (N A ) and a number of URLs in said second group (N B ), wherein R=N A /N B ; 
 randomly sampling, by said software application, said third list to generate a sampled list (D) of URLs from said third list; 
 sending, said sampled list (D) to said user of said computing system; 
 receiving, by said computing system, a user selected sub-list (C) of URLs from said sampled list (D), said user selected sublist based on a selection criteria; 
 calculating, by said software application, a precision value (P) based on a number of URLs on said user selected sub-list (N C ) and a number of URLs on said sampled list (N D ) wherein P=N C /N D ; 
 comparing, by said computing system, said recall value to a predetermined recall value; 
 determining, by said computing system based on first results of said comparing said recall value to said predetermined recall value, that said recall value comprises an acceptable recall value; 
 comparing, by said computing system, said precision value to a predetermined precision value; 
 determining, by said computing system based on second results of said comparing said precision value to said predetermined precision value, that said precision value comprises an acceptable precision value; 
 and saving, on said computing system in response to said first results and said second results, said sub-list (C). 
 
   
   
     6. The computing system of  claim 5  wherein said taxonomy comprises XML code. 
   
   
     7. The computing system of  claim 5  wherein said method further comprises:
 receiving, by said computing system, a fourth list comprising specified topic expressions; and 
 adding, by said software application, said fourth list to said first list. 
 
   
   
     8. A computer program product, comprising a computer usable medium including a software application and computer readable program code embodied therein, said computer readable program code comprising an algorithm adapted to implement a selection method within a computing system, said method comprising:
 receiving, by said computing system, a taxonomy of data related to a specified domain of knowledge on the web; 
 storing, by said computing system, said taxonomy of data; 
 constructing, by a software application within said computing system, a taxonomy tree from said taxonomy; 
 receiving, by said computing system, a user selection for a taxonomy sub-tree from said taxonomy tree, said sub tree related to a sub-domain from said specified domain; 
 receiving, by said computing system from a user, a first list comprising user expected universal resource locators (URLs) related to said sub-domain, wherein said user selection is associated with a published list of URLs; 
 generating, by said software application, a second list comprising topic expressions defining each node of said taxonomy sub-tree; 
 receiving, by said software application, a first command for removing a first topic expression of said topic expressions from said second list; 
 removing, by said software application in response to said first command, said first topic expression from said second list; 
 receiving, by said software application, a second command for adding a second topic expression to said second list; 
 adding, by said software application in response to said second command, said second topic expression to said second list; 
 after said removing and said adding, generating by said software application, a query based on said second list by applying at least one Boolean operator on said topic expressions on said second list; 
 applying, by said software application, said query on an index of URLs, said index generated from a web crawling process; 
 generating, by said query, a third list comprising actual URLs located during said query; 
 determining, by said software application, a first group (A) of URLs that are listed on and common to said third list and said first list; 
 determining, by said software application, a second group (B) of URLs that are listed on only said first list; 
 calculating, by said software application, a recall value (R) based on a number of URLs in said first group (N A ) and a number of URLs in said second group (N B ), wherein R=N A /N B ; 
 randomly sampling, by said software application, said third list to generate a sampled list (D) of URLs from said third list; 
 sending, said sampled list (D) to said user of said computing system; 
 receiving, by said computing system, a user selected sub-list (C) of URLs from said sampled list (D), said user selected sublist based on a selection criteria; 
 calculating, by said software application, a precision value (P) based on a number of URLs on said user selected sub-list (N C  ) and a number of URLs on said sampled list (N D ) wherein P=N C /N D  ; 
 comparing, by said computing system, said recall value to a predetermined recall value; 
 determining, by said computing system based on first results of said comparing said recall value to said predetermined recall value, that said recall value comprises an acceptable recall value; 
 comparing, by said computing system, said precision value to a predetermined precision value; 
 determining, by said computing system based on second results of said comparing said precision value to said predetermined precision value, that said precision value comprises an acceptable precision value; 
 and saving, on said computing system in response to said first results and said second results, said sub-list (C). 
 
   
   
     9. The computer program product of  claim 8 , wherein said method further comprises:
 receiving, by said computing system, a fourth list comprising specified topic expressions; and 
 adding, by said software application, said fourth list to said first list.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.