US10318650B2ActiveUtilityPatentIndex 71

Identifying corrupted text segments

Assignee: IBMPriority: Mar 3, 2016Filed: Jun 5, 2018Granted: Jun 11, 2019

Est. expiryMar 3, 2036(~9.7 yrs left)· nominal 20-yr term from priority

Inventors:HUANG CHAO-YUAN TSAI YI-LIN WANG DER-JOUNG Wu Yen-Min

G06F 40/263G06F 40/166G06F 16/31G06F 16/2365G06F 17/30371G06F 17/24G06F 17/275G06F 17/30613

PatentIndex Score

Cited by

References

Claims

Abstract

A computer system for taking a corrective action upon determination of an existence of a corrupted text segment within a set of web pages. Determination includes: determining a language affinity indicator corresponding to text segments within the set of web pages; generating an indexing repository based on a set of text artifacts within the text segments; creating an occurrence table for the set of text artifacts; and determining compliance of the text artifacts and text segments based on the single language grouping on which the set of text segments are based.

Claims

exact text as granted — not AI-modified

What is claimed is: 
     
       1. A computer system comprising:
 one or more computer processors; 
 one or more computer readable storage device; 
 program instructions stored on the one or more computer readable storage devices for execution by at least one of the one or more computer processors, the stored program instructions comprising:
 program instructions to select a set of web pages containing text associated with a single language grouping; 
 program instructions to determine a set of text segments within the set of web pages; 
 program instructions to determine a language affinity indicator corresponding to each text segment in the set of text segments, the language affinity indicator being a comparison value of a text segment with a set of predefined rules corresponding to the single language grouping; 
 program instructions to, responsive to each language affinity indicator indicating an affinity to the single language grouping, identify a set of text artefacts within the text segments; 
 program instructions to generate an indexing repository based on the set of text artefacts; 
 program instructions to create an occurrence table from the indexing repository; 
 program instructions to determine a compliance threshold value for the occurrence table; 
 program instructions to identify an individual occurrence value for each unique text artefact in the set of text artefacts, the individual occurrence value being the probability that a text artefact occurs within the occurrence table based on the single language grouping; and 
 program instructions to determine a compliance value for the set of text segments by, for each text segment in the set of text segments:
 program instructions to compute a compliance sum value for a first text segment in the set of text segments; 
 program instructions to adjust the compliance sum value according to the individual occurrence values of a subset of text artefacts occurring in the first text segment; 
 program instructions to determine a segment length value associated with the first text segment; and 
 program instructions to adjust the compliance sum value according to the segment length value; 
 program instructions to, responsive to computing a set of compliance sum values for each text segment in the set of text segments, compute the compliance value based on an average value of the set of compliance sum values; 
 program instructions to compute a compliance indicator for the set of text segments by comparing the compliance value and the compliance threshold; and 
 program instructions to, responsive to the compliance indicator indicating that the compliance value is less than the compliance threshold, take a corrective action; 
 
 wherein: 
 the corrective action is an action selected from the group consisting of:
 notifying a user of a corrupted set of text segments in the selected set of web pages; and 
 preventing the selected set of web pages from being accessed again.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.