Detecting, diagnosing, and directing solutions for source type mislabeling of machine data, including machine data that may contain PII, using machine learning
Abstract
A computerized method of diagnosing a mislabeling of a source type of a received event. The method comprising operations of receiving an event by a source type analysis logic with a data index and query system, wherein the event includes a portion of raw machine data and is associated with a specific point in time, obtaining an original source type assigned to the event and one or more predicted source types. The one or more predicted source types are determined by analysis of a data representation of the event in view of training data and the training data includes a plurality of data representations corresponding to known source types. Additionally, the computerized method also includes an operation of, determining whether the event has been mislabeled and in response to determining the event has been mislabeled, diagnosing a source of the mislabeling.
Claims
exact text as granted — not AI-modifiedWhat is claimed is:
1. A computerized method of diagnosing a labeling of a source type of an event using machine learning techniques, the method comprising:
receiving the event by a source type analysis logic with a data index and query system, wherein the event includes a portion of raw machine data and is associated with a specific point in time;
obtaining one or more predicted source types of the event, the one or more predicted source types being determined by analyzing a data representation of the event in view of training data, wherein the training data includes a plurality of data representations corresponding to known source types;
determining whether the event has been mislabeled by determining whether an original source type of the event is one or more of empty, missing, or incorrect; and
responsive to determining the event has been mislabeled based on a discrepancy between the original source type and the predicted source type, diagnosing a source of the mislabeling.
2. The computerized method of claim 1 , wherein the original source type is assigned according to one of a configuration file, one or more predefined rules, or a predetermined signature.
3. The computerized method of claim 1 , wherein each of the one or more predicted source types includes a probability, wherein a first probability of a first predicted source type indicates a likelihood that the first predicted source type is a correct source type of the event.
4. The computerized method of claim 1 , wherein the determining of whether the event has been mislabeled comprises comparing the original source type to a first predefined source type of the one or more predicted source types to determine whether a match exists.
5. The computerized method of claim 1 , wherein the determining of whether the event has been mislabeled comprises determining that the original source type was not assigned to the event, and selecting a predicted source type to assign to the event.
6. The computerized method of claim 1 , wherein the determining of whether the event has been mislabeled comprises: (i) determining whether a first probability of a first predicted source type of the one or more source types is greater than or equal to a first threshold, and (ii) responsive to determining the first probability is greater than or equal to the first threshold, comparing the first predicted source type with the original source type to determine whether a match exists, wherein the first probability of the first predicted source type indicates a likelihood that the event corresponds to the first predicted source.
7. The computerized method of claim 1 , wherein the determining of whether the event has been mislabeled comprises: (i) determining whether a first probability of a first predicted source type of the one or more source types is greater than or equal to a first threshold, and (ii) responsive to determining the first probability is greater than or equal the first threshold, comparing the first predicted source type with the original source type to determine whether a match exists, wherein the first probability of the first predicted source type indicates a likelihood that the event corresponds to the first predicted source, and wherein the first threshold is determined by a source type of at least one of the one or more predicted source types.
8. The computerized method of claim 1 , wherein the determining of whether at least two predicted source types of the one or more predicted source types each correspond to probabilities that are greater than or equal to a first threshold; and
responsive to determining the at least two predicted source types of the one or more predicted source types each correspond to probabilities that are greater than or equal to the first threshold, generating and providing an alert to an analyst indicating the at least two predicted source types of the one or more predicted source types each correspond to probabilities that are greater than or equal to the first threshold.
9. The computerized method of claim 1 , wherein the determining of whether at least two predicted source types of the one or more predicted source types each correspond to probabilities that are greater than or equal to a first threshold; and
responsive to determining the at least two predicted source types of the one or more predicted source types each correspond to probabilities that are greater than or equal to the first threshold, determining the event has been mislabeled when the original source type does not match a first source type of the at least two predicted source types having a highest probability.
10. The computerized method of claim 1 , wherein the obtaining of the one or more predicted source types of the event comprises:
generating the data representation of the event includes, wherein the data representation of the event includes content of the event other than personally identifiable information, and wherein the computerized method further comprises:
determining, from the data representation including the content of the event other than the personally identifiable information, that the original source type is an indicator other than a known source type, the indicator representing that the original source type is not one of a plurality of known source types.
11. The computerized method of claim 1 , wherein the obtaining of the one or more predicted source types of the event comprises:
generating the data representation of the event, wherein the data representation of the event includes content of the event other than personally identifiable information, and wherein the computerized method further comprises:
assigning, based on the one or more predicted source types of the event that are determined from the data representation including the content of the event other than the personally identifiable information, a predicted source type of the event when the original source type for the event is blank or missing.
12. The computerized method of claim 1 , further comprising:
determining the original source type is an indicator other than a known source type, the indicator representing that the original source type is not one of a plurality of known source types; and
responsive to determining the original source is the indicator other than the known source type, generating and providing an alert to an analyst indicating that the original source type is not one of a plurality of known source types.
13. The computerized method of claim 1 , further comprising:
responsive to determining the event has been mislabeled, generating and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, and (iii) a first predicted source type.
14. The computerized method of claim 1 , further comprising:
responsive to determining the event has been mislabeled, generating and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, (iii) the one or more predicted source types, and (iv) probabilities corresponding to each of the one or more predicted source types, wherein a first probability of a first predicted source type indicates a likelihood that the event corresponds to the first predicted source.
15. The computerized method of claim 1 , wherein diagnosing the source of the mislabeling includes determining the source of the mislabeling, and generating and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, and (iii) the one or more predicted source types, and (iv) the source of the mislabeling.
16. The computerized method of claim 1 , wherein diagnosing the source of the mislabeling includes determining the source of the mislabeling, determining a method used in mislabeling the event, and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, and (iii) the one or more predicted source types, (iv) the source of the mislabeling, and (v) the method used in mislabeling the event.
17. The computerized method of claim 1 , further comprising:
responsive to determining the event has been mislabeled, determining a specific solution to the mislabeling, and generating and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, (iii) a first predicted source type, and (iv) the specific solution.
18. The computerized method of claim 1 , further comprising:
responsive to determining the event has been mislabeled, determining a specific solution to the mislabeling, and generating and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, (iii) a first predicted source type, and (iv) the specific solution, wherein the specific solution is one of a first instruction to update a configuration file or a second instruction to update one or more rules used in labeling the event.
19. A non-transitory computer readable storage medium having instructions stored thereon that, in response to execution by a processing device, cause the processing device to perform operations of diagnosing a labeling of a source type of an event using machine learning techniques, the operations including:
receiving the event by a source type analysis logic with a data index and query system, wherein the event includes a portion of raw machine data and is associated with a specific point in time;
obtaining one or more predicted source types of the event, the one or more predicted source types being determined by analyzing a data representation of the event in view of training data, wherein the training data includes a plurality of data representations corresponding to known source types;
determining whether the event has been mislabeled by determining whether an original source type of the event is one or more of empty, missing, or incorrect; and
responsive to determining the event has been mislabeled based on a discrepancy between the original source type and the predicted source type, diagnosing a source of the mislabeling.
20. The non-transitory computer readable storage medium of claim 19 , wherein the original source type is assigned according to one of a configuration file, one or more predefined rules, or a predetermined signature.
21. The non-transitory computer readable storage medium of claim 19 , wherein each of the one or more predicted source types includes a probability, wherein a first probability of a first predicted source type indicates a likelihood that the first predicted source type is a correct source type of the event.
22. The non-transitory computer readable storage medium of claim 19 , wherein the determining of whether the event has been mislabeled comprises: (i) determining whether a first probability of a first predicted source type of the one or more source types is greater than or equal to a first threshold, and (ii) responsive to determining the first probability is greater than or equal to the first threshold, comparing the first predicted source type with the original source type to determine whether a match exists, wherein the first probability of the first predicted source type indicates a likelihood that the event corresponds to the first predicted source.
23. The non-transitory computer readable storage medium of claim 19 , wherein the obtaining of the one or more predicted source types of the event comprises:
generating the data representation of the event includes, wherein the data representation of the event includes content of the event other than personally identifiable information, and wherein the computerized method further comprises:
determining, from the data representation including the content of the event other than the personally identifiable information, that the original source type is an indicator other than a known source type, the indicator representing that the original source type is not one of a plurality of known source types.
24. The non-transitory computer readable storage medium of claim 19 , wherein there instructions, in response to execution by a processing device, cause the processing device to perform further operations including:
responsive to determining the event has been mislabeled, generating and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, and (iii) a first predicted source type.
25. A system comprising:
a memory to store executable instructions; and
a processing device coupled with the memory, wherein the instructions, when executed by the processing device, cause operations including:
receiving the event by a source type analysis logic with a data index and query system, wherein the event includes a portion of raw machine data and is associated with a specific point in time;
obtaining one or more predicted source types of the event, the one or more predicted source types being determined by analyzing a data representation of the event in view of training data, wherein the training data includes a plurality of data representations corresponding to known source types;
determining whether the event has been mislabeled by determining whether an original source type of the event is one or more of empty, missing, or incorrect; and
responsive to determining the event has been mislabeled based on a discrepancy between the original source type and the predicted source type, diagnosing a source of the mislabeling.
26. The system of claim 25 , wherein the original source type is assigned according to one of a configuration file, one or more predefined rules, or a predetermined signature.
27. The system of claim 25 , wherein each of the one or more predicted source types includes a probability, wherein a first probability of a first predicted source type indicates a likelihood that the first predicted source type is a correct source type of the event.
28. The system of claim 25 , wherein the determining of whether the event has been mislabeled comprises: (i) determining whether a first probability of a first predicted source type of the one or more source types is greater than or equal to a first threshold, and (ii) responsive to determining the first probability is greater than or equal to the first threshold, comparing the first predicted source type with the original source type to determine whether a match exists, wherein the first probability of the first predicted source type indicates a likelihood that the event corresponds to the first predicted source.
29. The system of claim 25 , wherein the obtaining of the one or more predicted source types of the event comprises:
generating the data representation of the event includes, wherein the data representation of the event includes content of the event other than personally identifiable information, and wherein the computerized method further comprises:
determining, from the data representation including the content of the event other than the personally identifiable information, that the original source type is an indicator other than a known source type, the indicator representing that the original source type is not one of a plurality of known source types.
30. The system of claim 25 , wherein the instructions stored in the memory, when executed by the processing device, cause operations further including:
responsive to determining the event has been mislabeled, generating and providing an alert to an analyst, the alert including at least (i) the event, (ii) the original source type, and (iii) a first predicted source type.Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.