System and method for machine learning model determination and malware identification
Abstract
A system and method for batched, supervised, in-situ machine learning classifier retraining for malware identification and model heterogeneity. The method produces a parent classifier model in one location and providing it to one or more in-situ retraining system or systems in a different location or locations, adjudicates the class determination of the parent classifier over the plurality of the samples evaluated by the in-situ retraining system or systems, determines a minimum number of adjudicated samples required to initiate the in-situ retraining process, creates a new training and test set using samples from one or more in-situ systems, blends a feature vector representation of the in-situ training and test sets with a feature vector representation of the parent training and test sets, conducts machine learning over the blended training set, evaluates the new and parent models using the blended test set and additional unlabeled samples, and elects whether to replace the parent classifier with the retrained version.
Claims
exact text as granted — not AI-modifiedThe invention claimed is:
1. A method comprising:
receiving first information associated with a first plurality of files associated with a first organization, wherein the first information does not comprise sensitive data associated with the first plurality of files;
based on the first information and second information associated with a second plurality of files associated with a second organization, training a machine learning model usable by the second organization for classifying files as comprising malicious content or benign content;
receiving a file comprising unknown content; and
causing output of an indication that the file comprises malicious content.
2. The method of claim 1 , wherein the first information comprises a first feature vector representation of the first plurality of files.
3. The method of claim 2 , wherein the first feature vector representation comprises non-sensitive data associated with the first organization.
4. The method of claim 1 , wherein the first plurality of files is associated with a first plurality of adjudicated classifications.
5. The method of claim 1 , wherein the training uses at least a portion of each of the first information and the second information to form a training data set.
6. The method of claim 5 , further comprising:
receiving an indication of an amount of each portion of each of the first information and the second information to be used in the training data set.
7. The method of claim 1 , wherein the first information indicates at least one of a file header property, a component of a file, or a binary sequence.
8. A device comprising:
one or more processors; and
memory storing instructions that, when executed by the one or more processors, cause the device to:
receive first information associated with a first plurality of files associated with a first organization, wherein the first information does not comprise sensitive data associated with the first plurality of files;
based on the first information and second information associated with a second plurality of files associated with a second organization, train a machine learning model usable by the second organization for classifying files as comprising malicious content or benign content;
receive a file comprising unknown content; and
cause output of an indication that the file comprises malicious content.
9. The device of claim 8 , wherein the first information comprises a first feature vector representation of the first plurality of files.
10. The device of claim 9 , wherein the first feature vector representation comprises non-sensitive data associated with the first organization.
11. The device of claim 8 , wherein the first plurality of files is associated with a first plurality of adjudicated classifications.
12. The device of claim 8 , wherein the training uses at least a portion of each of the first information and the second information to form a training data set.
13. The device of claim 12 , wherein the instructions, when executed by the one or more processors, further cause the device to:
receive an indication of an amount of each portion of each of the first information and the second information to be used in the training data set.
14. The device of claim 8 , wherein the first information indicates at least one of a file header property, a component of a file, or a binary sequence.
15. A non-transitory computer-readable storage medium storing computer-readable instructions that, when executed by one or more processors, cause:
receiving first information associated with a first plurality of files associated with a first organization, wherein the first information does not comprise sensitive data associated with the first plurality of files;
based on the first information and second information associated with a second plurality of files associated with a second organization, training a machine learning model usable by the second organization for classifying files as comprising malicious content or benign content;
receiving a file comprising unknown content; and
causing output of an indication that the file comprises malicious content.
16. The non-transitory computer-readable storage medium of claim 15 , wherein the first information comprises a first feature vector representation of the first plurality of files, wherein the first feature vector representation comprises non-sensitive data associated with the first organization.
17. The non-transitory computer-readable storage medium of claim 15 , wherein the first plurality of files is associated with a first plurality of adjudicated classifications.
18. The non-transitory computer-readable storage medium of claim 15 , wherein the training uses at least a portion of each of the first information and the second information to form a training data set.
19. The non-transitory computer-readable storage medium of claim 18 , wherein the instructions, when executed, further cause:
receiving an indication of an amount of each portion of each of the first information and the second information to be used in the training data set.
20. The non-transitory computer-readable storage medium of claim 15 , wherein the first information indicates at least one of a file header property, a component of a file, or a binary sequence.
21. A system comprising:
at least one first computer device configured to:
receive first information associated with a first plurality of files associated with a first organization, wherein the first information does not comprise sensitive data associated with the first plurality of files;
based on the first information and second information associated with a second plurality of files associated with a second organization, train a machine learning model usable by the second organization for classifying files as comprising malicious content or benign content;
receive a file comprising unknown content; and
cause output of an indication that the file comprises malicious content; and
at least one second computer device configured to:
send, to the first computer device, the first information.
22. The system of claim 21 , wherein the first information comprises a first feature vector representation of the first plurality of files.
23. The system of claim 22 , wherein the first feature vector representation comprises non-sensitive data associated with the first organization.
24. The system of claim 21 , wherein the first plurality of files is associated with a first plurality of adjudicated classifications.
25. The system of claim 21 , wherein the training uses at least a portion of each of the first information and the second information to form a training data set.
26. The system of claim 25 , wherein the instructions, when executed by the one or more processors, further cause the device to:
receive an indication of an amount of each portion of each of the first information and the second information to be used in the training data set.
27. The system of claim 21 , wherein the first information indicates at least one of a file header property, a component of a file, or a binary sequence.Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.