Leveraging point inferences on HTTP transactions for HTTPS malware detection
Abstract
In one embodiment, a traffic analysis service receives captured traffic data regarding a Transport Layer Security (TLS) connection between a client and a server. The traffic analysis service applies a first machine learning-based classifier to TLS records from the traffic data, to identify a set of the TLS records that include Hypertext Transfer Protocol (HTTP) header information. The traffic analysis service estimates one or more HTTP transaction labels for the connection by applying a second machine learning-based classifier to the identified set of TLS records that include HTTP header information. The traffic analysis service augments the captured traffic data with the one or more HTTP transaction labels. The traffic analysis service causes performance of a network security function based on the augmented traffic data.
Claims
exact text as granted — not AI-modifiedWhat is claimed is:
1. A method comprising:
receiving, at a traffic analysis service, captured traffic data regarding a Transport Layer Security (TLS) connection between a client and a server;
applying, by the traffic analysis service, a first machine learning-based classifier to TLS records from the traffic data to label each of the TLS records as either a TLS record that includes Hypertext Transfer Protocol (HTTP) header information comprising one or more HTTP header fields or a TLS record that does not include the HTTP header information;
estimating, by the traffic analysis service, one or more HTTP transaction labels for the connection by applying a second machine learning-based classifier to the identified set of TLS records that include the HTTP header information;
augmenting, by the traffic analysis service, the captured traffic data with the one or more HTTP transaction labels; and
causing, by the traffic analysis service, performance of a network security function based on the augmented traffic data,
wherein the TLS records remain encrypted during classification by the first and second machine learning-based classifiers.
2. The method as in claim 1 , wherein causing performance of the network security function comprises:
determining, by the traffic analysis service and based on the augmented traffic data, whether the TLS connection is malware-related, wherein the augmented traffic data comprises captured TLS metadata and data indicative of packet or record lengths from the connection.
3. The method as in claim 1 , wherein the connection between the client and the server uses the Tor protocol.
4. The method as in claim 1 , wherein the client executes a web browser to form the connection.
5. The method as in claim 1 , wherein the one or more HTTP transaction labels is indicative of at least one of: an HTTP method, an HTTP content type, an HTTP status code, or a type associated with the server.
6. The method as in claim 1 , wherein the one or more HTTP transaction labels is indicative of at least one of: a cookie, referer, upgrade-insecure-requests, via, set-cookie, origin, or etag header field.
7. The method as in claim 1 , wherein estimating the one or more HTTP transaction labels for the connection comprises:
iteratively classifying transactions of the connection by classifying a particular one of the transactions based in part on classification results from one or more previously classified transactions of the connection.
8. The method as in claim 1 , wherein the second machine learning-based classifier is a multi-class classifier.
9. An apparatus, comprising:
one or more network interfaces to communicate with a network;
a processor coupled to the network interfaces and configured to execute one or more processes; and
a memory configured to store a process executable by the processor, the process when executed configured to:
receive captured traffic data regarding a Transport Layer Security (TLS) connection between a client and a server;
apply a first machine learning-based classifier to TLS records from the traffic data to label each of the TLS records as either a TLS record that includes Hypertext Transfer Protocol (HTTP) header information comprising one or more HTTP header fields or a TLS record that does not include the HTTP header information;
estimate one or more HTTP transaction labels for the connection by applying a second machine learning-based classifier to the identified set of TLS records that include the HTTP header information;
augment the captured traffic data with the one or more HTTP transaction labels; and
cause performance of a network security function based on the augmented traffic data,
wherein the TLS records remain encrypted during classification by the first and second machine learning-based classifiers.
10. The apparatus as in claim 9 , wherein the apparatus causes performance of the network security function by:
determining, based on the augmented traffic data, whether the TLS connection is malware-related, wherein the augmented traffic data comprises captured TLS metadata and data indicative of packet or record lengths from the connection.
11. The apparatus as in claim 9 , wherein the connection between the client and the server uses the Tor protocol.
12. The apparatus as in claim 9 , wherein the client executes a web browser to form the connection.
13. The apparatus as in claim 9 , wherein the one or more HTTP transaction labels is indicative of at least one of: an HTTP method, an HTTP content type, an HTTP status code, or a type associated with the server.
14. The apparatus as in claim 9 , wherein the one or more HTTP transaction labels is indicative of at least one of: a cookie, referer, upgrade-insecure-requests, via, set-cookie, origin, or etag header field.
15. The apparatus as in claim 9 , wherein the apparatus estimates the one or more HTTP transaction labels for the connection by:
iteratively classifying transactions of the connection by classifying a particular one of the transactions based in part on classification results from one or more previously classified transactions of the connection.
16. The apparatus as in claim 9 , wherein the second machine learning-based classifier is a multi-class classifier.
17. A tangible, non-transitory, computer-readable medium storing program instructions that cause a traffic analysis service to execute a process comprising:
receiving, at the traffic analysis service, captured traffic data regarding a Transport Layer Security (TLS) connection between a client and a server;
applying, by the traffic analysis service, a first machine learning-based classifier to TLS records from the traffic data to label each of the TLS records as either a TLS record that includes Hypertext Transfer Protocol (HTTP) header information comprising one or more HTTP header fields or a TLS record that does not include the HTTP header information;
estimating, by the traffic analysis service, one or more HTTP transaction labels for the connection by applying a second machine learning-based classifier to the identified set of TLS records that include the HTTP header information;
augmenting, by the traffic analysis service, the captured traffic data with the one or more HTTP transaction labels; and
causing, by the traffic analysis service, performance of a network security function based on the augmented traffic data,
wherein the TLS records remain encrypted during classification by the first and second machine learning-based classifiers.
18. The computer readable medium as in claim 17 , wherein causing performance of the network security function comprises:
determining, by the traffic analysis service and based on the augmented traffic data, whether the TLS connection is malware-related, wherein the augmented traffic data comprises captured TLS metadata and data indicative of packet or record lengths from the connection.
19. A method for classifying an encrypted flow, comprising:
receiving, at a network device, a plurality of packets associated with a first encrypted flow traversing a network; collecting telemetry data from the plurality of packets associated with the first encrypted flow, wherein at least a portion of the telemetry data is collected from packets encrypted according to a cryptographic protocol, wherein the telemetry data includes cryptographic protocol data, and wherein the telemetry data further includes at least one of a source IP address, a destination IP address, a destination port, a start time of the first encrypted flow, a stop time of the first encrypted flow, a protocol associated with the first encrypted flow, a number of packets of the first encrypted flow, a number of bytes of the first encrypted flow, a sequence of TLS record lengths and types, a sequence of TCP packet lengths and flags, a number of bytes extracted from the first encrypted flow, a sequence of packet lengths and times from the first encrypted flow, a series of application lengths and times from the first encrypted flow, or byte distribution data; and sending the telemetry data to a traffic analysis process, wherein the traffic analysis process comprises: using the telemetry data in conjunction with a first machine learning classifier to make a first assessment whether the first encrypted flow is part of a set of traffic flows caused by malware in the network, using the first assessment as an input to a second machine learning process that generates a classification of the first encrypted flow, and using the classification of the first encrypted flow from the second machine learning process to modify a processing of the first encrypted flow.
20. The method of claim 19, wherein the traffic analysis process is performed at a second network device.
21. The method of claim 19, wherein the traffic analysis process receives the telemetry data from a plurality of network devices.
22. The method of claim 19, wherein using the telemetry data in conjunction with the first machine learning classifier to make a first assessment comprises using the cryptographic protocol data as input to the first machine learning classifier.
23. The method of claim 22, wherein using the cryptographic protocol data as input to the first machine learning classifier comprises using encryption certificate data as an input to the first machine learning classifier.
24. The method of claim 19, wherein the cryptographic protocol data includes encryption certificate data.
25. The method of claim 19, wherein using the telemetry data in conjunction with the first machine learning classifier to make a first assessment comprises using at least one of the source IP address, the destination IP address, or the destination port as an input to the first machine learning classifier.
26. The method of claim 19, wherein using the telemetry data in conjunction with the first machine learning classifier to make the first assessment comprises using at least one of the start time of the first encrypted flow, the stop time of the first encrypted flow, the protocol associated with the first encrypted flow, the number of packets of the first encrypted flow, or the number of bytes of the first encrypted flow as input to the first machine learning classifier.
27. The method of claim 19, wherein using the telemetry data in conjunction with a first machine learning classifier to make a first assessment comprises using at least one of the sequence of TLS record lengths and types, the sequence of TCP packet lengths and flags, or the number of bytes and packets as an input to the first machine learning classifier.
28. The method of claim 19, wherein using the first assessment as an input to a second machine learning process further comprises using a plurality of telemetry data from a plurality of flows as inputs to the second machine learning process.
29. The method of claim 28, wherein using the plurality of telemetry data decreases a false positive rate of the second machine learning process.
30. A method for creating and operating a multi-step classifier for an encrypted flow, the method comprising:
collecting telemetry data from a first plurality of encrypted flows using a plurality of traffic capture services located on network paths between clients and servers, the first plurality of encrypted flows including both malicious and benign flows, wherein at least a portion of the telemetry data is collected from packets encrypted according to a cryptographic protocol, wherein the telemetry data includes cryptographic protocol data, and wherein the telemetry data further includes at least one of a source IP address, a destination IP address, a destination port, a start time of any of the first plurality of encrypted flows, a stop time of any of the first plurality of encrypted flows, a protocol associated with any of the first plurality of encrypted flows, a number of packets of any of the first plurality of encrypted flows, a number of bytes of any of the first plurality of encrypted flows, a sequence of TLS record lengths and types, a sequence of TCP packet lengths and flags, a number of bytes extracted from any of the first plurality of encrypted flows, a sequence of packet lengths and times from any of the first plurality of encrypted flows, a series of application lengths and times from any of the first plurality of encrypted flows, or byte distribution data; using the telemetry data to create a first trained machine learning classifier to generate a classification of features present in the first plurality of encrypted flows; using the telemetry data to create a second trained machine learning classifier to classify flows as malicious or benign, based at least in part on the classification of features from the first trained machine learning classifier; receiving second telemetry data from an unclassified encrypted flow; using the first trained machine learning classifier to make an inference regarding the unclassified encrypted flow; and using the second trained machine learning classifier to classify the unclassified encrypted flow as malicious or benign using the second telemetry data and the inference made by the first trained machine learning classifier.
31. The method of claim 30, wherein use of a plurality of telemetry data decreases a false positive rate of the second trained machine learning classifier.
32. The method of claim 30, wherein the classification of the unclassified encrypted flow is based in part on pattern changes in flows other than the unclassified encrypted flow.Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.