Encrypted Traffic Detection,Eliminating Blind Spots in Network Security Defense
1 BackgroundIn the digitalization process, enterprises rely more and more on the Internet, but the plaintext transmission of important data and personal information over the Internet brings security risks. Therefore, encryption is necessary to protect data security and privacy. End-to-end encryption prevents middleman from stealing key data such as credit card numbers and passwords. Encryption protocols SSL and TLS are commonly used. After the evolution of the protocols, TLS1.2 serves as the mainstream encryption protocol. NSS Labs predicts that 75% of web traffic will be encrypted in 2019. Gartner predicts that more than 80% of enterprise network traffic will be encrypted in 2019. However, encryption is a double-edged sword, which protects privacy and brings opportunities for hackers. Viruses, Trojan horses, hacker control commands, and stolen data may be hidden in encrypted traffic, which cannot be detected by network security products. To protect information security of enterprise employees and prevent malware from entering enterprise networks through encrypted traffic, proxy decryption can be used. However, proxy decryption is a kind of man-in-the-middle (MITM) attack, which inevitably violates user privacy. For example, if a user needs to send encrypted communication information such as a credit card number and a password to a bank, proxy decryption damages the encryption trust chain. As a result, user privacy is violated. In addition, proxy decryption cannot ensure smooth communication. For websites that strictly verify certificates, this method does not work. Another case is that malware uses non-standard SSL/TLS to encrypt traffic. In this case, proxy decryption cannot decrypt the traffic. Now, we analyze the existing encrypted traffic and malicious samples, and introduce a detection mode that does not involve TLS decryption. 2 Encrypted Traffic Analysis2.1 Detection Scenario ExampleFirst, let's take an example to describe the scenario where malicious encrypted traffic communication occurs. The following figure shows a scenario where a host is infected and encrypted traffic is used for malicious communication. This figure is from https://www.malware-traffic-analysis.net/, which periodically releases analysis of malicious traffic. Download a malicious file (WORD DOC) which is an attachment of a phishing email (MALSPAM). The malicious Word document will trigger the download of malware (EMOTET and FOLLOW-UP MALWARE). Then, the malware can communicate with a malicious server. The following figure shows traffic analysis. Using malicious HTTP behavior, a Word document that contains malicious code is downloaded. When the Word document is opened, the Emotet malware is downloaded. Then, the execution of the Emotet malware triggers the download of the Zeus Panda Banker malware. Encrypted traffic is generated when the Zeus Panda Banker is running. 2.2 TLS Negotiation ProcessTo detect malicious TLS flows, you need to understand the TLS negotiation process. Parameter characteristics during TLS negotiation are the main characteristics of TLS-encrypted traffic. During the establishment of a TLS connection, multiple data packets are not encrypted. First, the encryption mode is negotiated, starting from a handshake packet. The client (which may be a browser or malware) sends a ClientHello message to the server. The Hello message contains a set of parameters, such as the cipher suites, acceptable versions, and optional extensions. After receiving the request from the client, the server sends a response to the client, which is called the ServerHello message. The message is used to confirm the encryption protocol version, encryption method, and server certificate. After receiving the response from the server, the client verifies the server certificate. If the certificate is not issued by a trusted authority, the domain name in the certificate is different from the actual domain name, or the certificate has expired, a warning is displayed to ask the user to determine whether to continue the communication. If the certificate is correct, the client obtains the public key of the server from the certificate. Then, the client sends a random number, an encoding change notification, and a client handshake end notification to the server. The server replies with an encoding change notification and a server handshake end notification. The handshake process ends. After that, the client and server enter encrypted communication. 2.3 Analysis of Network Behavior FeaturesAccording to the TLS protocol, information other than TLS handshake information is encrypted. Therefore, features can be extracted only from the TLS handshake information and its context information. The extracted features are classified into two types. One type of features is collected from the TLS flow itself, including the parameter features in the TLS negotiation process as well as the data packet length and time related statistical features of the TCP/IP flow. The other type of features is extracted from DNS and HTTP based on the SSL/TLS context. 2.3.1 TLS Flow FeaturesThe handshake process is not encrypted, and packets are transmitted in plain text. In this process, the certificate information and the encryption method selected by both parties can be extracted. Unencrypted metadata in the TLS handshake information flow (ClientHello and ServerHello) includes a data fingerprint that cannot be hidden by a hacker, so we extract various feature information from the SSL/TLS negotiation process to train algorithms. The handshake information is divided into two parts: client fingerprint information and server certificate information. For the client fingerprint information, analyze the cipher suite information used by the client during the TLS handshake. The following figure shows the distribution of cipher suites used by the client during the TLS handshake. Normally, each client has several cipher suites. Statistics on cipher suites are collected separately. You can see that some suites account for a large proportion in black samples, while others account for a large proportion in white samples. ECDH/DH AES cipher suites are recommended by OWASP. Cipher suites based on MD5 and RC4 are considered weak and insecure. The suites that account for a large proportion in black samples are not recommended by OWASP. It is speculated that the possible cause is to be compatible with more servers.
Figure 2-1 Proportion of some encryption suites in black and white samples As shown in the preceding figures, most of the subject names of the white sample certificates are formal websites, and the certificate information is specific and detailed, while the fields in a black sample certificate are usually informal and brief. Moreover, many black sample certificates are self-signed certificates (the issuer and user are the same). 2.3.2 Context Features Such as DNSIn the context traffic of malicious encrypted traffic, there may be malicious behavior, which can assist in detection and evidence collection. Flaws can be seen in the context of some malicious samples, such as DNS information. A large number of nonexistent domain names are generated in the query result of the DGA domain name in this sample. HTTP requests are initiated to the domain names that have response. Now, you can see that some features of DNS and HTTPS traffic can assist in encrypted traffic detection. 3 Machine Learning DetectionEncrypted traffic detection is performed in bypass mode. Supervised machine learning identifies malicious file communication in encrypted traffic. The generated supervised machine learning model is deployed on the CIS analyzer to analyze and detect traffic on the live network. The following figure shows the overall detection process. Machine learning models rely on training and learning of a large number of samples on the cloud background. The sandbox runs malicious files to generate traffic samples. The generated traffic and live-network traffic are identified to distinguish between malicious traffic samples and normal traffic samples. The researchers studied the differences between malicious traffic and normal traffic on SSL/TLS connections. Considering the characteristics of DNS and HTTP traffic along with the SSL/TLS traffic, the researchers summarized the features of malware traffic. These extracted features can be used to establish machine learning models. After a machine learning model is delivered to the analyzer deployed in the user environment, the switch and network probe extract the traffic features of the live network and send them to the analyzer. The analyzer checks whether the TLS flow is malicious. Figure 3-1 Establishing a model for encrypted traffic detection using the machine learning algorithm The traffic generated by malicious samples is investigated. In addition to the background white traffic generated by VMs, some samples may connect to whitelist websites, for example, search engine websites or portal websites such as google.com, yandex.ru, and ukr.net. Some ransomware programs usually connect to payment websites such as Bitcoin. Some adware programs often connect to advertising and shopping websites. In addition, it is found that many black samples communicate with servers built on cloud services, and about 10% of the servers are malicious according to intelligence. After analysis and identification of the samples, more pure black and white samples are obtained. The random forest algorithm is used to model the extracted features. The random forest algorithm greatly reduces the variance of the common decision tree algorithm and intuitively displays the importance and judgment process of each feature. First, use the Boostraping method to randomly select some training samples from the training sample set and perform k rounds of extraction. Then, the obtained k sample sets are used to train k decision tree models, and when constructing a decision tree, not all features are searched to obtain the maximum indicator (for example, an information gain), but some features are randomly extracted from all features for calculation. Finally, the probability of a malicious sample is determined by voting. After training the model, we get the feature importance of each part of features. You can see that the features mainly used for determination are those of the TLS negotiation process and TCP packets.
We performed cross-validation on the training model, and tested it with other white and black samples. The experimental results are very considerable. Malware using a non-standard SSL/TLS protocol may adopt a user-defined encryption mode. In this case, the data packet length and interval statistics features of a TCP/IP flow can be used to train the machine learning model to achieve high detection accuracy. The existing method is only for the prediction of one flow. It is believed that if information of multiple flows in a period of time is integrated, the connection behavior and the possible type of malware may be reflected, which can improve the accuracy rate and coverage rate of the algorithm. |