Method for HTTP-based access point fingerprint and classification using machine learning

專利號(hào)

US11399288B2

公開日期

2022-07-26

申請(qǐng)人

SAMSUNG ELETR?NICA DA AMAZ?NIA LTDA.（BR Campinas）

發(fā)明人

Igor Jochem Sanz

IPC分類

H04W12/122; G06N20/20; H04L9/40; H04W12/60; H04W12/79

技術(shù)領(lǐng)域

http,ap,packet,html,captive,header,server,malicious,phishing,portal

地域： S?o Paulo

摘要

A method for HyperText Transfer Protocol (HTTP) based fingerprint and classification. The method includes training a HTTP-based machine-learning model, using machine-learning training techniques and a historical dataset of labelled Access Point HTTP service response features collected. The method is useful to detect benign or malicious classes, to assess the potential trustworthiness, to detect any type of bad behavior of an HTTP server, and any other threats that modify or implement an AP HTTP server or webpage. The method takes advantage of the captive portal detection packet exchange between a station and an Access Point (AP) to passively classify the AP.

說(shuō)明書

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application is based on and claims priority under 35 U.S.C. § 119 to Brazilian Patent Application No. 10 2020 003104 0, filed on Feb. 13, 2020, in the Brazilian Intellectual Property Office, the disclosure of which is incorporated by reference herein in its entirety.

TECHNICAL FIELD

The present invention is related to wireless communications field. More specifically, it describes a technique to passively fingerprint and classify Access Point (AP) based on HTTP information, at the moment a device connects to the Access Point (AP), using machine-learning.

BACKGROUND OF THE INVENTION

Wireless communications became indispensable in any smart electronic device, as from the advent of IEEE 802.11 protocol family. Most of wireless devices, such as smartphones, laptops, IoT devices, Smart TVs, are already designed to be connected to Internet in most of time for better functionality and rely in web-based API (application programming interface) to improve user-based or service-based communications. Consequently, device users often search for public wireless Access Points for granting Internet access for their devices, mainly if they are travelling in foreign environment. Advances in device technologies made possible to easily set up an Access Point through software-based solutions, such as Hotspot or Wi-Fi direct, or using a complementary Access Point antenna-device (e.g. USB Wi-Fi dongle).

權(quán)利要求

What is claimed is:

1. A method for HTTP service fingerprint and classification using machine learning, the method comprising:training a HTTP-based machine-learning model, using machine-learning training techniques and a historical dataset of labelled Access Point HTTP service response features collected by a feature extractor module;

generating the HTTP-based machine-learning model to perform classification of HTTP services;

collecting, by a collector module, HTTP service response packets from multiple HTTP servers having known classification labels;

extracting, by the feature extraction module, features from the collected HTTP service response packets;

labelling the extracted features from the HTTP service response packets, using labels from a set of classes defined according to a classification objective of the HTTP-based machine-learning model;

selecting a set of features from the labelled HTTP service response packet features, using feature selection techniques, that are the best suitable features to be used in the HTTP-based machine-learning model; and

classifying to perform classification of HTTP services by:using the HTTP-based machine-learning model trained with labelled samples of the selected set of features from the labelled HTTP service response packet features,

selectively applying a second machine learning model that collects data present inside an HTTP body of the HTTP service response packets, and

selectively applying a third machine learning model that extracts human readable text information of a page rendered by a user browser.

2. The method of claim 1, wherein the collecting the HTTP response packets comprises:passively sniffing a network to obtain the HTTP service redirection response packets; and

sending the obtained HTTP service redirection response packets to the feature extractor module.

3. The method of claim 1, wherein the collecting the HTTP redirection response packets comprises:actively sending an HTTP request to the network gateway;

receiving the HTTP response packets from the network gateway HTTP server; and

sending the received HTTP response packets to the feature extractor module.

4. The method of claim 1, wherein the collecting the HTTP redirection response packets comprises:actively sending an HTTP request to a known HTTP server that has a known HTTP response behavior;

receiving the HTTP response packets from the HTTP server;

comparing the known HTTP server response behavior with the received HTTP server response packets to verify if it consists of a redirection response type; and

based on the comparing, sending the received HTTP server response packets that present different behavior from the known HTTP response behavior, which implies that the received HTTP response is an HTTP response of redirection type, to the feature extractor module.

5. The method of claim 1, wherein the extracting the features from the collected HTTP redirect response packets comprises:extracting the features from a header of the HTTP redirect response packets;

extracting the features from a body of the HTTP redirect response packets; and

extracting the features from text data contained in a body of HTTP redirect response packets.

6. The method of claim 5, wherein the extracting the features from the collected HTTP redirect responses further comprises:separating the extracted features into sets determined by a feature selection process, to be used in the HTTP-based machine-learning model to receive a specific set of features from the HTTP redirect response packets.

7. The method of claim 1, wherein the classifying comprises:classifying, using a plurality of machine-learning models trained with labelled samples of HTTP response, the selected set of features from the HTTP response packet features; and

combining results of the plurality of machine-learning models trained with labeled samples of HTTP responses, using a model ensemble technique to obtain a final classification result to be used in external solutions to classify HTTP servers.

8. The method of claim 1, wherein the classification model trained with labelled samples of HTTP responses is represented as a binary file, object file, parameter values in case of parametric models, weights, text description, or any type or combination of data files that entirely represent a machine-learning model.

9. The method of claim 1, further comprising recognizing an HTTP server by applying one or multiple machine-learning models, previously trained with labelled samples of HTTP responses, with recognition purposes.

10. The method of claim 8, wherein the recognition purposes include at least one of:identifying whether the HTTP Server is a known HTTP server or service;

identifying whether the HTTP Server belongs to a known class of HTTP server or services.

11. The method of claim 1, further comprising identifying characteristics of the HTTP server by applying one or multiple machine-learning models with purposes to identify the characteristics based on the HTTP server response packets.

12. The method of claim 11, wherein the identifying the characteristics includes at least one of:identifying a type of network infrastructure facility;

identifying network properties of the communication link; and

identifying vulnerabilities that may be presented in the network.

13. The method of claim 12, wherein the HTTP server is classified between malicious and benign classes, by applying one or multiple machine-learning models with purpose of detecting HTTP servers.

14. The method of claim 13, wherein the HTTP server is labelled as benign or malicious according to suspicious activities, including at least one of:a specific type of attack that an HTTP server may perform against a user;

a known malicious reputation that a type of HTTP server may have;

a software implementation of HTTP which is known to be used for penetrating test purposes; and

any type of bad behavior an HTTP server may have that is considered for non-legitimate purposes.

15. The method of claim 1, wherein the extracted features includes using information from the displayed text visible in user screen, which is rendered by the user browser using the data of the HTTP response content, such as HTML data, to be translated into a machine-learning model feature vector.

16. The method of claim 15, wherein the displayed text includes text visible in user screen includes text data that is displayed based on HTTP content of a last HTTP response from the HTTP server.

17. The method of claim 16, further comprising defining a word category as a label for the samples of HTTP server response, using a set of words, from the displayed text visible in user screen, that shares a common property, meaning, or relationship including semantical, syntactical, morphological, or grammatical.

18. The method of claim 17, wherein the features extracted from the displayed text visible in user screen includes at least one of:counting a number of times a specific word appears in the displayed text visible in user screen;

counting a number of words per word category;

counting a number of times a specific group of words appears together in the displayed text visible in user screen;

counting a number of times a specific group of words appears together in a specific order;

counting a number of times a specific group of words appears in sequence together in a specific order;

binary features representing an existence of specific words, word categories, word groups, or word sequences; and

any other combination of word, word category, or word groups.

19. The method of claim 1, wherein the features extracted from the collected HTTP server response packets include at least one of:a presence of specific HTML tags;

a count of the number of HTML tags;

a count of the number of HTML tags inside an HTML tag context;

a count of the number of HTML tags with specific attributes;

a count of the number of a specific HTML tag, in which a specific attribute value matches a specific string;

a count of the number of a specific HTML tag, in which a specific attribute value contains a specific string or character;

a count of the number of a specific HTML tag, in which a specific attribute starts with or ends with a specific string or characters;

a count of the number of occurrences of a specific string, character, or sequence of characters appears in a specific attribute value;

a count of the number of specific strings of the displayed text of the HTML data;

a count of the number of times a specific tag has a null or invalid value for a specific attribute;

whether a specific HTML tag exists;

whether a specific HTML tag with a specific attribute exists;

whether a specific HTML tag with a specific attribute and a specific attribute value exists;

a count of the number of HTML comment blocks;

a count of the number of time a specific tag has a valid value for a specific attribute;

a count of the number of time a specific media file is loaded;

a count of the number of times external content is loaded;

a count of the number of patterns in script source;

a count of the number of times that a specific tag has a specific attribute with value corresponding to a specific file extension;

whether page redirection instructions exist in the data;

a count of the number of times a page redirects instruction occurs;

a count of the number of words present inside a specific HTML tag context;

a count of the number of times a specific pattern that indicate the presence of a specific element in a page occurs;

whether a specific tag is in upper case;

any other feature that represents a property, the existence of a pattern, or the number of times a pattern occurs in the entire, or part of, the HTTP content, in which the property can be translated into a numeric value; and

any of the aforementioned features but restricted to a specific HTML tag context instead of the entire HTML data.

20. The method of claim 1, wherein the features extracted from the collected HTTP server response packets include at least one of:a total number of header fields;

a total size of HTTP header;

a total size of HTTP content;

a binary feature representing the presence of specific header fields;

an order between two or more header fields when present;

a binary feature indicating whether header field names are lower or upper case;

a binary feature representing the exact match of a header field value with a known value;

a presence of specific strings in a header field value;

a presence of specific characters in a header field value;

a count of specific strings in a header field value;

a count of specific characters in a header field value;

a length of a header field value;

a numeric value of a header field value;

a number of words present in a header field value; and

a number of header fields that are unknown by the feature extraction module.

微信群二維碼

意見反饋

白丝美女被狂躁免费视频网站,500av导航大全精品,yw.193.cnc爆乳尤物未满,97se亚洲综合色区,аⅴ天堂中文在线网官网

Method for HTTP-based access point fingerprint and classification using machine learning

摘要

說(shuō)明書

權(quán)利要求

該功能需要專業(yè)版企業(yè)版VIP權(quán)限，您可以：

該功能需要專業(yè)版企業(yè)版VIP權(quán)限，您可以：