From the sandbox
A couple of years ago, like many other site owners in RuNet, I faced a sharp increase in visitors from social networks. At first, this was pleasing, until it came to a detailed study of the behavior of such “users” – it turned out that these were bots. Not only that, they also greatly spoiled the behavioral factors that are critical for good ranking in Yandex, and even in Google.
Studying TG channels devoted to cheating behavioral factors (and most bots are used for this), I came up with the idea that the bot developers must be mistaken somewhere, somewhere they run into the inability to implement a full-fledged emulation of the parameters of real browsers and the behavior of real visitors. Based on these two hypotheses, the idea arose to create 2 neural networks. The first of these should determine the bot by numerous browser parameters. And the second – on the behavior on the site – on how the user scrolls, clicks and performs other actions.
The first thing you need to train a neural network is a sufficient number of training examples – visits that we know for sure are bots and visits that we know for sure are real people. Three parameters were used for this selection:
Recapcha V3 scoring.
Whether the visitor is logged into Google, Yandex, VK services.
Uniqueness of Canvas Hash.
The essence of the last parameter is that an image is drawn on the HTML “canvas” element, then the md5 hash of this image is calculated. Depending on the operating system, its version, browser version, device, the image is slightly different and the hash is different. Bots tend to add random noise to the image to make them harder to detect, and as a result, they have a unique hash. Real people don’t have unique hashes.
So, a real visitor if:
Recapcha V3 score >=0.9.
Logged in Yandex and one of: Google or VK.
Canvas Hash is non-unique.
Recapcha V3 score<=0.3
Not logged in anywhere, or logged in only in Yandex. Bots for cheating behavioral factors very often go under the Yandex profile.
Unique Canvas Hash.
Data was taken from three information sites for a period of 1 month. A little more than forty thousand visits got into the database, 25% of which are bots and 75% are real people. What exactly was going to be described below.
Bot detection by browser settings
Although the bots work on browser engines, this is far from the full-fledged Google Chrome that they want to impersonate. They have to emulate many settings in order to look like a real user’s browser. So let’s try to train the neural network to find discrepancies between the emulated parameters and the real ones. To do this, we will collect the maximum amount of information about the browser, namely:
OS, OS version, browser name, browser version, gadget model, if possible.
Connection parameters – network type, speed.
Screen resolution, display window size, whether there is a scrollbar, and other display-related options.
WebGL parameters (video card model, memory size, etc.).
The types of media content that the browser can play.
What fonts are supported by the browser (the 300 most common fonts are analyzed).
It turned out several dozen parameters
The next thing to decide is which neural network architecture to use. Since the base is somewhat unbalanced, the first thing that came to my mind was to try an autoencoder. Train it on real people (75% of them), and interpret emissions as bots. The following architecture was used:
The result is this:
The total error is large. It is necessary to try the usual classifier on fully connected layers. The following architecture was chosen:
The result is excellent! And for screening out most of the bots that are used to cheat the PF, such a neuron is suitable.
But what if the budget allows you to use bots not just on the browser engine, but on the real Google Chrome? This requires significantly more resources, but is technically easy to implement. For such a case, this neural network is not suitable. But you can try to analyze the behavior of the bot and compare it with the behavior of real people.
Definition of a bot by behavior on the site
Good bots emulate the behavior of a real person – they click, scroll, move the mouse along a human-like trajectory, etc. But they are probably wrong somewhere – maybe they have a slightly different distribution of events, other delays, click locations, etc. Let’s try to collect as much data as possible about the behavior of visitors. To do this, we analyze the following events:
For each event, collect the following parameters:
Changes in the X and Y axis.
Rate of change in X and Y axis.
The number of elementary events the browser received.
Events fit into a time series, which means it makes sense to use one-dimensional convolutional networks. The optimal architecture turned out to be:
And the result is the following:
The result is also quite good. The disadvantages include the fact that it takes at least 20 seconds to be on the page for a sufficient number of actions to occur. Therefore, it cannot be used to filter traffic at the download stage.