How Datadome collects your data for the greater good of humankind
Datadome is a global leader in web application security.
Technically speaking, Datadome uses a JavaScript file that collects data about visitors to your website and sends it to their servers. This data may include information such as the visitor's IP address, browser type, operating system, and the pages they visited on your site. It also includes information about the visitor's activity, such as mouse movements and clicks. This information is fed to Datadome's algorithms to block what they call “bad” traffic, such as bots and other automated programs.
Avoiding blocks from Datadome
There are a few ways to avoid getting blocked by Datadome. The easiest one by far is to block their script altogether; I've found it to help on some websites (maybe they had their “security” cursor set too low?) That being said, it won't work every time…
Diving deep into Datadome
Crash course on antibot services
Antibot services work by first generating a technical fingerprint about your browser.
This fingerprint will usually contain the following information:
- Your User-Agent
- Your timezone
- Your browser's plugins
- Your screen resolution
- The fonts you have installed
- Your system's language
- Mouse movements
- Keyboard input data
- And much more!
This fingerprint will be useful for them to detect discrepancies.
Let's see the following example.
As a bot creator, you edit your user-agent to be one of a Windows machine. (while you are running Linux, as every good hacker should ;))
However, by inspecting navigator properties (especially navigator.platform
), Datadome's backend engine detects that you are lying.
This discrepancy will increase the likelihood of you getting blocked.
This is technical fingerprinting in a nutshell. But wait. There's more!
Behavioural analysis
A human takes time to execute actions. It makes spelling mistakes. It clicks the wrong button and goes back. On the other hand, a bot is usually quite linear, fast and accurate. By analysing mouse movements and typing speed, Datadome's backend engine will be able to weed out small players who are not thinking about these variables. To counter that, one might use Bézier curves. This works for now, but I definitely see an horizon on this, as they improve their product to be more and more reliable.
Network fingerprinting
So, one might say: OK, I'm going to make raw HTTP requests then! But here's the catch: antibot companies will go as far as fingerprinting your networking stack.
HTTP Headers
HTTP headers include a great deal of information already; from the user-agent to the order of the headers, you are reavealing a lot by making a request to your target service. Datadome can fingerprint it and use it to rate-limit or block you altogether.
TCP Fingerprinting
Let's say that you go as far as emulating perfectly HTTP headers. Let's say that you use a User-Agent of a Chrome machine on Windows. But obviously, you are still running Linux (because you're a good hacker). Here is the bad news for you: TCP properties are different on Windows and Linux.
Let's get classic and quote Wikipedia:
Just inspecting the Initial TTL and window size fields is often enough in order to successfully identify an operating system, which eases the task of performing manual OS fingerprinting.
TLS fingerprinting
Yet, you are not going to stop here, are you? You are going to try to make your bot's fingerprint as close to a real machine as possible. You managed to fix HTTP headers and the TCP fingerprinting issue. Everything runs fine in production until you scale up and get blocked again! The issue might lie into TLS fingerprinting.
What is TLS fingerprinting? You may ask. Well, it's simple: the way your TLS implementation works is different from one of a real machine.
Chrome is using specific TLS properties, that are different from, let's say, Firefox, or request (if you've used Python, you know what I'm talking about).
However, it's not the end of the world. Nice projects such as utls can help you fix this.
So, everything looks fine. You've beaten all the obvious network fingerprinting vectors. But didn't you forget something crucial? That's right, you forgot about the JS-generated fingerprint!
Faking JS fingerprints
Remember the first part of this article? We talked about how Datadome generates an in-browser fingerprint. Well, it turns out that even when using requests, you need to send a fingerprint too, or you'll get blocked. To achieve that, you need to identify the format of the fingerprint, how to encode it, and watch Datadome's script for update to update your script too. This will take a lot more work than maintaining a single stealth Puppeteer stack and using it across targets.
So, that's why I think that it's better to use real browsers rather than writing manual requests. I acknowledge that this depends on your use case, though. Shoe-botting people will usually want to use requests because the benefits of developing specialized software for each antibot vendor offsets the cost.
Conclusion
In conclusion, I believe that it's better to use real browsers rather than writing manual requests. Datadome can fingerprint you and use it to rate-limit or block you altogether. To avoid that, you can fake some parts of your in-browser fingerprint or completely reverse-engineer their script to act like a web-browser. (but you'll also need to bypass network fingerprint issues). Data is getting more and more inaccessible, and the world is burning.
So, what's your favourite way to scrape? Please let me know in the comments!