Allgemein

Philipp Kern: What is happening with this “connection verification”?

Philipp Kern: What is happening with this “connection verification”?

You might see a verification screen pop up on more and more Debian web properties. Unfortunately the AI world of today is meeting web hosts that use Perl CGIs and are not built as multi-tiered scalable serving systems. The issues have been at three layers:

  1. Apache’s serving capacity runs full – with no threads left to serve requests. This means that your connection will sit around for a long time, not getting accepted. In theory this can be configured, but that would require requests to be handled in time.
  2. Startup costs of request handlers are too high, because we spawn a process for every request. This currently affects the BTS and dgit’s browse interface. packages.debian.org has been fixed, which increased scalability sufficiently.
  3. Requests themselves are too expensive to be served quickly – think git blame without caching.

Optimally we would go and solve some scalability issues with the services, however there is also a question of how much we want to be able to serve – as AI scraper demand is just a steady stream of requests that are not shown to humans.

How is it implemented?

DSA has now stood up some VMs with Varnish for proxying. Incoming TLS is provided by hitch, and TLS “on-loading” is done using haproxy. That way TLS goes in and TLS goes out. While Varnish does cache, if the content is cachable (e.g. does not depend on cookies) – that is not the primary reason for using it: It can be used for flexible query and response rewriting.

If no cookie with a proof of work is provided, the user is redirected to a challenge page that does some webcrypto in Javascript – because that looked similar to what other projects do (e.g. haphash that originally inspired the solution). However so far it looks like scrapers generally do not run with Javascript enabled, so this whole crypto proof of work business could probably be replaced with just a Javascript-based redirect. The existing solution also has big (security) holes in it. And, as we found out, Firefox is slower at webcrypto than Chrome. I have recently reduced the complexity, so you should notice it blocking you significantly less.

Once you have the cookie, you can keep accessing the site for as long as the cookie is valid. Please do not make any assumptions about the cookies, or you will be broken in the future.

For legitimate scrapers that obey robots.txt, there is now an automatically generated IP allowlist in place (thanks, Marco d’Itri). Turns out that the search engines do not actually run Javascript either and then loudly complain about the redirect to the challenge page. Other bots are generally exempt.

Conclusion 

I hope that right now we found sort of the sweet spot where the admins can stop spending human time on updating firewall rules and the services are generally available, reasonably fast, and still indexed. In case you see problems or run into a block with your own (legitimate) bots, please let me know.