Dan Jones @[email protected]

Husband, Father, Software Engineer (PHP, go, etc.). Lover of Star Trek and anime.

Looking for other things to do, such as writing, acting, voice acting, but not really finding the time for it. Maybe when my kids are a little older, I'll get back on stage.

Feeling pretty bleak about the future of the United States. #NeverTrump

Feel free to follow. I may follow back if we seem to have similar interests.

#BlackLivesMatter #TransRightsAreHumanRights #StayWoke

Other interests: #Parenting #StarTrek #Writing #Theater #anime #PHP #golang #Programming #WebDevelopment #genealogy #ScienceFiction #DadJokes

My Links

links.danielrayjones.com

Pronouns

he/him/his

XMPP

[email protected]

Mastodon account

fosstodon.org/@danjones000

LinkedIn

linkedin.com/in/danjones000

Résumé

danielrayjones.com

  • Notes
  • Articles
  • Followers 229
  • Following 324
  • Remote follow
Codeberg's avatar
Codeberg
@[email protected]

We apologize for a period of extreme slowness today. The army of AI crawlers just leveled up and hit us very badly.

The good news: We're keeping up with the additional load of new users moving to Codeberg. Welcome aboard, we're happy to have you here. After adjusting the AI crawler protections, performance significantly improved again.

  • permalink
  • 13 days ago
Watchful Citizen's avatar
Watchful Citizen
@[email protected]

in reply to this object

@Codeberg Great job!

  • permalink
  • 10 days ago
mosher's avatar
mosher
@[email protected]

in reply to this object

@Codeberg Really need to sue them for a denial of service attack, get them banned from touching a computer for 20 year.

  • permalink
  • 11 days ago
Mania Emma's avatar
Mania Emma
@[email protected]

in reply to this object

@Codeberg #ai

  • permalink
  • 11 days ago
bit101's avatar
bit101
@[email protected]

in reply to this object

@Codeberg I've been moving my stuff to Codeberg. Glad to see you have a presence on Mastodon! Thanks for being there.

  • permalink
  • 12 days ago
A Fine Day to Return Home 🏳️‍⚧️🏳️‍🌈🇺🇦🇵🇸's avatar
A Fine Day to Return Home 🏳️‍⚧️🏳️‍🌈🇺🇦🇵🇸
@[email protected]

in reply to this object

@Codeberg
Keep up the good work!

  • permalink
  • 12 days ago
Álex Sáez's avatar
Álex Sáez
@[email protected]

in reply to this object

@Codeberg can you identify the owners? I wonder if they are famous companies or someone else (not asking for names, just wondering).

  • permalink
  • 13 days ago
Daniel Lakeland's avatar
Daniel Lakeland
@[email protected]

in reply to this object

@Codeberg

Are you guys using traffic shaping and queue management at all? For example putting something like QFQ qdisc on your routers and then marking packets from spammy sources as low-priority and putting them into a low priority queue can be a huge boost in responsiveness for your real customers.
Spammy sources could be those that open new connections too often, transfer too many bytes, or have too many open active connections. All of those kinds of things can be accounted in nftables.

  • permalink
  • 13 days ago
lemgandi's avatar
lemgandi
@[email protected]

in reply to this object

@Codeberg

Thank You For Your Service. ( I moved to Codeberg, like, yesterday, and signed up a recurring donation )

  • permalink
  • 13 days ago
kajer's avatar
kajer
@[email protected]

in reply to this object

@Codeberg

gzip bomb when?

  • permalink
  • 13 days ago
RyanParsley's avatar
RyanParsley
@[email protected]

in reply to this object

@Codeberg what if the new captcha was get a bug fix PR merged? That'd keep them robits out.

  • permalink
  • 13 days ago
Alex's avatar
Alex
@[email protected]

in reply to this object

@Codeberg could just setup a few traps that crash the AI crawlers or something. This is going to get really annoying and hopefully these bastards don't interfer with some of my work in the long run with what they've been doing on the internet. Scraping is already largely frowned upon so these pos are just making it worse.

  • permalink
  • 13 days ago
Solinvictus :mastodon:'s avatar
Solinvictus :mastodon:
@[email protected]

in reply to this object

@Codeberg

GIF
Caption

tonight show nbc GIF by The Tonight Show Starring Jimmy Fallon

  • permalink
  • 13 days ago
Bradley Kuhn's avatar
Bradley Kuhn
@[email protected]

in reply to this object

😲🤬 re: what's happened to @Codeberg today.
The AI ballyhoo *is* a real DDoS against one of the few code hosting sites that takes a stand against slurping #FOSS code into LLM training sets — in violation of #copyleft.

Deregulation/lack-of-regulation will bring more of this. ∃ plenty of blame to go around, but #Microsoft & #GitHub deserve the bulk of it; they trailblazed the idea that FOSS code-hosting sites are lucrative targets.

https://giveupgithub.org

#GiveUpGitHub #FreeSoftware #OpenSource

Give Up GitHub - Software Freedom Conservancy giveupgithub.org
  • permalink
  • 13 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

It seems like the AI crawlers learned how to solve the Anubis challenges. Anubis is a tool hosted on our infrastructure that requires browsers to do some heavy computation before accessing Codeberg again. It really saved us tons of nerves over the past months, because it saved us from manually maintaining blocklists to having a working detection for "real browsers" and "AI crawlers".

  • permalink
  • 13 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

However, we can confirm that at least Huawei networks now send the challenge responses and they actually do seem to take a few seconds to actually compute the answers. It looks plausible, so we assume that AI crawlers leveled up their computing power to emulate more of real browser behaviour to bypass the diversity of challenges that platform enabled to avoid the bot army.

  • permalink
  • 13 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

We have a list of explicitly blocked IP ranges. However, a configuration oversight on our part only blocked these ranges on the "normal" routes. The "anubis-protected" routes didn't consider the challenge. It was not a problem while Anubis also protected from the crawlers on the other routes.

However, now that they managed to break through Anubis, there was nothing stopping these armies.

It took us a while to identify and fix the config issue, but we're safe again (for now).

  • permalink
  • 13 days ago
eternalyperplxed's avatar
eternalyperplxed
@[email protected]

in reply to this object

@Codeberg Thanks for fighting the good fight!

  • permalink
  • 13 days ago
Blaise Pabón's avatar
Blaise Pabón
@[email protected]

in reply to this object

@Codeberg
#HugOps

  • permalink
  • 13 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

For the load average auction, we offer these numbers from one of our physical servers. Who can offer more?

(It was not the "wildest" moment, but the only for which we have a screenshot)

Screenshot from htop system monitoring tool.
Load average: 5831.24 (and historical numbers: 4755.18 2710.22)
Tasks: 9537, 12566 thr ; 28 running
Uptime: 43 days, 04:50:20
Caption

Screenshot from htop system monitoring tool. Load average: 5831.24 (and historical numbers: 4755.18 2710.22) Tasks: 9537, 12566 thr ; 28 running Uptime: 43 days, 04:50:20

  • permalink
  • 13 days ago
Mx Autumn :blobcatpumpkin:'s avatar
Mx Autumn :blobcatpumpkin:
@[email protected]

in reply to this object

@Codeberg wow :psyduck:

  • permalink
  • 13 days ago
montrak's avatar
montrak
@[email protected]

in reply to this object

@Codeberg

GIF
Caption

a cartoon of a pigeon holding a cup of coffee with a speech bubble that says is fine

  • permalink
  • 13 days ago
DamonHD's avatar
DamonHD
@[email protected]

in reply to this object

@Codeberg In the days of single CPU servers (early 90s?) and an interesting filesystem problem, I think I may have seen ~400 at a client site!

  • permalink
  • 13 days ago
Kevin's avatar
Kevin
@[email protected]

in reply to this object

@Codeberg ouch. This remains a cat-and-mouse game.

At least having them solve the Anubis challenge does cost them extra resources, but if they can do that at scale, it doesn't promise a lot of good.

  • permalink
  • 13 days ago
Askaaron's avatar
Askaaron
@[email protected]

in reply to this object

@Codeberg wow - that looks scary. Thanks for all your work ❤️

  • permalink
  • 13 days ago
Mason Loring Bliss's avatar
Mason Loring Bliss
@[email protected]

in reply to this object

@Codeberg I'm really sorry there isn't a good legal avenue to stave off the abuse. Horrifying.

  • permalink
  • 13 days ago
Xe :verified:'s avatar
Xe :verified:
@[email protected]

in reply to this object

@Codeberg I really wish you contacted me at all about this before going public.

  • permalink
  • 13 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

@cadey I'm sorry if this gave you any unwanted or negative attention. I consider crawlers emulating more of real browser features to bypass protections of websites an inevitable future, and today at least one big crawler seems to have started doing so. ~f

  • permalink
  • 13 days ago
Xe :verified:'s avatar
Xe :verified:
@[email protected]

in reply to this object

@Codeberg Can we continue this conversation over email after my panic subsides? [email protected].

  • permalink
  • 13 days ago
Bredroll's avatar
Bredroll
@[email protected]

in reply to this object

@Codeberg yeowsa. this feels like an arms race that is going to get harder :(

  • permalink
  • 13 days ago
Hakan Bayındır's avatar
Hakan Bayındır
@[email protected]

in reply to this object

@Codeberg This is a great number, but I have seen higher in my career. Unfortunately I either have no screenshots or lost what I already have.

5831.24 is pretty good though. Congrats for hitting, hope your head doesn't hurt. :D

  • permalink
  • 13 days ago
lindesbs #FckAFD's avatar
lindesbs #FckAFD
@[email protected]

in reply to this object

@Codeberg
Hw much RAM do you have in your Machines?

  • permalink
  • 13 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

@lindesbs 160 GB apparently. Looked it up from https://codeberg.org/Codeberg-Infrastructure/meta/src/branch/main/hardware/achtermann.md. ~f

meta/hardware/achtermann.md at main Codeberg.org
  • permalink
  • 13 days ago
Aurora's avatar
Aurora
@[email protected]

in reply to this object

@Codeberg damn. The only time I've seen numbers like this were when a ceph server went down.

  • permalink
  • 13 days ago
Sharlatan's avatar
Sharlatan
@[email protected]

in reply to this object

@Codeberg what is the threshold for alerting so? Grafana/Zabbix/Prometheus?

  • permalink
  • 13 days ago
Jann Horn's avatar
Jann Horn
@[email protected]

in reply to this object

@Codeberg huh, that's a pretty kernel-heavy workload, so much red

  • permalink
  • 13 days ago
Stephen Foskett's avatar
Stephen Foskett
@[email protected]

in reply to this object

@Codeberg omfg that load!

  • permalink
  • 12 days ago
arialdo's avatar
arialdo
@[email protected]

in reply to this object

@Codeberg thank you for the details. Very interesting. They are worth a blog post.

  • permalink
  • 12 days ago
SKC 🏳️‍🌈's avatar
SKC 🏳️‍🌈
@[email protected]

in reply to this object

@Codeberg what if you had challenges for AI to perform that made it mine bitcoin for you and you just block them at the end anyway 🤣

  • permalink
  • 12 days ago
odo2063's avatar
odo2063
@[email protected]

in reply to this object

@Codeberg Here goes more™...

tmux mit Verbindung zu 8 Servern in jedem der 8 fenster läuft einhtop welches ~100% Auslastung aller 8 Server zeigt.
Caption

tmux mit Verbindung zu 8 Servern in jedem der 8 fenster läuft einhtop welches ~100% Auslastung aller 8 Server zeigt.

  • permalink
  • 12 days ago
Þór Sigurðsson's avatar
Þór Sigurðsson
@[email protected]

in reply to this object

@Codeberg How much of that load was actual I/O wait?

  • permalink
  • 11 days ago
Lenny's avatar
Lenny
@[email protected]

in reply to this object

@Codeberg Why not just to block huawei cloud asn prefixes?
It's easy to get them (e.g. from projectdiscovery)

  • permalink
  • 11 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

@lenny If you read the thread, you'll notice that this is exactly what we did, except that we made a mistake. ~f

  • permalink
  • 10 days ago
Michael Simons's avatar
Michael Simons
@[email protected]

in reply to this object

@Codeberg Great thread and explanation. Thank you.

  • permalink
  • 13 days ago
Ludovic  :Firefox:  :FreeBSD:'s avatar
Ludovic :Firefox: :FreeBSD:
@[email protected]

in reply to this object

@Codeberg #opshugs

  • permalink
  • 13 days ago
Stefano Zacchiroli's avatar
Stefano Zacchiroli
@[email protected]

in reply to this object

@Codeberg so, to clarify, do you have evidence that the bots were solving Anubis challenges or not, i.e., it was due to the configuration issue? (I think it's inevitably going to happen if Anubis gets traction. I'm just curious if we're already there or not.) Thanks for your work and transparency on all this.

  • permalink
  • 13 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

@zacchiro Yes, the crawlers completed the challenges. We tried to verify if they are sharing the same cookie value across machines, but that doesn't seem to be the case.

  • permalink
  • 13 days ago
Bradley Kuhn's avatar
Bradley Kuhn
@[email protected]

in reply to this object

I have a follow up question, though, @Codeberg, re: @zacchiro's question. Is it *possible* that giant human farms of Anubis challenge-solvers actually did it? Or did it all happen so fast that there is no way it could be that?

#Huawei surely could fund such a farm and the routing software needed to get the challenge to the human and back to the bot quickly enough that it might *seem* the bot did it.

  • permalink
  • 13 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

@bkuhn
Anubis challenges are not solved by humans. It's not like a captcha. It's a challenge that the browser computes, based on the assumption that crawlers don't run real browsers for performance reasons and only implement simpler crawlers.

So at least one crawler now seems to emulate enough browser behaviour to make it pass the anubis challenge. ~f
@zacchiro

  • permalink
  • 13 days ago
Bradley Kuhn's avatar
Bradley Kuhn
@[email protected]

in reply to this object

@Codeberg I get it now.

Thanks for taking the time to clue me in.

I'm lucky that I haven't needed to learn about this until now and I'm so sorry you've had to do all this work to fight this LLM training DDoS!

Cc: @zacchiro

  • permalink
  • 12 days ago
Ondřej Surý's avatar
Ondřej Surý
@[email protected]

in reply to this object

@Codeberg Is your list shared? It would be good to have a list of carefully curated AI-bot block lists.

  • permalink
  • 13 days ago
Henrý Ólson's avatar
Henrý Ólson
@[email protected]

in reply to this object

@Codeberg are the ip blocklists public?

  • permalink
  • 13 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

@nemo Currently not. We wanted to investigate the legal situation with regards to sharing such lists. They could currently contain individual's IP addresses and likely need to be cleaned up first. ~f

  • permalink
  • 13 days ago
Henrý Ólson's avatar
Henrý Ólson
@[email protected]

in reply to this object

@Codeberg no worries, ty for fighting the good fight o7

  • permalink
  • 13 days ago
Steven Sandoval's avatar
Steven Sandoval
@[email protected]

in reply to this object

@Codeberg Was the solution to increase the proof-of-work difficulty?

  • permalink
  • 13 days ago
Codeberg's avatar
Codeberg
@[email protected]

in reply to this object

@baltakatei No. We fixed our config. Now we're blocking the offending IP ranges directly. ~f

  • permalink
  • 13 days ago
altf4's avatar
altf4
@[email protected]

in reply to this object

@Codeberg Damn it. I hate AI !

  • permalink
  • 13 days ago
ec4x's avatar
ec4x
@[email protected]

in reply to this object

@Codeberg have you tried filing a criminal complaint against the "attacker" because basically it's a breach of ToS and a DoS, right? So it might qualify for a violation of § 303b StGB (German criminal code). I mean, I am no lawyer, but at least it's worth the try?

  • permalink
  • 13 days ago
Chamomile 🐑's avatar
Chamomile 🐑
@[email protected]

in reply to this object

@Codeberg How much were they slowed down by actually solving the challenges? I was under the impression that the proof of work was the primary intent of Anubis, and the fact that most crawlers just bombed out and didn't even attempt them in the first place was a bonus.

  • permalink
  • 13 days ago
Julien Avérous – 🇫🇷🇪🇺🇺🇦's avatar
Julien Avérous – 🇫🇷🇪🇺🇺🇦
@[email protected]

in reply to this object

@Codeberg It makes me wonder: there is a public curated IP blocklist somewhere that we can all use ? I searched a bit, I found only weak robot.txt solutions based on User Agent.

  • permalink
  • 13 days ago
mikeTesteLinuxQlub's avatar
mikeTesteLinuxQlub
@[email protected]

in reply to this object

@Codeberg Seem a bad mouse and cat game, glad that you could stay at the top of it (proves that human can still win). Jesus christ, those big tech compagnies should be held responsable for that shit and pay billions in fine. Maybe then they would think of stopping that insanity.

  • permalink
  • 13 days ago
Andreas Fink's avatar
Andreas Fink
@[email protected]

in reply to this object

@Codeberg instead of blocking, poisen the content for them...

  • permalink
  • 12 days ago
NerdNextDoor :Blobhaj:'s avatar
NerdNextDoor :Blobhaj:
@[email protected]

in reply to this object

@Codeberg Good luck with fighting the bots. I recently moved my OSDev project and site to Codeberg from GitHub and so far it’s been great!

Thank you for helping the open-source community!

  • permalink
  • 13 days ago
Krzysztof Sakrejda's avatar
Krzysztof Sakrejda
@[email protected]

in reply to this object

@Codeberg this is an absurd level of waste they're introducing

  • permalink
  • 13 days ago
Dan Jones's avatar
Dan Jones
@[email protected]

in reply to this object

They need fresh new code to steal train their models, and everybody knows all the best code is on Codeberg.

@[email protected] @[email protected]

  • permalink
  • 13 days ago
  • 3 likes
  • 1 share
Dan Jones's avatar
Dan Jones
@[email protected]

in reply to this object

Pardon my ignorance, but couldn't they just be using a headless browser, which would still do everything a regular browser does? Just recently, ChatGPT beat Cloudflare's CAPTCHA using a similar system. Is there really any way around this at all? @[email protected]

ChatGPT Agent Passes CAPTCHA Test, Exposes Flaws in Bot Detection Systems Analytics Insight: Latest AI, Crypto, Tech News & Analysis
  • permalink
  • interact from your instance
  • 13 days ago
  • -1 like
  • 4 replies
GNU/翠星石's avatar
GNU/翠星石
@[email protected]

in reply to this object

@danjones000 @Codeberg The only way is to give scrapers some delicious bait that humans won't follow, but the scraper will.

At the end of bait, you can put gzip bombs, or more complicated, multiple bait links, where multiple visits causes the IP to be temporarily nullrouted (a human may visit the bait once).


Trying to identify the scraper via fingerprinting and/or JavaScript is doomed to fail, as scrapers can use the same browsers as users (firefox+xdotool will do, but headless browsers tend to be more reliable and less resource-intenstive).
  • permalink
  • 13 days ago
George B's avatar
George B
@[email protected]

in reply to this object

@danjones000 @Codeberg

The crawler is a separate system where the gains of using even less of a browser are significant.

Basically training data scraping vs serving path, the serving path will always have more resources per request.

  • permalink
  • 13 days ago
endrift 🏳️‍⚧️'s avatar
endrift 🏳️‍⚧️
@[email protected]

in reply to this object

@danjones000 @Codeberg the way Anubis works is by making it computationally prohibitive to get through the challenge. It's still possible, but it would require a significant amount of time to do so, something that crawlers don't like doing.

  • permalink
  • 13 days ago
em♡⁠'s avatar
em♡⁠
@[email protected]

in reply to this object

@danjones000 @Codeberg the whole point is to force the scrapers to run a whole browser, because running a whole browser is significantly more expensive.

  • permalink
  • 13 days ago
argv minus one's avatar
argv minus one
@[email protected]

in reply to this object

@Codeberg

These companies are evidently willing to pay an absolutely staggering cost to do their scraping.

I wonder, are they paying with their own money, or are they “borrowing” some unsuspecting strangers' compromised computers/routers/etc to do the work?

  • permalink
  • 13 days ago
Trolli Schmittlauch 🦥's avatar
Trolli Schmittlauch 🦥
@[email protected]

in reply to this object

@binaergewitter Toter der Woche: Anubis-Challenges als wirksames Mittel

  • permalink
  • 13 days ago
Sharlatan's avatar
Sharlatan
@[email protected]

in reply to this object

@Codeberg maybe it's time to improve Anubis https://bsky.app/profile/techaro.lol

Techaro (@techaro.lol) Bluesky Social
  • permalink
  • 13 days ago
依云🦊's avatar
依云🦊
@[email protected]

in reply to this object

@Codeberg I observed them too about a month ago. I then sent the whole AS to Google's recaptcha and it worked (at least people who can solve recaptcha can still access our site while these bots can't).

  • permalink
  • 12 days ago
ozamidas's avatar
ozamidas
@[email protected]

in reply to this object

@Codeberg boy Huawei is so nasty

I wonder who are the biggest offenders on this matter...

  • permalink
  • 11 days ago
Aleksandra Fedorova :fedora:'s avatar
Aleksandra Fedorova :fedora:
@[email protected]

in reply to this object

@Codeberg

"AI crawlers learned how to solve the Anubis challenges"

Why does EU discuss chat control and not AI crawlers control again?

  • permalink
  • 13 days ago
p̷t̵r̴a̵c̷e̶'s avatar
p̷t̵r̴a̵c̷e̶
@[email protected]

in reply to this object

@Codeberg eBPF could be more effective and easy on the CPU, since it acts on a way lower network layer. Anubis kinda has it's limits and it's way too easy to circumvent (as you found out)

Maybe it's worth it to consider eBPF (if not already happened)

And thanks guys for your work. I'm a proud supporter and I'll continue to support your work. Companies shouldn't control the Open Source space

  • permalink
  • 13 days ago
Akseli :quake_verified:​ :kde:'s avatar
Akseli :quake_verified:​ :kde:
@[email protected]

in reply to this object

@Codeberg It's going to be rat race after all, I expected this to happen eventually. Surprising it took this long.

  • permalink
  • 13 days ago
Tobias Hellgren's avatar
Tobias Hellgren
@[email protected]

in reply to this object

@Codeberg Perhaps it's time stop letting robots solve puzzles and instead feed them bombs. Do we know how well a ZIP bomb works on these crawlers?

  • permalink
  • 12 days ago
varx/tech's avatar
varx/tech
@[email protected]

in reply to this object

@Codeberg Have you looked into serving these LLM crawlers alternative versions of the site, with poisoned data? (And rate-limiting, of course.) I know it would be additional work for you to implement this, but... it might be effective.

I'm thinking you could have a precomputed set of 1000 different poison repos that get served up randomly, each of which is a Markov-chain-scrambled version of the files in a real repo.

(I wrote https://codeberg.org/timmc/marko to do something similar to the contents of my blog posts—a Markov model on either characters or words.)

  • permalink
  • 12 days ago
Powered by microblog.pub 2.0.0+9c8693ea and the ActivityPub protocol. Admin.