• Skull giver@popplesburger.hilciferous.nl
    link
    fedilink
    English
    arrow-up
    0
    ·
    4 months ago

    Of course robots.txt is voluntarily, but the scraper that started this round of drama did actually follow robots.txt, so the problem would be solved in this instance.

    For malicious actors, there is no solution, except for whitelisted federation (with authorised fetch and a few other settings) or encryption (i.e. Circles, the social media based on Matrix). Anyone can pretend to be a well-willing Mastodon server and secretly scrape data. There’s little different between someone’s web browser looking through comments on a profile and a bot collecting information. Pay a few dollars and those “browsers” will come from residential ISPs as well. Even Cloudflare currently doesn’t block scrapers anymore if you pay the right service money.

    I’ve considered writing my own “scraper” to generate statistics about Lemmy/Mastodon servers (most active users, voting rings, etc.) but ActivityPub is annoying enough to run that I haven’t made time.

    As for the “firewall”, I’m thinking more broadly here; for example, I’d also include things like DNSBL, authorised fetch for anything claiming to be Mastodon, origin detection to bypass activitypub-proxy, a WAF for detecting and reporting attempts exploitation attempts, possibly something like SpamAssasin integration to reject certain messages, maybe even a “wireshark mode” for debugging Fediverse applications. I think a well-placed, well-optimised middlebox could help reduce the load on smaller instances or even larger ones that see a lot of bot traffic, especially during spam waves like with those Japanese kids we saw last week.