AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

sabreW4K3@lazysoci.al · 3 days ago

AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

FaceDeer@fedia.io · 3 days ago

so every single repository should have to spend their time, energy, and resources on accommodating a bunch of venture funded companies that want to get all of this shit for free without contributing to these repositories at all themselves?

Was Aaron Schwartz wrong to scrape those repositories? He shouldn’t have been accessing all those publicly-funded academic works? Making it easier for him to access that stuff would have been “capitulating to hackers?”

I think the problem here is that you don’t actually believe that information should be free. You want to decide who and what gets to use that “publicly-funded academic work”, and you have decided that some particular uses are allowable and others are not. Who made you that gatekeeper, though?

I think it’s reasonable that information that’s freely posted for public viewing should be freely viewable. As in anyone can view it. If they want to view all of it and that puts a load on the servers providing it, but there’s an alternate way of providing it that doesn’t put that load on the servers, what’s wrong with doing that? It solves everyones’ problems.

Zaleramancer@beehaw.org · 3 days ago

Really?

Okay, look, the reason people are disagreeing with you is that you’re responding to the following problem:

“Private companies are preventing access to public resources due to their rapacious, selfish greed.”

And your response has been:

“By changing how we structure things to make it easier for them to take things, we can both enjoy the benefits of the public resources.”

The companies are not the same as normal patrons. They’re motived by a desire for infinite growth and will consume anything that they can access for low prices to resell for high ones. They do not contribute to these public resources, because they only wish to abuse them for the potential capital they have.

Drawing an equivalence between these two things requires the willful disregard of this distinction so that you can act as if the underlying moral principle is being betrayed because your rhetorical opponent didn’t define it as rigorously as possible. They didn’t do that out of an expectation that you would engage with this in good faith.

Why are you doing this?

FaceDeer@fedia.io · 3 days ago

Yes, I know the companies are not the same as normal patrons. I don’t care that they’re not the same as normal patrons. All I’m concerned about is that the normal patrons get access to the data. The solution I proposed does that.

The problem, as I see it, is that’s not all that you are concerned about. Your goal also includes a second aspect; you want those companies to not have access to that data. So my proposal is not acceptable because it doesn’t thwart those companies.

I’m not drawing an equivalence between companies and individual patrons, I’m just saying my goals don’t include actively obstructing those companies. If they can get what they want without interfering with what the normal patrons want, why is that a bad thing?

LandedGentry@lemmy.zip · edit-2 3 hours ago

deleted by creator

FaceDeer@fedia.io · 3 days ago

If someone did an Aaron-Schwartz-style scrape, then published the data they scraped in a downloadable archive so that AI trainers could download it and use it, would you find that objectionable?

LandedGentry@lemmy.zip · edit-2 3 hours ago

deleted by creator

FaceDeer@fedia.io · 3 days ago

That suggestion is exactly the same as what I started with when I said “IMO the ideal solution would be the one Wikimedia uses, which is to make the information available in an easily-downloadable archive file.” It just cuts out the Aaron-Schwarts-style external middleman, so it’s easier and more efficient to create the downloadable data.

LandedGentry@lemmy.zip · edit-2 3 hours ago

deleted by creator

FaceDeer@fedia.io · 3 days ago

I don’t understand why the burden is on the victims here.

They put the website up. Load balancing, rate limiting, and such go with the turf. It’s their responsibility to make the site easy to use and hard to break. Putting up an archive of the content that the scrapers want is an easy and straightforward thing to do to accomplish this goal.

I think what’s really going on here is that your concern isn’t about ensuring that the site is up, and it’s certainly not about ensuring that the data it’s providing is readily available. It’s that there are these specific companies you don’t like and you just want to forbid them from accessing otherwise freely accessible data.

LandedGentry@lemmy.zip · edit-2 3 hours ago

deleted by creator

FaceDeer@fedia.io · 3 days ago

That is absolutely ridiculous. The pressure AI scraping puts on sites vastly outstrips anything people built for, as evidenced by the fact that the systems are going down.

Yes. Which is why I’m suggesting providing an approach that doesn’t require scraping the site.