AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

sabreW4K3@lazysoci.al · 3 days ago

AI Scraping Bots Are Breaking Open Libraries, Archives, and Museums

LandedGentry@lemmy.zip · edit-2 3 hours ago

deleted by creator

Kissaki@beehaw.org · 2 days ago

They were shouted down and called Luddites.

By whom and where?

LandedGentry@lemmy.zip · 1 day ago

You know this is such a dishonest question. Are you seriously expecting an answer?

The US just banned states from making any legislation limiting AI for a decade. That’s how dogmatic it has become.

Kissaki@beehaw.org · 1 day ago

I haven’t heard or read any such thing, and the EU passed legislation regarding AI regulation. Which seems like the opposite of those claims.

I really don’t see how it’s a dishonest question.

knightly the Sneptaur@pawb.social · 1 day ago

“In the USA” and “by large portions of the capitalist class” are the answers to your question, I think the previous commenter was just a bit too incredulous because this topic came up in the news lately and they’d expected it to be common knowledge even outside the U.S. since much of the “AI Industry” is involved with U.S. tech companies.

LandedGentry@lemmy.zip · edit-2 1 day ago

I am mostly annoyed that he couldn’t lift a finger to do a cursory search. It’s incredibly easy to verify what I’ve said.

Almost any time I start bringing sources to AI evangelists they either stop responding or they start picking apart the source instead of actually engaging the material or verifying it themselves

LandedGentry@lemmy.zip · edit-2 1 day ago

Use your search engine of choice - I do not want to get bogged down in your picking at whatever source I choose. This is incredibly easy to verify. You could’ve done it in less time than it took you to write that response

FaceDeer@fedia.io · 3 days ago

This seems contradictory. On the one hand you’re saying that these works are wrongly locked behind paywalls, but on the other you’re saying that scraping them is an “assault on the cornerstones of our public knowledge.” Is this information supposed to be freely viewable or not?

IMO the ideal solution would be the one Wikimedia uses, which is to make the information available in an easily-downloadable archive file. That lets anyone who wants the whole thing to have it without having to “hammer” the servers. Meanwhile the servers can be protected by standard load-balancing and DDOS prevention systems.

Zaleramancer@beehaw.org · 3 days ago

There’s a difference between making information accessible to humans for the purposes of advancing our shared knowledge vs saying that public institutions should subsidize the needs of private for-profit organizations.

It’s like, you can say, “Oh yeah, people should have access to freshwater for free,” and also say, “Companies shouldn’t be allowed to pump infinite freshwater from those sources to bottle it for profit.”

Those aren’t contradictory if your actual goal is the benefit of humankind and not, like, pendantic genie logic.

FaceDeer@fedia.io · 3 days ago

Unlike water, though, data can be duplicated easily.

Zaleramancer@beehaw.org · 3 days ago

Bandwidth can’t, though.

Is it okay to hire a bunch of people to check out half a library’s books, then rent them to people for money? Is that fine, or an obvious abuse?

Rendering this service inaccessible to actual human people in order to feed your for-profit software is only different in medium from that.

Captain Beyond@linkage.ds8.zone · edit-2 2 days ago

This

I’ve been very outspoken about my non-belief in intellectual property; I don’t think reading information or making a copy of it is stealing it. On the flipside, these bots are effectively performing a denial-of-service attack on public infrastructure, wasting computing resources, bandwidth, and time that is finite. The internet is for humans first and bots second; I don’t care about bots so much as long as they are well-behaved, which these are not.

My own instance went under several weeks back, then I installed Anubis and suddenly it’s usable again.

FaceDeer@fedia.io · 3 days ago

Bandwidth can’t, though.

Bandwidth is incredibly cheap. The problem these sites are having is not running into bandwidth limits, it’s that providing the pages requires processing to generate them. That’s why Wikipedia’s solution works - they offer all the “raw” data in a single big archive, which takes just as much bandwidth to download but way fewer server resources to process (because there’s literally no processing - it’s just a big blob of data).

Is it okay to hire a bunch of people to check out half a library’s books, then rent them to people for money?

This analogy fails because, as I said, data can be duplicated easily. Making a copy of the data doesn’t obstruct other people from also viewing the data provided you avoid the sorts of resource bottlenecks I described above.

Is your problem really about the accessibility of this data? Or is it that you just don’t want those awful for-profit companies you hate to have access to it? I really get the impression that that’s the real problem here - people hate AI companies, and so a solution that gives everyone what they want is unacceptable because the AI companies are included in “everyone.”

Zaleramancer@beehaw.org · 3 days ago

Dude, my problem is that capitalism is going to ruin everything. It is a rotting sickness that cuts through every layer of society and creates systemic, ugly problems.

Do you know how excited I was when LLM tech was announced? Do you know how much it sucked to realize, so soon, that companies were going to do their best to use it to optimize profits?

The free access of information problem is just a manifestation of this dark specter on society.

You are acting as if we can approach this problem in the abstract, where you have to abide by simplistic, binary philosophical rules and not that we live in a world of constant moral compromise and complexity.

It’s not as simple as, “Oh, you say that you believe in freedom of information, but curious how you don’t want private companies to use it to make money at your expense! Guess you’re a hypocrite.”

Tell me what you actually believe, or stop cycling back to this like it’s a damning rebuttal.

FaceDeer@fedia.io · 3 days ago

It’s ironic that you’re railing against capitalism while espousing exactly the sort of scarcity mindset that capitalism is rooted in, whereas I’m the one taking the “information wants to be free” attitude that would normally be associated with anti-capitalist mindsets.

Do you know how excited I was when LLM tech was announced? Do you know how much it sucked to realize, so soon, that companies were going to do their best to use it to optimize profits?

They do that with everything. Does that mean that everything must therefore become some kind of all-or-nothing battleground wherein companies must be thwarted?

It’s not as simple as, “Oh, you say that you believe in freedom of information, but curious how you don’t want private companies to use it to make money at your expense! Guess you’re a hypocrite.”

Emphasis added. That part is where you’re in error about my view, it’s not at my expense. It doesn’t harm me any.

Tell me what you actually believe, or stop cycling back to this like it’s a damning rebuttal.

I have been.

Zaleramancer@beehaw.org · 3 days ago

Wow, you’re beginning to understand the actual arguments and debates going on. :3

Why are you taking their side buddy?

FaceDeer@fedia.io · 3 days ago

I’m not “taking their side.” I’m just not actively trying to harm them. The world is not a zero-sum game, it’s often possible for everyone to get what they want without harming each other in the process.

LandedGentry@lemmy.zip · edit-2 3 hours ago

spoiler

asfasdfasdfas

LandedGentry@lemmy.zip · edit-2 3 hours ago

deleted by creator

FaceDeer@fedia.io · 3 days ago

so every single repository should have to spend their time, energy, and resources on accommodating a bunch of venture funded companies that want to get all of this shit for free without contributing to these repositories at all themselves?

Was Aaron Schwartz wrong to scrape those repositories? He shouldn’t have been accessing all those publicly-funded academic works? Making it easier for him to access that stuff would have been “capitulating to hackers?”

I think the problem here is that you don’t actually believe that information should be free. You want to decide who and what gets to use that “publicly-funded academic work”, and you have decided that some particular uses are allowable and others are not. Who made you that gatekeeper, though?

I think it’s reasonable that information that’s freely posted for public viewing should be freely viewable. As in anyone can view it. If they want to view all of it and that puts a load on the servers providing it, but there’s an alternate way of providing it that doesn’t put that load on the servers, what’s wrong with doing that? It solves everyones’ problems.

Zaleramancer@beehaw.org · 3 days ago

Really?

Okay, look, the reason people are disagreeing with you is that you’re responding to the following problem:

“Private companies are preventing access to public resources due to their rapacious, selfish greed.”

And your response has been:

“By changing how we structure things to make it easier for them to take things, we can both enjoy the benefits of the public resources.”

The companies are not the same as normal patrons. They’re motived by a desire for infinite growth and will consume anything that they can access for low prices to resell for high ones. They do not contribute to these public resources, because they only wish to abuse them for the potential capital they have.

Drawing an equivalence between these two things requires the willful disregard of this distinction so that you can act as if the underlying moral principle is being betrayed because your rhetorical opponent didn’t define it as rigorously as possible. They didn’t do that out of an expectation that you would engage with this in good faith.

Why are you doing this?

FaceDeer@fedia.io · 3 days ago

Yes, I know the companies are not the same as normal patrons. I don’t care that they’re not the same as normal patrons. All I’m concerned about is that the normal patrons get access to the data. The solution I proposed does that.

The problem, as I see it, is that’s not all that you are concerned about. Your goal also includes a second aspect; you want those companies to not have access to that data. So my proposal is not acceptable because it doesn’t thwart those companies.

I’m not drawing an equivalence between companies and individual patrons, I’m just saying my goals don’t include actively obstructing those companies. If they can get what they want without interfering with what the normal patrons want, why is that a bad thing?

LandedGentry@lemmy.zip · edit-2 3 hours ago

deleted by creator

FaceDeer@fedia.io · 3 days ago

If someone did an Aaron-Schwartz-style scrape, then published the data they scraped in a downloadable archive so that AI trainers could download it and use it, would you find that objectionable?

LandedGentry@lemmy.zip · edit-2 3 hours ago

deleted by creator

FaceDeer@fedia.io · 3 days ago

That suggestion is exactly the same as what I started with when I said “IMO the ideal solution would be the one Wikimedia uses, which is to make the information available in an easily-downloadable archive file.” It just cuts out the Aaron-Schwarts-style external middleman, so it’s easier and more efficient to create the downloadable data.

Lucy :3@feddit.org · 3 days ago

Use Iocaine and Anubis!

Kissaki@beehaw.org · 2 days ago

Alternative: go-away

Geodad@beehaw.org · 3 days ago

I’ve been seeing more Anubis lately. It pops up for like 5 seconds.

Lucy :3@feddit.org · 3 days ago

Action -> Reaction

Geodad@beehaw.org · 3 days ago

I usr a VPN, so my traffic is automatically looked upon as suspicious.

Lucy :3@feddit.org · 3 days ago

I doubt that there are (m)any anubis deployments that distinguish between suspicious or not. It’s just that as more companies get aggressive with scraping, we are getting more aggressive with said tools.

Geodad@beehaw.org · 3 days ago

Yeah, I can see that. I like seeing the cute anime art pop up briefly.

sabreW4K3@lazysoci.al · 3 days ago

Anubis is what slrpnk uses and it blocks the community icon for the electric vehicles community 😭