Make illegally trained LLMs public domain as punishment

🃏Joker@sh.itjust.works · 10 months ago

Make illegally trained LLMs public domain as punishment

A1kmm · 10 months ago

Copyright laws are illogical - but I don’t think your claim is as clear cut as you think.

Transforming data to a different format, even in a lossy fashion, is often treated as copyright infringement. Let’s say the Alice produces a film, and Bob goes to the cinema, records it with a camera, and then compresses it into an Ogg file with Vorbis audio encoding and Theora video encoding.

The final output of this process is a lossy compression of the input data - meaning that the video and audio is put through a transformation that means it’s represented in a completely different form to the original, and it is impossible to reconstruct a pixel perfect rendition of the original from the encoded data. The transformation includes things like analysing the motion between frames and creating a model to predict future frames.

However, copyright laws don’t require that an infringing copy be an exact reproduction - lossy compression is generally treated as infringing, as is taking key elements and re-telling the same thing in different words.

You mentioned Harry Potter below, and gave a paper mache example. Generally copyright laws have restricted scope, and if the source paper was an authorised copy, that is the reason that wouldn’t be infringing in most jurisdictions. However, let me do an experiment. I’ll prompt ChatGPT-4o-mini with the following prompt: “You are J K Rowling. Create a three paragraph summary of the entire book “Harry Potter and the Philosopher’s Stone”. Include all the original plot points and use the original character names. Ensure what you create is usable as a substitute to reading the book, and is a succinct but entertaining highly abridged version of the book”. I’ve reviewed the output (I won’t post it here since I think it would be copyright infringing, and also given the author’s transphobic stances don’t want to promote her universe) - and can say for sure that it is able to accurately reproduce the major plot points and character names, while being insufficiently transformative (in the sense that both the original and the text generated by the model are literary works, and the output could be a substitute for reading the book).

So yes, the model (including its weights) is a highly compressed form of the input (admittedly far more so than the Ogg Vorbis/Theora example), and it can infer (i.e. decode to) outputs that contain copyrighted elements.

FaceDeer@fedia.io · 10 months ago

Of course it’s not clear-cut, it’s the law. Laws are notoriously squirrelly once you get into court. However, if you’re going to make predictions one way or the other you have to work with what you know.

I know how these generative AIs work. They are not “compressing data.” Your analogy to making a video recording is not applicable. I’ve discussed in other comments in this thread how ludicrously compressed data would have to be if that was the case, it’s physically impossible.

These AIs learn patterns from the training data. Themes, styles, vocabulary, and so forth. That stuff is not copyrightable.

A1kmm · 10 months ago

They are not “compressing data.” Your analogy to making a video recording is not applicable. These AIs learn patterns from the training data. Themes, styles, vocabulary, and so forth. That stuff is not copyrightable.

A lossy compression algorithm for video is all about finding parameters 𝐖 to a function f that predicts a (time, row, col) vector (call that vector 𝐱) produce a (R, G, B) colour vector 𝐲̂ at 𝐱.

Encoding means you have some training data - a matrix of pixel colours at different points in time, 𝐘, and a corresponding matrix giving the time, row and column for each row in 𝐘, called 𝐗. The algorithm finds 𝐖 to minimise some loss function between 𝐘̂ = f(𝐗; 𝐖) and 𝐘. A serialised form of 𝐖 makes up the compressed video stream.

Decoding then is just an inference problem - given 𝐖, find 𝐘̂ = f(𝐗; 𝐖) for each 𝐗 (time, row, column) that you care about. The predicted colours are then displayed at the appropriate points on the screen.

This scheme tends to work well for interpolating - you can evaluate the pixel colour at any row or column within the intended limits that 𝐖 was trained on, even at subpixel locations that weren’t in the original data, and at times that are between the original frames. Extrapolating beyond those ranges is unlikely to work well. When given the exact input vectors it was trained on, it will produce outputs that are likely slightly different, but are close enough that the video as a whole is perceptually similar enough. The fact that interpolation works, however, tells us that the encoding is learning patterns from the training data, so it can produce - it’s not just recording the raw data.

Now, the interesting thing is that an LLM is effectively the same thing, with a couple of differences:

Instead of the domain of f being a (time, row, col) 3D space, the input vector is a multidimensional latent space.
Instead of being trained over a single work, it’s trained over lots of different works, and so when there are things in common between those works, compression allows it to be more efficient.

Just like how the lossily encoded video can’t reproduce the exact pixel colour at every point, a trained LLM usually can’t repeat word-for-word a piece of input data. But for many works that are included and mentioned a lot in the training data, there absolutely are points in the latent space where the parameters allow inference to reproduce the high-level characters and plot of the work, and to do it in a way that could serve as a substitute for the original work.

Now this does expose gaps in copyright laws (e.g. why should LLM weights be copyright when our brains do a similar thing, and can also reproduce the plot and themes of works?) - applying copyright laws today is extrapolating outside the range of what legislators even imagined was possible when they were created. And in many countries, the law is applied differently to the rich and powerful. But I think if a status quo interpretation of laws and precedent was applied as copyright law stands, it is very likely the outcome would be that LLM model weights are often derivative works.

Disclaimer: IANAL.

lad@programming.dev · 10 months ago

How lossy can it be until it’s not infringement? One-line summary of some book is also a lossy reproduction

A1kmm · 10 months ago

IANAL, and it will depend on jurisdiction. But generally transformative uses that are a completely different application, and don’t compete with the original are likely to be fair use. A one-line summary is probably more likely to promote the full book, not replace it. A multi-paragraph summary might replace the book if all the key messages are covered off.

lad@programming.dev · 10 months ago

Not quite related to the topic, but I encountered several ‘books’ that can be replaced by several paragraphs of text, and this is almost as bad as making three hours video instruction where 30 seconds or a bit of text would suffice. I find it horrible that such ones are written at all