cross-posted from: https://lemmy.world/post/76533
One of the arguments made for Reddit’s API changes is that they are now the go to place for LLM training data (e.g. for ChatGPT).
I haven’t seen a whole lot of discussion around this and would like to hear people’s opinions. Are you concerned about your posts being used for LLM training? Do you not care? Do you prefer that your comments are available to train open source LLMs?
(I will post my personal opinion in a comment so it can be up/down voted separately)
I do not want my content to contribute to propertiery LLM that will make billion for large tech company without giving back to the community. Unfortunately I think fediverse have a harder time countering large scale data harvesting than a centralized service like reddit.
On the other hand, I don’t mind open source, privacy respecting (is this a thing for LLM?) LLM to use my content.
I am also wary of big tech companies using my comment history for their LLMs. However, I worry that the tech companies will scrape data anyway and Reddit’s API pricing just locks out the open source LLMs. There are a few of them, a couple that I have played with:
https://github.com/nomic-ai/gpt4all
https://github.com/ggerganov/llama.cpp
Some projects even try to preserve privacy. But I think its more on the side of what extra training data you give it and the queries you issue.
https://github.com/imartinez/privateGPT