It’s all made from our data, anyway, so it should be ours to use as we want
Although I’m a firm believer that most AI models should be public domain or open source by default, the premise of “illegally trained LLMs” is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.
The idea of… well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.
The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn’t tip further into their favor to the point AI technology only exists to benefit them.
If the model is built on the corpus of humanity, then humanity should benefit.
Another clown dick article by someone who knows fuck all about ai
To speak of AI models being “made public domain” is to presuppose that the AI models in question are covered by some branch of intellectual property. Has it been established whether AI models (even those trained on properly licensed content) even are covered by some branch of intellectual property in any particular jurisdiction(s)? Or maybe by “public domain” the author means that they should be required to publish the weights and also that they shouldn’t get any trade secret protections related to those weights?
It’s not punishment, LLM do not belong to them, they belong to all of humanity. Tear down the enclosing fences.
This is our common heritage, not OpenAI’s private property
Correct
“Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data”
I wonder how many people besides the author of this article are upset solely about the profit-from-copyright-infringement aspect of automated plagiarism and bullshit generation, and thus would be satisfied by the models being made more widely available.
The inherent plagiarism aspect of LLMs seems far more offensive to me than the copyright infringement, but both of those problems pale in comparison to the effects on humanity of masses of people relying on bullshit generators with outputs that are convincingly-plausible-yet-totally-wrong (and/or subtly wrong) far more often than anyone notices.
I liked the author’s earlier very-unlikely-to-be-met-demand activism last year better:
…which at least yielded the amusingly misleading headline OpenAI ordered to delete ChatGPT over false death claims (it’s technically true - a court didn’t order it, but a guy who goes by the name “That One Privacy Guy” while blogging on linkedin did).
They’re spitting out propaganda and misinformation mostly from what I can see. If anything, it should get a refund.
-Outside of coding / debugging tasks (and that’s hit or miss)
A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.
I’m not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.
Yes, mining companies should all be nationalised for digging up the country’s ground and putting carbon in the country’s air.
You must be fun at parties.
this comment doesn’t make any sense
You must be new here.
So banks will be public domain when they’re bailed out with taxpayer funds, too, right?
They should be, but currently it depends on the type of bailout, I suppose.
For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank’s assets, and now effectively owns the bank.
At the same time, if a bank goes under, that means they owe more than they own, so “ownership” of that entity is basically worthless. In those cases, a bailout of the customers does nothing for the owners, because the owners still get wiped out.
The GM bailout in 2009 also involved wiping out all the shareholders, the government taking ownership of the new company, and the government spinning off the newly issued stock.
AIG required the company basically issue new stock to dilute owners down to 20% of the company, while the government owned the other 80%, and the government made a big profit when they exited that transaction and sold the stock off to the public.
So it’s not super unusual. Government can take ownership of companies as a condition of a bailout. What we generally don’t necessarily want is the government owning a company long term, because there’s some conflict of interest between its role as regulator and its interest as a shareholder.
With banks this is also true if they do not have enough liquid assets to meet the legal requirements. So the bank might not be able to count all bank accounts as assets but the FDIC is. Also they can then restructure the bank and force creditors to take a haircut.
This is why investment banks should be separate from banks that have consumer accounts that are insured by the government.
Then you can just let the investment bank fail. This was the whole premise of glass steagall that was repealed under clinton…
Public domain wouldn’t be the right term for banks being publicly owned. At least for the normal usage of Public Domain in copyright. You can copy text and data, you can’t copy a company with unique customers and physical property.
Oh good point. I’m not actually sure what the phrase would be… Publicly owned?
Banks are redundant, so is the stock market. These institutions do not need to, and should not be private. They are level playing fields in the economy, not participants trying to tilt the board for taking over the game.
I mean, that sometimes did happen.
Germany propped up the Commerzbank after 2007 by essentially buying a large part of it, and managed to sell several tranches with a healthy profit.
Same is true for Lufthansa during COVID.
No, “the banks” wouldn’t be what the AI would be trained on, it would be the private info of individuals the banks do business with.
Imaginary property has always been a tricky concept, but the law always ends up just protecting the large corporations at the expense of the people who actually create things. I assume the end result here will be large corporations getting royalties from AI model usage or measures put in place to prevent generating content infringing on their imaginary properties and everyone else can get fucked.
It’s like what happened with Spotify. The artists and the labels were unhappy with the copyright infringement of music happening with Napster, Limewire, Kazaa, etc. They wanted the music model to be the same “buy an album from a record store” model that they knew and had worked for decades. But, users liked digital music and not having to buy a whole album for just one song, etc.
Spotify’s solution was easy: cut the record labels in. Let them invest and then any profits Spotify generated were shared with them. This made the record labels happy because they got money from their investment, even though their “buy an album” business model was now gone. It was ok for big artists because they had the power to negotiate with the labels and get something out of the deal. But, it absolutely screwed the small artists because now Spotify gives them essentially nothing.
I just hope that the law that nothing created by an LLM is copyrightable proves to be enough of a speed bump to slow things down.
Bandcamp still runs on this mode though, and quite well
It’s also one of the few places that have lossless audio files available for download. I’m a big fan of Bandcamp. I like having all my music local.
Same. I refuse to use spotify, i’ve got 400gb of mp3s and winamp
It could also contain non-public domain data, and you can’t declare someone else’s intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.
Laws are never simple.
Forcing a bunch of neural weights into the public domain doesn’t make the data they were trained on also public domain, in fact it doesn’t even reveal what they were trained on.
LOL no. The weights encode the training data and it’s trivially easy to make AI generators spit out bits of their training data.
paper?
No, training data.
No, he’s challenging the assertion that it’s “trivially easy” to make AIs output their training data.
Older AIs have occasionally regurgitated bits of training data as a result of overfitting, which is a flaw in training that modern AI training techniques have made great strides in eliminating. It’s no longer a particularly common problem, and even if it were it only applies to those specific bits of training data that were overfit on, not on all of the training data in general.
Last time I looked it up and calculated it, these large models are trained on something like only 7x the tokens as the number of parameters they have. If you thought of it like compression, a 1:7 ratio for lossless text compression is perfectly possible.
I think the models can still output a lot of stuff verbatim if you try to get them to, you just hit the guardrails they put in place. Seems to work fine for public domain stuff. E.g. “Give me the first 50 lines from Romeo and Juliette.” (albeit with a TOS warning, lol). “Give me the first few paragraphs of Dune.” seems to hit a guardrail, or maybe just forced through reinforcement learning.
A preprint paper was released recently that detailed how to get around RL by controlling the first few tokens of a model’s output, showing the “unsafe” data is still in there.
I’ve been working with local LLMs for over a year now. No guardrails, and many of them fine-tuned against censorship. They can’t output arbitrary training material verbatim.
Llama 3 was trained on 15 trillion tokens, both the 8B and 70B parameter versions.. So around 1:1000, not 1:7.
I thought he meant LLMs shot out bits of paper like some ticker-tape parade.
How easy are we talking about here? Also, making the model public domain doesn’t mean making the output public domain. The output of an LLM should still abide by copyright laws, as they should be.
It wouldn’t contain any public-domain data though. That’s the thing with LLMs, once they’re trained on data the data is gone and just added to the series of weights in the model somewhere. If it ingested something private like your tax data, it couldn’t re-create your tax data on command, that data is now gone, but if it’s seen enough private tax data it could give something that looked a lot like a tax return to someone with an untrained eye. But, a tax accountant would easily see flaws in it.
So what you’re saying is that there’s no way to make it legal and it simply needs to be deleted entirely.
I agree.
There’s no need to “make it legal”, things are legal by default until a law is passed to make them illegal. Or a court precedent is set that establishes that an existing law applies to the new thing under discussion.
Training an AI doesn’t involve copying the training data, the AI model doesn’t literally “contain” the stuff it’s trained on. So it’s not likely that existing copyright law makes it illegal to do without permission.
By this logic, you can copy a copyrighted imege as long as you decrease the resolution, because the new image does not contain all the information in the original one.
Am I allowed to take a copyrighted image, decrease its size to 1x1 pixels and publish it? What about 2x2?
It’s very much not clear when a modification violates copyright because copyright is extremely vague to begin with.
More like reduce it to a handful of vectors that get merged with other vectors.
In the case of Stable Diffusion, they used 5 billion images to train a model 1.83 gigabytes in size. So if you reduce a copyrighted image to 3 bits (not bytes - bits), then yeah, I think you’re probably pretty safe.
Right, like I did. They’re safeguarding Disney and other places like that now. It’s just the little guys who get screwed.
Delete them. Wipe their databases. Make the companies start from scratch with new, ethically acquired training data.
Mmm yes so all that electricity is pure waste
Genuine question, does anyone know how much of the electricity is used for training the model vs using it to generate responses?
Not specifically, but training is pretty fucking expensive to do, while generating is kinda easy. The OpenAI models are massive, training them cost a lot. Though they also have a lot of traffic. But unless they stop training new models, I don’t think generating answers will ever catch up to training.
For perspective, all of the data centers in the US combined use 4% of total electric load.
The environmental cost of training is a bit of a meme. The details are spread around, but basically, Alibaba trained a GPT-4 level-ish model on a relatively small number of GPUs… probably on par with a steel mill running for a long time, a comparative drop in the bucket compared to industrial processes. OpenAI is extremely inefficient, probably because they don’t have much pressure to optimize GPU usage.
Inference cost is more of a concern with crazy stuff like o3, but this could dramatically change if (hopefully when) bitnet models come to frutition.
Still, I 100% agree with this. Closed LLM weights should be public domain, as many good models already are.
With current kWh/token it’s 100x of a regular google search query. That’s where the environmental meme came from. Also, Nvidia plans to manufacture enough chips to require global electricity production to increase by 20-30%.
Doesn’t Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?
What are bitnet models and what does that change in a nutshell?
What are bitnet models and what does that change in a nutshell?
Read the pitch here: https://github.com/ridgerchu/matmulfreellm
Basically, using ternary weights, all inference-time matrix multiplication can be replaced with much simpler matrix addition. This is theoretically more efficient on GPUs, and astronomically more efficient on dedicated hardware (as adders take up a fraction of the space as multipliers in silicon). This would be particularly fantastic for, say, local inference on smartphones or laptop ASICs.
The catch is no one has (publicly) risked a couple of million dollars to test it with a large model, as (so far) training it isn’t more efficient than “regular” LLMs.
Doesn’t Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?
No one really knows, because they’re so closed and opaque!
But it appears that their models perform relatively poorly for thier “size.” Qwen is nearly matching GPT-4 in some metrics, yet is probably an order of magnitude smaller, while Google/Claude and some Chinese models are also pulling ahead.
Only if they were trained on public material.
Yes!
Doesn’t seem like this helps out all the writers / artists that the LLM stole from.