Make illegally trained LLMs public domain as punishment

🃏Joker@sh.itjust.works · 17 hours ago

Make illegally trained LLMs public domain as punishment

nutsack@lemmy.world · edit-2 2 hours ago

intellectual property doesn’t really exist in most of the world. they don’t give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore…

it’s arbitrary law that is designed to protect corporations and it’s generally unenforceable.

Echo Dot@feddit.uk · 2 hours ago

But they’re not developing AI in those countries they’re developing it mostly in the US. In the US copyright law is enforced.

Magnetic_dud@discuss.tchncs.de · 3 hours ago

I used whisper to create subs of a video and in a section with instrumental relaxing music it filled on repeat with

La scuola del Dr. Paret è una tecnologia di ipnosi non verbale che si utilizza per risultati di un’ipnosi non verbale

Clearly stolen from this Dr paret YouTube channels where he’s selling hypnosis lessons in Italian. Probably in one or multiple videos he had subs stating this over the same relaxing instrumental music that I used and the model assumed the sound corresponded to that text

ZeroOne@lemmy.world · 2 hours ago

Nice one

ClamDrinker@lemmy.world · 8 hours ago

Although I’m a firm believer that most AI models should be public domain or open source by default, the premise of “illegally trained LLMs” is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.

The idea of… well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.

The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn’t tip further into their favor to the point AI technology only exists to benefit them.

If the model is built on the corpus of humanity, then humanity should benefit.

Echo Dot@feddit.uk · edit-2 2 hours ago

Banning AI is out of the question. Even the EU accepts that and they tend to be pretty ban heavy, unlike the US.

But it’s important that we have these discussions about how copyright applies to AI so that we can actually get an answer and move on, right now it’s this legal quagmire that no one really wants to get involved in except the big companies. If a small group of university students want to build an AI right now they can’t because of the legal nightmare that would be the Twilight zone of law that is acquiring training data.

Dkarma@lemmy.world · 8 hours ago

Another clown dick article by someone who knows fuck all about ai

interdimensionalmeme@lemmy.ml · 11 hours ago

It’s not punishment, LLM do not belong to them, they belong to all of humanity. Tear down the enclosing fences.

This is our common heritage, not OpenAI’s private property

Echo Dot@feddit.uk · 2 hours ago

It doesn’t matter anyway, we still need the big companies to bankroll AI. So it effectively does belong to them whatever we do.

Hopefully at some point people can get the processor requirements to something sane and AI development opens up to us all.

TootSweet@lemmy.world · edit-2 7 hours ago

To speak of AI models being “made public domain” is to presuppose that the AI models in question are covered by some branch of intellectual property. Has it been established whether AI models (even those trained on properly licensed content) even are covered by some branch of intellectual property in any particular jurisdiction(s)? Or maybe by “public domain” the author means that they should be required to publish the weights and also that they shouldn’t get any trade secret protections related to those weights?

Arthur Besse@lemmy.ml · edit-2 11 hours ago

“Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data”

I wonder how many people besides the author of this article are upset solely about the profit-from-copyright-infringement aspect of automated plagiarism and bullshit generation, and thus would be satisfied by the models being made more widely available.

The inherent plagiarism aspect of LLMs seems far more offensive to me than the copyright infringement, but both of those problems pale in comparison to the effects on humanity of masses of people relying on bullshit generators with outputs that are convincingly-plausible-yet-totally-wrong (and/or subtly wrong) far more often than anyone notices.

I liked the author’s earlier very-unlikely-to-be-met-demand activism last year better:

I just sent @OpenAI a cease and desist demanding they delete their GPT 3.5 and GPT 4 models in their entirety and remove all of my personal data from their training data sets before re-training in order to prevent #ChatGPT telling people I am dead.

…which at least yielded the amusingly misleading headline OpenAI ordered to delete ChatGPT over false death claims (it’s technically true - a court didn’t order it, but a guy who goes by the name “That One Privacy Guy” while blogging on linkedin did).

madthumbs@lemmy.world · 7 hours ago

They’re spitting out propaganda and misinformation mostly from what I can see. If anything, it should get a refund.

-Outside of coding / debugging tasks (and that’s hit or miss)

circuitfarmer@lemmy.sdf.org · 14 hours ago

A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.

I’m not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.

Dragon Rider (drag)@lemmy.nz · 12 hours ago

Yes, mining companies should all be nationalised for digging up the country’s ground and putting carbon in the country’s air.

SaharaMaleikuhm@feddit.org · 11 hours ago

You must be fun at parties.

pyre@lemmy.world · 8 hours ago

this comment doesn’t make any sense

circuitfarmer@lemmy.sdf.org · 10 hours ago

You must be new here.

Queen HawlSera@lemm.ee · 7 hours ago

Correct

fmstrat@lemmy.nowsci.com · 14 hours ago

So banks will be public domain when they’re bailed out with taxpayer funds, too, right?

ArchRecord@lemm.ee · 14 hours ago

They should be, but currently it depends on the type of bailout, I suppose.

For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank’s assets, and now effectively owns the bank.

booly@sh.itjust.works · 13 hours ago

At the same time, if a bank goes under, that means they owe more than they own, so “ownership” of that entity is basically worthless. In those cases, a bailout of the customers does nothing for the owners, because the owners still get wiped out.

The GM bailout in 2009 also involved wiping out all the shareholders, the government taking ownership of the new company, and the government spinning off the newly issued stock.

AIG required the company basically issue new stock to dilute owners down to 20% of the company, while the government owned the other 80%, and the government made a big profit when they exited that transaction and sold the stock off to the public.

So it’s not super unusual. Government can take ownership of companies as a condition of a bailout. What we generally don’t necessarily want is the government owning a company long term, because there’s some conflict of interest between its role as regulator and its interest as a shareholder.

RubberDuck@lemmy.world · edit-2 13 hours ago

With banks this is also true if they do not have enough liquid assets to meet the legal requirements. So the bank might not be able to count all bank accounts as assets but the FDIC is. Also they can then restructure the bank and force creditors to take a haircut.

This is why investment banks should be separate from banks that have consumer accounts that are insured by the government.
Then you can just let the investment bank fail. This was the whole premise of glass steagall that was repealed under clinton…

xthexder@l.sw0.com · edit-2 13 hours ago

Public domain wouldn’t be the right term for banks being publicly owned. At least for the normal usage of Public Domain in copyright. You can copy text and data, you can’t copy a company with unique customers and physical property.

fmstrat@lemmy.nowsci.com · 12 hours ago

Oh good point. I’m not actually sure what the phrase would be… Publicly owned?

leisesprecher@feddit.org · 13 hours ago

I mean, that sometimes did happen.

Germany propped up the Commerzbank after 2007 by essentially buying a large part of it, and managed to sell several tranches with a healthy profit.

Same is true for Lufthansa during COVID.

interdimensionalmeme@lemmy.ml · 11 hours ago

Banks are redundant, so is the stock market. These institutions do not need to, and should not be private. They are level playing fields in the economy, not participants trying to tilt the board for taking over the game.

LovableSidekick@lemmy.world · edit-2 13 hours ago

No, “the banks” wouldn’t be what the AI would be trained on, it would be the private info of individuals the banks do business with.

hark@lemmy.world · 15 hours ago

Imaginary property has always been a tricky concept, but the law always ends up just protecting the large corporations at the expense of the people who actually create things. I assume the end result here will be large corporations getting royalties from AI model usage or measures put in place to prevent generating content infringing on their imaginary properties and everyone else can get fucked.

merc@sh.itjust.works · 14 hours ago

It’s like what happened with Spotify. The artists and the labels were unhappy with the copyright infringement of music happening with Napster, Limewire, Kazaa, etc. They wanted the music model to be the same “buy an album from a record store” model that they knew and had worked for decades. But, users liked digital music and not having to buy a whole album for just one song, etc.

Spotify’s solution was easy: cut the record labels in. Let them invest and then any profits Spotify generated were shared with them. This made the record labels happy because they got money from their investment, even though their “buy an album” business model was now gone. It was ok for big artists because they had the power to negotiate with the labels and get something out of the deal. But, it absolutely screwed the small artists because now Spotify gives them essentially nothing.

I just hope that the law that nothing created by an LLM is copyrightable proves to be enough of a speed bump to slow things down.

Taleya@aussie.zone · 14 hours ago

Bandcamp still runs on this mode though, and quite well

xthexder@l.sw0.com · 13 hours ago

It’s also one of the few places that have lossless audio files available for download. I’m a big fan of Bandcamp. I like having all my music local.

Taleya@aussie.zone · 13 hours ago

Same. I refuse to use spotify, i’ve got 400gb of mp3s and winamp

m-p{3}@lemmy.ca · 16 hours ago

It could also contain non-public domain data, and you can’t declare someone else’s intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.

Laws are never simple.

drkt@scribe.disroot.org · 16 hours ago

Forcing a bunch of neural weights into the public domain doesn’t make the data they were trained on also public domain, in fact it doesn’t even reveal what they were trained on.

deegeese@sopuli.xyz · 16 hours ago

LOL no. The weights encode the training data and it’s trivially easy to make AI generators spit out bits of their training data.

drkt@scribe.disroot.org · 16 hours ago

paper?

SatansMaggotyCumFart@lemmy.world · 16 hours ago

No, training data.

FaceDeer@fedia.io · 14 hours ago

No, he’s challenging the assertion that it’s “trivially easy” to make AIs output their training data.

Older AIs have occasionally regurgitated bits of training data as a result of overfitting, which is a flaw in training that modern AI training techniques have made great strides in eliminating. It’s no longer a particularly common problem, and even if it were it only applies to those specific bits of training data that were overfit on, not on all of the training data in general.

31337@sh.itjust.works · 4 hours ago

Last time I looked it up and calculated it, these large models are trained on something like only 7x the tokens as the number of parameters they have. If you thought of it like compression, a 1:7 ratio for lossless text compression is perfectly possible.

I think the models can still output a lot of stuff verbatim if you try to get them to, you just hit the guardrails they put in place. Seems to work fine for public domain stuff. E.g. “Give me the first 50 lines from Romeo and Juliette.” (albeit with a TOS warning, lol). “Give me the first few paragraphs of Dune.” seems to hit a guardrail, or maybe just forced through reinforcement learning.

A preprint paper was released recently that detailed how to get around RL by controlling the first few tokens of a model’s output, showing the “unsafe” data is still in there.

FaceDeer@fedia.io · 4 hours ago

I’ve been working with local LLMs for over a year now. No guardrails, and many of them fine-tuned against censorship. They can’t output arbitrary training material verbatim.

Llama 3 was trained on 15 trillion tokens, both the 8B and 70B parameter versions.. So around 1:1000, not 1:7.

SatansMaggotyCumFart@lemmy.world · 14 hours ago

I thought he meant LLMs shot out bits of paper like some ticker-tape parade.

stephen01king@lemmy.zip · 15 hours ago

How easy are we talking about here? Also, making the model public domain doesn’t mean making the output public domain. The output of an LLM should still abide by copyright laws, as they should be.

grue@lemmy.world · 16 hours ago

So what you’re saying is that there’s no way to make it legal and it simply needs to be deleted entirely.

I agree.

FaceDeer@fedia.io · 14 hours ago

There’s no need to “make it legal”, things are legal by default until a law is passed to make them illegal. Or a court precedent is set that establishes that an existing law applies to the new thing under discussion.

Training an AI doesn’t involve copying the training data, the AI model doesn’t literally “contain” the stuff it’s trained on. So it’s not likely that existing copyright law makes it illegal to do without permission.

grue@lemmy.world · 2 hours ago

There’s no need to “make it legal”, things are legal by default until a law is passed to make them illegal.

Yes, and that’s already happened: it’s called “copyright law.” You can’t mix things with incompatible licenses into a derivative work and pretend it’s okay.

xigoi@lemmy.sdf.org · 11 hours ago

By this logic, you can copy a copyrighted imege as long as you decrease the resolution, because the new image does not contain all the information in the original one.

yetAnotherUser@discuss.tchncs.de · 7 hours ago

Am I allowed to take a copyrighted image, decrease its size to 1x1 pixels and publish it? What about 2x2?

It’s very much not clear when a modification violates copyright because copyright is extremely vague to begin with.

grue@lemmy.world · 2 hours ago

Just because something is defined legally instead of technologically, that doesn’t make it vague. The modification violates copyright when the result is a derivative work; no more, no less.

Voyajer@lemmy.world · 9 hours ago

More like reduce it to a handful of vectors that get merged with other vectors.

FaceDeer@fedia.io · 8 hours ago

In the case of Stable Diffusion, they used 5 billion images to train a model 1.83 gigabytes in size. So if you reduce a copyrighted image to 3 bits (not bytes - bits), then yeah, I think you’re probably pretty safe.

merc@sh.itjust.works · 14 hours ago

It wouldn’t contain any public-domain data though. That’s the thing with LLMs, once they’re trained on data the data is gone and just added to the series of weights in the model somewhere. If it ingested something private like your tax data, it couldn’t re-create your tax data on command, that data is now gone, but if it’s seen enough private tax data it could give something that looked a lot like a tax return to someone with an untrained eye. But, a tax accountant would easily see flaws in it.

pelespirit@sh.itjust.works · 16 hours ago

Right, like I did. They’re safeguarding Disney and other places like that now. It’s just the little guys who get screwed.

https://imgur.com/a/these-are-new-niki-mice-drawings-phone-company-chainsaws-merms-donut-logos-burger-mc-winfruit-computers-republunch-political-party-logos-Rhgi0OC

chiliedogg@lemmy.world · 13 hours ago

Delete them. Wipe their databases. Make the companies start from scratch with new, ethically acquired training data.

ryannathans@aussie.zone · 11 hours ago

Mmm yes so all that electricity is pure waste

humorlessrepost@lemmy.world · 9 hours ago

Genuine question, does anyone know how much of the electricity is used for training the model vs using it to generate responses?

Rikudou_Sage@lemmings.world · 9 hours ago

Not specifically, but training is pretty fucking expensive to do, while generating is kinda easy. The OpenAI models are massive, training them cost a lot. Though they also have a lot of traffic. But unless they stop training new models, I don’t think generating answers will ever catch up to training.

Hackworth@lemmy.world · 8 hours ago

For perspective, all of the data centers in the US combined use 4% of total electric load.

brucethemoose@lemmy.world · edit-2 15 hours ago

The environmental cost of training is a bit of a meme. The details are spread around, but basically, Alibaba trained a GPT-4 level-ish model on a relatively small number of GPUs… probably on par with a steel mill running for a long time, a comparative drop in the bucket compared to industrial processes. OpenAI is extremely inefficient, probably because they don’t have much pressure to optimize GPU usage.

Inference cost is more of a concern with crazy stuff like o3, but this could dramatically change if (hopefully when) bitnet models come to frutition.

Still, I 100% agree with this. Closed LLM weights should be public domain, as many good models already are.

brie@programming.dev · 11 hours ago

With current kWh/token it’s 100x of a regular google search query. That’s where the environmental meme came from. Also, Nvidia plans to manufacture enough chips to require global electricity production to increase by 20-30%.

j4k3@lemmy.world · 15 hours ago

Doesn’t Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?

What are bitnet models and what does that change in a nutshell?

brucethemoose@lemmy.world · edit-2 14 hours ago

What are bitnet models and what does that change in a nutshell?

Read the pitch here: https://github.com/ridgerchu/matmulfreellm

Basically, using ternary weights, all inference-time matrix multiplication can be replaced with much simpler matrix addition. This is theoretically more efficient on GPUs, and astronomically more efficient on dedicated hardware (as adders take up a fraction of the space as multipliers in silicon). This would be particularly fantastic for, say, local inference on smartphones or laptop ASICs.

The catch is no one has (publicly) risked a couple of million dollars to test it with a large model, as (so far) training it isn’t more efficient than “regular” LLMs.

Doesn’t Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?

No one really knows, because they’re so closed and opaque!

But it appears that their models perform relatively poorly for thier “size.” Qwen is nearly matching GPT-4 in some metrics, yet is probably an order of magnitude smaller, while Google/Claude and some Chinese models are also pulling ahead.