It’s all made from our data, anyway, so it should be ours to use as we want

  • nutsack@lemmy.world
    link
    fedilink
    English
    arrow-up
    11
    ·
    edit-2
    2 hours ago

    intellectual property doesn’t really exist in most of the world. they don’t give a shit about it in india, bangladesh, vietnam, china, the philippines, malaysia, singapore…

    it’s arbitrary law that is designed to protect corporations and it’s generally unenforceable.

    • Echo Dot@feddit.uk
      link
      fedilink
      English
      arrow-up
      7
      ·
      2 hours ago

      But they’re not developing AI in those countries they’re developing it mostly in the US. In the US copyright law is enforced.

  • Magnetic_dud@discuss.tchncs.de
    link
    fedilink
    English
    arrow-up
    7
    ·
    3 hours ago

    I used whisper to create subs of a video and in a section with instrumental relaxing music it filled on repeat with

    La scuola del Dr. Paret è una tecnologia di ipnosi non verbale che si utilizza per risultati di un’ipnosi non verbale

    Clearly stolen from this Dr paret YouTube channels where he’s selling hypnosis lessons in Italian. Probably in one or multiple videos he had subs stating this over the same relaxing instrumental music that I used and the model assumed the sound corresponded to that text

  • ClamDrinker@lemmy.world
    link
    fedilink
    English
    arrow-up
    29
    arrow-down
    2
    ·
    8 hours ago

    Although I’m a firm believer that most AI models should be public domain or open source by default, the premise of “illegally trained LLMs” is flawed. Because there really is no assurance that LLMs currently in use are illegally trained to begin with. These things are still being argued in court, but the AI companies have a pretty good defense in the fact analyzing publicly viewable information is a pretty deep rooted freedom that provides a lot of positives to the world.

    The idea of… well, ideas, being copyrightable, should shake the boots of anyone in this discussion. Especially since when the laws on the book around these kinds of things become active topic of change, they rarely shift in the direction of more freedom for the exact people we want to give it to. See: Copyright and Disney.

    The underlying technology simply has more than enough good uses that banning it would simply cause it to flourish elsewhere that does not ban it, which means as usual that everyone but the multinational companies lose out. The same would happen with more strict copyright, as only the big companies have the means to build their own models with their own data. The general public is set up for a lose-lose to these companies as it currently stands. By requiring the models to be made available to the public do we ensure that the playing field doesn’t tip further into their favor to the point AI technology only exists to benefit them.

    If the model is built on the corpus of humanity, then humanity should benefit.

    • Echo Dot@feddit.uk
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      2 hours ago

      Banning AI is out of the question. Even the EU accepts that and they tend to be pretty ban heavy, unlike the US.

      But it’s important that we have these discussions about how copyright applies to AI so that we can actually get an answer and move on, right now it’s this legal quagmire that no one really wants to get involved in except the big companies. If a small group of university students want to build an AI right now they can’t because of the legal nightmare that would be the Twilight zone of law that is acquiring training data.

  • interdimensionalmeme@lemmy.ml
    link
    fedilink
    English
    arrow-up
    51
    arrow-down
    3
    ·
    11 hours ago

    It’s not punishment, LLM do not belong to them, they belong to all of humanity. Tear down the enclosing fences.

    This is our common heritage, not OpenAI’s private property

    • Echo Dot@feddit.uk
      link
      fedilink
      English
      arrow-up
      1
      ·
      2 hours ago

      It doesn’t matter anyway, we still need the big companies to bankroll AI. So it effectively does belong to them whatever we do.

      Hopefully at some point people can get the processor requirements to something sane and AI development opens up to us all.

  • TootSweet@lemmy.world
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    7 hours ago

    To speak of AI models being “made public domain” is to presuppose that the AI models in question are covered by some branch of intellectual property. Has it been established whether AI models (even those trained on properly licensed content) even are covered by some branch of intellectual property in any particular jurisdiction(s)? Or maybe by “public domain” the author means that they should be required to publish the weights and also that they shouldn’t get any trade secret protections related to those weights?

  • Arthur Besse@lemmy.ml
    link
    fedilink
    English
    arrow-up
    24
    arrow-down
    3
    ·
    edit-2
    11 hours ago

    “Given they were trained on our data, it makes sense that it should be public commons – that way we all benefit from the processing of our data”

    I wonder how many people besides the author of this article are upset solely about the profit-from-copyright-infringement aspect of automated plagiarism and bullshit generation, and thus would be satisfied by the models being made more widely available.

    The inherent plagiarism aspect of LLMs seems far more offensive to me than the copyright infringement, but both of those problems pale in comparison to the effects on humanity of masses of people relying on bullshit generators with outputs that are convincingly-plausible-yet-totally-wrong (and/or subtly wrong) far more often than anyone notices.

    I liked the author’s earlier very-unlikely-to-be-met-demand activism last year better:

    I just sent @OpenAI a cease and desist demanding they delete their GPT 3.5 and GPT 4 models in their entirety and remove all of my personal data from their training data sets before re-training in order to prevent #ChatGPT telling people I am dead.

    …which at least yielded the amusingly misleading headline OpenAI ordered to delete ChatGPT over false death claims (it’s technically true - a court didn’t order it, but a guy who goes by the name “That One Privacy Guy” while blogging on linkedin did).

    • madthumbs@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      1
      ·
      7 hours ago

      They’re spitting out propaganda and misinformation mostly from what I can see. If anything, it should get a refund.

      -Outside of coding / debugging tasks (and that’s hit or miss)

  • circuitfarmer@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    56
    arrow-down
    1
    ·
    14 hours ago

    A similar argument can be made about nationalizing corporations which break various laws, betray public trust, etc etc.

    I’m not commenting on the virtues of such an approach, but I think it is fair to say that it is unrealistic, especially for countries like the US which fetishize profit at any cost.

    • ArchRecord@lemm.ee
      link
      fedilink
      English
      arrow-up
      52
      ·
      14 hours ago

      They should be, but currently it depends on the type of bailout, I suppose.

      For instance, if a bank completely fails and goes under, the FDIC usually is named Receiver of the bank’s assets, and now effectively owns the bank.

      • booly@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        7
        ·
        13 hours ago

        At the same time, if a bank goes under, that means they owe more than they own, so “ownership” of that entity is basically worthless. In those cases, a bailout of the customers does nothing for the owners, because the owners still get wiped out.

        The GM bailout in 2009 also involved wiping out all the shareholders, the government taking ownership of the new company, and the government spinning off the newly issued stock.

        AIG required the company basically issue new stock to dilute owners down to 20% of the company, while the government owned the other 80%, and the government made a big profit when they exited that transaction and sold the stock off to the public.

        So it’s not super unusual. Government can take ownership of companies as a condition of a bailout. What we generally don’t necessarily want is the government owning a company long term, because there’s some conflict of interest between its role as regulator and its interest as a shareholder.

        • RubberDuck@lemmy.world
          cake
          link
          fedilink
          English
          arrow-up
          4
          ·
          edit-2
          13 hours ago

          With banks this is also true if they do not have enough liquid assets to meet the legal requirements. So the bank might not be able to count all bank accounts as assets but the FDIC is. Also they can then restructure the bank and force creditors to take a haircut.

          This is why investment banks should be separate from banks that have consumer accounts that are insured by the government.
          Then you can just let the investment bank fail. This was the whole premise of glass steagall that was repealed under clinton…

    • xthexder@l.sw0.com
      link
      fedilink
      English
      arrow-up
      10
      ·
      edit-2
      13 hours ago

      Public domain wouldn’t be the right term for banks being publicly owned. At least for the normal usage of Public Domain in copyright. You can copy text and data, you can’t copy a company with unique customers and physical property.

    • leisesprecher@feddit.org
      link
      fedilink
      English
      arrow-up
      2
      ·
      13 hours ago

      I mean, that sometimes did happen.

      Germany propped up the Commerzbank after 2007 by essentially buying a large part of it, and managed to sell several tranches with a healthy profit.

      Same is true for Lufthansa during COVID.

    • interdimensionalmeme@lemmy.ml
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      11 hours ago

      Banks are redundant, so is the stock market. These institutions do not need to, and should not be private. They are level playing fields in the economy, not participants trying to tilt the board for taking over the game.

    • LovableSidekick@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      arrow-down
      2
      ·
      edit-2
      13 hours ago

      No, “the banks” wouldn’t be what the AI would be trained on, it would be the private info of individuals the banks do business with.

  • hark@lemmy.world
    link
    fedilink
    English
    arrow-up
    25
    arrow-down
    2
    ·
    15 hours ago

    Imaginary property has always been a tricky concept, but the law always ends up just protecting the large corporations at the expense of the people who actually create things. I assume the end result here will be large corporations getting royalties from AI model usage or measures put in place to prevent generating content infringing on their imaginary properties and everyone else can get fucked.

    • merc@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      13
      ·
      14 hours ago

      It’s like what happened with Spotify. The artists and the labels were unhappy with the copyright infringement of music happening with Napster, Limewire, Kazaa, etc. They wanted the music model to be the same “buy an album from a record store” model that they knew and had worked for decades. But, users liked digital music and not having to buy a whole album for just one song, etc.

      Spotify’s solution was easy: cut the record labels in. Let them invest and then any profits Spotify generated were shared with them. This made the record labels happy because they got money from their investment, even though their “buy an album” business model was now gone. It was ok for big artists because they had the power to negotiate with the labels and get something out of the deal. But, it absolutely screwed the small artists because now Spotify gives them essentially nothing.

      I just hope that the law that nothing created by an LLM is copyrightable proves to be enough of a speed bump to slow things down.

        • xthexder@l.sw0.com
          link
          fedilink
          English
          arrow-up
          8
          ·
          13 hours ago

          It’s also one of the few places that have lossless audio files available for download. I’m a big fan of Bandcamp. I like having all my music local.

  • m-p{3}@lemmy.ca
    link
    fedilink
    English
    arrow-up
    41
    arrow-down
    4
    ·
    16 hours ago

    It could also contain non-public domain data, and you can’t declare someone else’s intellectual property as public domain just like that, otherwise a malicious actor could just train a model with a bunch of misappropriated data, get caught (intentionally or not) and then force all that data into public domain.

    Laws are never simple.

    • drkt@scribe.disroot.org
      link
      fedilink
      English
      arrow-up
      19
      arrow-down
      2
      ·
      16 hours ago

      Forcing a bunch of neural weights into the public domain doesn’t make the data they were trained on also public domain, in fact it doesn’t even reveal what they were trained on.

      • deegeese@sopuli.xyz
        link
        fedilink
        English
        arrow-up
        10
        arrow-down
        17
        ·
        16 hours ago

        LOL no. The weights encode the training data and it’s trivially easy to make AI generators spit out bits of their training data.

            • FaceDeer@fedia.io
              link
              fedilink
              arrow-up
              8
              ·
              14 hours ago

              No, he’s challenging the assertion that it’s “trivially easy” to make AIs output their training data.

              Older AIs have occasionally regurgitated bits of training data as a result of overfitting, which is a flaw in training that modern AI training techniques have made great strides in eliminating. It’s no longer a particularly common problem, and even if it were it only applies to those specific bits of training data that were overfit on, not on all of the training data in general.

              • 31337@sh.itjust.works
                link
                fedilink
                English
                arrow-up
                1
                ·
                4 hours ago

                Last time I looked it up and calculated it, these large models are trained on something like only 7x the tokens as the number of parameters they have. If you thought of it like compression, a 1:7 ratio for lossless text compression is perfectly possible.

                I think the models can still output a lot of stuff verbatim if you try to get them to, you just hit the guardrails they put in place. Seems to work fine for public domain stuff. E.g. “Give me the first 50 lines from Romeo and Juliette.” (albeit with a TOS warning, lol). “Give me the first few paragraphs of Dune.” seems to hit a guardrail, or maybe just forced through reinforcement learning.

                A preprint paper was released recently that detailed how to get around RL by controlling the first few tokens of a model’s output, showing the “unsafe” data is still in there.

        • stephen01king@lemmy.zip
          link
          fedilink
          English
          arrow-up
          4
          ·
          15 hours ago

          How easy are we talking about here? Also, making the model public domain doesn’t mean making the output public domain. The output of an LLM should still abide by copyright laws, as they should be.

    • grue@lemmy.world
      link
      fedilink
      English
      arrow-up
      23
      arrow-down
      10
      ·
      16 hours ago

      So what you’re saying is that there’s no way to make it legal and it simply needs to be deleted entirely.

      I agree.

      • FaceDeer@fedia.io
        link
        fedilink
        arrow-up
        7
        arrow-down
        3
        ·
        14 hours ago

        There’s no need to “make it legal”, things are legal by default until a law is passed to make them illegal. Or a court precedent is set that establishes that an existing law applies to the new thing under discussion.

        Training an AI doesn’t involve copying the training data, the AI model doesn’t literally “contain” the stuff it’s trained on. So it’s not likely that existing copyright law makes it illegal to do without permission.

        • grue@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 hours ago

          There’s no need to “make it legal”, things are legal by default until a law is passed to make them illegal.

          Yes, and that’s already happened: it’s called “copyright law.” You can’t mix things with incompatible licenses into a derivative work and pretend it’s okay.

        • xigoi@lemmy.sdf.org
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          2
          ·
          11 hours ago

          By this logic, you can copy a copyrighted imege as long as you decrease the resolution, because the new image does not contain all the information in the original one.

          • yetAnotherUser@discuss.tchncs.de
            link
            fedilink
            English
            arrow-up
            2
            arrow-down
            1
            ·
            7 hours ago

            Am I allowed to take a copyrighted image, decrease its size to 1x1 pixels and publish it? What about 2x2?

            It’s very much not clear when a modification violates copyright because copyright is extremely vague to begin with.

            • grue@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              ·
              2 hours ago

              Just because something is defined legally instead of technologically, that doesn’t make it vague. The modification violates copyright when the result is a derivative work; no more, no less.

          • Voyajer@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            arrow-down
            1
            ·
            9 hours ago

            More like reduce it to a handful of vectors that get merged with other vectors.

          • FaceDeer@fedia.io
            link
            fedilink
            arrow-up
            1
            arrow-down
            1
            ·
            8 hours ago

            In the case of Stable Diffusion, they used 5 billion images to train a model 1.83 gigabytes in size. So if you reduce a copyrighted image to 3 bits (not bytes - bits), then yeah, I think you’re probably pretty safe.

    • merc@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      9
      arrow-down
      2
      ·
      14 hours ago

      It wouldn’t contain any public-domain data though. That’s the thing with LLMs, once they’re trained on data the data is gone and just added to the series of weights in the model somewhere. If it ingested something private like your tax data, it couldn’t re-create your tax data on command, that data is now gone, but if it’s seen enough private tax data it could give something that looked a lot like a tax return to someone with an untrained eye. But, a tax accountant would easily see flaws in it.

  • chiliedogg@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    arrow-down
    4
    ·
    13 hours ago

    Delete them. Wipe their databases. Make the companies start from scratch with new, ethically acquired training data.

      • humorlessrepost@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        9 hours ago

        Genuine question, does anyone know how much of the electricity is used for training the model vs using it to generate responses?

        • Rikudou_Sage@lemmings.world
          link
          fedilink
          English
          arrow-up
          6
          ·
          9 hours ago

          Not specifically, but training is pretty fucking expensive to do, while generating is kinda easy. The OpenAI models are massive, training them cost a lot. Though they also have a lot of traffic. But unless they stop training new models, I don’t think generating answers will ever catch up to training.

          • Hackworth@lemmy.world
            link
            fedilink
            English
            arrow-up
            1
            ·
            8 hours ago

            For perspective, all of the data centers in the US combined use 4% of total electric load.

  • brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    15 hours ago

    The environmental cost of training is a bit of a meme. The details are spread around, but basically, Alibaba trained a GPT-4 level-ish model on a relatively small number of GPUs… probably on par with a steel mill running for a long time, a comparative drop in the bucket compared to industrial processes. OpenAI is extremely inefficient, probably because they don’t have much pressure to optimize GPU usage.

    Inference cost is more of a concern with crazy stuff like o3, but this could dramatically change if (hopefully when) bitnet models come to frutition.

    Still, I 100% agree with this. Closed LLM weights should be public domain, as many good models already are.

    • brie@programming.dev
      link
      fedilink
      English
      arrow-up
      2
      ·
      11 hours ago

      With current kWh/token it’s 100x of a regular google search query. That’s where the environmental meme came from. Also, Nvidia plans to manufacture enough chips to require global electricity production to increase by 20-30%.

    • j4k3@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      15 hours ago

      Doesn’t Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?

      What are bitnet models and what does that change in a nutshell?

      • brucethemoose@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        ·
        edit-2
        14 hours ago

        What are bitnet models and what does that change in a nutshell?

        Read the pitch here: https://github.com/ridgerchu/matmulfreellm

        Basically, using ternary weights, all inference-time matrix multiplication can be replaced with much simpler matrix addition. This is theoretically more efficient on GPUs, and astronomically more efficient on dedicated hardware (as adders take up a fraction of the space as multipliers in silicon). This would be particularly fantastic for, say, local inference on smartphones or laptop ASICs.

        The catch is no one has (publicly) risked a couple of million dollars to test it with a large model, as (so far) training it isn’t more efficient than “regular” LLMs.

        Doesn’t Open AI just have the same efficiency issue as computing in general due to hardware from older nodes?

        No one really knows, because they’re so closed and opaque!

        But it appears that their models perform relatively poorly for thier “size.” Qwen is nearly matching GPT-4 in some metrics, yet is probably an order of magnitude smaller, while Google/Claude and some Chinese models are also pulling ahead.