• Lojcs@lemm.ee
    link
    fedilink
    English
    arrow-up
    3
    arrow-down
    2
    ·
    5 months ago

    What kind of a website is that? Super slow and doesn’t work without web assembly. Do you really need that for a simple interface

    • Scott@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      2
      ·
      5 months ago

      It’s not about their frontend, they are running custom LPUs which can process LLM tokens at 500/sec which is insanely impressive.

      For reference with a max size of 2k tokens, my dual xeon silver 4114 procs take 2-3 minutes.

      • Lojcs@lemm.ee
        link
        fedilink
        English
        arrow-up
        1
        ·
        5 months ago

        No I got what you meant, but that site is weird if it’s not doing anything on its own

      • Finadil@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        5 months ago

        That with a fp16 model? Don’t be scared to try even a 4 bit quantization, you’d be surprised at how little is lost and how much quicker it is.

      • Amaltheamannen@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        5 months ago

        Isn’t it those that cost $2000 per 250mb of memory?? Meaning you’d about 350 to load any half decent model.

        • Scott@sh.itjust.works
          link
          fedilink
          English
          arrow-up
          2
          ·
          5 months ago

          Not sure how they are doing it, but it was actually $20k not $2k for 250mb of memory on the card. I suspect the models are probably cached in system memory.