Beta testing Stad.social

@vidarh@stad.social

  • 16 Posts
  • 54 Comments
Joined 9 months ago
cake
Cake day: October 1st, 2023

help-circle















  • Funny thing is it’s not a proper lake, and not very old. It’s an artificial basin that was originally prepared to allow for control of the height of a canal dug all the way from the Thames a few miles away, for transport. But they finished it not long before the railway came, and it went bankrupt, and the canal path itself was sold off to a railway company and is now the path of one of the main London rail lines. As a result there are roads near me, nowhere near water, named things like Towpath Way and Canal Walk.








  • My own. My Emacs config grew over years to several thousand lines, and it got to a point where I decided I could write an editor in fewer lines that it took to configure Emacs how I liked it. It’s … not for everyone. I’m happy with it, because it does exactly only the things I want it to, and nothing else, but it does also mean getting used to quirks you can’t be bothered to fix, and not getting to blame someone else when you run into a bug.

    That said, writing your own editor is easier than people think, as long as you leverage libraries for whichever things you don’t have a pressing need to customize (e.g. mine is written in Ruby, and I use Rouge for syntax highlighting, and I believe Rouge is more lines of code than the editor itself thanks to all the lexers)







  • The thing, is realistically it won’t make a difference at all, because there are vast amounts of public domain data that remain untapped, so the main “problematic” need for OpenAI is new content that represents up to data language and up to date facts, and my point with the share price of Thomson Reuters is to illustrate that OpenAI is already getting large enough that they can afford to outright buy some of the largest channels of up-to-the-minute content in the world.

    As for authors, it might wipe a few works by a few famous authors from the dataset, but they contribute very little to the quality of an LLM, because the LLM can’t easily judge during training unless you intentionally reinforce specific works. There are several million books published every year. Most of them make <$100 in royalties for their authors (an average book sell ~200 copies). Want to bet how cheap it’d be to buy a fully licensed set of a few million books? You don’t need bestsellers, you need many books that are merely sufficiently good to drag the overall quality of the total dataset up.

    The irony is that the largest benefactor of content sources taking a strict view of LLMs will be OpenAI, Google, Meta, and the few others large enough to basically buy datasets or buy companies that own datasets because this creates a moat for those who can’t afford to obtain licensed datasets.

    The biggest problem won’t be for OpenAI, but for people trying to build open models on the cheap.