[IDEA] Scaling inference-time with complexity

My observation

Humans think about different things and concepts for different periods of time. Saying “and” takes less effort to think of than “telephone”, as that is more context sensetive.

Example

User: What color does an apple have?

LLM: Apples are red.

Here, the inference time it takes to generate the word “Apple” and “are” is exactly the same time as it takes it to generate “red”, which should be the most difficult word to come up with. It should require the most amount of compute.

Or let’s think about this the other way around. The model thought just as hard about the word “red”, as it did the way less important words “are” and “Apples”.

My idea

We add maybe about 1000 new tokens to an LLM which are not word tokens, but thought tokens or reasoning tokens. Then we train the AI as usual. Every time it generates one of these reasoning tokens, we don’t interpret it as a word and simply let it generate those tokens. This way, the AI would kinda be able to “think” before saying a word. This thought is not human-interpretable, but it is much more efficient than the pre-output reasoning tokens of o1, which uses human language to fill its own context window with.

Chances

My hope for this is to make the AI able to think about what to say next like a human would. It is reasonable to assuma that at first in training, it doesn’t use the reasoning tokens all that much, but later on, when it has to solve more difficult things in training, it will very likely use these reasoning tokens to improve its chances of succeeding.
This could drastically lower the amount of parameters we need to get better output of models, as less thought-heavy tasks like smalltalk or very commonly used sentence structures could be generated quickly, while more complex topics are allowed to take longer. It would also make better LLMs more accessible to people running models at home, as not the parameters, but the inference time is scaled.
It would train itself to provide useful reasoning tokens. Compared to how o1 does it, this is a much more token-friendly approach, as we allow for non-human-text generation, which the LLM is probably going to enjoy a lot, as it fills up its context less.
This approach might also lead to more concise answers, as now it doesn’t need to use CoT (chain of thought) to come to good conclusions.

Pitfalls and potential risks

Training an AI using some blackboxed reasoning tokens can be considered a bad idea, as it’s thought proccess is literally uninterpretable.
We would have to constrain the amount of reasoning tokens, so that it doesn’t take too long for a single normal word-token output. This is a thing with other text-only LLMs too, they tend to like to generate long blocks of texts for simple questions.
We are hoping that during training, the model will use these reasoning tokens in its response, even though we as humans can’t even read them. This may lead to the model completely these tokens, as they don’t seem to lead to a better output. Later on in training however, I do expect the model to use more of these tokens, as it realizes how useful it can be to have thoughts.

What do you think?

I like this approach, because it might be able to achieve o1-like performace without the long wait before the output. While an o1-like approach is probably better for coding tasks, where planning is very important, in other tasks this way of generating reasoning tokens while writing the answer might be better.