From what I understand, the key contribution lies in its cost efficiency during training. However, is this referring to the whole training (including pre-training) phase, or just the reinforcement learning stage?
Additionally, it seems that the cost savings primarily come from improvements in the reward model. The paper mentions two examples: for math problems, they use fixed answers to generate reward scores, and for LeetCode problems, they rely on a compiler.
However, these examples cover only a narrow set of problem types. Not all logical challenges fall under math or coding. Can a model trained mainly on math and coding problems generalize well to other types of logical reasoning tasks?
So it's just a much better option that is free and more efficient to run, and to add insult to injury it was apparently trained for much less than what OpenAI paid for o1. The cost of training is mostly irrelevant though because the model weights are available for free.