But I don't understand: there is so much data in the real world beyond the internet. Webcams. Microphones. Cars. Robots... Everything can collect multimodal data and more importantly (for robots) even get feedback loops from reality.
So isn't data functionally infinite? And the only thing standing in the way is the number of sensors and open datastreams and datasets.
Please help me understand
The scaling law plots are log scale so to get more juice with naive scaling we'd need to invest exponentially more resources, and we're at a point where the juice is not worth the squeeze, so people will shift to moving the curve down with new architectures, better curated datasets and test time compute / RL.
See:
- FineWeb: https://arxiv.org/abs/2406.17557
- Phi-4: https://arxiv.org/abs/2412.08905
- DataComp: https://arxiv.org/abs/2406.11794
Humans get to PhD level with barely a drop of training data compared to what LLMs are trained on.
If there were infinite useful data, then scaling AI on data would make sense. Since there isn't, the way forward is getting more efficient at using the data we have.
Data from the internet can be chunked, sorted, easily processed, and has a relatively high signal-to-noise ratio. Data from a webcam or a microphone -- if even legal to access in the first place -- would be a mess. Imagine chunking and processing 5TB of that sort of data. Seems to me that the effort would far outweigh the reward.
Robots are a different problem entirely. It's darkly amusing that simple problems of motion through space are more complex to replicate than painting the simulacra of a masterpiece, or acing the medical licensing exam. We'll probably have AGI before we can mimic the movement of the simple housefly.
Access to physical reality is important when negotiating with the beings that can form under this constraint. People have apparently known this instinctively for a very long time and they are not going to give in to the demands of the AI industry.
It's a great mistake to humanize everything in your consciousness.