Is the training material the new AI moat?

Question

DeepSeek R1 claims to be trained with 14.8T high quality tokens, without further info on the contents. Meta claims 15T tokens from public sources with undisclosed filters.Are such datasets needed to train these models accessible to any company? If one wants to get such dataset, let's say in some non-AI hotspot country, like India or Brazil, would it be possible in the first place? Or is special corporate access needed, as these datasets seem to be kept private and undisclosed for all new models?

verdverm · Accepted Answer

There is a large, open data set of copyright free material somewhere (which I cannot relocate at the moment, perhaps another HNer will have a link handy) IIRC, Meta has put it togetherNot what I was looking for, but a good link nonetheless: https://ai.meta.com/datasets/