Are such datasets needed to train these models accessible to any company? If one wants to get such dataset, let's say in some non-AI hotspot country, like India or Brazil, would it be possible in the first place? Or is special corporate access needed, as these datasets seem to be kept private and undisclosed for all new models?
Not what I was looking for, but a good link nonetheless: https://ai.meta.com/datasets/