How useful would the training data be
Open datasets are getting much better (Tulu for an instruct database/recipe is a great example), but its clear the giants still have “secret sauce” that gives them at least a small edge over open datasets.
There actually seems to be some vindication of using massively multilingual datasets as well, as the hybrid chinese/english models are turning out very good.