AI Training Data Resources Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training https://arxiv.org/abs/2506.01732