1/
🧵 The depletion of training data for Large Language Models (LLMs) is a pressing concern. How imminent is this issue? Let’s explore detailed estimates.
#Mozo #DecentralizedAI2/ Llama 3, an advanced LLM, has been trained on 15 trillion tokens (11T words). This dataset is approximately 100,000 times larger than the linguistic exposure of an average human by age 20.
3/ The predominant source for LLM data is web crawls. Common Crawl, encompassing over 100 trillion tokens, contains significant noise. However, curated subsets like FineWeb, comprising 15T tokens, provide a more refined dataset.
4/ English constitutes only 45% of web content, suggesting that multilingual data could potentially double available training data. While current models derive limited benefits from this, future innovations might harness this multilingual data effectively.
5/ Common Crawl captures only HTML pages, excluding valuable data in PDFs, dynamically rendered content, and login-protected areas, indicating untapped data reservoirs.
6/ Academic literature and patents collectively add another 1.2 trillion tokens. Despite extraction challenges, these sources offer high-quality, invaluable text for training LLMs.
7/ Books represent another substantial data source, albeit less accessible. Google Books alone provides nearly 5T tokens, though it remains exclusive to Google. Shadow libraries like Anna’s Archive offer significant additional resources.
8/ Social media platforms are vast repositories of text data. Twitter’s dataset includes 11T tokens, while Weibo’s dataset is significantly larger at 38T tokens. These highlight the immense potential of user-generated content.
9/ Transcribed audio from YouTube, TikTok, podcasts, and radio archives also contributes significantly to available data. For instance, YouTube and TikTok combined provide 12T tokens of valuable spoken content.
10/ Source code represents a critical type of text for training models. Public repositories alone contribute 0.75T tokens, with the total potentially reaching 20T tokens when including private code.
11/ Private data vastly surpasses public sources. Emails and instant messages alone account for trillions of tokens, highlighting a massive yet ethically sensitive data reservoir.
12/ Current LLMs are trained on datasets of up to 15T tokens. With concerted effort, this could extend to 25-30T, and potentially 60T with the inclusion of non-English data. However, ethical and practical constraints must be considered.
13/ The future of LLMs hinges on leveraging available data effectively while upholding ethical standards. At Mozo, we strive to democratize AI development through decentralized, high-quality data contributions.
#Mozo #DecentralizedAI14/ Join us in pioneering the future of AI. Together, we can ensure sustainable and ethical data use, pushing the boundaries of decentralized AI development.
🌐💡 #Mozo #DecentralizedAIWebsite|
Twitter|
Community|
Telegram Mini App|
News Channel