The top AI tool, Colossal Clean Crawled Corpus (C4), depends on multiple crypto platforms for a significant portion of its data. An analysis shows that C4 extracts millions of text snippets from crypto-based websites or web platforms closely related to cryptocurrency.
According to reports, the U.S. Securities and Exchange Commission (SEC), which now contains a significant amount of crypto-related information, accounts for 36 million C4 tokens, representing 0.02% of the platform’s dataset. The SEC’s website (sec.gov), from which C4 fetches the data, ranked 39th among the websites engaged by C4.
Satoshi Nakamoto’s Bitcointalk.org accounted for 6.1 million C4 tokens, equivalent to 0.004% of the total tokens. It ranked as the 780th website engaged by the platform.
Other crypto platforms engaged by C4 for data acquisition include the crypto news website, Cointelegraph, and the tokens aggregation platform, CoinmarketCap. These and six more related websites accounted for 0.008% of all C4 tokens, while other websites related to specific cryptocurrencies formed a negligible part of the representation.
IPFS (ipfs.io) and Steemit (steemit.com) featured significantly in C4’s dataset. IPFS ranked 16th, while Steemit ranked in the 594th position. Both these sites are not directly involved in crypto but have significant inclinations toward the crypto industry.
The involvement of crypto-related platforms in C4’s AI training process exposes cryptocurrency’s encroachment into the mainstream. Crypto websites’ extent of representation is significant enough to influence the outcome of C4, even though mainstream websites like Google and Facebook outrank them significantly.
C4 has faced criticism over pirated data and hate speech, despite reports of the dataset being “cleaned”. With only 400 words in its list for censoring specific content, it suggests there could still be controversial content within C4. The presence of crypto sites in its dataset could also affect its level of bias.