English Language Dominance

The English language will increase its dominance in an AI world

Language is itself a technology, and like many technologies, it exhibits a classic network effect: each additional speaker of a language increases that language’s utility for all other speakers. The more “users” who speak and write English, the more valuable it is to know and use English in just about all affairs.

One obvious example is in software programming. Though there is a lot of symbolic and mathematical notation in programming languages, most would agree that English is head-and-shoulders more valuable to know (relative to the 2nd or 3rd most popular language) if you want to be a good programmer. It’s better for troubleshooting, for reading documentation, for scouring StackOverflow for copy-paste code, and now for getting ChatGPT or CoPilot to write code for you.

My belief is that as AI proliferates, English will only increase its lead. English is already in the lead with 1.4B speakers (though this number varies significantly depending on how you measure fluency), and Mandarin Chinese is second at 1.1B.

Why?

AI models need data. English comprises a majority of the available online training data. It helps that the largest economy in the world (the US) and the most populous country in the world (India, which depending on your reference, surpassed China’s population this year), are both English markets.

The largest content generating internet platforms — from Google to Facebook to Twitter to Wikipedia and on and on — are dominated by English speakers. An AI model’s output quality is directly correlated with the quantity of its training data, and there is simply more English data available than any other language, including Mandarin Chinese. Thus GPT4 and LLAMA and so forth are “smartest” in English.

There are multiple reasons why Mandarin Chinese lags behind, beyond just the fact that the breakthrough innovations in AI research and productization happened first in the US and UK. Among these reasons are the Great Firewall, the highly regulated and controlled nature of Chinese data, and China’s pervasive digital censorship (For example, there are more than 500 words alone that can’t be used on many Chinese UGC websites because they are perceived as unfavorable nicknames for President Xi Jinping)

Thus Chinese online training data lags English in both quantity and likely quality. There are also some reasons related to the languages themselves, where English is a more explicit language and Chinese more contextual.

English’s initial data lead is a self-reinforcing feedback loop — the more that people use English to interact with services like CharacterAI and ChatGPT, the more data the LLMs have to refine and improve (in English). Leaving other languages in the dust, especially long tail ones like Icelandic or Khmer.

As AI agents increasingly interact with each other, I’m guessing they will develop their own unique protocols for AI-to-AI communication. Not dissimilar to how computers communicate via highly structured network requests, only more complex and perhaps unique. AI will eventually create its own AI lingua franca. However, it’s also necessary that some human-readable component be built into this AI-ese (because at a minimum, developers will want to know where to debug and fix errors). English will likely be chosen for that AI-to-AI interface.

Of course, AI is an amazing and broad innovation that will benefit speakers of all languages. It will help to preserve and distribute rarer languages, and enable faster and better language-to-language education and translation. Whether you speak Vietnamese or Icelandic, there will be an AI model for you. I’m simply arguing that these secondary languages won’t be anywhere NEAR as good as the leading English models, and I would venture that even if English isn’t your first or even second language, you will probably still get better results using broken English to interact with ChatGPT than, say, French.

I could be very wrong here. As with any emerging technology, second and third order effects are by their nature unpredictable and chaotic. And the technology still has a long way to evolve and mature. Let’s see how it all plays out. I’m especially curious about what kinds of AI-to-AI communications will emerge, whether exposed through a human-readable interface or otherwise.

Ok that’s it, over and out good sers and madams! OpenAI wow!

Discover more from @habits

Subscribe now to keep reading and get access to the full archive.

Continue reading