The English language will increase its dominance in an AI world

Language is itself a technology, and like many technologies, it exhibits a classic network effect: each additional speaker of a language increases that language’s utility for all other speakers. The more “users” who speak and write English, the more valuable it is to know and use English in just about all affairs.

One obvious example is in software programming. Though there is a lot of symbolic and mathematical notation in programming languages, most would agree that English is head-and-shoulders more valuable to know (relative to the 2nd or 3rd most popular language) if you want to be a good programmer. It’s better for troubleshooting, for reading documentation, for scouring StackOverflow for copy-paste code, and now for getting ChatGPT or CoPilot to write code for you.

My belief is that as AI proliferates, English will only increase its lead. English is already in the lead with 1.4B speakers (though this number varies significantly depending on how you measure fluency), and Mandarin Chinese is second at 1.1B.

Why?

AI models need data. English comprises a majority of the available online training data. It helps that the largest economy in the world (the US) and the most populous country in the world (India, which depending on your reference, surpassed China’s population this year), are both English markets.

The largest content generating internet platforms — from Google to Facebook to Twitter to Wikipedia and on and on — are dominated by English speakers. An AI model’s output quality is directly correlated with the quantity of its training data, and there is simply more English data available than any other language, including Mandarin Chinese. Thus GPT4 and LLAMA and so forth are “smartest” in English.

There are multiple reasons why Mandarin Chinese lags behind, beyond just the fact that the breakthrough innovations in AI research and productization happened first in the US and UK. Among these reasons are the Great Firewall, the highly regulated and controlled nature of Chinese data, and China’s pervasive digital censorship (For example, there are more than 500 words alone that can’t be used on many Chinese UGC websites because they are perceived as unfavorable nicknames for President Xi Jinping)

Thus Chinese online training data lags English in both quantity and likely quality. There are also some reasons related to the languages themselves, where English is a more explicit language and Chinese more contextual.

English’s initial data lead is a self-reinforcing feedback loop — the more that people use English to interact with services like CharacterAI and ChatGPT, the more data the LLMs have to refine and improve (in English). Leaving other languages in the dust, especially long tail ones like Icelandic or Khmer.

As AI agents increasingly interact with each other, I’m guessing they will develop their own unique protocols for AI-to-AI communication. Not dissimilar to how computers communicate via highly structured network requests, only more complex and perhaps unique. AI will eventually create its own AI lingua franca. However, it’s also necessary that some human-readable component be built into this AI-ese (because at a minimum, developers will want to know where to debug and fix errors). English will likely be chosen for that AI-to-AI interface.

Of course, AI is an amazing and broad innovation that will benefit speakers of all languages. It will help to preserve and distribute rarer languages, and enable faster and better language-to-language education and translation. Whether you speak Vietnamese or Icelandic, there will be an AI model for you. I’m simply arguing that these secondary languages won’t be anywhere NEAR as good as the leading English models, and I would venture that even if English isn’t your first or even second language, you will probably still get better results using broken English to interact with ChatGPT than, say, French.

I could be very wrong here. As with any emerging technology, second and third order effects are by their nature unpredictable and chaotic. And the technology still has a long way to evolve and mature. Let’s see how it all plays out. I’m especially curious about what kinds of AI-to-AI communications will emerge, whether exposed through a human-readable interface or otherwise.

Ok that’s it, over and out good sers and madams! OpenAI wow!

Who’s in the loop? How humans create AI that then creates itself

If you think about the approximate lifecycle of AI that’s being built today, it goes something like this:

1. Write algorithms (eg, neural nets)
2. Scrape data (eg, text and images)
3. Train (1) algorithms on (2) scraped data to create models (eg, GPT-4, Stable Diffusion)
4. Use human feedback (eg, RLHF) to fine tune (3) models – including addition of explicit rules / handicaps to prevent abuse
5. Build products using those (4) fine tuned models – both end-user products (like MidJourney) and API endpoints (like OpenAI’s API)
6. Let users do things with the (5) products (eg, write essays, suggest code, translate languages). Inputs > Outputs
7. Users and AI owners then evaluate the (6) results against objectives like profitability, usefulness, controllability, etc. Based on these evaluations, steps (1) through (6) are further refined and rebuilt and improved

Each of those steps initially involved humans. Many humans doing many things. Humans wrote the math and code that went into the machine learning algorithms. Humans wrote the scripts and tools that scraped the data. Etc.

And very steadily, very incrementally, very interestingly, humans are automating and removing themselves from each of those steps.

AI agents are one example of this. Self-directed AI agents can take roughly defined goals and execute multi-step action plans, removing humans from steps (6) and (7).

Data scraping is mostly automated (2). And I think AI and automation can already do much of the cleaning and labeling (eg, ETL), in ways that are better cheaper faster than humans.

AI is being taught how to write and train its own algorithms (steps 1 and 3).

I’m not sure about state of AI art for steps (4) and (5). Step 4 (human feedback) seems hardest because, well, ipso facto. But there are early signs “human feedback” is not all that unique, whether using AI to generate synthetic data, or to perform tasks by “acting like humans” (eg, acting like a therapist), or labeling images, etc.

Step (5) is definitely within reach, given all the viral Twitter threads we’ve seen where AI can build websites and apps and debug code.

So eventually we’ll have AI that can do most if not all of steps 1-7. AI that can write itself, train itself, go out and do stuff in the world, evaluate how well it’s done, and then improve on all of the above. All at digital speed, scale, and with incremental costs falling to zero.

Truly something to behold. And in that world, where will humans be most useful, if anywhere?

Just a fascinating thought experiment, is all. 🧐🧐🧐

These times are only gettin’ weirder.

Using ChatGPT (GPT-4) to study Chinese song lyrics

Recently I wanted to understand the lyrics for 青花瓷, but I couldn’t find good translations through Google since the writing is fairly dense and symbolic. For me it reads like a Tang poem or something. Google Translate was nearly meaningless.

So I turned to ChatGPT (using GPT-4) and boy did it deliver! I was giddy when I saw the first reply to my simple prompt:

chatgpt gpt4 song lyrics

Wow! It’s got everything I need.

I really want to use ChatGPT more. One of the downsides of being in my late 30s is that I’m so *comfortable* with my existing tech habits that it takes more consistent reminding and constant pushing to build a new one.

But this leap feels to me like it’s bigger than when internet search first became fairly good. I’m thinking back to like, the improvement that was Altavista, let alone Google

Podcast notes: Sam Altman (OpenAI CEO) on Lex Fridman – “Consciousness…something very strange is going on”

// everything is paraphrased from Sam’s perspective unless otherwise noted

Base model is useful, but adding RLHF – take human feedback (eg, of two outputs, which is better) – works remarkably well with remarkably little data to make model more useful

Pre training dataset – lots of open source DBs, partnerships – a lot of work is building great dataset

“We should be in awe that we got to this level” (re GPT 4)

Eval = how to measure a model after you’ve trained it

Compressing all of the web into an organized box of human knowledge

“I suspect too much processing power is using model as database” (versus as a reasoning engine)

Every time we put out new model – outside world teaches us a lot – shape technology with us

ChatGPT bias – “not something I felt proud of”
Answer will be to give users more personalized, granular control

Hope these models bring more nuance to world

Important for progress on alignment to increase faster than progress on capabilities

GPT4 = most capable and most aligned model they’ve done
RLHF is important component of alignment
Better alignment > better capabilities and vice-versa

Tuned GPT4 to follow system message (prompt) closely
There are people who spend 12 hours/day, treat it like debugging software, get a feel for model, how prompts work together

Dialogue and iterating with AI / computer as a partner tool – that’s a really big deal

Dream scenario: have a US constitutional convention for AI, agree on rules and system, democratic process, builders have this baked in, each country and user can set own rules / boundaries

Doesn’t like being scolded by a computer — “has a visceral response”

At OpenAI, we’re good at finding lots of small wins, the detail and care applied — the multiplicative impact is large

People getting caught up in parameter count race, similar to gigahertz processor race
OpenAI focuses on just doing whatever works (eg, their focus on scaling LLMs)

We need to expand on GPT paradigm to discover novel new science

If we don’t build AGI but make humans super great — still a huge win

Most programmers think GPT is amazing, makes them 10x more productive

AI can deliver extraordinary increase in quality of life
People want status, drama, people want to create, AI won’t eliminate that

Eliezer Yudkowsky’s AI criticisms – wrote a good blog post on AI alignment, despite much of writing being hard to understand / having logical flaws

Need a tight feedback loop – continue to learn from what we learn

Surprised a bit by ChatGPT reception – thought it would be, eg, 10th fastest growing software product, not 1st
Knew GPT4 would be good – remarkable that we’re even debating whether it’s AGI or not

Re: AI takeoff, believes in slow takeoff, short timelines

Lex: believes GPT4 can fake consciousness

Ilya S said if you trained a model that had no data or training examples whatsoever related to consciousness, yet it could immediately understand when a user described what consciousness felt like

Lex on Ex Machina: consciousness is when you smile for no audience, experience for its own sake

Consciousness…something very strange is going on

// Stopped taking notes ~halfway

“Cars get better by some modest amount each year, as do most other things I buy or use. LLMs, in contrast, can make leaps.” – Tyler Cowen on AI

Must read if you’re interested in AI and its implications; Tyler’s commentary on the recent explosion of AI into the popular consciousness (driven in large part by ChatGPT) has been, in my view, the most realistic+pragmatic:

https://www.bloomberg.com/opinion/articles/2023-01-23/chatgpt-is-only-going-to-get-better-and-we-better-get-used-to-it

“I don’t have a prediction for the rate of improvement, but most analogies from the normal economy do not apply. Cars get better by some modest amount each year, as do most other things I buy or use. LLMs, in contrast, can make leaps.”

“I’ve started dividing the people I know into three camps: those who are not yet aware of LLMs; those who complain about their current LLMs; and those who have some inkling of the startling future before us. The intriguing thing about LLMs is that they do not follow smooth, continuous rules of development. Rather they are like a larva due to sprout into a butterfly.”