Podcast notes – Noam Shazeer (Character AI, Attention is all you need) on Good Times w Aarthi and Sriram

Intro
-Founded Character AI
-One of authors of “Attention is all you need”
-Was at Google for 20+ years (took a few years break)

Went to Duke undergrad on math scholarship

Realized he didn’t enjoy math, preferred programming and getting computers to do things

During Google interview, Paul Buchheit asked him how to do a good spell corrector, and Noam ended up writing the spell corrector feature for Gmail

Google has been traditionally a bottoms up company – could work on what he wanted

When he started AI, exciting thing was Bayesian networks

Came back to Google to work with Jeff Dean and Google Brain team
“Just a matter of the hardware”
All the growth in hardware is parallelism

Neural networks are mostly matrix multiplications – operations that can be done well on modern hardware

Gamers / video games pulled GPU advancement (highly parallel hardware) out of market

Idea of neural networks has been around since 1970s – loosely modeled on our impression of the brain

Very complicated formula to go from input > output
Formula is made of parameters, and keep tweaking parameters
Neural nets rebranded as “deep learning”
Took off because of parallel computation and gamers

Neural language models are neural networks applied to text
Input is text to this point, output is prediction of what text comes next (probability distribution)
Infinite amount of free training data (text content)
“AI complete problem”
“Really complicated what’s going on in there” (in the neural network)

It’s a really talented improvisational actor – “Robin Williams in a box”

Model improvement is kinda like a child learning – as training and model size grow

Lot more an art than a science – can’t predict very well – if 10% of his changes are improvements, considered “brilliant research” – kinda like alchemy in early days

(Software) bugs – hard to know if you introduce a bug – the system just gets dumber – makes de-bugging extremely difficult

Co-authored “Attention is all you need”
-Previous state of art in LLM is recurrent neural networks (RNN) – hidden state, each new word updates the hidden state, but it’s sequential – slow and costly
Transformer figures out how to process the entire sequence in parallel – massively more performant
-The entire document / batch becomes the sequence
-Lets you do parallelism during training time
During inference time it’s still sequential

Image processing models – parallelism across pixels – convolutional neural nets (CNN)

Google Translate was inspiration – biggest success of machine learning at the time
Translating languages > one RNN for understanding, and another RNN for generating, and need to connect them
Attention layer – take source sentence (language A), turn into key-value associative memory, like a soft lookup into an index
“Attention” is building a memory, a lookup table that you’re using

DALL-E, Stable Diffusion, GPT3, they’re all built on this Google research

Bigger you make the model, more you train it, the smarter it gets – “ok, let’s just push this thing further”

Eventually need super computer
Google built TPU pods – super computer built out of custom ASICS for deep learning

Now need massively valuable applications

Turing Test, Star Trek, lot of AI inspiration is dialogue

Google LAMDA tech & team – eventually decided to leave and build as a startup

“The best apps are things we have not thought of”

If you ask people with first computers “what is this thing good for”, would get completely wrong answers

Parasocial relationships – feel connection with celebrity or character – one way connection – with AI you can make it two ways

Aarthi: “Your own personal Jarvis”

Still need to make it cheaper – or make the chips faster

Aarthi: ideas / areas for entrepreneurs
-Image gen has exploded – lots of good companies coming, very early and promising
-Things like Github Co-Pilot
-new Airtable – using AI for computation

Sriram:
-What’s optimization function that all these models will work toward?
-Will be a very big political / social debate

How do you know better than the user what the user wants?