Sycophancy and sandbagging 🤔

they are more likely to learn to act as expected in precisely those circumstances while behaving competently but unexpectedly in others. This can surface in the form of problems that Perez et al. (2022) call sycophancy, where a model answers subjective questions in a way that flatters their user’s stated beliefs, and sandbagging, where models are more likely to endorse common misconceptions when their user appears to be less educated

Kinda like people, no?

From the same paper I mentioned before: https://cims.nyu.edu/~sbowman/eightthings.pdf

8 (fascinating) things about large language models: “Specific important behaviors in LLM tend to emerge unpredictably as a byproduct of increasing investment”

From this paper: https://cims.nyu.edu/~sbowman/eightthings.pdf

Below are some selections from the list (quoted verbatim):

1. LLMs predictably get more capable with increasing investment, even without targeted innovation

There are substantial innovations that distinguish these three models, but they are almost entirely restricted to infrastructural innovations in high-performance computing rather than model-design work that is specific to language technology.

2. Specific important behaviors in LLM tend to emerge unpredictably as a byproduct of increasing investment

They’re justifiably confident that they’ll get a variety of economically valuable new capabilities, but they can make few confident predictions about what those capabilities will be or what preparations they’ll need to make to be able to deploy them responsibly.

4. There are no reliable techniques for steering the behavior of LLMs

In particular, models can misinterpret ambiguous prompts or incentives in unreasonable ways, including in situations that appear unambiguous to humans, leading them to behave unexpectedly

6. Human performance on a task isn’t an upper bound on LLM performance

they are trained on far more data than any human sees, giving them much more information to memorize and potentially synthesize

Podcast notes: Sam Altman (OpenAI CEO) on Lex Fridman – “Consciousness…something very strange is going on”

// everything is paraphrased from Sam’s perspective unless otherwise noted

Base model is useful, but adding RLHF – take human feedback (eg, of two outputs, which is better) – works remarkably well with remarkably little data to make model more useful

Pre training dataset – lots of open source DBs, partnerships – a lot of work is building great dataset

“We should be in awe that we got to this level” (re GPT 4)

Eval = how to measure a model after you’ve trained it

Compressing all of the web into an organized box of human knowledge

“I suspect too much processing power is using model as database” (versus as a reasoning engine)

Every time we put out new model – outside world teaches us a lot – shape technology with us

ChatGPT bias – “not something I felt proud of”
Answer will be to give users more personalized, granular control

Hope these models bring more nuance to world

Important for progress on alignment to increase faster than progress on capabilities

GPT4 = most capable and most aligned model they’ve done
RLHF is important component of alignment
Better alignment > better capabilities and vice-versa

Tuned GPT4 to follow system message (prompt) closely
There are people who spend 12 hours/day, treat it like debugging software, get a feel for model, how prompts work together

Dialogue and iterating with AI / computer as a partner tool – that’s a really big deal

Dream scenario: have a US constitutional convention for AI, agree on rules and system, democratic process, builders have this baked in, each country and user can set own rules / boundaries

Doesn’t like being scolded by a computer — “has a visceral response”

At OpenAI, we’re good at finding lots of small wins, the detail and care applied — the multiplicative impact is large

People getting caught up in parameter count race, similar to gigahertz processor race
OpenAI focuses on just doing whatever works (eg, their focus on scaling LLMs)

We need to expand on GPT paradigm to discover novel new science

If we don’t build AGI but make humans super great — still a huge win

Most programmers think GPT is amazing, makes them 10x more productive

AI can deliver extraordinary increase in quality of life
People want status, drama, people want to create, AI won’t eliminate that

Eliezer Yudkowsky’s AI criticisms – wrote a good blog post on AI alignment, despite much of writing being hard to understand / having logical flaws

Need a tight feedback loop – continue to learn from what we learn

Surprised a bit by ChatGPT reception – thought it would be, eg, 10th fastest growing software product, not 1st
Knew GPT4 would be good – remarkable that we’re even debating whether it’s AGI or not

Re: AI takeoff, believes in slow takeoff, short timelines

Lex: believes GPT4 can fake consciousness

Ilya S said if you trained a model that had no data or training examples whatsoever related to consciousness, yet it could immediately understand when a user described what consciousness felt like

Lex on Ex Machina: consciousness is when you smile for no audience, experience for its own sake

Consciousness…something very strange is going on

// Stopped taking notes ~halfway

Jobs replaced by AI, or jobs re-created by AI?

Tweet from @bentossell (I love his daily AI newsletter)

The list got me thinking… instead of framing as “AI replaces X job”, I think the actual outcome is more like “AI recreates X job”, in much the same way that ATMs recreated the bank teller’s job, and personal computers recreated the typist’s job, and Photoshop recreated the graphic designer’s job…

Implicit in this, is that change is inevitable and outcomes will favor those who best adapt.

Just some thinking aloud…

Content creator –> after AI –> Human does more editing, curating, and aggregating (eg, across different media types)

Journalist –> AI –> Human does more primary research (developing sources, interviewing), editing

Teacher –> AI –> Human does more coaching (emotional support), planning (what to learn when), problem solving (when students are stuck)

Customer service rep –> AI –> Human does more complex issue resolution, relationship building, sales development

Social media manager –> AI –> Human does more editing and curation, community and relationship building

Translator –> AI –> Human does more fact checking, editing, research

Musician –> AI –> Human does more mixing, curating, multimedia, live performance, inventing new musical styles

Not insignificant, too, that several of the jobs on the list — such as web developer or social media manager — didn’t exist in their current form as recently as a few decades ago, and were also enabled (or transformed) by similar mega waves of technological change (eg, personal computers, smartphones, the internet).

I do think AI has surprised in the following important way: Even as recently as a year ago, most people would have assumed that the creative fields (broadly, activities like making art, writing fiction, composing music) were less at risk than the more repetitive, linear, analytical fields. Today generative art and LLMs have definitively proven otherwise.

Change filled times ahead!

Is text all you need…? Do you even need text? (Ribbonfarm on AI)

A thought provoking post from Venkatesh Rao (@vgr / Ribbonfarm) on AI:

Yes, there’s still superhuman-ness on display — I can’t paint like Van Gogh as Stable Diffusion can (with or without extra fingers) or command as much information at my finger-tips as the bots — but it’s the humanizing mediocrity and fallibility that seems to be alarming people. We already knew that computers are very good at being better than us in any domain where we can measure better. What’s new is that they’re starting to be good at being ineffectual neurotic sadsacks like us in domains where “better” is not even wrong as a way to assess the nature of a performance.

There are, by definition, only a handful of humans whose identity revolves around being the world’s best Go player. The average human can at best be mildly vicariously threatened by a computer wiping the floor with those few humans. But there are billions whose identity revolves around, for instance, holding some banal views about television shows, sophomoric and shallow opinions about politics and philosophy, the ability to write pedestrian essays, do slow, error-prone arithmetic, write buggy code, and perhaps most importantly, agonize endlessly about relationships with each other, creating our heavens and hells of mutualism.

Link: https://studio.ribbonfarm.com/p/text-is-all-you-need

I don’t think humans are all that special. Yes, each human is special in some limited way, and together as a species we have built some very special things.

But it’s increasingly clear that some of those very special things we have built — such as AI and coming soon, smart robots — will expose our own flaws and imperfections, a kind of inverse magic mirror, and there is and will be a deepening divide between those who use or even love the magic mirror, and those who want to look away or smash it.

This divide is already a driver of the world’s growing income inequality (though I think the generational divide has been a much larger cause of this, at least in developed economies), and I think it will become *the* driver in the coming decades.