Sycophancy and sandbagging 🤔

they are more likely to learn to act as expected in precisely those circumstances while behaving competently but unexpectedly in others. This can surface in the form of problems that Perez et al. (2022) call sycophancy, where a model answers subjective questions in a way that flatters their user’s stated beliefs, and sandbagging, where models are more likely to endorse common misconceptions when their user appears to be less educated

Kinda like people, no?

From the same paper I mentioned before: https://cims.nyu.edu/~sbowman/eightthings.pdf

Discover more from @habits

Subscribe now to keep reading and get access to the full archive.

Continue reading