Scott Emmons

エモンズ・スコット

Center for Human-Compatible AI, UC Berkeley

Scott Emmons is a PhD student in the Department of Electrical Engineering and Computer Sciences at the University of California, Berkeley. Advised by Stuart Russell, he works with the Center for Human-Compatible AI to help ensure that increasingly powerful artificial intelligence systems are robustly beneficial. He previously cofounded far.ai, a 501(c)3 research nonprofit that incubates and accelerates beneficial AI research agendas.

Challenges with Partial Observability of Human Evaluators in Reward Learning

Friday, April 5th, 14:00–14:30

Past analyses of reinforcement learning from human feedback (RLHF) assume that the human fully observes the environment. What happens when human feedback is based only on partial observations? We formally define two failure cases: deception and overjustification. Modeling the human as Boltzmann-rational w.r.t. a belief over trajectories, we prove conditions under which RLHF is guaranteed to cause deception (in particular, deceptive inflating), overjustification, or both. To help address these issues, we mathematically characterize how partial observability of the environment translates into (lack of) ambiguity in the learned return function. In some cases, accounting for partial observability makes it theoretically possible to recover the return function, while in other cases, there is irreducible ambiguity. We caution against blindly applying RLHF in partially observable settings and advocate for future research to tackle these challenges.