Oliver Klingefjord

クリンゲフョード・オリバー

Meaning Alignment Institute

Oliver Klingefjord is the co-founder of the Meaning Alignment Institute – a research organization with the goal of ensuring human flourishing in the age of AGI, by applying theories of values and moral learning from philosophy and sociology to alignment.

What are human values, and how do we align to them?

Friday, April 5th, 16:20–16:50

There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but there is very little work on what that means and how we actually do it. We split the problem of “aligning to human values” into three parts: first, eliciting values from people; second, reconciling those values into an “alignment target” for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what makes for a “good” target when aligning to human values? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a large language model to interview participants about their values in particular contexts; our approach is inspired by the philosophy of values advanced by Taylor (1977), Chang (2004a), and others. We trial MGE with a representative sample of 500 Americans, on 3 intentionally divisive prompts (e.g. advice about abortion). Our results provide evidence that MGE produces an alignment target that meets each of our 6 criteria. For example, almost all participants (98.7%) felt well represented by the process, and most (89%) thought the final moral graph was fair, even if their value wasn’t voted as the wisest. Our process often results in “expert” values (e.g. values from women who have solicited abortion advice) rising to the top of the moral graph, without defining who is considered an expert in advance.