Book: The Alignment Problem

Excerpts: “How to prevent such a catastrophic divergence–how to ensure that these models capture our norms and values, understand what we mean or intend, and, above all, do what we want–has emerged as one of the most central and most urgent scientific questions in the field of computer science. It has a name: the alignment problem” (p. 13).

“We don’t use the flashing lights on the screen as data that we can leverage to gain ‘real’ rewards in that environment. The flashing lights, and whatever reaction they provoke in us, are all the reward there is” (p. 203).

“Here diverge two different schools of moral thought: ‘possibilism’–the view that one should do the best possible thing in every situation–versus ‘actualism’–the view that one should do the best thing at the moment, given what will actually happen later (whether because of your own later deeds or some other reason)” (p. 236).

“A third fundamental challenge with imitation is that if one’s primary objective is to imitate the teacher, it will be hard to surpass them” (p. 240).

“Some, for instance, worry that humans aren’t a particularly good source of moral authority. ‘We’ve talked a lot about the problem of infusing human values into machines,’ says Google’s Blasie Aguera y Arcas. ‘I actually don’t think that that’s the main problem. I think that the problem is that human values as they stand don’t cut it. They’re not good enough'” (p. 247).

“The other problem, though, apart from the lack of a ‘none of the above’ answer, is that not only do these models have to guess an existing label, they are alarmingly confident in doing so” (p. 281-282).

“This isn’t always bad: if the system makes a mess of some kind, we probably want it to clean up after itself. But sometimes these ‘offsetting’ actions are problematic. We don’t want a system that cures someone’s fatal illness but then–to nullify the high impact of the cure–kills them” (p. 292).

“It didn’t fit the mold of the perennial concerns in ethical philosophy: ‘What is the right thing to do, given some moral standard?’ and ‘What standard should we use to determine the right thing to do?’ This was subtly but strikingly different. It was “What is the right thing to do when you don’t know the right thing to do” (p. 304).

“Further, with any particular simple model, we may well ask where the ‘menu’ of possible features came from, not to mention what human process drove the desiderata and the creation of the tool in the first place” (p. 319).

“In fact, cognitive scientists such as Hugo Mercier and Dan Sperber have recently argued that the human capacity for reasoning evolved not because it helped us make better decisions and hold more accurate beliefs about the world but, rather, because it helped us win arguments and persuade others” (p. 319).

“Humans likewise understand that, in the words of the mindfulness teacher Jon Kabat-Zinn, ‘wherever you go, there you are,’ whereas RL agents typically don’t think of themselves as part of the world they’re modeling” (p. 321).

“University of Louisville computer scientist Roman Yampolskiy concurs, stressing, ‘We as humanity do not agree on common values, and even parts we do agree on change with time” (p. 324).

“We are in danger of losing control of the world not to AI or to machines as such but to models. To formal, often numerical specifications for what exists and for what we want” (p. 325).

“In the National Transportation Safety Board review of the self-driving Uber car that killed pedestrian Elaine Herzberg in Tempe, Arizona, in 2018, the analysis reveals that the ‘system never classified her as a pedestrian…because she was crossing…without a crosswalk; the system design did not include a consideration for jaywalking pedestrians'” (p. 327).

Christian, Brian (2020). The Alignment Problem: Machine Learning and Human Values. New York: W.W. Norton & Company, Inc.