The Alignment Problem: Machine Learning and Human Values By Brian Christian

Title	:	The Alignment Problem: Machine Learning and Human Values
Author	:	Brian Christian
Rating	:
ISBN	:	0393635821
ISBN-10	:	9780393635829
Format Type	:	Hardcover
Number of Pages	:	496
Publication	:	First published October 6, 2020
Awards	:	Los Angeles Times Book Prize Science & Technology (2020)

A jaw-dropping exploration of everything that goes wrong when we build AI systems and the movement to fix them.

Today’s "machine-learning" systems, trained by data, are so effective that we’ve invited them to see and hear for us—and to make decisions on our behalf. But alarm bells are ringing. Recent years have seen an eruption of concern as the field of machine learning advances. When the systems we attempt to teach will not, in the end, do what we want or what we expect, ethical and potentially existential risks emerge. Researchers call this the alignment problem.

Systems cull résumés until, years later, we discover that they have inherent gender biases. Algorithms decide bail and parole—and appear to assess Black and white defendants differently. We can no longer assume that our mortgage application, or even our medical tests, will be seen by human eyes. And as autonomous vehicles share our streets, we are increasingly putting our lives in their hands.

The mathematical and computational models driving these changes range in complexity from something that can fit on a spreadsheet to a complex system that might credibly be called “artificial intelligence.” They are steadily replacing both human judgment and explicitly programmed software.

In best-selling author Brian Christian’s riveting account, we meet the alignment problem’s “first-responders,” and learn their ambitious plan to solve it before our hands are completely off the wheel. In a masterful blend of history and on-the ground reporting, Christian traces the explosive growth in the field of machine learning and surveys its current, sprawling frontier. Readers encounter a discipline finding its legs amid exhilarating and sometimes terrifying progress. Whether they—and we—succeed or fail in solving the alignment problem will be a defining human story.

The Alignment Problem offers an unflinching reckoning with humanity’s biases and blind spots, our own unstated assumptions and often contradictory goals. A dazzlingly interdisciplinary work, it takes a hard look not only at our technology but at our culture—and finds a story by turns harrowing and hopeful.

The Alignment Problem: Machine Learning and Human Values Reviews

David Rubenstein

The biggest problem in artificial intelligence (AI) is to devise a reward function that gives you the behavior you want, while avoiding side effects or unforseen consequences. This book examines the alignment problem from a number of fascinating perspectives.

This is a fascinating book, full of the implications of AI on philosophy, sociology, and psychology. There are interactions between AI and sociology, psychology in a two-way street. Our understanding of psychology helps to improve AI in numerous ways. Also, AI gives researchers many valuable insights into psychology, and issues in sociology. After all, we want automated algorithms to be unbiased, to be fair. But, who is to say exactly what is fair? Sometimes, the answer isn't easy.

The first problem, well known to workers in AI, is the inherent bias due to small training datasets. AI algorithms demonstrate bias, and can subtly perpetuate it. It seems like many of the biases are not the fault of the algorithms, but instead are a mirror of society and culture. In the 1950's, people tried to predict, using punch card machiles, which prisoners would succeed on parole. A ProPublica study was conducted of the accuracy of COMPAS (Correctional Offender Management Profiling for Alternative Sanctions). COMPAS is used to predict whether an inmate, if released, would commit a violent or a nonviolent crime within 1-3 years. The algorithm was found to be biased against blacks; it overpredicts recidivism among blacks, and underpredicts for whites. A key factor is that it actually does not predict whether a released prisoner would commit a crime. It really predicts whether a released prisoner would be arrested and convicted for a crime. Higher rates of police profiling blacks lead to an inherent bias.

There is a US antidiscrimination law that prohibits certain attributes--like race and gender--from being used in machine-learning modes for hiring, criminal detentions, and so on. Nevertheless, other unprotected variables are correlated with race and gender, so the algorithms can still be discriminatory. In addition, the blocking of these attributes prevents or even mitigating the discrimination!

Predicting whether of not a patient with pneumonia should be hospitalized as an inpatient is problematic. Models predict that if a patient has a chest pain, or has heart disease, asthma, or is over 100, then the patient is less likely to die! The reason is that patients with these conditions automatically receive more care, so they are less likely to die.

Many problems in AI are solved by looking at psychology. For example, BF Skinner taught a pigeon how to bowl in a miniature alley through incremental steps. This led researchers to teach an algorithm to play difficult video games by rewarding incremental steps. Basically, great video games train you how to play. Similarly, neural networks learn language translation by starting with simple sentences before graduating to more difficult ones. This approach is similar to language learning by children. The book
Bobby Fischer Teaches Chess uses a similar approach.

AI is not just about automating tasks, but how can we better understand human psychology. How can we best train ourselves/ First, we should use sparse rewards. Second, we should incentivize a state, not an action. In real life, we can use gamification as an approach to reinforcement learning. Studies of toddlers show that toys that seem to violate the laws of physics were most novel, and held the interest of six-year-olds for the longest time. Infants use violations of prior expectations as special opportunities for learning.

Psychologists have studied overimitation in children and chimpanzees. People learning a new task will learn best through imitation. Sometimes we imitate behaviors that are not relevant to a task. A toddler might overimitate if he cannot figure out why an adult is doing something, so he does it too. As it turns out, chimpanzees do not purposely overimitate. But children can understand whether an adult is teaching or simply experimenting. If an adult is experimenting, the child does not overimitate.

A fascinating chapter of imitation describes the problems encountered by the the first researchers in autonomous driving. Teaching an autonomous care in a video game to drive with imitation is best done by randomly alternating between human and machine drivers.

This book is fascinating on many levels. But it is not always an easy read. Some of the concepts are difficult, even subtle. It is such a pleasure to read a well-researched book that plumbs to the depths of a complicated subject.
Dan Elton

A well researched book on AI safety written to be enjoyed by experts and newbies alike!

This book is the culmination of *four years* of dedicated work and interviews with over 100 world-class experts. The brilliant thing about this book is that it is so information dense and full of interesting anecdotes that people of any level of expertise stand to gain something from it. He’s carefully tuned it so a wide variety of people can enjoy it without getting bored or overwhelmed.

This book covers the well known problems of bias and brittleness in machine learning, including the following well-known cases - the Richard Caruana’s example of pneumonia triage system that went haywire, the COMPAS parole recommendation system, the Google Photos “gorilla” tag fiasco, word2vector gender bias, and the 2018 fatal Uber car crash in Tempe, Arizona. You’d be mistaken to think of this as just another book warning about data bias, lack of robustness, and the potential for discrimination and the perpetuation of inequalities, however.

Sprinkled between the warnings and calls for action are remarkably clear descriptions of modern machine learning techniques and how they relate and/or were inspired by recent developments in neuroscience, cognitive science, developmental psychology, and the social sciences. The author dives into the nitty gritty of how present day AI systems work and does not shy away from explaining current technical challenges.

The way he explains reinforcement learning and links it to research on the dopamine in the brain was one of the highlights of the book for me (I had forgotten how dopamine was linked to temporal difference error, and his description of the history of study on dopamine was fascinating). Not all of the concepts were new to me, but in every case the way he explained each concept was very new to me and wonderful to read. I learned new concepts too. For instance, I never understood what the difference between “on policy” and “off policy” RL systems was until I read his explanation. Other concepts I picked up were “cooperative reinforcement learning”, “shaping”, and various “impact metrics”. If you haven’t heard of these terms and are interested in AI safety, I heartily recommend this book.

This book follows a trend of seamlessly linking near term and far term AI safety concerns which has been a trend since the publication of Nick Bostrom’s 2014 meditation on far future AI, “Superintelligence”. The book is very “down to earth” -- you may be surprised that the standard arguments about why we should be concerned about long term AI risk that we’ve heard from Elon Musk, Sam Harris, etc are largely absent from this book (most notoriously, the “paperclip maximizer”). This is refreshing because those arguments draw on assumptions (such as fast takeoff) which are very hard to defend with empirical data or the current science on AI. (I still find those arguments convincing enough to warrant serious investment of resources to prevent risk, but they aren’t necessarily the best first arguments to present to someone) Instead the author follows an ingenious strategy - he starts with current problems in AI and some near future concerns (for instance with driverless cars driving off the road or home robots that refuse to be turned off.) Then, by providing sufficient technical background, he proceeds to explain why these are really hard problems, some of the solutions that are being worked on, and the limitations of the solutions proposed so far. The book is cautiously optimistic, showing how meaningful progress on the alignment problem is already occurring. So far the problems with AI that we are encountering *right now* appear tractable, which should motivate more people and resources to flow into AI Safety rather than trying to regulate progress to a standstill, which is impossible and likely to be harmful. At the same time, however, by the end of the book the reader will have a deep appreciation of the challenges ahead and the need for extreme caution as we move towards more and more intelligent and powerful AI.
aPriL does feral sometimes

'The Alignment Problem: Machine Learning and Human Values' by Brian Christian is a very interesting overview about the issues in developing useful computing machines. I found it very comprehensive and yet easy to understand. However, it does give me pause in any fantasy I may have had over the Singularity occurring.

The main goal of machine learning is teaching the computer to see, hear and do things without human oversight, and to learn to categorize and make inferences on inputs like humans, and performing a job on the inputs similar to how the human brain functions. The amount and types of inputs necessary to think like a human being, well, ok, computers cannot be fed enough inputs, actually, because of severe limitations based on current hardware. Typically, inputs have to be identified first by an actual human, too, i.e., this is a cat, this is a shadow, this is a dress. Software has to be upgraded to make inferences, judgements, decisions. Which is why scientists are exploring machine learning instead. The computer will teach itself about what/who/why/where by identifying the inputs without help, and performing human-like brain processing on inputs. Theoretically.

Toddlers can do the job of learning about their environment and how to do social interaction (starting with what that is) and how to do a job and figure out actions and activities more quickly and comprehensively than any computer. Quantum computers might be the only hope of a computer thinking as good as a toddler. Meanwhile, computer scientists are making do with inventing new ways for programming machine learning on the computers we have today. The answer is having the computer program itself after starting with minimal basic programming.

I have copied the book blurb as it is accurate:

"Today’s “machine-learning” systems, trained by data, are so effective that we’ve invited them to see and hear for us—and to make decisions on our behalf. But alarm bells are ringing. Recent years have seen an eruption of concern as the field of machine learning advances. When the systems we attempt to teach will not, in the end, do what we want or what we expect, ethical and potentially existential risks emerge. Researchers call this the alignment problem.

Systems cull résumés until, years later, we discover that they have inherent gender biases. Algorithms decide bail and parole—and appear to assess Black and White defendants differently. We can no longer assume that our mortgage application, or even our medical tests, will be seen by human eyes. And as autonomous vehicles share our streets, we are increasingly putting our lives in their hands.

The mathematical and computational models driving these changes range in complexity from something that can fit on a spreadsheet to a complex system that might credibly be called “artificial intelligence.” They are steadily replacing both human judgment and explicitly programmed software.

In best-selling author Brian Christian’s riveting account, we meet the alignment problem’s “first-responders,” and learn their ambitious plan to solve it before our hands are completely off the wheel. In a masterful blend of history and on-the ground reporting, Christian traces the explosive growth in the field of machine learning and surveys its current, sprawling frontier. Readers encounter a discipline finding its legs amid exhilarating and sometimes terrifying progress. Whether they—and we—succeed or fail in solving the alignment problem will be a defining human story.

The Alignment Problem offers an unflinching reckoning with humanity’s biases and blind spots, our own unstated assumptions and often contradictory goals. A dazzlingly interdisciplinary work, it takes a hard look not only at our technology but at our culture—and finds a story by turns harrowing and hopeful."

Computer scientists and mathematicians are trying to get computers to not only be useful doing repetitive acts that bore people to do, and to do work more quickly, but to be useful the same a human brain is useful.

One of the first concepts I learned in studying programming thirty years ago is "Garbage In, Garbage Out." As I turned the last page of 'The Alignment Problem' I realized that that was still true of inputs. However, machine learning has added more garbage, as in output 💩.

The book shows how computer scientists have become more cognizant that simple if-then-else modules won't do at all. For the last 70 years, the needle has moved from programming the computers to do everything by an explicitly created program for a job, to programming computers to "teach" themselves how to do a job, like that of driving a car, or flying an airplane, or face recognition, or mortgage and job applicant assessments, or judging if a convicted offender will reoffend, etc. It is too difficult to program a computer with everything necessary to perform a complex job like the ones I mentioned. But after reading this book, I think teaching a computer to teach itself is very difficult too. It amplifies our own biases, for one example, as explained in this book.

Think about gender and race discrimination. It's not the programmers' fault computers are racists and misogynists. If most of the professional photos programmers input into computers are of white males, or of white males performing a job, like being a doctor or a scientist or a plumber, the computer will 'learn' scientists and doctors and plumbers are all white males - an obvious conclusion to a computer. Most professional photos of many workers in the professions ARE of white males, including politicians.

First, as described in the book, most of the computer scientists didn't see the issue of discrimination at all as the computer worked (problem one). When it was pointed out, they realized the self-teaching computer was a "black box" - they didn't know WHY it was teaching itself only white males were "good" for whatever was the job (problem two). The computer was teaching itself as it had been programmed to do, and so however the computer was doing it had become an invisible process to the scientists who were out of the loop of whatever the computer was doing to do the job (problem three).

Another issue of photos is until recently cameras were calibrated with a photo of a blue-eyed blond girl. ALL CAMERAS. Darker skin colors were completely ignored by manufacturers of cameras. The history of this is described in the book.

An issue about self-teaching computers is they clearly got the impression black people who've been in prison are sure to return to prison, based on statistics the computer was fed. Not only was the computer 'unaware' of black only neighborhoods (they don't know about segregated black and white neighborhoods), it didn't know black neighborhoods have generally a hell of a lot more police officers policing their neighborhoods and arresting black people far more than in white neighborhoods (white people have a lot fewer police policing them). Computers do not know about any of the other systemic issues - black people getting arrested for walking or driving because they are a black person, etc. A lot of black people get arrested and rearrested - that's all the computer knows. Once scientists became aware of how the computer was teaching itself from its inputs, they then had new problems -how to fix it?

Programming the computer to be blind to race and gender will not work, either. For example, women who have nine-month gaps in their work histories will be labeled as terrible employees without a gender tag and giving the computer instructions to ignore gaps in women's employment applications.

But in trying to resolve race and gender issues, a lot of ethical and political social issues come up -fairness is hard to program in a software when we humans can't get it right in the real world.

Since computers were being taught to teach themselves, how was it coming up with its answers? What was it 'looking' at? This was often hard to discover because once the computer began to teach itself it was a black box. But eventually programmers sometimes were able to figure it out through trial and error. For example, in one case, programmers were distressed to find the computer had decided shadows on the ground were more important instead of other objects in a photo, so it was giving answers based on the shadows. Or it was looking at measurement rulers as a key element in photos because some photos had a ruler next to the object that the computer was supposed to be looking at. If the photo had a ruler, it was good, regardless of the object it had been intended to judge and regardless of any other factors.

Computers have been giving erroneous answers to questions people thought it was answering correctly, and people didn't know it was outputting crap. These computers had taught themselves, using the beginning algorithms it had been programmed with, and were coming up with completely off-the-wall outputs. Some of these programs are being used still by many companies and government agencies and police departments today.

Christian is much more scientific and circumspect than me, gentle reader. My own outrage colors my review. Christian writes like the educated scientist he is.

From his Goodreads bio:

"Born in Wilmington, Delaware, Christian holds degrees in philosophy, computer science, and poetry from Brown University and the University of Washington. A Visiting Scholar at the University of California, Berkeley, the Director of Technology at McSweeney’s Publishing, and an active open-source contributor to projects such as Ruby on Rails, he lives in San Francisco."

To know what it is necessary to train a computer to use the same skillset we humans have, it has become necessary to involve specialists in psychology, sociology and philosophy to describe what skills we humans have in our braincases. The book includes the work of psychiatrists' tests on babies and toddlers that show some of the ways how the human brain functions. Philosophers are necessary because of the issues of morality. Sociologists are necessary to explain as best they can how and why of human behavior. These parts of the chapters are as fascinating as those describing how scientists are translating the art of being human to a computer!

So. Ok, then. Computer scientists are translating the work of psychiatrists, philosophers and sociologists on how the brain learns and other behaviors of people into machine-learning programs. This means a lot of what computer scientists are doing is translating biochemical brain responses (dopamine, serotonin) and electrical neuron-signaling into math. This is described in the book.

Machine learning is basically about the computer "earning" a +1 if it does good, or a -1 if it effs up - "rewards" and "demerits". This requires the necessity to tell the computer the parameters of earning a +1 or -1. And of course, when, or if, to stop.

There are, and were, a lot of funny outcomes due to the programmers' inability to foresee everything a computer needed as inputs to 'think', as well as the learning, a computer had to do for itself to resolve a problem. Algorithms have had to change from checking and working with every inputted detail, into being told to look for a more generalized thing and being guided by earning a +1 if they got a solution that was right or a -1 if they got it wrong. For example, finding a photo of a bicycle out of many photos of many objects without being told "this is a photo of a bicycle".

The chapters on game playing, which are a matter of earning points, had some hilarious outcomes because programmers neglected programming what winning the game was. Instead computers went into loops that never ended in order to wrack up points forever! +1, +1, +1, ....

There were other amazing challenges computer programmers conquered in teaching a computer to teach itself how to win at games, too. The book tells the story of computers winning over real human players at chess, Go, and even the Super Mario video games.

My conclusions? I sincerely think the answer to when a computer will 'feel happy' or have any feelings is basically: it will never happen. How would we program that? We don't even know exactly what the boundaries of Life are, much less how being alive starts. Secondly, a computer is only as accurate as its inputs - garbage in, garbage out. However, today, it's also about how it has 'taught' itself - the machine's IQ.

Omg.

The book has extensive Acknowledgements, Notes, Bibliography and Index sections - over a hundred pages for these sections! I recommend 'The Alignment Problem', but I think nerds will enjoy it most.
Krzysztof

There is a great book trapped inside this good book, waiting for a skillful editor to carve it out. The author did vast research in multiple domains and it seems like he could neither build a cohesive narration that could connect all of it nor leave anything out.

This book is probably the best intro to machine learning space for a non-engineer I've read. It presents its history, challenges, what can be done, and what can't be done (yet). It's both accessible and substantive, presenting complex ideas in a digestible form without dumbing them down. If you want to spark the ML interest in anyone who hasn't been paying attention to this field, give them this book. It provides a wide background connecting ML to neuroscience, cognitive science, psychology, ethics, and behavioral economics that will blow their mind.

It's also very detailed, screaming at the reader "I did the research, I went where no one else dared to go!". It will not only present you with an intriguing ML concept but also: trace its roots to XIX century farming problem or biology breakthrough, present all the scientist contributing to this research, explain how they met and got along, cite author's interviews with some of them, and present their life after they published their masterpiece, including completely unrelated information about their substance abuse and dark circumstances of their premature death. It's written quite well, so there might be an audience who enjoys this, but sadly I'm not a part of it.

If this book was structured to touch directly the subject of the alignment problem it would be at least 3 times shorter. It doesn't mean that 2/3 are bad - most of it is informative, some of it is entertaining, a lot seems like ML things that the author found interesting and just added to the book without any specific connection to its premise. I really liked the first few chapters where machine learning algorithms are presented as the first viable benchmark to the human thinking process and mental models that we build. Spoiler alert: it very clearly shows our flaws, biases, and lies that we tell ourselves (that are further embedded in ML models that we create and technology that uses them).

Overall, I enjoyed most of this book. I just feel a bit cheated by its title and premise, which advertise a different kind of book. This is the Machine Learning omnibus, presenting the most interesting scientific concepts of this field and the scientists behind them. If this is what you expect and need, you won't be disappointed!
Tariq Mahmood

My AI's perception as a superior technology which should be embraced unquestionably almost reverentially was successfully challenged after going through the numerous examples in this book. By the end of the book, I was convinced that AI is better and will get even more efficient as compared to human ingenuity, but needs to be constantly tested for questioned, any AI system depends upon the quality of the training data and the type of algorithms employed to solve any problem.
Max

Really nice introduction to AI & the alignment problem - Christian gives a great overview over some bigger trends in ML (e.g. curiosity, imitation learning, transparency) and the history of AI, often connecting it to insights from cognitive science, which really enriched the book, speaking as a human and cognitive scientist. I wonder what more refined thinkers on the future of AI think of the book*, but I found that it connects nicely to many of the looming challenges with building AI systems that are robust and whose workings will be appropriately aligned with human values. Even though similar in style and purpose, I found that it has little overlap with the recent
The AI Does Not Hate You: Superintelligence, Rationality and the Race to Save the World and
Human Compatible: Artificial Intelligence and the Problem of Control. I expect this triple to contribute a lot to introducing more smart cookies to face this formidable challenge and heaving AI's longer-term developments to many agendas as a Serious Issue. So here's to hoping that the ongoing AI revolution will be less of a naively hopeful leap than I'm afraid it will be.

*Rohin Shah from the Alignment Newsletter [liked it a lot](
https://www.lesswrong.com/posts/gYfgW...)
Rishabh Srivastava

Strongly recommended if you're into Machine Learning. The first third of the book is accessible to all readers, but the rest of it is more enjoyable if you have some basic idea of how ML works.

Had some fascinating takeaways beyond machine learning that can be applied to decision making. My favorites were:

1. Simpler models tend to be the most generalizable. For example, when modeling the self-reported happiness of a couple, a simple metric (# of times they had sex - # of times they fought) was far more generally predictive than other, more complex indicators. More complex features can help predict things in a narrow domain better, but simpler features are more generalizable

2. Model attention and explainability is often more important than just predictive accuracy. Multitask networks with feature saliency and visualization techniques are great for understanding the features that a model considers important

3. We should strive to reward states of the world, rather than the actions of our agent (in reinforcement learning). Reward functions that are helpful in one environment (always eat as much sugar and fat as you can is good as a hunter gatherer) are harmful in another environment (modern humans)

4. In reinforcement learning, points have to be assigned in such a way that when you undo something, you know are “fined” the same amount of points as what you earned when did it. If not, your model will promote short term decision making

5. A novelty detection system that tells an agent that they’re in a new situation, and hence should have weak priors, improves the generalizability and performance of an agent . Also rewarding an agent for being wrong in surprising ways leads to better performance than just rewarding an agent when it’s right
Rick Wilson

It’s a good overview of a brief moment in technological advancement.

There’s a common thread in machine learning (AI, I'm going to use these terms interchangeably) research that “oh man we got to be really careful and think about how we set up these machines because they may end the world as we know it.” Thankfully this seems to be counterbalanced by the actual empirical research being done, which mostly seems like a lot of fun tricks. Similar to impressing people with your ability to open a jar by smashing it on the ground.

I love the new models coming out. As of April 2022, Open AI's DallE and GPT-3 models are super cool, (hell, I used their Davinci model to help me write a homework assignment last week) but computer “intelligence“ is intelligence the way a stick you found on the ground is like a forest. I’m sure it represents a tiny little part of it, and there’s some really cool stuff happening in the AI field right now, there’s a phenomenal convergence between computing power and new research methods, just a mind-boggling amount of funding, and a lot of brilliant people going into the field. But every time I read a book like this, I get the impression that “intelligence “is just brute force. It’s like breaking into a bank vault by unleashing a large nuclear explosive. Which is cool. But it’s not intelligence. And it’s not close to intelligence. And it always seems like the answer that these authors have is to dissect the wholeness of consciousness and human experience into constituent parts and then try to reconstruct the parts of the whole.

And that’s what this author does, compellingly. He breaks apart a lot of parts of human consciousness and thought and problem solving and then goes on to show how those have been deconstructed into machine learning algorithms. And I’m sure we can go back-and-forth with me saying that this isn’t intelligence, the author saying “ya ha,” and so on, but I find myself unconvinced that we are even on the right track. We are creating some really impressive tricks out of silicone chips, and the field is advancing it’s such a rapid state that it’s hard to keep up. But it seems like a combination of errors in that we don’t understand what’s happening anymore than we really understand ourselves. It’s like driving down a country road that says there’s a town in 10 miles. You drive on for what feels like 20 minutes, the town should be there, and then there’s another sign saying that the town is in 10 miles.

That said, this book was great. It’s a fascinating tour of the state of machine learning circa 2022. I feel like this field flips itself on its head every year, and in five years it will probably be quaint and mostly outdated. But for now I thought it was a great book. With the title “The Alignment Problem,” I thought I’d be a little more oriented towards Nick Bostrom type warnings about the dangers of AI. Instead it’s essentially a tour of an AI museum of modern machine learning models.

I thought it was well told and generally stays between the lines of speculation and hyperbole. There were some times when talking about evolutionarily psychology, I thought the author was getting a little off what my impression of modern research is. It seems like in psychology whenever we say “only humans can do this“ that thing is contradicted by some sort of niche exception almost immediately. Tool use, language, generosity. We think we are really special as humans and are so willing to come up with reasons why we are unique. I just haven’t typically seen that backed up in significant ways in replicable research. That doesn’t necessarily contradict the core of the book, but it’s becoming a pet peeve of mine. I do think the point the author is trying to make is that what separates us from say a reptile or bird is potentially what would separate us from, on the other side of the spectrum, AGI or some sort of intelligent computer. I’ll grant that, but I think there’s a better and more truthful way to portray it.

That said it this is a good book if you’re willing to get into the weeds of how modern AI is set up, the types of different structures a system can be assembled in, who did what where, and why we’ve been using those structures. It’s a fantastic overview and a strong aggregation of what I understand to be an up to date tour of the field.

Also, if you made it this far, here’s a treat (
https://arxiv.org/abs/2204.06974)
Jessica Dai

tldr worth a read !

Really solid overview of the research field that is typically referred to as "responsible AI" (fairness, explainability, deep learning, language models, RL) -- this book is therefore unique from other tech x society books in the sense that it is highly technical but also [I think] accessible, though I'm probably not the best person to judge that. I'd consider myself pretty familiar with the academic work that this book describes, but Christian packages a really nice story for the history of particular subfields/ lines of inquiry, and draws connections to e.g. psych/neuro, and I feel like I learned a lot.

My personal thought on e.g. putting a values-aligned lens on RL agents has always been that I have trouble drawing a line from the academic work to what this means in practice (as opposed to e.g. fairness or language models, where these are related to systems already in production and which are therefore already shaping/reshaping people's lives). I sort of wish this was made clearer! But also nitpicking lol.

Reboot review (not written by me)
here.
Karl Robert

Brilliant reading that covers numerous aspects concerning learning and teaching of both humans and programs, a bit of practical ethics and filosofy all woven together under one topic that is the development of machine learning programs. It demonstrates perfectly how in order to teach you must first understand the subject and how you learn more as you teach it to someone.
If you have any interest in AI, its safety and real ethical problems or the history of how machine learning has developed hand in hand with psychology, computer science, social sciences and neurology, this book is well worth a read.
Poorna Kumar

Very nice! Superb technical writing and enjoyable (and I say this as someone who isn't particularly into science writing).

I was somewhat familiar with part 1 of the book (on fairness and transparency) from my work and studies, and can confidently say that the author has done a fabulous job of distilling the current understanding on these topics with nuance. This is a real feat when the subject is so complex. Even though I knew about these topics from before, the book still deepened my understanding and appreciation of them and put many results in perspective.

Parts 2 and 3 of the book, broadly around reinforcement learning, were fascinating and quite new to me. I enjoyed those parts as well, but not as deeply as Part 1, maybe because of my own ignorance/being new to the subject.

This book is carefully and comprehensively researched, and really well explained. It's hard to find something like this. If you care about machine learning, read this book.
Baal Of

There are already dozens of excellent reviews summarizing the content of this book so there's no need for me to write anything. This book is important and useful for anyone who wants to get a fairly deep layman's understanding of the problems inherent with machine learning AI development. These problems are difficult, but it is extremly important that they be confronted head on since they can literaly be a matter of life and death. Christian has written an excellent book, one I think should be widely read.
Alexander Kutovoy

This book is an excellent read for DS professionals and those just wondering about machine learning's origins, limitations, and prospects. There is nothing particularly mind-blowing or too technical. Still, some cases and stories backtracing the evolution of things one otherwise takes for granted nowadays are fabulous—many references to cognitive scientists, human biology and anthropology studies, which I loved the most. Worth reading indeed.
Alex Railean

This is an excellent book, it is like a survey paper written in very understandable terms.

ßßßßßßßßßßßßßßßßßßßß notes for personal üse

- word2vec example: doctor - man + woman = nurse
- and so it went, with many examples placing women in household contexts

- perceptron
- - bias in the camera itself, color calibration [could not adequately represent black people]
- Kodak employee and model, Shirley Page
- - "Shirley card" - the same principle applies to any data set used for training
- bias propagates easily now, by means of open source libraries or data sets that others reuse in their projects
- - orchestra audition behind a screen, to avoid bias; later the candidates were also instructed to remove shoes, because the sound of their walk would be used to infer gender, hence bias creeped back in
- redundant encoding - some trait that can be used to infer something else that we're trying to NOT use in our calculations (e. g. race, gender)

- fairness through blindness doesn't work

# transparency
A mountain of unstructured data is not transparency

- black box neural nets va decision trees. The latter is easy to understand and follow
- - story: asthmatic patients -> send them home, they are safe. This rule was produced by a machine learning algorithm. A human doctor would treat this as a critical problem and move the patient to ICU. they get better care, hence they have a much higher survival rate. The machine got it completely wrong, building a model that actually endangers vulnerable patients..
- idea: when a company uses black boxes to make judgments, the verdict must be signed by a human, who is then responsible for answering the "why so?", if needed.
- - bogsat modeling technique: bunch of guys sitting at a table
- animal detection vs bokeh detection, because most photos of wildlife have artistically blurred backgrounds
- - saliency: design a neural net that shows you which part of the image contributed to the result the most
- this is how the animal/bokeh detector was caught
- -

- multi-tasking TODO focus not only on the inputs but also on the outputs
- - deconvolution: visualize the intermediate layers of the neural network
- localization of training data : fire trucks in the USA are red, but in Canberra - neon yellow. Self driving cars trained in the USA might not recognize fire trucks elsewhere
- - todo: tcap method

## training
Credit assignment problem: answer the question "where did I go wrong?" (instead of just giving you a pass /fail verdict in the end)

Td-learning (temporal differences) : make intermediate predictions, learn from them, even before a game (or other process) ends, before the final score is available. This always converges to the optimum, if it can train long enough. The principle is to observe how predictions change over time)
It seems that this is the role played by dopamine in our systems: track the error in the expectations of future rewards (not rewards themselves, and not just reward predictions)

## x
Skinner's variable returns had the most effect: the reward will come, but after a variable number of iterations.
This pattern is also what keeps gamblers glued to their addiction.

Shaping: Reward behavior that at least somehow resembles thr desired one, in order to steer the subject towards the end goal. If you wait until the subject performs the desired action right away [in order to reward it], the moment might never come, or come much later. This is a "sparse reward", aka the "**sparsity problem"**.

Epsilon-greedy: be greedy [in terms of gathering points] most of the time, but occasionally try a variation for fewer points, doing something unusual.

Parenting: react promptly to a child's legitimate attention requests, and slower to the ones that are just seeking attention.

### Key ingredients for good shaping:
**a good curriculum**: start with simple problems and actions that prepare you for more complex, upcoming challenges

Reference to the Super Mario example: you learn to avoid mushrooms because they kill you - this happens at an early stage in the game, so you learn it fast. Then you have to learn that the big mushrooms are good and should not be avoided. That type of mushroom is introduced in a moment in the game where you don't have enough room to maneuver - so you learn about the good mushrooms at an early stage too.

Thus, a good curriculum plays a crucial role in one's learning experience. If the challenges are not properly calibrated, the learner may never stumble upon the good behavior on their own.

**Well-chosen incentives**. If you get it wrong, you fall into the trap of "rewarding A, while hoping to get B".

This often applies to management of companies and employees

Reward functions: reward states, not actions. Otherwise you end up with agents that find loopholes to get easy points (example: child that cleans the room, then throws everything back on the floor, to pick it up again)

Gamification - looks into the problem of finding how to find rewards for certain behaviors that bring humans closer to their goals.

# curiosity
This is what made it possible to make a breakthrough in "Montezuma's revenge", which is a serious case of the sparsity problem.

Compression: a better understood world is more concisely compressible. That is, you can express the underlying principles in an elegant way that makes sense. Thus one can use compressibility as a metric for understanding

## imitation and over-imitation
Reference to the experiment where human babies would imitate everything, including redundant moves, when opening a puzzle box. Other animals would skip the unnecessary part and get straight to the point.

Perhaps the ability to over-imitate is what is needed to bootstrap a curious and self-driven intelligence that doesn't depend too much on external rewards?

However, in a related experiment that probes whether the child is aware of the redundancy of that action it is established that they are. Therefore we come to another potential explanation: "I know the action is unnecessary, but I assume the other human also knows it, and yet does it anyway; probably they know something I don't, so I better do what they do".

In another variation of the experiment, there is an adult who uses a toy, and the baby observes. If the baby has reasons to believe that the adult is unfamiliar with the toy, then the child does NOT perform the redundant action. They only do it when they are aware of the fact that the adult has seen the toy before and is better fsmiliarized with it.

Knowing that a solution exists is sometimes a key factor in accomplishing something, or even accomplishing it more efficiently. Reference example: two climbers found a path to climb a geological formation in Yosemite Park (it is basically a flat wall). It took them 8 years to plan the path and come up with a strategy.
After this was done, another climber was able to do it after only a week of analysis.

**indirect normativity** - a way to align the system to our desire, without articulating every tiny detail of the expected result.

Learning by observing - A beginner watching an expert will not get the chance to see how the expert deals with "beginner mistakes", because the expert doesn't make such mistakes anymore. Thus, this will train a model that is not able to deal with basic issues, which is a major weakness of this approach.

**possiblism** - always do the best theoretically possible thing for the current situation. However, it might not be always feasible - for example, a beginner might know what needs to be done, in principle - but they have insufficient skill to do it right.

**actualism** - do what makes sense based on what you think will actually happen.

Example: you want someone to review your paper. You can give it to a super qualified professor, who is very busy, so you might not even get the review. But if you get it - it will be very thorough. Alternatively you can ask a less qualified colleague to look at it - you'll get feedback of a lower quality, but it will arrive in a short time.

### inverse reinforcement learning
Turn the matter around and ask: what is the reward?

Unlike a computer game, life is not easy. There is no obvious score. Suppose "walking" is a feature that was developed through reinforcement learning - in that case, what was the objective? What was being optimized?

### cross training
Switch roles, the trainer becomes the trainee (like in pair programming). This enables the trainer to learn something too

To-do: review this

### open-category problem
A neural network trained to identify which of the N classes a given object belongs to, will always choose one of the N, without considering that it could also be "none of the above".

În other words, it will give you an answer even if you provide trash at the input, and sometimes it will even be very confident in its verdict!

**Dropout** - run the same input data through the same network multiple times, but each time turn off a random part of the network. Then compare the results provided by this "ensemble of networks". This improves the quality of the output.

When there is no consensus, the system can say "I know that I don't know" and perhaps involve a human for further investigation.

**Corrigibility** - ability to intervene in the operation of an autonomous system and change parameters/goals/etc.

### concluding remarks

Certain types of errors are less serious than others (like in Onlite, not knowing the exact number of business partners is not really a big deal, you only need a rough estimate)
Tommy

The Alignment Problem was phenomenal and I would highly recommend it to anyone who is even remotely interested in machine learning, how algorithms shape modern life, or even the parallels between psychology and artificial intelligence. My main background in AI is from an extensive article on Wait But Why, which explained much more of the future cases of what artificial general intelligence would mean for our society. The Alignment Problem, however, goes into the nuts and bolts of both the history and the current implementation—including successes as well as the multitude of pitfalls—of machine learning. Ultimately, this book gave me hope in the future of machine learning, not because AI itself is so cool, but because there are so many people working to make it ethical, just, and amazing.
We find ourselves at a fragile moment in history—where the power and flexibility of these models have made them irresistibly useful for a large number of commercial and public applications, and yet our standards and norms around how to use them appropriately are still nascent. (page 48)

I read this voraciously and enjoyed it so much that I think I might buy it so that I can reread it. I must also give the caveat that most of my reading of this book occurred in somewhat of a fugue state: sleep-deprived on a Greyhoud bus. Nonetheless, I still believe The Alignment Problemto be enthralling.

I absolutely loved the way that Christian writes, equally erudite and strikingly approachable. When there is a new topic that he wants the reader to learn about, he has a unique way of bringing it up that I found to be extremely effective. First, he describes an everyday situation, then he gives a formal definition the subject/topic/term, and finally he explains how it is relevant or its application in the real world. In essence, he invites the reader to build an intuition of a new topic, tells you that you kind of already know what this is—but he puts a new name to it—and then he shows you how it is quite a bit more amazing than you thought. I think more people ought to teach in this way; to me, this is near the Platonic ideal of how to teach.

Furthermore, it was quite clear that Christian did his research for The Alignment Problem. When he says that he did hundreds of interviews, I do not doubt him at all. I must also address my earlier comment about how this book is extremely approachable in its prose. Since a lot of this book was based not only on original research, but also relied heavily on personal interviews, Christian gave direct quotes of the way that people spoke (including their dialects/mannerisms of speaking) and also used syntactical tools such as ellipses to great effect.

I'll try not to gush too much more about this book, but I must also point out that I loved how much he integrated psychology into this book. He could almost write an entire book just on how our brains work and I would love it equally. Since this book was about machine learning and human values, Christian had to adequately address the latter portion of the subtitle, and boy did he deliver! I especially enjoyed the chapters on Imitation and Inference, where he described how we are trying to include human values in our AI either by—you guessed it—having the machines imitate us or infer what we are doing. Lengthy sections of the book spoke exclusively on neuroscience (such as how dopamine is a "reward chemical" based not on the reward itself, but actually on how reality differs from our expectation of the future).

Finally, I'll leave you with one of my favorite justifications about why you ought to learn more about this, from the conclusion, page 327
Increasingly, our interaction with almost any system involves a formal model of our own behavior.... What we have seen in this book is the power of these models, the ways they go wrong, and the ways we are trying to align them with our interest.
Fred Oliveira

One of the best books - if not the actual best - on AI I've ever read. Perhaps a little dense at times, and potentially challenging for people who have never come across some of the topics. However, if you are in AI or a tangential industry – and one might argue that that's every industry, right now - this almost feels like required reading. Highly recommended.
Jacob Williams

"We are in danger of losing control of the world not to AI or to machines as such but to models."

This is full of interesting historical anecdotes (like that time William James kept a bunch of chickens in his basement to help out a student) and good high-level explanations of various approaches to machine learning.

Perhaps the most shocking issue discussed is how some US state justice systems used a model (called COMPAS) from a third-party provider for years to guide bail and sentencing decisions without doing any sort of validation of the efficacy or fairness of the model. Christian also gives compelling examples of how dangerous it can be to naively trust a model you don't understand, like the case where a pneumonia-diagnosis model was accurately predicting that some patients were less likely to die of pneumonia: it turned out the reason they were less likely to die is that they had extra health conditions which caused hospital staff to view them as higher-risk and give them additional care. So if the staff had started trusting the model's predictions instead, those patients would have likely been at even higher risk of dying than they were to begin with. Trying to act on the model's advice would have undermined the model's accuracy!

Still, although the description of this book on goodreads calls it a "jaw-dropping exploration of everything that goes wrong when we build AI systems", I found it to take a pretty measured attitude towards the problem, especially in parts 2 and 3. The general impression it left me with is that there are very smart people working hard on making AI safe, and that they've got some good ideas. The question, I guess, is whether society will listen when they urge caution, or if overeager deployment of stuff like COMPAS will be the norm.
Ellison

I learned a lot from this book about the history of AI development over the last 70 years, but more about how the author and the scientists he interviewed felt about AI and the alignment problem than I did the actual problem itself. They all seemed a little naive in thinking the problem is a programming error when it is an human error really. We are not aligned or alignable but ‘crooked timber’ through and through. Computers might help us follow or reveal the grain of that timber but the danger seems to be that they might also shape us in ways we don’t recognize or control.
‘One way to do this, the Berkely group realized, is to have the system be at some level aware of how difficult it is to design an explicit reward function - to realize that the human users or programmers made their best-faith effort to craft a reward function that captured everything they wanted, but they likely did an imperfect job. In this case even the score is not the score. There is something the humans want, which the explicit objective merely and imperfectly relflects.’
I love that phrase 'even the score is not the score.' I think I will put it on a T shirt.
Vidur Kapur

A well-researched treatment of the problem of AI alignment. The author does have a tendency, like many popular science writers, to digress and relay anecdotes or stories about historical scientific experiments that aren't really necessary to aid the reader's understanding of the topic. However, the final third of the book was focused and informative. After reading Stuart Russell's
Human Compatible, I became more confident that the alignment problem can be solved. This book wasn't as big an update for me, but it did make some of the pathways that Russell discussed more concrete in my mind. (Conditional on AGI being built within 100 years, I'm now ~42% credent that it won't lead to existential catastrophe.)
KC

A very well-written book on AI alignment that is more focused on recent research efforts and practical algorithms rather than higher-level philosophical ideas like "Life 3.0" or "Superintelligence." I thought the coverage of reinforcement learning was especially detailed but accessible. I would recommend this to anyone who is curious about current AI research. It fills a similar niche to Stuart Russell's "Human Compatible," but I would recommend this book over that one by a significant margin. (The content might become outdated relatively quickly though.)
R

This fascinating book looks deeply at both the history of machine learning and the urgent challenges around ethics and safety that are being faced by those on the front lines. Well-researched and well-written, this book is approachable for people without a technical background—you will learn a lot!—while still being thought-provoking for those in the field. Highly recommended!
Casey Dorman

I must admit that I was taken by surprise by the contents of Brian Christian’s recent book, The Alignment Problem. The book came out in 2020 and made quite a splash among the artificial intelligence (AI) and machine intelligence community. Much of the public, including myself, had been made aware of “the alignment problem” by Nick Bostrom’s book, Superintelligence, or the writings of people such as MIT physicist, Max Tegmark. In fact, in my case, it was the conundrum of the alignment problem that spurred me to write my science fiction novel, Ezekiel’s Brain. Simply put, the alignment problem in the AI world is the question of how you create a superintelligent AI that is “friendly,” i.e., helpful rather than dangerous, to humanity. It’s such a difficult question that, in my novel, Ezekiel’s Brain, the creators of the superintelligent AI fail, and the result is disastrous for the human race. What I was expecting from Brian Christian’s book was another description of the nightmare scenarios of the kind I wrote about in my novel, and experts such as Bostrom and Tegmark talk about in their writings. That wasn’t what The Alignment Problem was about… or at least not what it was mostly about.

Christian gives some detailed accounts of disastrous results applying the most sophisticated AI learning algorithms to actual human situations. Some of these are well-known, such as attempts to censor social media content, or to produce an algorithm that aided judges in criminal sentencing, or to develop screening tools for employment selection. Training AIs using data on human decisions simply amplified the biases, including gender, racial and ideological, we humans use to make our decisions. These were instances of AIs performing in a way that was more harmful than helpful to humans, and they were results of which I had previously only been vaguely aware. Although they were not the kind of misalignment that I was concerned with and had prompted me to buy the book, they expanded my concept of alignment considerably.

Instead of providing nightmare scenarios of the danger of superintelligent AIs that are not aligned with what is best for humanity, the bulk of Christian’s book provides an exquisite history, up to the present, of the efforts of the AI community to define how machines can learn, what they are learning and what they ought to be learning, and how to identify whether the progress being made is bringing AIs in closer alignment with what humans want from them. What was most surprising and gratifying to me, as a psychologist, was how much this effort is entwined with progress in understanding how people learn and what affects that learning process.

Christian writes his book like a good mystery, but rather than following a narrow plot, the breadth of inquiry is extraordinary. Even a psychologist, such as myself, learned about findings in psychology and learning and child development of which I was unaware. How computer scientists who develop AI use psychological findings to open up new avenues in machine learning is fascinating to hear about. The collaborations are thrilling, and both psychologists and AI researchers who are not aware of how much is happening on this front should read Christian’s book to get an idea of how exciting an important this area of research is becoming.

Although I have some background related to psychology, AI, and the alignment problem, this book is written for the non-expert, and the interested layperson can easily understand it and bring their knowledge of the subject up to date. I found it one of the most captivating and informative books I have read in the last several years, and I recommend it for everyone for whom this topic sparks an interest.
Vijai

On March 23rd 2016, Microsoft threw their twitter chatbot at the world. It started all innocent, what with the chatbot claiming that humans are cool. Not 24 hours later, On March 24th 2016, the very chatbot proclaimed that Hitler did nothing wrong and wishing that feminists burned in hell. Here's the
link to that saga.

So, why did this ambitious machine learning project go tits up? Because the chatbot was modeling itself on the shit that was already in public domain like it was designed to.

Like parents, like children. Garbage in, garbage out. You get my drift? Machine learning and AI just don't become sentient from word-go. They are modelled on what real people have already done. A wonderful concept assuming that existing data that the bots are sifting through are all from well meaning, ideal humans. Unfortunately, us humans aren't.

Which is the crux of this author's main thrust in this book. What if the brilliant ML programmer just happened to be flaming misogynist? Is there a chance that, a teeny tiny one, that his ML could have a gender bias? I am not even conjecturing, it has already happened as noted by the author in this book.

A wonderful book that only vindicated the fears I have had about this recent mad rush behind AI and ML. Its like communism, the idea is all cool on paper and may well suit an utopian and ideal world. However, does that serve a practical purpose in real life? In this nobody's opinion, no. And my bias tells me that the author is saying so too.

An excellent read. I enjoyed this author's first book on
algorithms and I notice that there is a certain maturity in his material presentation in this book. It is wonderful to see an author grow and become better. Worth a read and worth the 5 stars.

Please buy first hand.
Neven

The topic of AI has had its flare-ups and its quiet periods in the history of computing; one moment it would promise to revolutionize everything, then its limitations would put it on the back burner for a while as the field refocuses on some non-“smart” technology instead.

It’s often said how the topics of computer vision and translation were considered just about to be solved in the 1960s—and then they practically languished for decades. Of course, a lot of work was being done on both in the meantime, and we’re currently at a point where they’re once again almost “solved”; whatever that means.

Christian’s book arrives at a perfect time, then, and thanks to his level-headed, clear, smart writing, it covers how we got to where it is that we find ourselves right now. A number of advancements in how we conceptualize machine thinking synced up with fast-enough computers, resulting in Dall•E, Midjourney, and GPT-3.

The book’s title refers to the foundational problem of how we humans get our algorithms to do “what we want”; that is the alignment we seek. The specific concerns we should care about here are many: is the algorithm understanding our words correctly, at the right mix of literal and metaphorical? Does it know why we want what we want, and what outcome would make us happy? Can it adapt if its guess is wrong? Does it care if what we asked for is ethically good?

It would be easy to write a book celebrating the successes of AI by listing all the various impressive projects of the 21st century. It would also be easy(ish) to enumerate the failures (the ones stemming from datasets trained mostly on white men can easily fill many books) and the remaining troubling questions. This book balances the two well, proceeding with the understanding that AI is here to stay, so we need to get it right.

Two books in, I’m a fan of Christian’s writing. Looking forward to more of it, including the myriad future works on this topic that I’m sure he’ll get opportunities to write in the coming years!
Tim

My main insight from this book is that alignment is a far more concrete problem than I imagined. I assumed that alignment simply meant making sure bad people don't have control of the goals of AI, and that we're thoughtful about crafting those goals. Instead, AI safety research is about engineering systems that can understand and carefully act in environments with ambiguous or contradictory goals.

That was a very abstract sentence, so let me try to explain that again. A big problem for AI safety is that human intentions are implicit. The famous paperclip problem - what if an AI is told to make paperclips and it stops at nothing to turn the whole world into paperclips? - is posed because the AI doesn't understand that humans care about things other than paperclips. But instead of training a model to make paperclips, you can train it to understand human intentions. Or you can train it to be cautious about taking actions that change the world in any significant way. There's real, concrete research ongoing in this field. I'm excited to go learn more about the specific math and computer science behind these ideas.

This also changed my perspective of AI safety from being a negative research field spurred by fear to a positive research field motivated by creation and hope. AI safety research actually makes AI more useful rather than only telling us that we should really be careful what we build.

I wonder if someone less familiar with machine learning (not that I'm an expert - but I've taken stats and intro to ML courses) would have trouble understanding this book. For instance, as far as I recall the book doesn't describe specifically how neural networks work. Maybe that's a strength of the book, though.

The Alignment Problem: Machine Learning and Human Values Reviews

Releated on The Alignment Problem: Machine Learning and Human Values