Skip to content
The Future

Why aligning AI to our values may be harder than we think

Can we stop a rogue AI by teaching it ethics? That might be easier said than done.

Credit: STR/JIJI PRESS/AFP via Getty Images

Eerie looking supercomputer.
Key Takeaways
  • One way we might prevent AI from going rogue is by teaching our machines ethics so they don’t cause problems.
  • The questions of what we should, or even can, teach computers remains unknown.
  • How we pick the values artificial intelligence follows might be the most important thing.
Sign up for Smart Faster newsletter
The most counterintuitive, surprising, and impactful new stories delivered to your inbox every Thursday.

Plenty of scientists, philosophers, and science fiction writers have wondered how to keep a potential super-human AI from destroying us all. While the obvious answer of “unplug it if it tries to kill you” has many supporters (and it worked on the HAL 9000), it isn’t too difficult to imagine that a sufficiently advanced machine would be able to prevent you from doing that. Alternatively, a very powerful AI might be able to make decisions too rapidly for humans to review for ethical correctness or to correct for the damage they cause.

The issue of keeping a potentially super-human AI from going rogue and hurting people is called the “control problem,” and there are many potential solutions to it. One of the more frequently discussed is “alignment” and involves syncing AI to human values, goals, and ethical standards. The idea is that an artificial intelligence designed with the proper moral system wouldn’t act in a way that is detrimental to human beings in the first place.

However, with this solution, the devil is in the details. What kind of ethics should we teach the machine, what kind of ethics can we make a machine follow, and who gets to answer those questions?

Iason Gabriel considers these questions in his new essay, “Artificial Intelligence, Values, and Alignment.” He addresses those problems while pointing out that answering them definitively is more complicated than it seems.


Humans are really good at explaining ethical problems and discussing potential solutions. Some of us are very good at teaching entire systems of ethics to other people. However, we tend to do this using language rather than code. We also teach people with learning capabilities similar to us rather than to a machine with different abilities. Shifting from people to machines may introduce some limitations.

Many different methods of machine learning could be applied to ethical theory. The trouble is, they may prove to be very capable of absorbing one moral stance and utterly incapable of handling another.

Reinforcement learning (RL) is a way to teach a machine to do something by having it maximize a reward signal. Through trial and error, the machine is eventually able to learn how to get as much reward as possible efficiently. With its built-in tendency to maximize what is defined as good, this system clearly lends itself to utilitarianism, with its goal of maximizing the total happiness, and other consequentialist ethical systems. How to use it to effectively teach a different ethical system remains unknown.

Alternatively, apprenticeship or imitation learning allows a programmer to give a computer a long list of data or an exemplar to observe and allow the machine to infer values and preferences from it. Thinkers concerned with the alignment problem often argue that this could teach a machine our preferences and values through action rather than idealized language. It would just require us to show the machine a moral exemplar and tell it to copy what they do. The idea has more than a few similarities to virtue ethics.

The problem of who is a moral exemplar for other people remains unsolved, and who, if anybody, we should have computers try to emulate is equally up for debate.

At the same time, there are some moral theories that we don’t know how to teach to machines. Deontological theories, known for creating universal rules to stick to all the time, typically rely on a moral agent to apply reason to the situation they find themselves in along particular lines. No machine in existence is currently able to do that. Even the more limited idea of rights, and the concept that they should not be violated no matter what any optimization tendency says, might prove challenging to code into a machine, given how specific and clearly defined you’d have to make these rights.

After discussing these problems, Gabriel notes that:

“In the light of these considerations, it seems possible that the methods we use to build artificial agents may influence the kind of values or principles we are able encode.”

This is a very real problem. After all, if you have a super AI, wouldn’t you want to teach it ethics with the learning technique best suited for how you built it? What do you do if that technique can’t teach it anything besides utilitarianism very well but you’ve decided virtue ethics is the right way to go?

If philosophers can’t agree on how people should act, how are we going to figure out how a hyper-intelligent computer should function?

The important thing might not be to program a machine with the one true ethical theory, but rather to make sure that it is aligned with values and behaviors that everybody can agree to. Gabriel puts forth several ideas on how to decide what values AI should follow.

A set of values could be found through consensus, he argues. There is a fair amount of overlap in human rights theory among a cross-section of African, Western, Islamic, and Chinese philosophy. A scheme of values, with notions like “all humans have the right to not be harmed, no matter how much economic gain might result from harming them,” could be devised and endorsed by large numbers of people from all cultures.

Alternatively, philosophers might use the “Veil of Ignorance,” a thought experiment where people are asked to find principles of justice that they would support if they didn’t know what their self-interests and societal status would be in a world that followed those principles, to find values for an AI to follow. The values they select would, presumably, be ones that would protect everyone from any mischief the AI could cause and would assure its benefits would reach everyone.

Lastly, we could vote on the values. Instead of figuring out what people would endorse under certain circumstances or based on the philosophies they already subscribe to, people could just vote on a set of values they want any super AI to be bound to.

All of these ideas are also burdened by the present lack of a super AI. There isn’t a consensus opinion on AI ethics yet, and the current debate hasn’t been as cosmopolitan as it would need to be. The thinkers behind the Veil of Ignorance would need to know the features of the AI they are planning for when coming up with a scheme of values, as they would be unlikely to choose a value set that an AI wasn’t designed to process effectively. A democratic system faces tremendous difficulties in assuring a just and legitimate “election” for values that everybody can agree on was done correctly.

Despite these limitations, we will need an answer to this question sooner rather than later; coming up with what values we should tie an AI to is something you want to do before you have a supercomputer that could cause tremendous harm if it doesn’t have some variation of a moral compass to guide it.

While artificial intelligence powerful enough to operate outside of human control is still a long way off, the problem of how to keep them in line when they do arrive is still an important one. Aligning such machines with human values and interests through ethics is one possible way of doing so, but the problem of what those values should be, how to teach them to a machine, and who gets to decide the answers to those problems remains unsolved.

Sign up for Smart Faster newsletter
The most counterintuitive, surprising, and impactful new stories delivered to your inbox every Thursday.

Related

Up Next