The delusions of neural networks

19 Jan 2018

Let’s start with a little experiment that will give you a deeper understanding of what a neural network is. Even if you are an expert in AI, please try.

Write down on a piece of paper what you see here (up to two words):

We will be back to this experiment later, but it is very important that you keep that physical piece of paper near you.

A bit of context

This post was born after an interesting discussion on Nexa’s mailing list.

The Nexa Center for Internet for Internet & Society is an independent research center, focusing on interdisciplinary analysis of the force of the Internet and of its impact on society.

The thread was about a report from the French Commission National informatique et libertés (CNIL), and I tried to give my technical contribution by sharing obvious considerations about machine learning in general and neural networks in particular.
Such simple considerations come from hands-on experience of the matter, since I’m a programmer.

Turn out, they were not that obvious.

So here I try, in the style of Nexa, to provide useful insights to both the layman that use AI for fun or profit, and to the AI practitioner that is too deep in the field to see it from a broader, interdisciplinary perspective.
It’s lengthy, to be both clear and technically correct, but should worth a read.

Also, while we focus on artificial neural networks here, most of what is said applies, mutatis mutandis, to other machine learning techniques.
Just, with different trade-offs.

What is an artificial neural network?

We call artificial neural networks a class of algorithms that can statistically approximate any function.

Currently, they constitute the most exciting research field in Statistics.

“Function”

The first keyword in our definition is “function”.

A function, in mathematics, is a rule that assigns to each element in a set exactly one element in another.

Counting is a function from the set of “things” to the set of numbers.

Addition is a function from (the set of) numbers’ pairs to (the set of) numbers. Such is multiplication. And exponential.

The point here is that, if you have two sets and a rule that map each element of one set to exactly one element of the other, you have a function.

“Statistically”

Neural networks are statistical applications.

This should explain why they need tons of data to be calibrated.
They are just like any other statistical application (or ML technique).

Don’t agree? Try to apply them to domains with low cardinality. ;-)

If you think of neural networks (and ML in general) as statistical applications, it suddenly becomes clear that any discriminatory behavior is not due to an obscure “bias” in an inscrutable computer brain.

It’s simply a problem in the data. Or in the code. Or both.

“Approximate”

Just like any other statistical application, the results provided by the output layer of a neural network are subject to an error.

Consider the function f : ℕ → ℙ that map each natural N to the Nth prime.

We have f(0) = 2; f(1) = 3; f(2) = 5; f(3) = 7 … and so on.

You literally have infinite samples to use, so you might want to use a neural network to approximate this function.

Or, you could try to calibrate a neural network to identify prime numbers.
Again you have a function and literally infinite samples to use.

But, if you try to use their outputs for cryptography, you are doomed.
Still you can’t blame the AI, just your poor understanding of statistics.

“Algorithms”

The formal definition of algorithms is still a deep research topic.

In 1972, however, Harold Stone provided an useful informal definition that most programmers will agree with: “we define an algorithm to be a set of rules that precisely defines a sequence of operations such that each rule is effective and definite and such that the sequence terminates in a finite time.”

Since computer systems are deterministic in nature (and will continue to be, until the widespread adoption of quantum computing), all algorithms executed by computers are deterministic too.

When a race condition makes a concurrent algorithm non-deterministic, programmers call it “a bug”. We just add time to the equation and fix it.
And when true entropy is fed to a deterministic algorithm to make its results hard to predict, we can still replicate them by recording the random bits fed to the algorithm and replying its execution with those same bits.
We randomize the input, not the algorithm.

“Any”

The last word in our definition is what make neural networks “magical”.

Neural networks can statistically approximate any function.

Even unknown ones.

If you suspect that a function exists, you can try to statistically approximate it with a neural network, even if you do not know the rule that it follows.
You just need two set. And tons of data.

This is the strongest strength of neural networks. And their weakness, too.

So where is the intelligence?

Back to our little experiment. Take your piece of paper. What did you write?

Can you see the cat? Me too.

This is what we call pattern recognition: we match a stimulus with information retrieved from memory.

We, as humans, are very good at this. Very, very good.

Still, there is no cat there!

Really, there is just a screen connected to a computer. ;-)

Humans are so good at pattern recognition that we can be fooled by it.

Beauty is in the eye of the beholder

We take two sets, such as the (set of) measures of Iris and the botanic classification of Iris (a set of words).

We “suspect” that a function exists between these two sets. :-D

We look for a large data set, classified by an expert, such as a Botanist.

We calibrate a neural network to approximate that hypothetical function.

Finally we run the program, and we see that it classifies Iris “like a pro”.

And, just like a mother looks her beloved son, we say: “how smart it is!”

We see a program doing the work of a Botanist and we recognize a pattern.
We are matching the program with experiences from our own memories.

We look at the computer and we see a Botanist. We see an intelligence.

But it’s like with the cat.

Verba manent.

If you know the technology in depth, you might have noticed that I’ve never used the term “training” for the calibration phase of a neural network.
In the same way, I do not like to use the term “learning” for machines.

The words we use to describe the reality forge our understanding of it.

Talking about “deep learning”, “intelligence” and “training” is evocative, attracts investments and turns programmers into semi-gods.

It’s funny, but plain wrong.
And dangerous, as we will see later.

Enter, Big Data

All techniques known as “machine learning” requires tons of data.

We don’t have better algorithms. We just have more data.
— Peter Norwing, Chief Scientist, Google

It’s obvious, once you understand that they are just statistical applications.

Still, the amount of data required to calibrate a neural network is so large that, despite being a 70 years old technology, it became practical just recently.

Today, anybody can easily collect, buy or sell tons of data.

Why? Simply because we leak data. Precious data. Data about ourselves.

The perfect match

We could apply these tools to any system for which we have enough data.
But we have tons of cheap data about people.

With enough data, we could try to calibrate a neural network to select the resumes to consider for an interview. Or to decide the perfect salary for an employee. Or to select the best match for a transplant. Or for a love story.
A company could try to approximate the “right” cost for your insurance.

The “function” in the details

Why we need so much data?
Fine, it’s just statistics, but concretely why?

Let’s recap: a neural network can statistically approximate any function.

How many curves pass from a point in a multidimensional space?
And from two points? And from three? And from N?

Turns out that the answer to all of these questions is “infinitely many”.

Can you see the problem?

A little poetic license

Let’s explain this in a cooler way, using the common AI parlance.

After training a neural network, we do not know which knowledge it will deduce from the training samples and we do not know what reasoning it will use for its computation. It’s a like a black box.

It approximates the desired output in the range covered by our samples.
That’s all we can say.

More properly

There are infinitely many functions that a given network might approximate.
And, currently, we cannot say which function a given node from the output layer of a neural network is actually approximating.

Now, there are a few interesting researches about this issue, but I’m not much optimistic about them. My insight is that being able to deduce the target function from a generic calibrated neural network is equivalent to resolving the halting problem.
After all, DGNs are neural networks too!

BTW, we need a big data set to filter out unwanted functions.

All the headaches you get from overfitting or underfitting are just side effects of this heroic challenge: whenever you feed a sample to the network, you exclude an infinite number of functions from being approximated by it.
Nevertheless an infinite number of functions still fit your data set.
So you can not know which function your network will approximate.
It’s sad, but you can’t really win.

Headache apart, this fact has deep legal implications.

How can you prove that a neural network do not discriminate a minority?
How can you prove that it’s not calibrated to be racist? Or sexist?

Theoretically, you can not.

Brute force to the rescue!

What math can’t prove, engineers can check.

When in doubt, use brute force.
— Ken Thompson, Unix inventor

To show that your neural network is not “trained to discriminate” you simply have to declare the function you tried to approximate and

provide (and thus safely store, for years) the whole data sets used to calibrate the network, including the one used for cross-validation;
provide (and thus safely store, for years) the initial values you choose for each topology you tried, and obviously the topologies;
disclose the full source code, with documentation;
hire an independent team of experts to verify the whole application.

The experts will try to falsify a theory, a predicate about the network.
Experimentally. They will at least

verify the data sets against selection bias;
verify that the training outputs do not actually apply a discrimination;
carefully debug the neural networks during the whole calibration, to prove that the your network actually derives from that calibration;
verify that your neural network was the best performing in the cross-validation (that also means to debug all the neural networks you included in the process);
verify that no programming error affect the network behavior.

If they do not find a programming error (which, trust me, is almost sure) it might take years, since the cost of debugging always grows with complexity.

Still, while very expensive, this approach is always technically possible.
Recall? Neural networks are deterministic programs.

But when does it worth such effort?
When do you have to pay such a huge cost?

A legal perspective (in European Union)

If you calibrate a neural network to play Go, nobody will ask you to prove that it is not discriminating white stones. Nobody cares about stones’ rights.

But if you delegate to a neural network a decision about people, the decision is still on your own responsibility.
You are stating that you trust who implemented, configured and calibrated the network and you are accepting to be liable of its outputs.

It’s just a statistical application, after all!

European General Data Protection Regulation

The word “responsibility” comes from the Latin responsum abilem, which basically means “able to explain”.

This was pretty clear to people who wrote the European GDPR.

Indeed, at Nexa’s mailing list, Marco Ciurcina pointed out that article 13 and article 14 of the GDPR were relevant to the discourse.

In particular, the point (f) of Article 13(2) states that

…the controller shall (…) provide the data subject with the following further information necessary to ensure fair and transparent processing: (…) the existence of automated decision-making, including profiling, referred to in Article 22(1) and (4) and, at least in those cases, meaningful information about the logic involved, as well as the significance and the envisaged consequences of such processing for the data subject.

The point (g) of Article 14(2) is equivalent, but relates to data acquired by third parties, instead of the data subject.

So if you operate in Europe and you apply AI to people, you should be able to explain the logic that led to each of its outputs in a court.

“Meaningful information”

Which information about the logic involved in an automated decision-making are “meaningful” for the data subject?

To be “meaningful” the information about a decision-making process (automated or not) must be

pertinent (they must describe that specific process)
understandable (to the data subject and the court)

But to prove that they are pertinent, the information must be complete.

And to prove that they are complete, you have to be able to replicate that specific decision-making process using them.

So, to recap, the meaningful information about a decision-making process are all those information that are relevant to the process itself and that the data subject (or the Magistrate) can use to the replicate process itself.

And obviously, to be pertinent, such information must be up to date.

Programmers can see how Magistrates debug. :-)

Just a matter of costs

Of all machine learning techniques, neural networks are the most expensive to debug/explain. By several times the most expensive.

But you can’t simply state “we didn’t trained the network to be racist”.
Or “the neural network was simply trained so and so”.

You could be lying.

Fortunately, you can prove your statement. With brute force and debug.
It’s just another cost in the budget. Probably a huge cost, but a cost.

Still, if you can’t afford to show that your neural networks (or your MLs in general) are approximating legal functions, it’s wise to replace them.
Or your business model.

Pick out

If you are not a programmer, it is safe (and wise) to mentally replace terms like AI and ML with “statistical application”, in any discourse you listen.
You will have a deeper understanding of the matter, if you do so.

Artificial neural networks are simply deterministic algorithms that statistically approximate functions. It’s just not possible to exactly say which function they approximate.

This is not a big issue, until you apply them to people.
And it’s just expensive when you do. At least in Europe.

Incidentally, the biggest data sets are about people.

But people are not stupid.

And artificial intelligence is not allowed to discriminate on your behalf.