Tuesday, November 6, 2007

Why you can't measure everything with the Pythagorean Theorem

So I recently read this article explaining how you can measure any old thing with a generalization of the Pythagorean Theorem and the little part of my brain which makes me say sentences starting with the word "Actually,..." said "Actually, you can't." First some background. The Pythagorean theorem as we know it, in just two dimensions, is just a way of assigning a length to a vector. A vector is just one way of describing the qualities of a particular thing. We usually think of vectors as representing locations in physical space, but as the article points out, that isn't always the case. A vector might be made up of a coordinate representing how much you like Roy Orbisson an another coordinate describing how much you like Elvis Costello (although its unlikely such a pair of vectors would span the space of musical taste). You could still use the Pythagorean Theorem to quantify how different your tastes in the Orbisson x Costello plane are from another person's though. Never the less, there are some kinds of things which are simply not vectors and can't be meaningfully treated that way. By way of example, I will draw from my own work as a neuroscientist. Neuroscientists, perhaps not surprisingly, are frequently interested in what neurons are saying to one another. Neurons, in turn, often communicate by sending "spikes" of voltage down their axons. The axons connect to other neurons via synapses which convert those spikes of voltage into currents in other neurons which effect the probability of these "post-synaptic cells" producing their own spikes. Somehow all these spikes ultimately represent a computation and all these computations ultimately represent, well, your consciousness and mine. So, the neuroscientist says to him or herself, I would like to understand what information is in the trains of voltage spikes. Since the brain is noisy, and spikes are the only short term phenomenon which is stable over long distances, we can make progress by just measuring the times that a neuron produces spikes. The way this is generally done is just characterizing a neuron's response by just making a list of the times it spikes. In a typical experiment, the neuroscientist presents a particular stimulus to an animal many times and then, for each of these trials, writes down for each of them when the neuron in question spiked. We can plot such a characterization in a "rastergram" (see Figure 2). So now the million dollar question: How similar are two different spike trains? Equivalently, what is the distance between two spike trains. We'd like to know because if a neuron responds differently to the same stimulus, it is sort of like "noise" in the neural system. The level of noise effects the possible coding/computation strategies the brain might use (this is a hot area of research, as you might imagine.) It turns out that spike trains don't make very good vectors, and, therefore, the Pythagorean Theorem does not work very well for characterizing them. Why is this? Imagine two spike trains: These aren't vectors. But we can (naively) turn them into vectors with the following rule: divide time into a set of bins, and then count the number of spikes in each bin for a given spike train. But that number into a vector which has as many elements as bins as the entry for its corresponding dimension: So these two vectors would look like: Train a (vector form) : [ 0 1 0 1 0 0 0 0 0 1 0 ] Train b (vector form) : [0 1 0 1 0 0 0 0 1 0 0 ] For the first four or so elements things look good. The trains are intuitively similar and they appear to have similar vectors. But the rub occurs at elements 9 and 10. Train B points along direction 9 one whole unit, and train ten points along direction ten a whole unit and yet, intuitively, the two spike trains are very similar around these bins. A better illustration can be had if we restrict our vectors to just elements 9, 10. Then: Train a (vector form) : [0 1] Train b (vector form) : [1 0] The Pythagorean distance between these two vectors is just sqrt( (0-1)^2 + (1-0)^2 ) = sqrt(2). But if we just nudge the spike in train b forward a little in time (making in incremental change in the nature of train b), as soon as we push over the bin boundary, the vectors become: Train a (vector form): [0 1] Train b (vector form): [0 1] And the distance between them goes to, obviously, zero. In other words, a small change in our data produces a large change in the vectors representing our data. This is not a desirable behavior. Particularly troubling is that if we want a "higher resolution" view of our data, we naturally make smaller bins, which just makes this problem worse. The moral: not everything is a vector and so the Pythagorean Distance is not appropriate to characterize "differences" between all types of things. This should surprise us at all. What, after all, is the vector distance between the words "cat" and "cattle"? Its obvious they are similar words (at least textually) but there is no obvious way to turn them into vectors. There are a variety of types of spaces of objects with fewer properties than vector spaces, and it makes sense to use them. Putting some piece of data into a vector space, for example, requires that addition of the data be defined in some way. It isn't even sensible to ask what "cat" + "cattle" would be in terms of another vector of characters. Maybe one day I will write some about the information theoretic ways distances can be calculated between data which don't behave well as vectors. Until then, don't go applying the Pythagorean Theorem all willy nilly.

1 comment:

kalid said...

Hi there, this is Kalid from BetterExplained -- thanks for the detailed analysis!

I agree that vectors don't/shouldn't apply to everything -- there are other measures of distance that may be more useful, like "edit distance".

Taking your example, "cat" and "chat" have an edit distance of 1 (inserting an h), but are otherwise similar. Of course, computing edit distance is much more difficult than squaring a few numbers.

Similarly, the signals from neurons may be better suited to the concept of "edit distance" than a raw comparison of signals over time.

Appreciate the analysis,

-Kalid