Tuesday, April 15, 2008

A Delicious Mouthful of Functional Programming Tools for Matlab

Since learning Lisp writing for loops in Matlab seems even less appetizing that the concerns of vectorization dictate it would already be. Higher order functions let you do some very nice things in a very small amount of space, and I've gradually integrated some of that power into my Matlab code. I've created a group of functional style programming functions which "do the right thing" in the context of Matlab's multidimensional matrices and cell arrays - as long as "the right thing" isn't "be very fast." These functions are not meant for optimized code and are quite slow. What they are useful for is filtering lists of strings which represent data sets or doing more complicated operations on inhomogeneous inputs without writing for loops in preparation for performance critical code. They've made my life as a data analyst more fun and at least a little easier. There are five functions: map, filt, partial, ix, and ixc Here are the documentation strings (link to a zip at the bottom).

% F=PARTIAL(F,ARG) given F and ARG returns F curried on ARG
%
% F=PARTIAL(F,ARG) partially evaluates F with the first argument ARG
%  and returns a function of the remaining arguments.
%
% F=PARTIAL(F,ARG,POSN) partially evaluates F with the POSN placement
%  and value arg and returns a function of the remaining arguments, in the 
%  same order.

% RES=MAP(F,MATRIX1,...,MATRIXN) Maps F of arity N onto MATRIX1 ... N
%  When F is a function of N variables, it is applied to
%   each set of things in MATRIXs in column major order.
%
%  When all MATRIXs are numeric and the same shape, the result
%   is a numeric MATRIX of the if F returns a single number.  Otherwise
%   the result is a CELL ARRAY of the same shape as the inputs.
%
%  If the MATRIXES are different in size, the smallest one determines the
%   output, which is returned as a matrix of that shape.

% [R1,R2,...,RN] = FILT(F,MATRIX1,MATRIX2,...,MATRIXN) filters MATRIXES by F
%  R1,R2,...,RN are the values of MATRIX1 MATRIX2 ... MATRIXN where F is true.
%  These are generally returned as flat arrays unless F is always true.  Then
%  this is the identity.

% RES = IX(MATRIX,INDS1,INDS2,...,INDS3) indexes MATRIX with INDS
%  Identical to MATRIX(INDS1,INDS2,INDS3)
%  Does not support END notation

% IXC is like IX except uses cell indexing.

Archive: matlab_functional Let me know if anyone uses them.

Tuesday, April 8, 2008

On the Randomness of Pi

(Edit: As one reader has pointed out, the test performed at the end of the article is, to put it charitably, rather a poor one for determining randomness. I am already working on a more robust characterization.) As a man with a bit of a mathematical background I've always been sort of mystified by the popular obsession with the number pi. The interest is not entirely misplaced1 and can be evoked with a simple example:

Upon ejecting Adam and Eve from the Garden of Eden, God said to the Angels, all perfect in proportion and hence identical, that a boundary must be made around the Garden so that Mankind might never enter it again. He instructed Michael to grasp with one hand the Tree of Knowledge of Good and Evil, which would form the center of the forbidden Garden, and in his other grasp one of the thousand archangels, each grasping another until a chain of 100 archangels was formed. The last angel, Chamuel, grasping the Sword of Damocles, would then proceed to trace with its burning tip, a line in the lush undergrowth of the Garden, walking constrained by the angels, until he returned to his starting point, tracing out a circle around the Garden. God then instructed the archangels to guard this boundary by joining hands as before, along the line of demarcation, so that no point of the boundary could be crossed by Mankind ever again. To their dismay, no number of archangels, hands joined thusly, could form such a circle, for when 628 angels joined hands they found that there was not sufficient space for the final angel. Seeing this, God was silent, the angels dismayed, and Mankind, watching huddled from the wilderness, was filled with dread, for it now understood the world it had been expelled into.

In less metaphorical terms, pi is disturbing because it relates which ought to be two simple things, the radius of a circle, easily perceived, with its circumference, also simple enough to see, but it does so in a way which requires kinds of numbers well beyond the ken of day to day life. This unfortunately leads to all sorts of misplaced fascination with the exact digits of pi, as though a mystical secret may be encoded in them. This is not the soul province of the casually interested, either. In Contact, a novel by none other than Carl Sagan, a thoroughly trained astronomer who probably should have known better, the digits of pi have hidden deep within them a message from the creators of the Universe. So rather than try to talk up how interesting pi is, today I will focusing on how uninteresting it is. First, lets have some fun. At nersc you can search the first four billion binary digits of pi for character strings. For instance, searching for Jesus gives us:
search string = "jesus"
25-bit binary equivalent =   0101000101100111010110011

search string found at binary index =   514534284
binary pi    : 1110001001010001011001110101100110000000000000101011110110000011
binary string:         0101000101100111010110011                          
character pi    : sxgepajxpkkt;gbjesus__bwvawwmn;n:,tyjj
character string:                jesus
But a search for Muhammed, the Prophet of Islam, yields
search string = "muhammed"
40-bit binary equivalent =   0110110101010000000101101011010010100100

string does not occur in first 4 billion binary digits of pi
Which might disturb any Muslims reading except that a search for Islam gives a result at 758395516 digits but a search for Christianity fails. Why can't pi tell us whose religion is right? What is going on with these mixed signals? The answer is that, practically speaking (and probably rigorously), the digits of pi are random. The reason Islam appears and Christianity doesn't is that Christianity has twelve letters in it while Islam has only 5. If each digit of pi is drawn randomly, in a completely uncorrelated way from those around it, and we are representing the digits in base 26 (so that we have the alphabet to work with), then the probability of getting any particular string of five characters is just one twenty-sixth to the 5th power, which is approximately 8.4e-8. The probability of finding Christianity (or any other 12 letter string, for that matter) is one twenty-sixth to the power 12 which is 1e-17, nine orders of magnitude smaller. Christianity should have chosen shorter name. If you are looking for enlightenment in pi, you are quite probably just as well off doing so in the digits produced by a high quality random number generator. We can, with enough data, test the hypothesis that pi's digits are random and get a confidence interval. Of course we can never completely disprove the possibility that pi is not random (and its actually not clear that a random pattern contains no information) but we can at least give a likelihood that our hypothesis is wrong. We do this by grabbing a bunch of digits of pi and comparing the expected distribution of digits (all digits occur equally often) with the actual distribution calculated from the data. While it would be fun to count the digits ourselves, people with much greater resources than I have on hand have already counted the digits of pi out to 1200000000000 places. The results are
0 119999636735
1 120000035569
2 120000620567
3 119999716885
4 120000114112
5 119999710206
6 119999941333
7 119999740505
8 120000830484
9 119999653604
Our model for the digits of pi is a discreet probability distribution which produces one of ten results with equal probability. A standard test for whether certain data are drawn for a hypothetical distribution is the Pearson's Chi Squared test. This is given by Where Oi is the observation (in this case the number of digits with value i and Ei is the expected number given by the model. In this case, for 1200000000000 digits of pi, the expected count for each digit is just a tenth of that value, or 120000000000. The value of chi in the above expression can be used with the partial distribution function of the chi squared distribution to ask "what is the probability that we measure this data set or a more divergent one if the null hypothesis is true?" In this case, the answer is (from a digital back of the envelope calculation) p = 0.15 which gives us some sense that these numbers really are normally distributed. While its never been mathematically proven that pi is normal, there are some recent results like the Bailey-Borwein-Plouffe formula which suggest that it's very likely (although not in a probabilistic sense). Maybe we will be lucky enough to live to see the day its proven random and we can all move on to spending our valuable time obsessing over the golden ratio. Some interesting links: Pi at the Wikipedia Visualize Pi Search Pi for Strings Finally, if you like this post, you may enjoy reading the Tract Magazine blog Tractable Items. 1 ...or unwanted, for that matter, any interest in things other than NASCAR and the foibles of the Olsen Twins is welcome. Note to self: celebrity model destruction derby.