Friday, December 27, 2013

Leslie Valiant is probably British. Or old.

I got Leslie Valiant's new book, Probably Approximately Correct, for Christmas.  I'm embarrassed to admit that I was not familiar with the author, especially since he won the Turing Award in 2010.  But I wasn't, and that led to a funny sequence of thoughts, which leads to an interesting problem in Bayesian inference.

When I saw the first name "Leslie," I thought that the author was probably female, since Leslie is a primarily female name, at least for young people in the US.  But the dust jacket identifies the author as a computer scientist, and when I read that I saw blue and smelled cheese, which is the synesthetic sensation I get when I encounter high-surprisal information that causes large updates in my subjective probabilistic beliefs (or maybe it's just the title of a TV show).

Specifically, the information that the author is a computer scientist caused two immediate updates: I concluded that the author is more likely to be male and, if male, more likely to be British, or old, or both.

A quick flip to the back cover revealed that both of those conclusions were true, but it made me wonder if they were justified.  That is, was my internal Bayesian update system (IBUS) working correctly, or leaping to conclusions?

Part One: Is the author male?

To check, I will try to quantify the analysis my IBUS performed.  First let's think about the odds that the author is male.  Starting with the name "Leslie" I would have guessed that about 1/3 of Leslies are male.  So my prior odds were 1:2 against.

Now let's update with the information that Leslie is a computer scientist who writes popular non-fiction.  I have read lots of popular computer science books, and of them about 1 in 20 were written by women.  I have no idea what fraction of computer science books are actually written by women.  My estimate might be wrong because my reading habits are biased, or because my recollection is not accurate.  But remember that we are talking about my subjective probabilistic beliefs.   Feel free to plug in your own numbers.

Writing this formally, I'll define

M: the author is male
F: the author is female
B: the author is a computer scientist
L: the author's name is Leslie

then

odds(M | L, B) = odds(M | L) like(B | M) / like(B | F)

If the prior odds are 1:2 and the likelihood ratio is 20, the posterior odds are 10:1 in favor of male.  Intuitively, "Leslie" is weak evidence that the author is female, but "computer scientist" is stronger evidence that the author is male.

Part Two: Is the author British?

So what led me to think that the author is British?  Well, I know that "Leslie" is primarily female in the US, but closer to gender-neutral in the UK.  If someone named Leslie is more likely to be male in the UK (compared to the US), then maybe men named Leslie are more likely to be from the UK.  But not necessarily.  We need to be careful.

If the name Leslie is much more common in the US than in the UK, then the absolute number of men named Leslie might be greater in the US.  In that case, "Leslie" would be evidence in favor of the hypothesis that the author is American.

I don't know whether "Leslie" is more popular in the US.  I could do some research, but for now I will stick with my subjective update process, and assume that the number of people named Leslie is about the same in the US and the UK.

So let's see what the update looks like.  I'll define

US: the author is from the US
UK: the author is from the UK

then

odds(UK | L, B) = odds(UK | B) like(L | UK) / like(L | US)

Again thinking about my collection of popular computer science books, I guess that one author in 10 is from the UK, so my prior odds are about 10:1.

To compute the likelihoods, I use the law of total probability conditioned on the probability that the author is male (which I just computed).  So:

like(L | UK) = prob(M) like(L | UK, M) + prob(F) like(L | UK, F)

and

like(L | US) = prob(M) like(L | US, M) + prob(F) like(L | US, F)

Based on my posterior odds from Part One:

prob(M) = 90%
prob(F) = 10%

Assuming that the number of people named Leslie is about the same in the US and the UK, and guessing that "Leslie" is gender neutral in the UK:

like(L | UK, M) = 50%
like(L | UK, F) = 50%

And guessing that "Leslie" is primarily female in the US:

like(L | US, M) = 10%
like(L | US, F) = 90%

Taken together, the likelihood ratio is about 3:1, which means that knowing L and suspecting M is evidence in favor of UK.  But not very strong evidence.

Summary

It looks like my IBUS is functioning correctly or, at least, my analysis can be justified provided that you accept the assumptions and guesses that went into it.  Since any of those numbers could easily be off by a factor of two, or more, don't take the results too seriously.