My current side project is the Heritage Health Prize, a data mining competition with a $3 million first prize. I’ve teamed up with a much smarter friend and started playing around with new algorithms, in particular Random Forests.
I think we’re in with a chance. I believe that the difference between mathematicians and programmers is that mathematicians try to model the world with equations, whereas programmers use look-up tables, and this competition seems to favour the programmers (like me).
To prevent individual patients from being identified, the dataset has been massaged to conceal rarely-occurring variables. So instead of providing the patient’s age as a number they give us a range, accurate to the nearest decade. instead of a continuous variable, which could be plugged into an equation, we have nine discrete values, which favours a look-up table.
For the time being we’re limiting our efforts to building generic algorithms rather than focusing on the actual data. There are some inconsistencies in the dataset, and I wouldn’t be surprised if there are changes before the release of the rest of the data on May 4th.
But it looks like a fun competition, and I’m picking up some useful skills along the way.