PhD final draft complete

The last few months were spent at conferences and sitting in front of a computer knocking out the final draft of the thesis. I was working on another project over the Christmas break, but I’ll discuss that in another post. What I’m going to talk about here is the status of our Heritage Health Prize efforts.

Basically, we’ve stopped working on it. We haven’t submitted since August last year, and we’ve slipped to 12th position (although we climbed three places due to teams ahead of us merging). There are a few reasons …

  • We ran out of ideas. Simple as that. Actually, I’ve got a few ideas I haven’t tried, but they’re a lot of effort to implement and the pay-off won’t be worth it.
  • It’s not cost-effective. The US$3 million prize won’t go off, trust me. So the best we can hope for is $500,000. Shared between two people, for two year’s work, converted to Australian dollars, I’d be better off with a real job. And that’s assuming we win.
  • I’ve learned all I wanted to learn. I’ve never done data mining before, so one motivation for competing was to get up to speed on the latest techniques. Mission accomplished. I’ve now resurrected my undergrad linear regression skills and learned all about decision trees. When it comes to learning new skills I hit the point of diminishing returns long ago.
  • It isn’t useful. Probably the greatest pleasure I gain from writing software is knowing that someone will use it. I hate wasted effort. That’s why I much prefer the business world over academia. Unfortunately, due to privacy safeguards, the data provided in the competition is nothing like real world data, so the algorithms we develop will never be used in practice. That wasn’t the competition’s intention, but that’s the way it will play out.

Continuing on the last point, consider the following example. Probably the easiest hospitalization outcome to predict is childbirth. On the day a pregnant woman gets her first medical check-up, the doctor can pretty much pencil in the date she’ll need a hospital bed. Sure, some pregnancies end in miscarriage or late-term abortion, but they often require hospitalization as well.

Unfortunately, the HHP data doesn’t contain enough information to figure out the date of conception. Or to tell for certain if the patient was pregnant. Or if they had an abortion after discovering they were pregnant. You can tell when they actually gave birth (it’s a hospitalization event with a specific code), but when I tried to predict those outcomes I was wrong almost as often as I was right.

In other words, I think the world’s best data mining software, trained on crippled data, will be less effective at predicting hospitalization than a medical professional using real data. So the software will never be used, and all the effort will be wasted.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s