The New Big Problem

At a table during last week's Business of Software conference, I met a fellow whose name escapes me now. Hardly a new occurrence for me, but I do recall the topic we were discussing.

His company does big data migrations for big government agencies. So for example if you're in charge of running the standardized health insurance across Ohio, you need data from every hospital in the state. And of course, each of these hospitals has been storing their data in some insane scheme that was cobbled together by the Pascal programmer who wrote the original record-keeping software back in 1975. It probably wasn't insane, actually. It probably made perfect sense. At the time.

The real problem isn't that this particular Pascal programmer was insane. The problem is that there was more than one Pascal programmer in Ohio in 1975 (I'm guessing, here, but give me a pass on that one). And so the solution that Pascal Programmer A chose for Hospital I was different (perhaps very different, perhaps a little different, doesn't matter) than the solution that Pascal Programmer B chose for Hospital II. Even if the SAME PATIENT was recorded in both hospitals, the data would look subtly different. Pascal Programmer A recorded First_Name and Last_Name as separate columns in a Patient_Data table, while Programmer B made sure NameF and NameL were stored as columns in a Personal_Info table. That's a trivial problem to solve, of course, if you're only worrying about these two databases. It gets more substantial if you're dealing with thousands of databases.

And not all problems in this space are trivial. The same data (or same types of data) can be stored in literally an infinite number of ways, and given the oft-frustrating creativity and ingenuity of people, you can bet that there will be as many variations as there were people involved in the varying.

So this gentleman I met obviously has his work cut out for him, and from what I could tell appears to be doing a roaring business handling this sort of thing. Because if you do want this sort of work done, there's really no solution other than to get a bunch of people to go through all your varied databases and manually map one set of data to the other, then perform whatever complicated trickery is required to do whatever conversion you've decided to implement so that at the end of the whole process, you have ONE dataset that includes everything all the originating databases contained.

And I thought to myself, "Wouldn't it be amazing if somebody figured out how to automate this process? If all the data in the world could just be connected to all the other data in the world? You'd never have to transfer information around, you wouldn't have to keep track of things in various places and forget to update it here after you've updated it there. It would be a transformative thing for the world, if every piece of information we had could talk to every other piece."

I used to think the big problem right now was latency. That the next big innovation was going to appear as some way to accelerate our ability to transfer information from one place to the next. But data normalization on a global scale is BIGGER.

I don't know how it can be done. Just thinking about it, it seems like the most tedious manual process of trivial decision-making imaginable, and yet, how can you derive a process that can look through the masses of databases around the world and understand how they connect to each other? But once you'd done it, well, that would be a world-sized oyster facing you, right there.

I wish I were smarter.

And that's hardly a new occurrence for me, either.

Photo: "network spheres" by gerard79.