Tuesday, July 9, 2013

Interesting response to Nature Article "Biology must develop its own big data systems"

The original article, found here, seems to place the blame on the database engineers. I also found the response by a commenter, Tony Berno, interesting:

Tony Berno•2013-07-08 03:45 PM

This editorial is frustrating because it could have been written fifteen years ago, and there has been little in the way of progress since then. If "people are not going to change" and "the problem is not technical", then what opportunity is there for progress? We already have excellent systems for representing, maintaining, and querying data; conventional relational databases are a particularly mature example. The SQL language used to query them has a few counter-intuitive features, but on the whole it is difficult to imagine an easier or more comprehensive technology for asking a wide variety of questions about large, complex datasets. Yet I regularly encounter senior scientists and informatics directors who look at relational data models and proclaim, almost defiantly, that they "mean nothing" to them. Replacements for this well-established technology are sometimes more powerful or scalable, and are usually sold on their technical merits, but in reality they are motivated by magical thinking. It is believed that if scientists can formulate questions "without programming", it is assumed they will obtain answers without effort. Unfortunately, what is called "programming" is actually the easiest part of this process - the discipline required to think rigorously about data and formulate precise questions about it is much harder to learn, and it cannot be delegated or sidestepped by replacing the underlying technology with something "visual" or "intuitive". This approach actually makes the situation worse by establishing unrealistic expectations about the nature of the problem. Also, the requirement of mapping data to clean, transparent, and shared ontologies at the time of production (or before!) cannot be avoided by any current technology. Databases that are "flexible" in accepting varying data structures simply push the data curation problem to the time of the query, at which point it must be solved independently by multiple users who are even less prepared to address it. This inevitably results in a "roach motel" database - data goes in, but doesn't come out. It is not reasonable to expect every biologist to become a software engineer, but the basics of data modelling and query construction are not especially hard to learn. The current situation is akin to construction workers that refuse to read architectural diagrams. This is simply unsustainable, and it seems to me that as a first step, an introductory course in database technology must be required of all biology graduates. It is an unfortunate truth that many biologists chose their specialty out of a desire to avoid the rigorous mathematics of physics or chemistry. This is a critical mistake. Biology has changed dramatically, becoming one of the most mathematics- and data-intensive of all the sciences. If its culture does not fully embrace the intellectual challenge presented by its own data models, it will forever fall short of its potential.

No comments:

Post a Comment