XML in Biology

summary Flower can be useful in modern biology
folder documentation

XML in Biology

Scientific exploration requires the handling of data and information: data, such as the results of experiments; information, such as the structure and content of scientific discourse.

Biological science, particularly that which is in any way connected with genomics, is undergoing rapid changes in its data and information archival and processing needs. These changes are driven by processes of industrialization and by changes in the nature of the kind of science being pursued. Industrialization increases pressure to share data, and advances in genomics and synthetic biology rapidly multiply the quantity of data to be shared.

A clear example can be found in the handling of sequence data. The $1,000 human genome may be a few years away, but each step closer expands the number of organisms which can be sequenced for a small amount of money. Here for example is a graph from GenBank.

Genbank is hardly the end of that story. Not all data winds up there: labs have their own collections. And, of course, there is more than simply sequences.

My understanding is that, among other uses, researchers across many biological disciplines find it useful to mine pools of such data, and thus also to exchange and share pools of such data with one another.

The Impact of Industrialization

The industrialization of biology is, in part, a greater emphasis on specialization. Rather than being confined to a single lab, a program of research or development of a product is more likely to be an ad hoc process that pipelines the capabilities of individual, more specialized labs. A paradigmatic example is the iGem competition, in which one group of labs prepare "parts" (aka "biobricks") for synthetic biology, and students then use these resources to solve particular engineering tasks.

This is one source of demand for changes to how data and information are handled: data/information exchange costs are a significant part of the transaction costs in coordinating multiple labs, in integrating new equipment, etc. Customers for better tools would find themselves in paradise if all equipment and inter-lab protocols "magically" conformed, flawlessly and seamlessly, to industry standard data representation and exchange formats — for example, if no special effort were required to export to or import from the iGem registry much more than clicking a button indicating a desire to do so in the ordinary course of work.

The Impact of Changes in the Science

A century ago, the lifetime work of a prolific biologist might produce, perhaps, a few 10,000 pages of drawings, writings, measurements, and so forth. Today, if we measure in bits, a modest sequencing lab can top that handily in a good month's work.

Moreover, this abundance of new data appears to be highly cumulative by which I mean that scientists obtain non-linear gains if they aggregate this data and then "mine" it. So, once again, there is (rapidly evolving) demand for data exchange formats that will lower the transaction costs of aggregating and manipulating this data.

There is an interesting trend to note: in some cases, the current substitute for really fully worked out exchange formats is centralized services. For example, quite a few businesses sell BLAST searches over various databases. I suspect that that is not a stable solution. Such services are inconvenient to labs not ready to "publish" all of their data and so people inevitably build substitutes, usually open source. I expect that the big services will, more and more, be used just as shared hubs for raw data feeds.

Why XML? Network Effects!

There is little to no "rocket science" in devising some framework for data exchange formats and so forth. You can see perfectly honorable attempts to do so in plain text formats like fasta or genbank. There is, though, a tricky strategic problem:

Once you have a data format, then you need tools that use it. You need other vendors to make formats that those tools also work with. How do you write report generators based on the format? Parsers? Input forms? Database indexing rules? Validators?

XML is a good choice of "framework" whenever the question of how to represent and exchange data and information comes up. It isn't that there's anything especially magic about XML's details — they're pedestrian, if thorough. Rather, it's that XML gets nothing especially wrong and that enough people are using it so that you get many of those "tools" more or less for free.

Those popular tools are a kind of intermediating technology — they make it cheap and easy to glue together most any two XML-based programs. And that means that if two distinct groups of biologists independently devise different but conceptually related data models, each using XML, then a third group will have an easy time gluing up those new data sources to produce a third. The data is cumulative.

Flower for XML in Biology

Even at its present early stage of maturity, Flower 0.5 represents an opportunity for labs with modest internal IT expertise who are either in possession of XML data sets or of data sets readily converted to XML. For example, out of the box, Flower provides an interactive over-the-web XQuery environment, with a customizable user interface, for exploring those data sets.

Copyright

Copyright Copyright © 2007 Thomas Lord Flower source code is licensed under the
Open Software License version 3.0

Creative Commons License Copyright © 2007 Thomas Lord This page is licensed under the
Creative Commons Attribution-No Derivative Works 3.0 Unported License .

Flower includes Patent Pending technology.