Ola! After 5 years, I've abandoned this blog. If you want more, go to boscoh.com

10.05.2005

Of proteins, clusters and musical chairs

Recently, I've been reading about how computationally-intense problems in science, such as protein folding, has been driving the adoption of cutting-edge computer technologies, such as Beowolf clusters. But it works the other way as well. The adoption of Beowolf clusters has, in turn, chipped away at the foundations of protein folding so much so that theory in protein science has had to evolve in ways that accomodate the use Beowolf clusters.

I work in the area of protein folding, one of the hairy workhorses of scientific computing. One of course, would like to simulate a protein molecule in as much detail as possible. In most approaches, the lowest level of detail is the covalent bond: every covalent bond is modeled but not allowed to break. This approach is known as molecular dynamics. Modeling the covalent bonds imposes a limit to the simulation: the biggest step you can take must be smaller than a typical vibration of a covalent bond - you must use baby-steps no bigger than a femto-second, or 0.0000000001 s.

A typical protein has tens of thousands of atoms and one normally bathes a protein in another few thousand water molecules. A realistic system might contain hundreds of thousands of atoms. Given the need to model the covalent bonds, the number of time-steps needed to simulate large motions in a protein, typically a microsecond 0.0000001 s, pushes the envelope of even the largest computing systems.

In the past, only well-endowed labs, with access to restricted Pentagon number-crushers, could dream of simulating a protein using molecular dynamics. Even ten years ago, Peter Kollman got a paper in the prestigous journal Science simply for simulating a small protein (the Villin headpiece) for what seemed like an eternity, a single micro-second (0.000001s). He didn't find out anything new but the study was hailed as a tour de force computation (read: my computer is bigger than yours).

But over the last 5 years, a sea-change has swept the field of protein modeling. What has happened is that linux has now commoditized hardware, processors have reached a critical performance hump, and communication protocols between individual computers have matured (MPI - message passing). For a rather modest investment, one can buy 10 or 20 cheap PC clones (Xeon motherboards), and hook them together into a tightly coupled network, known as a beowolf cluster. In theory, a beowolf cluster provides computation that rivals even outrageously expensive super-computers.

As forward-thinking (and penny-pinching) scientists have jumped onto the beowolf-cluster bandwangon, they've collectively run into a methodological brick wall. How do you slice up a molecular dynamics computer program that was written to run on one computer? The probelm is that running a computer simulation is a little like following a recipe for baking a cake. You have to mix the eggs and flour in a batter before you put the batter in the oven. Putting the flour and eggs in the oven whilst mixing the dough will not result in a cake. However, various strategies have been devised to bake a protein in parallel.

The first is the shot-gun approach. This approach used mainly by Vijay Pande at Standford in the feel-good folding@home program. Folding@home is unique in that it farms out the computation to tens of thousands of desktop computers scattered around the world. The essence of the shot-gun approach is that running a very long molecular-dynamics of a single protein should be just like running thousands of short simulations of the same protein. In order to make sure that each simulation is different, the protein is scrambled at the begining of each simulation. This makes it easy to farm out a short simulation to one of the many desktop computers in the folding@home project. At the end of all the simulations, the results are averaged out to find the consensus solution. The prevailing philosophy is that with enough shots in the catridge, you will hit the bird, no matter what.

Another approach is the divide-and-conquer approach. Machiavelli famously advocated that, when facing a coalition of enemy forces, instead of hurtling into an full-frontal assault, one should learn to play each enemy against another. By using subterfuge and dipolacy to foster internal strife, victory is brought closer to fruition. Similarly, in a simulation of a protein and its vassal water molecules, the simulation is broken-up into little spatial grids where all the atoms in each grid-cell is simulated by a single computer in a beowolf cluster. Just as a successful divide-and-conquer approach depends on subterfuge and diplomancy, the success of the simulation will depend on the quality of the cross-talk between the grid-cells. This is a rather new technique and time will tell if it can be successfully applied to protein folding.

But by far the most popular technique is replica-exchange, or the musical-chairs approach to molecular-dynamics simulation. In replica-exchange, a modest number of copies of the protein are run on the processors of a beowolf cluster. But unlike the shotgun approach, each copy of the protein, called a replica, is run for a very long time. The replicas running on the beowolf clusters play a game of musical chairs. The replicas are mixed around so that each processor gets a chance to simulate each of the replicas. Of course, there's no point in mixing the replicas around unless each chair is different. When the music stops, each replica sits down on a chair with a different temperature - some are tepidly warmed at room-temperature, whilst others are hot enough to deep-fry a turkey. The idea is mixing-and-matching replicas at different temperatures supposedly speeds-up the simulation where only proteins that are in a good shape can sit down onto a cooler chair. This method of speeding up the simulation is wildly speculative but seems to be tailor-made for beowolf clusters - only a few number of processors are used with swaps between processors happening at regular intervals. It calls to mind afternoon cable television commericals - you really do get more for less.

But during a recent conference, my boss heard a speaker quip that replica-exchange was a hack technique that doesn't improve molecualar-dynamics simulations. The room erupted. It turned out that pretty much everyone in the room was using replica-exchange. The speaker then pointed out that, anecodotal evidence withstanding, nobody had yet shown quantitatively that replica-exchange was more efficient. But what was left unspoken was that the guys in the room had to believe in replica-exchange, if only for the luxury of using their affordably cheap beowolf-clusters.

1 comment:

Chineseblade said...

Scientists, in their unique bigotry, tend to think that science drives technology. But really technology drives science. And drive is a weak word. Technology DEFINES science. It gives tools to science and scientists can only use these tools, plus a bit of their imaginations, which are invariably oversated in importance. Newton may have discovered the laws of gravity, but it is Gallileo's telescope that ushered in modern astronomy. And astronomers had to believe in what they saw behind the lens. Fortunately, they had a good reason to.

Take a more recent example. Since Max Perutz first determined the structure of haemoglobin by X-ray crystallography, the technique has been the de facto method of structural determination of proteins and nucleic acids. Almost all the proposed mechanism of protein functions are based on the crystallographic structures. Sure these structures are only static images of dynamic proteins, and we all agree that protein dynamics are often essential to their function. But no technology has managed to determine dynamic structures of proteins (aside from limited successes of NMR spectroscopy), so crystallography is still the best we have got. After all, scientific truth is not what can be proven true, but what cannot be proven false, yet.

So run replica exchange on the amazingly affordable Beowulf clusters, and be content doing so.