guest post – Data Science Hub @ the Claremont Colleges Library

Sarah Marzen, Ph.D.
W. M. Keck Science Department

Biophysics is “searching for principles”, and boy is it data-heavy

There have been a plethora of data-gathering advances in biophysics that allow us to probe at the single molecule or single cell or organism or even population level. We can specify a protein’s position as a function of time from super-resolution microscopy readouts. We can specify the number of microbes of each type as a function of time in a microbial population. We can specify, with the advent of artificial intelligence, the exact positioning and behavior of a worm or a fly as a function of time. We can even specify brain activity as a function of time to varying degrees of spatial and temporal precision. Some of the best data sets in the world are available online for anyone to analyze.

What do we do now? How do we make sense of all this data?

If we could only make sense of all this data, we would have a Holy Grail of biophysics: a quantitative theory of life.

When wrestling with the data, it immediately becomes obvious that one can build highly predictive models that impart almost no understanding. When I was a postdoc at MIT, I gratefully worked for Nikta Fakhri on a project that involved fluorescent readouts of a particular protein in starfish eggs. The protein made beautiful spiral patterns as it moved through the cell, and we wanted at first to just predict the next video frames from the past video frames. We were hopeful and somewhat convinced that a predictive model would lead to new explanations of what was going on in the cell.

Boy, were we wrong.

We could easily build wonderful predictive models. A combination of PCA and ARIMA models did quite well. A combination of PCA and reservoir computing did even better. Then, a CNN and LSTM beat even that, eking out ever smaller gains in predictive accuracy.

But what had we learned?

In physics, there is the 80/20 rule: that we want a rule that with 20% effort nails 80% accuracy. When Copernicus proposed that the Earth revolved around the Sun, his theory did not produce better predictions than the theories with epicycles of epicycles and the Sun revolving around the Earth. However, the epicycles of epicycles were so much more complicated that the simpler rule that could basically nail the predictions was preferable. (Actually, Copernicus thought that the Earth’s orbit was circular, and so added an epicycle in — but still.)

So, too, with starfish oocytes. Tzer-Han, then a graduate student in the lab, added expert knowledge into the model of proteins swirling around on the surface of the starfish oocyte. He hypothesized that these protein concentrations obeyed a reaction-diffusion equation, and all we had to do was fit the reaction term. This worked– not quite as well as the complicated models of before in terms of predictive accuracy, but with some interpretability.

The lesson from this? There is a surplus of data, but what is needed is more than just model.fit and model.predict. Data science requires models with explanatory power to solve the Holy Grail of biophysics and many other fields.