by Sarah Marzen, Associate Professor of Physics, W. M. Keck Science Department
I get some of the best experimental collaborators! They just send me cleaned data and ask, “What do I do with it?” We always come up with something, and there’s always a cool project. From mouse obesity to neural modeling to immune systems, something always comes out.
Do you want to know my secrets? They’re not really secrets. Basically, if an experimentalist hands you good data, it’s totally possible to sometimes do a first pass on modeling the data in a way that produces novel insight– without thinking too hard and without spending too much time wondering how to set things up.
First, you figure out if their scientific questions are in the realm of unsupervised learning or supervised learning. Then, you figure out if their data are categorical or ordinal. From these two things, you can then figure out a range of attacks that range from easily done on SciPy to easily done with Keras or PyTorch, and off you go. This is not to say that you shouldn’t ask your own scientific questions, that may not be your collaborators’ questions, and may be in a different realm entirely! For example, if you have neural activity, you could model its relation to behavior, making it supervised learning– or you could just model the neural activity, making it unsupervised learning, or potentially semi-supervised learning. It’s up to you.
The goal of this piece is not to give you a complete introduction to all of these things, but rather, to give you ways to figure out where to look if you want to get started on modeling something. References are not necessarily the original article, along those lines– they’re whatever would most help someone starting out.
Before we start, let’s quickly introduce two key concepts: probability distributions, and functions.
A quick primer
A probability distribution is, for all practical purposes, a list of frequencies. Someone hands you a list of objects x, all representative of what you might see from the same “source”. This could be a neural system spitting out binary vectors that describe the activity of the neurons (1 for active, 0 for silent, in all time bins) [1]; this could be a list of the velocities of birds in a flock [2]. In the first case, the probability of seeing a particular neural activity pattern x in N samples is given, when N is large, by Np(x), where p(x) is a probability distribution. In the second case, with the positions and velocities of birds, the frequency with which we see a velocity between v and v+dv is p(v)dv, where p(v) is a probability density function. In either case, probabilities and frequencies are one and the same.
Sometimes, people will use a Bayesian interpretation, in which probabilities are beliefs about what is likely. For every frequentist algorithm, there is a Bayesian interpretation.
A function will take x and turn it into f(x), which could even be in a new space. Neural activity x could get translated to an estimate of a stimulus; bird velocity v could get mapped to an estimate of its brain activity by some function. Neural networks are basically networks of layers in which the next layer is typically a function of the previous layer with trainable weights and biases, resulting in some overall very complicated, very nonlinear function of the input x.
Unsupervised learning
Imagine someone hands you disparate information that is all from the same source– in other words, a set of data points x that all belong to the same distribution p(x). There are a number of things you might imagine doing to this set of data points, all to get an understanding of the underlying structure.
Modeling
Imagine that you have neural activity from some brain region that you are trying to model. This neural activity comes in the form of “spike trains”. Every neuron fires, really quickly, emitting what is essentially a spike. What you see in a multi-electrode recording are a series of spikes from every neuron at times that we try to understand as neuroscientists. Some questions that you might ask about neurons are about a relationship between neurons and the sensory signals they see, or between neurons and the organism’s behavior. But some questions are just about the neural activity themselves. Are the neurons connected, functionally, in some way? Or do they behave independently? If they’re connected functionally, are their connections two-way, three-way, or higher?
How do you figure this out? You built some sort of model of p(x), the probability distribution. A classical way to do so is to do a Maximum Entropy model [3], which in effect is just an exponential linear model– meaning that the probability p(x) is proportional to the exponential of a linear combination of functions of x. You simply then fit the weights so that the model matches the data, usually by minimizing what’s called the Kullback-Leibler divergence [4]. A nonparametric model that might work better could come from Parzen window estimates, for example [5].
More complicated methods involve fitting a restricted Boltmann Machine or Boltzmann Machine [6], which partly got Geoffrey Hinton the Nobel Prize.
Dimensionality reduction
Sometimes, what you want are a few “components” of x that describe basically what’s going on with x. How does this work? Let’s turn to neural data again. There are maybe 200 or so neurons in every recording, but maybe only a few “components” of their activity that really tell you what’s going on. What you want is a smaller-dimensional vector r from your high-dimensional vector x that capture most of the variance, most of the fluctuations, in your data. If you want to get fancy later, you can look for an r that captures aspects of x that tell you about some other variable y, but for unsupervised learning, we’re simply staring at x.
The best-known techniques for this are PCA [7], in which you find a linear projection of your data that captures as much variance as possible; and tSNE [8] and uMAP [9], which try to find mappings of your data that retain a sense of distance between data points. Of later interest might also be auto-encoders, which you can roughly think of as a neural network which has a bottleneck layer with fewer neurons that must be expanded to reconstruct the original data.
Supervised learning
The main differentiation between supervised learning and unsupervised learning is that in unsupervised learning, there is one information source; in supervised learning, there are two information sources. One of the information sources spits out x, while the other spits out y. Our job is to relate x to y in some way, depending on what you desire.
Modeling and dimensionality reduction
Sometimes, what we’re looking for are aspects of x that tell us about y. A simple example for why this would matter comes from neuroscience. If someone hands you neural activity, it’s all very well and good to understand the connections in the brain, but what if you want to know what about the neural activity tells you what the sensory stimulus was– whether or not we saw a lion or a flower? This becomes a problem in which we try to model and then compress the neural activity x to retain information about the stimulus y.
This can be formalized through the information bottleneck method, but the easiest way of dealing with this version of dimensional reduction is through something called Canonical Correlation Analysis (CCA) [10]. Implicit in all these methods is an assumption that the joint probability distribution of x and y adopts some form. CCA works perfectly for Gaussian variables. If it doesn’t work on your data very well, ask yourself– is my data too far away from Gaussian? We are just starting to discover how these algorithms really work.
Prediction
Imagine someone hands you some variable x and asks you to predict y. We’ve seen instances of this in pretty much every STEM class we’ve ever taken. In the introductory physics labs, we ask students to predict the period of a pendulum from the mass of the bob, the length of the string, and the initial angle the string makes with the vertical right before release. In more complicated systems, you’re given some reagents and asked to predict the output of the chemical reaction. In a really complicated system, someone might hand you the primary structure of a protein (the list of amino acids) and ask you to predict the protein’s three-dimensional structure– another contribution from AI that won the Nobel Prize just this year. In all of these cases, we are looking for a function of x that produces y, or something close to it.
If x and y are real-valued vectors, linear regression is a great way to start. If you don’t have enough data, try regularizing your linear regression, and take a look at my notes on the subject that are also on the Data Science Hub website. If instead y is a categorical variable, meaning that y is either this or this or that and not just any real number, try logistic regression, in which the probability that y is a particular answer is proportional to an exponential of a linear function of x. Logistic regression can also be regularized. The key here is, these simple methods that you learn even in basic physics classes can be incredibly powerful for establishing basic relationships. A student of mine from CMC, Jonathan Soriano, and I used linear regression to predict the weight of mice from aspects of their behavior (the frequency with which they are active during the day, for instance) in an unpublished study that supposedly– ahem– is going to be in a book! We used regularized linear regression, not having quite enough data to do something more complicated.
If you want to take it to the next level, the easiest thing to do is to use a neural network. In Keras or PyTorch, neural networks are relatively easy to implement, as long as you can follow the tutorials available on their websites. The neural network will transform x, layer by layer, into your variable y. At first, the mapping will be dismal, most likely. But then you train the neural network, changing the weights and biases of the neural network layers so that the output of the neural network is as close, on average, to the true answer (y) as possible. You can choose any loss function you wish, and categorical data will have a different loss function and different final layer potentially than ordinal data. I can’t pretend to go into all the details of how a neural network works, but follow the tutorials on the Tensorflow (Keras) or PyTorch websites and you should be ready to go!
A classic case of using a neural network to relate one information stream to another comes from object recognition. If someone draws a digit for you, can you figure out the number? Most likely– but what if the digit is drawn really weirdly? And more importantly, can you get a computer to figure this out for you? This is called the MNIST task, and it’s fun to try and take the intensity pattern describing the picture of the digit x (a matrix of intensities) and to classify the digit (a categorical variable that is an integer between 0 and 9)! Try logistic regression first– but a simple neural network can get up to 98% classification quite easily on a dataset that, decades ago, stumped us.
Time series prediction
Sometimes, the supervision comes from part of the data itself. This is key for time series understanding, in which we see a correlated set of symbols in some order and use the past symbols to predict what happens next in the stream. This is relevant, for instance, to stock prices, and is of utmost importance to the lives of traders. How might they use the past Apple stock prices, along with the past stock prices of Apple’s competitors perhaps, to predict Apple’s future stock price? There are a number of techniques, starting with autoregressive techniques [11] and moving to reservoir computers [12], and then to recurrent neural networks like the Long Short-Term Memory Unit (LSTM) [13]. Transformers can also be a powerful time series modeling technique [14], and have in fact been used to model English language! They are the underlying force behind ChatGPT. All of these models are straightforwardly implementable in Python with existing code online.
References
[1] Schneidman, Elad, et al. “Weak pairwise correlations imply strongly correlated network states in a neural population.” Nature 440.7087 (2006): 1007-1012.
[2] Bialek, William, et al. “Statistical mechanics for natural flocks of birds.” Proceedings of the National Academy of Sciences 109.13 (2012): 4786-4791.
[3] Guiasu, Silviu, and Abe Shenitzer. “The principle of maximum entropy.” The mathematical intelligencer 7 (1985): 42-48.
[4] “Kullback-Leibler divergence.” Wikipedia, The Free Encyclopedia, Wikimedia Foundation, 15 October 2024, en.wikipedia.org/wiki/Kullback–Leibler_divergence
[5] Izenman, Alan Julian. “Review papers: Recent developments in nonparametric density estimation.” Journal of the american statistical association 86.413 (1991): 205-224.
[6] Hinton, Geoffrey E. “Boltzmann machine.” Scholarpedia 2.5 (2007): 1668.
[7] “Principal component analysis.” Wikipedia, The Free Encyclopedia, Wikimedia Foundation, 15 October 2024, en.wikipedia.org/wiki/Principal_component_analysis
[8] Hinton, Geoffrey E., and Sam Roweis. “Stochastic neighbor embedding.” Advances in neural information processing systems 15 (2002).
[9] Leland, McInnes, Healy John, and Melville James. “Uniform manifold approximation and projection for dimension reduction.” arXiv preprint arXiv:1802.03426 (2018).
[10] Thompson, Bruce. “Canonical correlation analysis.” (2000).
[11] Box, George EP, et al. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
[12] Lukoševičius, Mantas, and Herbert Jaeger. “Reservoir computing approaches to recurrent neural network training.” Computer science review 3.3 (2009): 127-149.
[13] Hochreiter, S. “Long Short-term Memory.” Neural Computation MIT-Press (1997).
[14] Vaswani, A. “Attention is all you need.” Advances in Neural Information Processing Systems (2017).