Data Science @ Claremont Blog

What to Do with Experimental Data

Post author By David Merten-Jones
Post date October 15, 2024

by Sarah Marzen, Associate Professor of Physics, W. M. Keck Science Department

I get some of the best experimental collaborators! They just send me cleaned data and ask, “What do I do with it?” We always come up with something, and there’s always a cool project. From mouse obesity to neural modeling to immune systems, something always comes out.

Do you want to know my secrets? They’re not really secrets. Basically, if an experimentalist hands you good data, it’s totally possible to sometimes do a first pass on modeling the data in a way that produces novel insight– without thinking too hard and without spending too much time wondering how to set things up.

First, you figure out if their scientific questions are in the realm of unsupervised learning or supervised learning. Then, you figure out if their data are categorical or ordinal. From these two things, you can then figure out a range of attacks that range from easily done on SciPy to easily done with Keras or PyTorch, and off you go. This is not to say that you shouldn’t ask your own scientific questions, that may not be your collaborators’ questions, and may be in a different realm entirely! For example, if you have neural activity, you could model its relation to behavior, making it supervised learning– or you could just model the neural activity, making it unsupervised learning, or potentially semi-supervised learning. It’s up to you.

The goal of this piece is not to give you a complete introduction to all of these things, but rather, to give you ways to figure out where to look if you want to get started on modeling something. References are not necessarily the original article, along those lines– they’re whatever would most help someone starting out.

Before we start, let’s quickly introduce two key concepts: probability distributions, and functions.

A quick primer

A probability distribution is, for all practical purposes, a list of frequencies. Someone hands you a list of objects x, all representative of what you might see from the same “source”. This could be a neural system spitting out binary vectors that describe the activity of the neurons (1 for active, 0 for silent, in all time bins) [1]; this could be a list of the velocities of birds in a flock [2]. In the first case, the probability of seeing a particular neural activity pattern x in N samples is given, when N is large, by Np(x), where p(x) is a probability distribution. In the second case, with the positions and velocities of birds, the frequency with which we see a velocity between v and v+dv is p(v)dv, where p(v) is a probability density function. In either case, probabilities and frequencies are one and the same.

Sometimes, people will use a Bayesian interpretation, in which probabilities are beliefs about what is likely. For every frequentist algorithm, there is a Bayesian interpretation.

A function will take x and turn it into f(x), which could even be in a new space. Neural activity x could get translated to an estimate of a stimulus; bird velocity v could get mapped to an estimate of its brain activity by some function. Neural networks are basically networks of layers in which the next layer is typically a function of the previous layer with trainable weights and biases, resulting in some overall very complicated, very nonlinear function of the input x.

Unsupervised learning

Imagine someone hands you disparate information that is all from the same source– in other words, a set of data points x that all belong to the same distribution p(x). There are a number of things you might imagine doing to this set of data points, all to get an understanding of the underlying structure.

Modeling

Imagine that you have neural activity from some brain region that you are trying to model. This neural activity comes in the form of “spike trains”. Every neuron fires, really quickly, emitting what is essentially a spike. What you see in a multi-electrode recording are a series of spikes from every neuron at times that we try to understand as neuroscientists. Some questions that you might ask about neurons are about a relationship between neurons and the sensory signals they see, or between neurons and the organism’s behavior. But some questions are just about the neural activity themselves. Are the neurons connected, functionally, in some way? Or do they behave independently? If they’re connected functionally, are their connections two-way, three-way, or higher?

How do you figure this out? You built some sort of model of p(x), the probability distribution. A classical way to do so is to do a Maximum Entropy model [3], which in effect is just an exponential linear model– meaning that the probability p(x) is proportional to the exponential of a linear combination of functions of x. You simply then fit the weights so that the model matches the data, usually by minimizing what’s called the Kullback-Leibler divergence [4]. A nonparametric model that might work better could come from Parzen window estimates, for example [5].

More complicated methods involve fitting a restricted Boltmann Machine or Boltzmann Machine [6], which partly got Geoffrey Hinton the Nobel Prize.

Dimensionality reduction

Sometimes, what you want are a few “components” of x that describe basically what’s going on with x. How does this work? Let’s turn to neural data again. There are maybe 200 or so neurons in every recording, but maybe only a few “components” of their activity that really tell you what’s going on. What you want is a smaller-dimensional vector r from your high-dimensional vector x that capture most of the variance, most of the fluctuations, in your data. If you want to get fancy later, you can look for an r that captures aspects of x that tell you about some other variable y, but for unsupervised learning, we’re simply staring at x.

The best-known techniques for this are PCA [7], in which you find a linear projection of your data that captures as much variance as possible; and tSNE [8] and uMAP [9], which try to find mappings of your data that retain a sense of distance between data points. Of later interest might also be auto-encoders, which you can roughly think of as a neural network which has a bottleneck layer with fewer neurons that must be expanded to reconstruct the original data.

Supervised learning

The main differentiation between supervised learning and unsupervised learning is that in unsupervised learning, there is one information source; in supervised learning, there are two information sources. One of the information sources spits out x, while the other spits out y. Our job is to relate x to y in some way, depending on what you desire.

Modeling and dimensionality reduction

Sometimes, what we’re looking for are aspects of x that tell us about y. A simple example for why this would matter comes from neuroscience. If someone hands you neural activity, it’s all very well and good to understand the connections in the brain, but what if you want to know what about the neural activity tells you what the sensory stimulus was– whether or not we saw a lion or a flower? This becomes a problem in which we try to model and then compress the neural activity x to retain information about the stimulus y.

This can be formalized through the information bottleneck method, but the easiest way of dealing with this version of dimensional reduction is through something called Canonical Correlation Analysis (CCA) [10]. Implicit in all these methods is an assumption that the joint probability distribution of x and y adopts some form. CCA works perfectly for Gaussian variables. If it doesn’t work on your data very well, ask yourself– is my data too far away from Gaussian? We are just starting to discover how these algorithms really work.

Prediction

Imagine someone hands you some variable x and asks you to predict y. We’ve seen instances of this in pretty much every STEM class we’ve ever taken. In the introductory physics labs, we ask students to predict the period of a pendulum from the mass of the bob, the length of the string, and the initial angle the string makes with the vertical right before release. In more complicated systems, you’re given some reagents and asked to predict the output of the chemical reaction. In a really complicated system, someone might hand you the primary structure of a protein (the list of amino acids) and ask you to predict the protein’s three-dimensional structure– another contribution from AI that won the Nobel Prize just this year. In all of these cases, we are looking for a function of x that produces y, or something close to it.

If x and y are real-valued vectors, linear regression is a great way to start. If you don’t have enough data, try regularizing your linear regression, and take a look at my notes on the subject that are also on the Data Science Hub website. If instead y is a categorical variable, meaning that y is either this or this or that and not just any real number, try logistic regression, in which the probability that y is a particular answer is proportional to an exponential of a linear function of x. Logistic regression can also be regularized. The key here is, these simple methods that you learn even in basic physics classes can be incredibly powerful for establishing basic relationships. A student of mine from CMC, Jonathan Soriano, and I used linear regression to predict the weight of mice from aspects of their behavior (the frequency with which they are active during the day, for instance) in an unpublished study that supposedly– ahem– is going to be in a book! We used regularized linear regression, not having quite enough data to do something more complicated.

If you want to take it to the next level, the easiest thing to do is to use a neural network. In Keras or PyTorch, neural networks are relatively easy to implement, as long as you can follow the tutorials available on their websites. The neural network will transform x, layer by layer, into your variable y. At first, the mapping will be dismal, most likely. But then you train the neural network, changing the weights and biases of the neural network layers so that the output of the neural network is as close, on average, to the true answer (y) as possible. You can choose any loss function you wish, and categorical data will have a different loss function and different final layer potentially than ordinal data. I can’t pretend to go into all the details of how a neural network works, but follow the tutorials on the Tensorflow (Keras) or PyTorch websites and you should be ready to go!

A classic case of using a neural network to relate one information stream to another comes from object recognition. If someone draws a digit for you, can you figure out the number? Most likely– but what if the digit is drawn really weirdly? And more importantly, can you get a computer to figure this out for you? This is called the MNIST task, and it’s fun to try and take the intensity pattern describing the picture of the digit x (a matrix of intensities) and to classify the digit (a categorical variable that is an integer between 0 and 9)! Try logistic regression first– but a simple neural network can get up to 98% classification quite easily on a dataset that, decades ago, stumped us.

Time series prediction

Sometimes, the supervision comes from part of the data itself. This is key for time series understanding, in which we see a correlated set of symbols in some order and use the past symbols to predict what happens next in the stream. This is relevant, for instance, to stock prices, and is of utmost importance to the lives of traders. How might they use the past Apple stock prices, along with the past stock prices of Apple’s competitors perhaps, to predict Apple’s future stock price? There are a number of techniques, starting with autoregressive techniques [11] and moving to reservoir computers [12], and then to recurrent neural networks like the Long Short-Term Memory Unit (LSTM) [13]. Transformers can also be a powerful time series modeling technique [14], and have in fact been used to model English language! They are the underlying force behind ChatGPT. All of these models are straightforwardly implementable in Python with existing code online.

References

[1] Schneidman, Elad, et al. “Weak pairwise correlations imply strongly correlated network states in a neural population.” Nature 440.7087 (2006): 1007-1012.

[2] Bialek, William, et al. “Statistical mechanics for natural flocks of birds.” Proceedings of the National Academy of Sciences 109.13 (2012): 4786-4791.

[3] Guiasu, Silviu, and Abe Shenitzer. “The principle of maximum entropy.” The mathematical intelligencer 7 (1985): 42-48.

[4] “Kullback-Leibler divergence.” Wikipedia, The Free Encyclopedia, Wikimedia Foundation, 15 October 2024, en.wikipedia.org/wiki/Kullback–Leibler_divergence

[5] Izenman, Alan Julian. “Review papers: Recent developments in nonparametric density estimation.” Journal of the american statistical association 86.413 (1991): 205-224.

[6] Hinton, Geoffrey E. “Boltzmann machine.” Scholarpedia 2.5 (2007): 1668.

[7] “Principal component analysis.” Wikipedia, The Free Encyclopedia, Wikimedia Foundation, 15 October 2024, en.wikipedia.org/wiki/Principal_component_analysis

[8] Hinton, Geoffrey E., and Sam Roweis. “Stochastic neighbor embedding.” Advances in neural information processing systems 15 (2002).

[9] Leland, McInnes, Healy John, and Melville James. “Uniform manifold approximation and projection for dimension reduction.” arXiv preprint arXiv:1802.03426 (2018).

[10] Thompson, Bruce. “Canonical correlation analysis.” (2000).

[11] Box, George EP, et al. Time series analysis: forecasting and control. John Wiley & Sons, 2015.

[12] Lukoševičius, Mantas, and Herbert Jaeger. “Reservoir computing approaches to recurrent neural network training.” Computer science review 3.3 (2009): 127-149.

[13] Hochreiter, S. “Long Short-term Memory.” Neural Computation MIT-Press (1997).

[14] Vaswani, A. “Attention is all you need.” Advances in Neural Information Processing Systems (2017).

Claremont Colleges ArcGIS Online Usage Dashboard

Post author By David Merten-Jones
Post date June 10, 2024

Did you know that The Claremont Colleges Library manages the Institutional License for ArcGIS for all 7Cs? This dashboard shows our registered users, a breakdown of users by campus, and our unique user logins to ArcGIS Online on a weekly and a monthly basis. Since the beginning of 2020, our registered ArcGIS Online user base has more than tripled.

If you’d like to learn more about ArcGIS and other Geographic Information Systems, check our Research Guide for GIS.

And don’t forget to join us for Map Mondays this Fall Semester! Look out for updates on the Library Events page.

Fall 2024 Data Science Course Preview

Post author By David Merten-Jones
Post date April 25, 2024

Slides from Zoom Meeting (4/11/2024) for The Claremont Colleges Fall 2024 Data Science Course Preview.

Contributing Faculty:

Mark Huber (Claremont McKenna College)

Heather Zinn Brooks (Harvey Mudd College)

David Bachman (Pitzer College)

Jo Hardin (Pomona College)

Christina Edholm (Scripps College)

Uncovering patterns in data

Post author By Jeanine Finn
Post date April 20, 2023

Lecture notes from biophysics professor Sarah Marzen

How to build a model that respects physical and biophysical assumptions when you have no idea what is going on.

Fall 2022 Machine Learning Workshop

Post author By Jeanine Finn
Post date October 18, 2022

A workshop series on machine learning using tools in python is starting this Thursday (Oct 20th) at HMC. If you are interested, please find more information below and register using this link.

Short version:

What: A Hands-on Workshop series in Machine Learning

Who: All 7C students

Instructor: Dr. Aashita Kesarwani

When: 3-5 pm PST on Tuesdays and Thursdays from Oct 20th, 2022 to Nov 10th, 2022 (7 sessions in total)

Where: Shanahan 3485 (Grace Hopper Conference room) or remotely via Zoom (link will be shared when you register)

Long version:

The workshop series is designed with a focus on the practical aspects of machine learning using real-world datasets and the tools in the Python ecosystem and is targeted towards complete beginners familiar with Python.

You will learn the minimal but most useful tools for exploring datasets using pandas and then be gently introduced to neural networks. You will also learn various architectures such as Convolution Neural Networks (CNN), Recurrent Neural Networks (RNN), transformer-based models, etc., and apply them to real-world textual and image datasets.

Please register using this Google form to save your seat. It is highly recommended to attend the workshop in person as you will be coding in groups and participating in discussions, but there is an option to join remotely via Zoom. The Zoom link and the recordings for each session will be shared with the registered participants. You are free to attend some of the sessions while skipping others if you are already familiar with certain topics.

Tags Machine learning, Python, workshops

The “holy grail” of biophysics

Post author By Jeanine Finn
Post date September 21, 2022

Photo of a spiny starfish on a bed of grey pebbles — Photo by Flickr user heartypanther

Sarah Marzen, Ph.D.
W. M. Keck Science Department

Biophysics is “searching for principles”, and boy is it data-heavy

There have been a plethora of data-gathering advances in biophysics that allow us to probe at the single molecule or single cell or organism or even population level. We can specify a protein’s position as a function of time from super-resolution microscopy readouts. We can specify the number of microbes of each type as a function of time in a microbial population. We can specify, with the advent of artificial intelligence, the exact positioning and behavior of a worm or a fly as a function of time. We can even specify brain activity as a function of time to varying degrees of spatial and temporal precision. Some of the best data sets in the world are available online for anyone to analyze.

What do we do now? How do we make sense of all this data?

If we could only make sense of all this data, we would have a Holy Grail of biophysics: a quantitative theory of life.

When wrestling with the data, it immediately becomes obvious that one can build highly predictive models that impart almost no understanding. When I was a postdoc at MIT, I gratefully worked for Nikta Fakhri on a project that involved fluorescent readouts of a particular protein in starfish eggs. The protein made beautiful spiral patterns as it moved through the cell, and we wanted at first to just predict the next video frames from the past video frames. We were hopeful and somewhat convinced that a predictive model would lead to new explanations of what was going on in the cell.

Boy, were we wrong.

We could easily build wonderful predictive models. A combination of PCA and ARIMA models did quite well. A combination of PCA and reservoir computing did even better. Then, a CNN and LSTM beat even that, eking out ever smaller gains in predictive accuracy.

But what had we learned?

In physics, there is the 80/20 rule: that we want a rule that with 20% effort nails 80% accuracy. When Copernicus proposed that the Earth revolved around the Sun, his theory did not produce better predictions than the theories with epicycles of epicycles and the Sun revolving around the Earth. However, the epicycles of epicycles were so much more complicated that the simpler rule that could basically nail the predictions was preferable. (Actually, Copernicus thought that the Earth’s orbit was circular, and so added an epicycle in — but still.)

So, too, with starfish oocytes. Tzer-Han, then a graduate student in the lab, added expert knowledge into the model of proteins swirling around on the surface of the starfish oocyte. He hypothesized that these protein concentrations obeyed a reaction-diffusion equation, and all we had to do was fit the reaction term. This worked– not quite as well as the complicated models of before in terms of predictive accuracy, but with some interpretability.

The lesson from this? There is a surplus of data, but what is needed is more than just model.fit and model.predict. Data science requires models with explanatory power to solve the Holy Grail of biophysics and many other fields.

Tags Big Data, Biophysics

A Data Science Hub at the Claremont Colleges Library

Post author By Jeanine Finn
Post date August 20, 2021

An image of the interior windows of the library overlaid with a data visualization.

The Claremont Colleges Library is pleased to announce the establishment of the Claremont Colleges Data Science Hub this fall.

The DS Hub is intended to support the collective interests and aims of the seven Claremont Colleges in their efforts to develop research and teaching in rapidly-growing data science fields. Following a 3-year evaluation and study process directed by the Office for Consortial Academic Collaboration (OCAC), the Library was proposed as a communal site to host essential support activities such as workshops, consultations, faculty presentations, and a computing lab open to the 7Cs community.

The Library will strive to develop this hub as an inclusive and collaborative support system for the teaching and learning of data science tools and approaches across disciplines and skill levels. The DS Hub is currently under the guidance of Data Science and Digital Scholarship Coordinator Jeanine Finn and will be administered as part of the Digital Strategies and Scholarship Division.

Please look out for more information as we develop our physical space and program plans during the 2021-2022 academic year.

Constellate Text and Data Mining Platform

Post author By Jeanine Finn
Post date July 27, 2021

Graphical visualisation of tweets containing #breastcancer - looks a large cloud of clustered yellow points — Graphical visualization of tweets containing #breastcancer. Credit: Eric Clarke, Richard Arnett, Jane Burns, Royal College of Surgeons in Ireland. Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)

ITHAKA and JSTOR Labs are inviting scholars to explore their new text and data mining platform Constellate.

Constellate is a text analytics platform aimed at teaching and enabling new and established researchers to text mine. Two of ITHAKA’s services, JSTOR and Portico, are the initial sources of content for the new platform, which now includes Chronicling America, collections from Documenting the American South, the South Asia Open Archives and Independent Voices from Reveal Digital.

The platform incorporates Jupyter Notebooks for tutorials and open-source analytical tools and hosts a lively community of practice for scholars wanting to develop their TDM skills with Python.

A Hands-on Workshop Series in Machine Learning (Fall 2019)

Post author By Jeanine Finn
Post date September 30, 2019

Who: All 7C students & faculty

When: 5:45 to 7:45 pm on 7 consecutive Wednesdays from Oct 2nd to Nov 13th

Where: Aviation Room, Hoch-Shanahan Dining Commons, HMC

Why: To learn machine learning techniques and related tools in Python!

More information:

The workshop series is designed with a focus on the practical aspects of machine learning using real-world datasets using the tools in the Python ecosystem. It is targeted towards complete beginners familiar with Python but is also designed adaptively so that you will be challenged even if you have some familiarity with machine learning tools.

You will learn the minimal but most useful tools for exploring datasets using pandas quickly and then move on to the conventional machine learning algorithms and other related concepts that comes in handy for all models including neural networks. The neural networks will be introduced gently from the fourth session onwards and you will learn some more involved architectures such as Convolution Neural Networks (CNN) and apply them to real-world datasets. The sessions will be a good mix of theory explained intuitively in a simplified manner and hands-on exercises.

Seats are limited, please register using this link. The only prerequisites are python programming and basics of probability and statistics. It is important that you attend all the sessions of the series for it to be useful.

Data Science for Good

Post author By Jeanine Finn
Post date August 14, 2019

open hand with blue and green bubbles floating out of it

Pomona mathematics professor Jo Hardin has spent the summer collaborating with data science colleagues Hunter Glanz (Cal Poly, San Luis Obispo) and Nick Horton (Amherst) on an ambitious post-a-day-project building the Teach Data Science blog.

The blog is filled with useful resources and reflections on teaching data science, designed to ease the learning curve for faculty engaging in this interdisciplinary space.

Jo’s recent post on Data Science for Good compiles some excellent resources for educators trying to incorporate a more critical approach to data science topics, with links to a number of groups and projects looking at the relationship between big data projects and social justice outcomes.