Open data repositories
These are mostly structured data (some very large datasets) for secondary analysis, reproducible research, and exploring big data techniques.
- figshare contains data deposited from academic researchers in a number of disciplines
- Kaggle hosts a large variety of datasets primarily for training and experimentation. Data is primarily in .csv format though many datasets are in JSON. There are datasets for “big data” and machine learning work and Kaggle sponsors a variety of competitions around data analysis. The data comes from a variety of sources and is not peer-reviewed by Kaggle itself (so be cautious about research use).
- Zenodo is an open source global repository for research data.
The library has subscription-based access to some data resources that will require you to log in with your Claremont credentials.
These subscription databases primarily include government statistics, and business and industry reports.
Take a look at the Finding Data & Statistics Guide for more information.
Social media datasets
Dataset grab bag
New and unusual datasets that might be interesting to work with (a rotating feature — suggestions welcome!)
What’s on the menu? Historic menu data from the city of New York.
Wikipedia – How often do Wikipedia editors edit?
Remittances – World Bank data on international remittance payments.