Question:
Where can I find datasets?
โ
โAnswer:
Here are a few data sources to get you started.
Curated datasets:
For small, reasonably clean datasets, Wikipedia tables are excellent.
You can generate a list of CRAN packages that contain datasets using finddatasetpkgs. Use the following code:
library(remotes)
install_github("datacamp/finddatasetpkgs")
library(finddatasetpkgs)
finddatasetpkgs::get_dataset_pkgs()
Bioconductor has its own list called BiocViews.
The Dataset subreddit is a potluck of datasets.
Vanderbilt University has a page of medical datasets.
Microsoft Research's Open Data page has many datasets (mostly) related to natural language processing and computer vision.
KDnuggets list of datasets, mostly for machine learning.
Jo Hardin and Amelia McNemara maintain lists of datasets for teaching with, and other sources of data.
Ben Teusch's list of human resources datasets.
Various financial datasets from Aswath Damadaran.
Awesome Public Datasets lists several hundred sources of data by domain.
UEA & UCR Time Series Classification has many time-based machine learning datasets, particularly for image classification.
List of chemistry databases on Chemweb.
Marketing datasets from the Kilts Center.
IBM Watson datasets, mostly business related.
ProPublica maintains a list of datasets on economics and social issues.
Dartmouth College's list of sources of business data.
Data-driven farming has a list of agricultural datasets.
Leonardo Mauro has a list of datasets on digital games.
ABC Dataset is a collection of CAD models for geometric deep learning.
Wikipedia's list of peer-reviewed machine learning datasets.
Data sharing platforms:
CKAN is a data sharing platform. Some popular instances include datahub.io, catalog.data.gov, and the European Data Portal.
Dataverse is a good source of datasets from academic papers. (Click the map on the home page to see specific Dataverse installations.)
data.world hosts datasets directly and contains many (typically small) datasets.
Our world in data contains articles on the state of the world according to datasets, with links to the data used.
Wharton Research Data Services contains business-related datasets (registration required).
Governmental and NGO datasets:
World Health Organization Global Health Observatory Data Repository and Mortality Database.
Center for Disease Control health and epidemiology datasets.
HMFO Conservation and Science has a blog linking to datasets on that subject.
Datasets on wars from Correlates of War and the Peace Research Institute Oslo.
Economic datasets from the National Bureau of Economic Research.
Astronomy datasets from the Spitzer Heritage Archive, National Radio Astronomy Observatory, Hubble Space Telescope, and Chandra X-ray Center.
Office of Personnel Management employment and HR datasets.
Agricultural datasets from the USDA Agricultural Research Service and Montgomery County.