Where can I find datasets?
Here are a few data sources to get you started.
- DataCamp can provide access to the datasets curated by ThinkNum. Ask your Curriculum Manager for details.
- For small, reasonably clean datasets, Wikipedia tables are excellent.
- You can generate a list of CRAN packages that contain datasets using finddatasetpkgs. Use the following code:
- Bioconductor has its own list called BiocViews.
- UCI Machine Learning Archive.
- Kaggle datasets.
- The Dataset subreddit is a potluck of datasets.
- Vanderbilt University has a page of medical datasets.
- Microsoft Research's Open Data page has many datasets (mostly) related to natural language processing and computer vision.
- KDnuggets list of datasets, mostly for machine learning.
- Jo Hardin and Amelia McNemara maintain lists of datasets for teaching with, and other sources of data.
- Ben Teusch's list of human resources datasets.
- Various financial datasets from Aswath Damadaran.
- Awesome Public Datasets lists several hundred sources of data by domain.
- UEA & UCR Time Series Classification has many time-based machine learning datasets, particularly for image classification.
- Various movie datasets from zanmel.com.
- List of chemistry databases on Chemweb.
- Marketing datasets from the Kilts Center.
- IBM Watson datasets, mostly business related.
- ProPublica maintains a list of datasets on economics and social issues.
- Dartmouth College's list of sources of business data.
- Data-driven farming has a list of agricultural datasets.
- Leonardo Mauro has a list of datasets on digital games.
- ABC Dataset is a collection of CAD models for geometric deep learning.
- Wikipedia's list of peer-reviewed machine learning datasets.
Data sharing platforms:
- CKAN is a data sharing platform. Some popular instances include datahub.io, catalog.data.gov, and the European Data Portal.
- Dataverse is a good source of datasets from academic papers. (Click the map on the home page to see specific Dataverse installations.)
- data.world hosts datasets directly and contains many (typically small) datasets.
- Our world in data contains articles on the state of the world according to datasets, with links to the data used.
- Wharton Research Data Services contains business-related datasets (registration required).
Governmental and NGO datasets:
- UK government data.
- World Health Organization Global Health Observatory Data Repository and Mortality Database.
- Center for Disease Control health and epidemiology datasets.
- Humanitarian Data Exchange.
- World Bank Open Data.
- HMFO Conservation and Science has a blog linking to datasets on that subject.
- Datasets on wars from Correlates of War and the Peace Research Institute Oslo.
- Economic datasets from the National Bureau of Economic Research.
- Astronomy datasets from the Spitzer Heritage Archive, National Radio Astronomy Observatory, Hubble Space Telescope, and Chandra X-ray Center.
- NOAA atmospheric datasets.
- Office of Personnel Management employment and HR datasets.
- Agricultural datasets from the USDA Agricultural Research Service and Montgomery County.