Where can I find datasets?
Here are a few data sources to get you started.
For small, reasonably clean datasets, Wikipedia tables are excellent.
You can generate a list of CRAN packages that contain datasets using finddatasetpkgs. Use the following code:
The Dataset subreddit is a potluck of datasets.
Vanderbilt University has a page of medical datasets.
Microsoft Research's Open Data page has many datasets (mostly) related to natural language processing and computer vision.
KDnuggets list of datasets, mostly for machine learning.
Ben Teusch's list of human resources datasets.
Various financial datasets from Aswath Damadaran.
Awesome Public Datasets lists several hundred sources of data by domain.
UEA & UCR Time Series Classification has many time-based machine learning datasets, particularly for image classification.
List of chemistry databases on Chemweb.
Marketing datasets from the Kilts Center.
IBM Watson datasets, mostly business related.
ProPublica maintains a list of datasets on economics and social issues.
Dartmouth College's list of sources of business data.
Data-driven farming has a list of agricultural datasets.
Leonardo Mauro has a list of datasets on digital games.
ABC Dataset is a collection of CAD models for geometric deep learning.
Wikipedia's list of peer-reviewed machine learning datasets.
Data sharing platforms:
Dataverse is a good source of datasets from academic papers. (Click the map on the home page to see specific Dataverse installations.)
data.world hosts datasets directly and contains many (typically small) datasets.
Our world in data contains articles on the state of the world according to datasets, with links to the data used.
Wharton Research Data Services contains business-related datasets (registration required).
Governmental and NGO datasets:
Center for Disease Control health and epidemiology datasets.
HMFO Conservation and Science has a blog linking to datasets on that subject.
Economic datasets from the National Bureau of Economic Research.
Office of Personnel Management employment and HR datasets.