Where can I find datasets?
Here are a few data sources to get you started.
- DataCamp can provide access to the datasets curated by ThinkNum. Ask your Curriculum Lead for details.
- For small, reasonably clean datasets, Wikipedia tables are excellent.
- You can generate a list of CRAN packages that contain datasets using finddatasetpkgs. Use the following code:
- Bioconductor has its own list called BiocViews.
- UCI Machine Learning Archive.
- Kaggle datasets.
- The Dataset subreddit is a potluck of datasets.
- Vanderbilt University has a page of medical datasets.
- Microsoft Research's Open Data page has many datasets (mostly) related to natural language processing and computer vision.
- Jo Hardin and Amelia McNemara maintain lists of dataset for teaching with, and other sources of data.
Data sharing platforms:
- CKAN is a data sharing platform. Some popular instances include datahub.io, catalog.data.gov, and the European Data Portal.
- Dataverse is a good source of datasets from academic papers. (Click the map on the home page to see specific Dataverse installations.)
- data.world hosts datasets directly and contains many (typically small) datasets.
- Our world in data contains articles on the state of the world according to datasets, with links to the data used.