From a222f076e5d6951ce0d8b6ba44345577931d1c70 Mon Sep 17 00:00:00 2001 From: Omar Santos Date: Mon, 4 Sep 2023 23:45:37 -0400 Subject: [PATCH] Create ml_ai_datasets.md --- ai_security/ML_Fundamentals/ml_ai_datasets.md | 62 +++++++++++++++++++ 1 file changed, 62 insertions(+) create mode 100644 ai_security/ML_Fundamentals/ml_ai_datasets.md diff --git a/ai_security/ML_Fundamentals/ml_ai_datasets.md b/ai_security/ML_Fundamentals/ml_ai_datasets.md new file mode 100644 index 0000000..1451665 --- /dev/null +++ b/ai_security/ML_Fundamentals/ml_ai_datasets.md @@ -0,0 +1,62 @@ +# Datasets for AI / ML Research + +1. **UCI Machine Learning Repository**: A collection of databases, domain theories, and data generators widely used by the machine learning community. + Website: [UCI ML Repository](https://archive.ics.uci.edu/ml/index.php) + +2. **Kaggle Datasets**: Offers a wide variety of datasets in different domains including economics, biology, computer vision, and natural language processing. + Website: [Kaggle](https://www.kaggle.com/datasets) + +3. **AWS Public Datasets**: Amazon Web Services offers a variety of public datasets that anyone can access. + Website: [AWS Public Datasets](https://registry.opendata.aws/) + +4. **Google Dataset Search**: A tool that enables the discovery of datasets stored across the web. + Website: [Google Dataset Search](https://datasetsearch.research.google.com/) + +5. **Microsoft Research Open Data**: A collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain-specific sciences. + Website: [Microsoft Research Open Data](https://msropendata.com/) + +6. **OpenML**: An online platform for collaborative machine learning - easily share data, models, and experiments. + Website: [OpenML](https://www.openml.org/) + +7. **Data.gov**: The home of the U.S. Government’s open data, providing data, tools, and resources. + Website: [Data.gov](https://www.data.gov/) + +8. **EU Open Data Portal**: Provides access to an expanding range of data from the European Union institutions and other EU bodies. + Website: [EU Open Data Portal](https://data.europa.eu/euodp/en/home) + +9. **Awesome Public Datasets on GitHub**: A collection of high-quality open datasets in public domains. + GitHub Repository: [Awesome Public Datasets](https://github.com/awesomedata/awesome-public-datasets) + +10. **World Bank Open Data**: Free and open access to global development data. + Website: [World Bank Open Data](https://data.worldbank.org/) + +11. **CERN Open Data Portal**: Provides access to data generated by the Large Hadron Collider and other CERN experiments. + Website: [CERN Open Data Portal](http://opendata.cern.ch/) + +12. **National Aeronautics and Space Administration (NASA)**: Offers a wide range of datasets related to space and Earth sciences. + Website: [NASA](https://data.nasa.gov/) + +13. **NOAA Data Sets**: Provides access to national and global data on climate, weather, oceans, and coasts. + Website: [NOAA](https://www.noaa.gov/data) + +14. **ImageNet**: A dataset of over 15 million labeled high-resolution images across 22,000 categories. + Website: [ImageNet](http://www.image-net.org/) + +15. **COCO (Common Objects in Context)**: A dataset with millions of images containing objects in complex scenes with annotations. + Website: [COCO Dataset](https://cocodataset.org/) + +16. **Wikipedia: List of datasets for machine-learning research**: A wikipedia article providing a comprehensive list of datasets for machine-learning research. Website: [Wikipedia List](https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research) + +17. **Natural Earth Data**: Offers free vector and raster map data at various scales. + Website: [Natural Earth Data](https://www.naturalearthdata.com/) + +18. **Reddit Datasets**: A subreddit that has datasets made available by the Reddit community. + Website: [Reddit Datasets](https://www.reddit.com/r/datasets/) + +19. **Quandl**: Provides financial, economic, and alternative datasets. + Website: [Quandl](https://www.quandl.com/) + +20. **Stanford Large Network Dataset Collection**: A collection of large network datasets including social networks, web graphs, etc. + Website: [Stanford Network Analysis Project](http://snap.stanford.edu/data/index.html) + +These sources offer a wide range of datasets from various domains, and you can explore them based on your specific requirements and interests in machine learning.