Raja Marappan1* and S Bhaskaran2
1Senior Assistant Professor, School of Computing, SASTRA Deemed University, India
1Assistant Professor, School of Computing, SASTRA Deemed University, India
*Corresponding author: Raja Marappan, Senior Assistant Professor, School of Computing, SASTRA Deemed University, India
Submission: April 26, 2022;Published: June 10, 2022
ISSN:2832-4463 Volume2 Issue1
Recently, there are different public datasets are available for machine learning and data science applications. These datasets are also applied in computer vision, sentiment analysis, clinical data, natural language processing, and others. When anyone shapes the logic of artificial intelligence and data science project, need to write variants of algorithms, and the performance of the model can be evaluated based on the data required for training, and also discuss results of prediction with other methods. Even if anyone is not required to gather particular data collection, the learner can spend more amount of time searching for a better dataset that is required for the implementation of the live project. There are a wide variety of datasets on different categories available in the public domain. The learner must search for the right dataset by spending time. This article is targeted at helping the learners to search for the best available recent datasets in the public domains for artificial intelligence, machine learning, and data science projects.
Keywords:Machine learning; Data science; Datasets; Recommendation; Ratings; Artificial intelligence
Machine Learning (ML) is considered the magical model, where one can shuffle the complete information and cast the obtained information into the right predictions. Hence one needs to collect, clean, and merge large amounts of information [1,2]. This article simplifies and gives an exploration of the best online sites at which one can search and get the cumulative datasets for all real-world applications. Datasets are serving like the railways upon which ML methods ride [3-7]. Without these datasets, any ML method will fail to continue in the areas of text mining, categorizing products, and text classification. ML is considered a branch of AI that explores computer science procedures that automatically improve with data processing. The methods of ML use the inputs from historical information to produce efficient prediction measures. ML also plays a central position in world-leading concerns such as Google, Facebook, Amazon, etc. Hence it is necessary to learn the different public datasets before one starts the project.
This section explores the best public datasets finders available for data science and ML
applications [8-21].
a. Search through Google dataset: Domain: https://datasetsearch.research.google.
com/
b. Kaggle dataset: Domain: https://www.kaggle.com/datasets
c. Open-source datasets for UCI ML repository
d. High-quality datasets from CMU Libraries
e. NLP tasks-The Big Bad NLP Database
f. Wikipedia ML Datasets
g. AWS Open Data Registry
h. UCI Machine Learning Repository
This section explores the best public datasets available for data
science and ML applications. The datasets are classified as follows
[12-20]:
1. General Datasets: Housing Datasets & Geographic Datasets
2. Machine Learning Datasets: Mall Customers Dataset, IRIS
Dataset, MNIST Dataset, Boston Housing Dataset, Fake News
Detection Dataset, Wine quality dataset, SOCR data - Heights
and Weights Dataset, Titanic Dataset, Credit Card Fraud
Detection Dataset
3. Computer Vision Datasets: xView, ImageNet, Kinetics-700,
Google’s Open Images, Cityscapes Dataset, IMDB-Wiki dataset,
Color Detection Dataset, Stanford Dogs Dataset
4. Sentiment Analysis Datasets: Lexicoder Sentiment Dictionary,
IMDB reviews, Stanford Sentiment Treebank, Twitter US
Airline Sentiment
5. NLP Datasets: The Big Bad NLP Database, HotspotQA Dataset,
Amazon Reviews, Rotten Tomatoes Reviews, SMS Spam
Collection in English, Enron Email Dataset, Recommender
Systems Dataset, UCI Spambase Dataset, IMDB reviews
6. Self-driving (Autonomous Driving) Datasets: Waymo Open
Dataset, Berkeley DeepDrive BDD100k, Bosch Small Traffic
Light Dataset, LaRa Traffic Light Recognition, WPI datasets,
Comma.ai, MIT AGE Lab, LISA: Laboratory for Intelligent & Safe
Automobiles, UC San Diego Datasets, Cityscape Dataset
7. Clinical Datasets: MaskedFace-Net, COVID-19 Dataset, MIMICIII
8. Datasets for Recommender Systems: MovieLens, Jester, Million
Song Dataset
9. DataPortals: meta-database with 524 data portals
10. OpenDataSoft: a map with more than 2600 data portals
11. Knoema: home to nearly 3.2-billion time series data of 1040
topics from more than 1200 sources
12. Data.gov: 261,073 sets of the US open government data
13. Eurostat: open data from the EU statistical office
14. Scientific research datasets: Re3data: 2000 research data
repositories with flexible search, Harvard Dataverse: 92,839
datasets by the scientific community for the scientific
community, Academic torrents: 53.52TB research data
aggregated at one place, The Sloan Digital Sky Survey: 3D maps
of the Universe
15. Verified datasets from data science communities: DataHub:
high-quality datasets shared by data scientists for data
scientists, UCI Machine Learning Repository: one of the
oldest sources with 488 datasets, data.world: open data
community, GitHub: a list of awesome datasets made by the
software development community, Kaggle datasets: 25,144
themed datasets on “Facebook for data people”, KDnuggets:
a comprehensive list of data repositories on a famous data
science website, Reddit: datasets and requests of data on a
dedicated discussion board
16. Political and social datasets from media outlets: BuzzFeed:
datasets and related content by a media company,
FiveThirtyEight: datasets from data-driven pieces
17. Finance and economic datasets: Quandl: Alternative Financial
and Economic Data, The International Monetary Fund and The
World Bank: International Economy Stats
18. Healthcare datasets: World Health Organization: Global Health
Records from 194 Countries, The Center for Disease Control
(CDC): Searching for data is easy with an online database,
Medicare: data from the US health insurance program, The
Healthcare Cost and Utilization Project (HCUP): another
source with data on healthcare services
19. Travel and transportation datasets: Bureau of Transportation
Statistics: the US transportation system in over 260 data tables,
Federal Highway Administration: US road transportation data
20. Other sources: Amazon Web Services: free public datasets
and paid machine learning tools, Google Public datasets: data
analysis with the BigQuery tool in the cloud
21. Earth Dataset: Domain: https://earthdata.nasa.gov/
22. Amazon and Microsoft Datasets, Azure and AWS: Domain AWS:
https://registry.opendata.aws/ Domain Azure: https://azure.
microsoft.com/en-us/services/open-datasets/catalog/?q=
23. FBI Crime Data Explorer Domain: https://crime-data-explorer.
fr.cloud.gov/downloads-and-docs
24. Data World: Domain: https://data.world/
25. CERN Open Data Portal: Domain: http://opendata.cern.ch/
26. Lionbridge AI Datasets: Domain: https://lionbridge.ai/
datasets/
27. UCI Machine Learning Repository: Domain: https://archive.
ics.uci.edu/ml/index.php
28. Government Datasets for ML: Data USA, EU Open Data Portal,
Data.gov, US Healthcare Data, The UK Data Service, School
System Finances, The US National Center for Education
Statistics
29. Finance & Economics Datasets for ML: American Economic
Association (AEA), Quandl, IMF Data, World Bank Open Data,
Financial Times Market Data, Google Trends
30. Image Datasets for Computer Vision: VisualQA, Labelme,
ImageNet, Indoor Scene Recognition, Visual Genome, Stanford
Dogs Dataset, Google’s Open Images, Labelled Faces in the Wild
Home, COIL-100, CIFAR-10, Cityscapes, IMDB-Wiki, Fashion
MNIST, MS COCO, MPII Human Pose Dataset
31. Sentiment Analysis Datasets for ML: Multi-Domain Sentiment
Analysis Dataset, Amazon Product Data, Twitter US Airline
Sentiment, IMDB Sentiment, Sentiment140, Stanford
Sentiment Treebank, Paper Reviews, Lexicoder Sentiment
Dictionary, Sentiment Lexicons for 81 Languages, Opin-Rank
Review Dataset
32. NLP Datasets: Enron Dataset, UCI’s Spambase, Amazon
Reviews, Yelp Reviews, Google Books Ngrams, SMS Spam
Collection in English, Jeopardy, Gutenberg eBooks List, Blogger
Corpus, Wikipedia Links Data
33. Datasets for Autonomous Vehicles: Berkeley DeepDrive
BDD100K, Comma.ai, Oxford’s Robotic Car, LISA, Cityscapes
Dataset, Baidu Apolloscapes, Landmarks, Landmarks-v2,
PandaSet, nuScenes, Open Images V5, Waymo Open Dataset
34. Recommendation and Ratings Public Data Sets For ML:
Movies Recommendation, Music Recommendation, Books
Recommendation, Food Recommendation, Merchandise
Recommendation, Healthcare Recommendation, Dating
Recommendation, Scholarly Paper Recommendation
This article gives an overview of different public datasets available for ML and data science applications. The best public datasets finders available for data science and ML applications are also listed. The required datasets available in the public domains for data science and ML applications are classified with several examples. In the future, these public datasets can be applied with soft computing and approximation algorithms for solving the different real-world applications [22-41].
© 2022 Raja Marappan. This is an open access article distributed under the terms of the Creative Commons Attribution License , which permits unrestricted use, distribution, and build upon your work non-commercially.