Selected from Analytics Vidhya
Author:Pranav Dar
Translated by Machine Heart
Contributors: Chen Yunzhu, Lu
This article introduces 25 open deep learning datasets, including those for image processing, natural language processing, speech recognition, and real-world problem datasets.
Introduction
The key to deep learning (or most fields in life) is practice. You need to practice solving various problems, including image processing, speech recognition, etc. Each problem has its unique nuances and solutions.
But where can you obtain data? Many papers now use proprietary datasets, which are often not publicly available. If you want to learn and apply skills, the inability to access suitable datasets is a problem.
If you are facing this issue, this article can offer you a solution. It introduces a range of publicly available high-quality datasets that every deep learning enthusiast should try to enhance their skills. Working with these datasets will make you a better data scientist, and the knowledge you gain will become invaluable in your career. We also introduce papers with state-of-the-art results for readers to study and improve their models.
How to Use These Datasets?
First, you need to understand that these datasets are very large! Therefore, ensure your internet connection is smooth, and there are no or minimal restrictions on data volume when downloading.
There are various ways to use these datasets; you can apply different deep learning techniques. You can use them to hone your skills, learn how to identify and construct various problems, think of unique use cases, and share your findings with everyone!
The datasets are divided into three categories—image processing, natural language processing, and audio/speech processing.
Let’s take a look together!
Image Processing Datasets
MNIST
Link: https://datahack.analyticsvidhya.com/contest/practice-problem-identify-the-digits/
MNIST is one of the most popular deep learning datasets. It is a handwritten digit dataset that contains a training set with 60,000 samples and a test set with 10,000 samples. This is a very good database for trying to learn techniques and deep recognition patterns on real-world data, and it requires minimal time and effort for data preprocessing.
Size: About 50 MB
Count: 70,000 images, divided into 10 categories.
SOTA: “Dynamic Routing Between Capsules”
Reference Reading:
-
Finally, Geoffrey Hinton’s highly anticipated Capsule paper is public
-
Analysis of Geoffrey Hinton’s recently proposed Capsule program
-
Understand the CapsNet architecture first and then implement it using TensorFlow; this should be the most detailed tutorial
-
After the official code for Capsule was open-sourced, Machine Heart provided a core code interpretation
MS-COCO
Link: http://cocodataset.org/#home
COCO is a large dataset for object detection, segmentation, and caption generation. It has the following features:
-
Object segmentation
-
Contextual recognition
-
Superpixel object segmentation
-
330,000 images (over 200,000 annotated images)
-
1.5 million object instances
-
80 object categories
-
91 object classifications
-
5 captions per image
-
250,000 keypoint annotated portraits
Size: About 25 GB (compressed)
Count: 330,000 images, 80 object categories, 5 captions per image, 250,000 keypoint annotated portraits
SOTA: “Mask R-CNN”
Reference Reading:
Academia | Facebook’s new paper proposes a universal object segmentation framework Mask R-CNN: simpler, more flexible, and better performance
Deep Learning | Convolutional Neural Networks for Image Segmentation: From R-CNN to Mask R-CNN
Resource | Mask R-CNN Applications: Masking Portraits like the British TV series “Black Mirror”
ImageNet
Link: http://www.image-net.org/
ImageNet is an image dataset organized according to the WordNet hierarchy. WordNet contains about 100,000 phrases, and ImageNet provides an average of about 1,000 descriptive images for each phrase.
Size: About 150 GB
Count: Approximately 1,500,000 images; each image has multiple bounding boxes and corresponding category labels.
SOTA: “Aggregated Residual Transformations for Deep Neural Networks” (https://arxiv.org/pdf/1611.05431.pdf)
Open Images Dataset
Link: https://github.com/openimages/dataset
Open Images is a dataset containing nearly 9 million image URLs. These images are annotated with image-level labels bounding boxes across thousands of categories. The training set contains 9,011,219 images, the validation set contains 41,260 images, and the test set contains 125,436 images.
Size: 500GB (compressed)
Count: 9,011,219 images with over 5,000 labels
SOTA: Resnet 101 image classification model (trained on V2 data):
-
Model checkpoint: https://storage.googleapis.com/openimages/2017_07/oidv2-resnet_v1_101.ckpt.tar.gz
-
Checkpoint readme: https://storage.googleapis.com/openimages/2017_07/oidv2-resnet_v1_101.readme.txt
-
Inference code: https://github.com/openimages/dataset/blob/master/tools/classify_oidv2.py
VisualQA
Link: http://www.visualqa.org/
VQA is a dataset containing open-ended questions about images. Answering these questions requires visual and linguistic understanding. The dataset has the following interesting features:
-
265,016 images (COCO and abstract scenes)
-
Each image contains at least 3 questions (an average of 5.4 questions)
-
Each question has 10 correct answers
-
Each question has 3 seemingly reasonable (but incorrect) answers
-
Automatic evaluation metrics
Size: 25GB (compressed)
Count: 265,016 images, each with at least 3 questions, and each question has 10 correct answers
SOTA: “Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge” (https://arxiv.org/abs/1708.02711)
Street View House Numbers Dataset (SVHN)
Link: http://ufldl.stanford.edu/housenumbers/
This is a real-world dataset used for developing object detection algorithms. It requires minimal data preprocessing. It is somewhat similar to the MNIST dataset but has more annotated data (over 600,000 images). These data were collected from house numbers in Google Street View.
Size: 2.5GB
Count: 630,420 images, divided into 10 classes
SOTA: “Distributional Smoothing With Virtual Adversarial Training” (https://arxiv.org/pdf/1507.00677.pdf)
This paper from Kyoto University in Japan proposed a new concept of Localized Distributional Smoothness (LDS), which can be used as regularization to improve the smoothness of model distributions. This method not only performs excellently on supervised and semi-supervised learning tasks on the MNIST dataset but also achieves scores of 24.63 and 9.88 on SVHN and NORB data, respectively, proving that this method significantly outperforms the current best results on semi-supervised learning tasks.
CIFAR-10
Link: http://www.cs.toronto.edu/~kriz/cifar.html
This dataset is also used for image classification. It consists of 60,000 images across 10 categories (each class represented as a row in the above image). The dataset contains 50,000 training images and 10,000 test images. The dataset is divided into 6 parts—5 training batches and 1 test batch. Each batch contains 10,000 images.
Size: 170MB
Count: 60,000 images, divided into 10 classes
SOTA: “ShakeDrop regularization” (https://openreview.net/pdf?id=S1NHaMW0b)
Fashion-MNIST
Link: https://github.com/zalandoresearch/fashion-mnist
Fashion-MNIST contains 60,000 training images and 10,000 test images. It is a fashion product database similar to MNIST. Developers believe that the use of MNIST has become too frequent, so they use this dataset as a direct replacement for MNIST. Each image is displayed in grayscale and has a label (one of 10 categories).
Size: 30MB
Count: 70,000 images, divided into 10 classes
SOTA: “Random Erasing Data Augmentation” (https://arxiv.org/abs/1708.04896)
Natural Language Processing
IMDB Movie Review Dataset
Link: http://ai.stanford.edu/~amaas/data/sentiment/
This dataset is great for movie enthusiasts. It is used for binary sentiment classification and currently contains more data than other datasets in the field. In addition to training and testing review samples, there are also some unlabeled data available. Moreover, this dataset includes original text and preprocessed bag-of-words format.
Size: 80 MB
Count: 25,000 highly polarized movie reviews in both training and test sets
SOTA: “Learning Structured Text Representations” (https://arxiv.org/abs/1705.09207)
Twenty Newsgroups Dataset
Link: https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups
As the name suggests, this dataset covers information related to newsgroups, consisting of a compilation of 20,000 newsgroup documents collected from 20 different newsgroups (1,000 articles selected from each group). These articles have typical features such as titles and leads.
Size: 20MB
Count: 20,000 reports from 20 newsgroups
SOTA: “Very Deep Convolutional Networks for Text Classification” (https://arxiv.org/abs/1606.01781)
Sentiment140
Link: http://help.sentiment140.com/for-students/
Sentiment140 is a dataset for sentiment analysis. This popular dataset allows you to perfectly kickstart your journey in natural language processing. The emotions in the data have been pre-cleaned. The final dataset has the following six features:
-
Emotion polarity of tweets
-
Tweet ID
-
Date of the tweet
-
Query
-
Twitter username
-
Tweet text
Size: 80MB (compressed)
Count: 160,000 tweets
SOTA: “Assessing State-of-the-Art Sentiment Models on State-of-the-Art Sentiment Datasets” (http://www.aclweb.org/anthology/W17-5202)
WordNet
Link: https://wordnet.princeton.edu/
As mentioned when introducing the ImageNet dataset, WordNet is a large English synset database. A synset is a group of synonyms describing different concepts. The structure of WordNet makes it a very useful tool in NLP.
Size: 10 MB
Count: 117,000 synsets, which are interconnected through a small number of “concept relationships” with other synsets
SOTA: “Wordnets: State of the Art and Perspectives” (https://aclanthology.info/pdf/R/R11/R11-1097.pdf)
Yelp Dataset
Link: https://www.yelp.com/dataset
This is an open dataset released by Yelp for learning purposes. It contains millions of user reviews, business attributes, and over 200,000 photos from multiple metropolitan areas. This dataset is very commonly used in NLP challenges worldwide.
Size: 2.66 GB JSON, 2.9 GB SQL, and 7.5 GB of photos (all compressed)
Count: 5,200,000 reviews, 174,000 business attributes, 200,000 photos, and 11 metropolitan areas
SOTA: “Attentive Convolution” (https://arxiv.org/pdf/1710.00519.pdf)
Wikipedia Corpus
Link: http://nlp.cs.nyu.edu/wikipedia-data/
This dataset is a collection of full Wikipedia texts, containing nearly 1.9 billion words from over 4 million articles. You can search through it word by word, phrase by phrase, and paragraph by paragraph, making it a powerful NLP dataset.
Size: 20 MB
Count: 4,400,000 articles, containing 1.9 billion words
SOTA: “Breaking The Softmax Bottleneck: A High-Rank RNN language Model” (https://arxiv.org/pdf/1711.03953.pdf)
Blog Authorship Corpus
Link: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm
This dataset contains blog posts collected from thousands of bloggers, with data collected from blogger.com. Each blog is provided in a separate file. Each blog contains at least 200 common English words.
Size: 300 MB
Count: 681,288 blog posts, totaling over 140 million words.
SOTA: “Character-level and Multi-channel Convolutional Neural Networks for Large-scale Authorship Attribution” (https://arxiv.org/pdf/1609.06686.pdf)
European Language Machine Translation Dataset
Link: http://statmt.org/wmt18/index.html
This dataset contains training data for four European languages aimed at improving current translation methods. You can use the following language pairs:
-
French – English
-
Spanish – English
-
German – English
-
Czech – English
Size: About 15 GB
Count: About 30,000,000 sentences and corresponding translations
SOTA: “Attention Is All You Need”
Reference Reading:
Academia | New breakthroughs in machine translation: Google achieves a fully attention-based translation architecture
Resource | TensorFlow implementation of Google’s all-attention machine translation model Transformer
Audio/Speech Datasets
Free Spoken Digit Dataset
Link: https://github.com/Jakobovski/free-spoken-digit-dataset
This is another dataset inspired by the MNIST dataset! This dataset aims to solve the task of recognizing spoken digits in audio samples. It is a public dataset, so we hope it will continue to grow as people continue to provide data. Currently, it has the following features:
-
3 different voices
-
1500 recordings (50 times for each digit from 0-9)
-
English pronunciation
Size: 10 MB
Count: 1500 audio samples
SOTA: “Raw Waveform-based Audio Classification Using Sample-level CNN Architectures” (https://arxiv.org/pdf/1712.00866)
Free Music Archive (FMA)
Link: https://github.com/mdeff/fma
FMA is a music analysis dataset consisting of full HQ audio, pre-computed features, and track and user-level metadata. It is a public dataset for evaluating multiple tasks in MIR. Here are the CSV files included in this dataset and their contents:
-
tracks.csv: Records metadata for each song and each track, such as ID, song title, artist, genre, tags, and play counts, totaling 106,574 songs.
-
genres.csv: Records the ID and name of all 163 genres and their parent style names (used to infer genre hierarchy and parent genres).
-
features.csv: Records common features extracted using librosa.
-
echonest.csv: Audio features provided by Echonest (now Spotify) for a subset of 13,129 tracks.
Size: About 1000 GB
Count: About 100,000 tracks
SOTA: “Learning to Recognize Musical Genre from Audio” (https://arxiv.org/pdf/1803.05337.pdf)
Ballroom
Link: http://mtg.upf.edu/ismir2004/contest/tempoContest/node5.html
This dataset contains ballroom dance audio files. It provides several feature segments of various dance styles in real audio format. Here are some characteristics of this dataset:
-
Total instances: 698
-
Single segment duration: about 30 seconds
-
Total duration: about 20940 seconds
Size: 14 GB (compressed)
Count: About 700 audio samples
SOTA: “A Multi-Model Approach To Beat Tracking Considering Heterogeneous Music Styles” (https://pdfs.semanticscholar.org/0cc2/952bf70c84e0199fcf8e58a8680a7903521e.pdf)
Million Song Dataset
Link: https://labrosa.ee.columbia.edu/millionsong/
The Million Song Dataset contains audio features and metadata for one million contemporary popular songs, available for free. Its purposes include:
-
Encouraging research on commercial-scale algorithms
-
Providing reference datasets for evaluating research
-
Serving as a shortcut for creating large datasets using APIs (e.g., The Echo Nest API)
-
Helping entry-level researchers work in the MIR field
The core of the dataset is the feature analysis and metadata of one million songs. The dataset does not contain any audio; it only includes exported features. Example audio can be obtained from services like 7digital using code provided by Columbia University (https://github.com/tb2332/MSongsDB/tree/master/Tasks_Demos/Preview7digital).
Size: 280 GB
Count: One million songs!
SOTA: “Preliminary Study on a Recommender System for the Million Songs Dataset Challenge” (http://www.ke.tu-darmstadt.de/events/PL-12/papers/08-aiolli.pdf)
LibriSpeech
Link: http://www.openslr.org/12/
This dataset is a large corpus containing about 1,000 hours of English speech. The data comes from audiobooks from the LibriVox project. The dataset has been reasonably segmented and aligned. If you are still looking for a starting point, click here to see the acoustic models trained on this dataset and click here to see language models suitable for evaluation.
Size: About 60 GB
Count: 1000 hours of speech
SOTA: “Letter-Based Speech Recognition with Gated ConvNets” (https://arxiv.org/abs/1712.09444)
VoxCeleb
Link: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
VoxCeleb is a large-scale speaker recognition dataset. It contains about 100,000 voice segments from 1,251 celebrities from YouTube videos. The data is generally gender-balanced (55% male). These celebrities have different accents, professions, and ages. There is no overlap between the development and test sets. Classifying and recognizing what big stars say is an interesting task.
Size: 150 MB
Count: 100,000 voices from 1,251 celebrities
SOTA: “VoxCeleb: a large-scale speaker identification dataset” (https://www.robots.ox.ac.uk/~vgg/publications/2017/Nagrani17/nagrani17.pdf)
To help you practice, we also provide some real-life problems and datasets for readers to try their hands on. In this section, we list the deep learning problems on the DataHack platform.
Twitter Sentiment Analysis Dataset
Link: https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/
Racist and sexist remarks have become a dilemma on Twitter, so it is essential to separate such tweets from others. In this real-world problem, the Twitter data we provide contains both ordinary remarks and extremist remarks. As a data scientist, your task is to identify which tweets are extremist and which are not.
Size: 3 MB
Count: 31,962 tweets
Indian Actor Age Detection Dataset
Link: https://datahack.analyticsvidhya.com/contest/practice-problem-age-detection/
For deep learning enthusiasts, this is a fascinating challenge. This dataset contains images of thousands of Indian actors, and your task is to determine their ages. All images are manually selected and cropped from video frames, resulting in high variability in scale, pose, expression, brightness, age, resolution, occlusion, and makeup.
Size: 48 MB (compressed)
Count: 19,906 images in the training set and 6,636 images in the test set
Urban Sound Classification Dataset
Link: https://datahack.analyticsvidhya.com/contest/practice-problem-urban-sound-classification/
This dataset contains over 8,000 urban sound clips from 10 categories. This real-world problem aims to introduce you to audio processing in common classification scenarios.
Size: Training set – 3 GB (compressed), Test set – 2 GB (compressed)
Count: 8,732 labeled urban sound clips from 10 categories (each audio clip duration <= 4s)
Original link: https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/
This article is compiled by Machine Heart, please contact this public account for authorization to reprint.
✄————————————————
Join Machine Heart (Full-time reporter/intern): [email protected]
Submissions or inquiries: [email protected]
Advertising & Business Cooperation: [email protected]