25 Must-Know Deep Learning Open Datasets for Data Scientists

Selected from Analytics Vidhya

Author:Pranav Dar

Translated by Machine Heart

Contributors: Chen Yunzhu, Lu

This article introduces 25 open deep learning datasets, including those for image processing, natural language processing, speech recognition, and real-world problem datasets.

Introduction

The key to deep learning (or most fields in life) is practice. You need to practice solving various problems, including image processing, speech recognition, etc. Each problem has its unique nuances and solutions.

But where can you obtain data? Many papers now use proprietary datasets, which are often not publicly available. If you want to learn and apply skills, the inability to access suitable datasets is a problem.

If you are facing this issue, this article can offer you a solution. It introduces a range of publicly available high-quality datasets that every deep learning enthusiast should try to enhance their skills. Working with these datasets will make you a better data scientist, and the knowledge you gain will become invaluable in your career. We also introduce papers with state-of-the-art results for readers to study and improve their models.

How to Use These Datasets?

First, you need to understand that these datasets are very large! Therefore, ensure your internet connection is smooth, and there are no or minimal restrictions on data volume when downloading.

There are various ways to use these datasets; you can apply different deep learning techniques. You can use them to hone your skills, learn how to identify and construct various problems, think of unique use cases, and share your findings with everyone!

The datasets are divided into three categories—image processing, natural language processing, and audio/speech processing.

Let’s take a look together!

Image Processing Datasets

MNIST

Link: https://datahack.analyticsvidhya.com/contest/practice-problem-identify-the-digits/

MNIST is one of the most popular deep learning datasets. It is a handwritten digit dataset that contains a training set with 60,000 samples and a test set with 10,000 samples. This is a very good database for trying to learn techniques and deep recognition patterns on real-world data, and it requires minimal time and effort for data preprocessing.

Size: About 50 MB

Count: 70,000 images, divided into 10 categories.

SOTA: “Dynamic Routing Between Capsules”

Reference Reading:

Finally, Geoffrey Hinton’s highly anticipated Capsule paper is public
Analysis of Geoffrey Hinton’s recently proposed Capsule program
Understand the CapsNet architecture first and then implement it using TensorFlow; this should be the most detailed tutorial
After the official code for Capsule was open-sourced, Machine Heart provided a core code interpretation

MS-COCO

Link: http://cocodataset.org/#home

COCO is a large dataset for object detection, segmentation, and caption generation. It has the following features:

Object segmentation
Contextual recognition
Superpixel object segmentation
330,000 images (over 200,000 annotated images)
1.5 million object instances
80 object categories
91 object classifications
5 captions per image
250,000 keypoint annotated portraits

Size: About 25 GB (compressed)

Count: 330,000 images, 80 object categories, 5 captions per image, 250,000 keypoint annotated portraits

SOTA: “Mask R-CNN”

Reference Reading:

Academia | Facebook’s new paper proposes a universal object segmentation framework Mask R-CNN: simpler, more flexible, and better performance

Deep Learning | Convolutional Neural Networks for Image Segmentation: From R-CNN to Mask R-CNN

Resource | Mask R-CNN Applications: Masking Portraits like the British TV series “Black Mirror”

ImageNet

Link: http://www.image-net.org/

ImageNet is an image dataset organized according to the WordNet hierarchy. WordNet contains about 100,000 phrases, and ImageNet provides an average of about 1,000 descriptive images for each phrase.

Size: About 150 GB

Count: Approximately 1,500,000 images; each image has multiple bounding boxes and corresponding category labels.

SOTA: “Aggregated Residual Transformations for Deep Neural Networks” (https://arxiv.org/pdf/1611.05431.pdf)

Open Images Dataset

Link: https://github.com/openimages/dataset

Open Images is a dataset containing nearly 9 million image URLs. These images are annotated with image-level labels bounding boxes across thousands of categories. The training set contains 9,011,219 images, the validation set contains 41,260 images, and the test set contains 125,436 images.

Size: 500GB (compressed)

Count: 9,011,219 images with over 5,000 labels

SOTA: Resnet 101 image classification model (trained on V2 data):

Model checkpoint: https://storage.googleapis.com/openimages/2017_07/oidv2-resnet_v1_101.ckpt.tar.gz
Checkpoint readme: https://storage.googleapis.com/openimages/2017_07/oidv2-resnet_v1_101.readme.txt
Inference code: https://github.com/openimages/dataset/blob/master/tools/classify_oidv2.py

VisualQA

Link: http://www.visualqa.org/

VQA is a dataset containing open-ended questions about images. Answering these questions requires visual and linguistic understanding. The dataset has the following interesting features:

265,016 images (COCO and abstract scenes)
Each image contains at least 3 questions (an average of 5.4 questions)
Each question has 10 correct answers
Each question has 3 seemingly reasonable (but incorrect) answers
Automatic evaluation metrics

Size: 25GB (compressed)

Count: 265,016 images, each with at least 3 questions, and each question has 10 correct answers

SOTA: “Tips and Tricks for Visual Question Answering: Learnings from the 2017 Challenge” (https://arxiv.org/abs/1708.02711)

Street View House Numbers Dataset (SVHN)

Link: http://ufldl.stanford.edu/housenumbers/

This is a real-world dataset used for developing object detection algorithms. It requires minimal data preprocessing. It is somewhat similar to the MNIST dataset but has more annotated data (over 600,000 images). These data were collected from house numbers in Google Street View.

Size: 2.5GB

Count: 630,420 images, divided into 10 classes

SOTA: “Distributional Smoothing With Virtual Adversarial Training” (https://arxiv.org/pdf/1507.00677.pdf)

This paper from Kyoto University in Japan proposed a new concept of Localized Distributional Smoothness (LDS), which can be used as regularization to improve the smoothness of model distributions. This method not only performs excellently on supervised and semi-supervised learning tasks on the MNIST dataset but also achieves scores of 24.63 and 9.88 on SVHN and NORB data, respectively, proving that this method significantly outperforms the current best results on semi-supervised learning tasks.

CIFAR-10

Link: http://www.cs.toronto.edu/~kriz/cifar.html

This dataset is also used for image classification. It consists of 60,000 images across 10 categories (each class represented as a row in the above image). The dataset contains 50,000 training images and 10,000 test images. The dataset is divided into 6 parts—5 training batches and 1 test batch. Each batch contains 10,000 images.

Size: 170MB

Count: 60,000 images, divided into 10 classes

SOTA: “ShakeDrop regularization” (https://openreview.net/pdf?id=S1NHaMW0b)

Fashion-MNIST

Link: https://github.com/zalandoresearch/fashion-mnist

Fashion-MNIST contains 60,000 training images and 10,000 test images. It is a fashion product database similar to MNIST. Developers believe that the use of MNIST has become too frequent, so they use this dataset as a direct replacement for MNIST. Each image is displayed in grayscale and has a label (one of 10 categories).

Size: 30MB

Count: 70,000 images, divided into 10 classes

SOTA: “Random Erasing Data Augmentation” (https://arxiv.org/abs/1708.04896)

Natural Language Processing

IMDB Movie Review Dataset

Link: http://ai.stanford.edu/~amaas/data/sentiment/

This dataset is great for movie enthusiasts. It is used for binary sentiment classification and currently contains more data than other datasets in the field. In addition to training and testing review samples, there are also some unlabeled data available. Moreover, this dataset includes original text and preprocessed bag-of-words format.

Size: 80 MB

Count: 25,000 highly polarized movie reviews in both training and test sets

SOTA: “Learning Structured Text Representations” (https://arxiv.org/abs/1705.09207)

Twenty Newsgroups Dataset

Link: https://archive.ics.uci.edu/ml/datasets/Twenty+Newsgroups

As the name suggests, this dataset covers information related to newsgroups, consisting of a compilation of 20,000 newsgroup documents collected from 20 different newsgroups (1,000 articles selected from each group). These articles have typical features such as titles and leads.

Size: 20MB

Count: 20,000 reports from 20 newsgroups

SOTA: “Very Deep Convolutional Networks for Text Classification” (https://arxiv.org/abs/1606.01781)

Sentiment140

Link: http://help.sentiment140.com/for-students/

Sentiment140 is a dataset for sentiment analysis. This popular dataset allows you to perfectly kickstart your journey in natural language processing. The emotions in the data have been pre-cleaned. The final dataset has the following six features:

Emotion polarity of tweets
Tweet ID
Date of the tweet
Query
Twitter username
Tweet text

Size: 80MB (compressed)

Count: 160,000 tweets

SOTA: “Assessing State-of-the-Art Sentiment Models on State-of-the-Art Sentiment Datasets” (http://www.aclweb.org/anthology/W17-5202)

WordNet

Link: https://wordnet.princeton.edu/

As mentioned when introducing the ImageNet dataset, WordNet is a large English synset database. A synset is a group of synonyms describing different concepts. The structure of WordNet makes it a very useful tool in NLP.

Size: 10 MB

Count: 117,000 synsets, which are interconnected through a small number of “concept relationships” with other synsets

SOTA: “Wordnets: State of the Art and Perspectives” (https://aclanthology.info/pdf/R/R11/R11-1097.pdf)

Yelp Dataset

Link: https://www.yelp.com/dataset

This is an open dataset released by Yelp for learning purposes. It contains millions of user reviews, business attributes, and over 200,000 photos from multiple metropolitan areas. This dataset is very commonly used in NLP challenges worldwide.

Size: 2.66 GB JSON, 2.9 GB SQL, and 7.5 GB of photos (all compressed)

Count: 5,200,000 reviews, 174,000 business attributes, 200,000 photos, and 11 metropolitan areas

SOTA: “Attentive Convolution” (https://arxiv.org/pdf/1710.00519.pdf)

Wikipedia Corpus

Link: http://nlp.cs.nyu.edu/wikipedia-data/

This dataset is a collection of full Wikipedia texts, containing nearly 1.9 billion words from over 4 million articles. You can search through it word by word, phrase by phrase, and paragraph by paragraph, making it a powerful NLP dataset.

Size: 20 MB

Count: 4,400,000 articles, containing 1.9 billion words

SOTA: “Breaking The Softmax Bottleneck: A High-Rank RNN language Model” (https://arxiv.org/pdf/1711.03953.pdf)

Blog Authorship Corpus

Link: http://u.cs.biu.ac.il/~koppel/BlogCorpus.htm

This dataset contains blog posts collected from thousands of bloggers, with data collected from blogger.com. Each blog is provided in a separate file. Each blog contains at least 200 common English words.

Size: 300 MB

Count: 681,288 blog posts, totaling over 140 million words.

SOTA: “Character-level and Multi-channel Convolutional Neural Networks for Large-scale Authorship Attribution” (https://arxiv.org/pdf/1609.06686.pdf)

European Language Machine Translation Dataset

Link: http://statmt.org/wmt18/index.html

This dataset contains training data for four European languages aimed at improving current translation methods. You can use the following language pairs:

French – English
Spanish – English
German – English
Czech – English

Size: About 15 GB

Count: About 30,000,000 sentences and corresponding translations

SOTA: “Attention Is All You Need”

Reference Reading:

Academia | New breakthroughs in machine translation: Google achieves a fully attention-based translation architecture

Resource | TensorFlow implementation of Google’s all-attention machine translation model Transformer

Audio/Speech Datasets

Free Spoken Digit Dataset

Link: https://github.com/Jakobovski/free-spoken-digit-dataset

This is another dataset inspired by the MNIST dataset! This dataset aims to solve the task of recognizing spoken digits in audio samples. It is a public dataset, so we hope it will continue to grow as people continue to provide data. Currently, it has the following features:

3 different voices
1500 recordings (50 times for each digit from 0-9)
English pronunciation

Size: 10 MB

Count: 1500 audio samples

SOTA: “Raw Waveform-based Audio Classification Using Sample-level CNN Architectures” (https://arxiv.org/pdf/1712.00866)

Free Music Archive (FMA)

Link: https://github.com/mdeff/fma

FMA is a music analysis dataset consisting of full HQ audio, pre-computed features, and track and user-level metadata. It is a public dataset for evaluating multiple tasks in MIR. Here are the CSV files included in this dataset and their contents:

tracks.csv: Records metadata for each song and each track, such as ID, song title, artist, genre, tags, and play counts, totaling 106,574 songs.
genres.csv: Records the ID and name of all 163 genres and their parent style names (used to infer genre hierarchy and parent genres).
features.csv: Records common features extracted using librosa.
echonest.csv: Audio features provided by Echonest (now Spotify) for a subset of 13,129 tracks.

Size: About 1000 GB

Count: About 100,000 tracks

SOTA: “Learning to Recognize Musical Genre from Audio” (https://arxiv.org/pdf/1803.05337.pdf)

Ballroom

Link: http://mtg.upf.edu/ismir2004/contest/tempoContest/node5.html

This dataset contains ballroom dance audio files. It provides several feature segments of various dance styles in real audio format. Here are some characteristics of this dataset:

Total instances: 698
Single segment duration: about 30 seconds
Total duration: about 20940 seconds

Size: 14 GB (compressed)

Count: About 700 audio samples

SOTA: “A Multi-Model Approach To Beat Tracking Considering Heterogeneous Music Styles” (https://pdfs.semanticscholar.org/0cc2/952bf70c84e0199fcf8e58a8680a7903521e.pdf)

Million Song Dataset

Link: https://labrosa.ee.columbia.edu/millionsong/

The Million Song Dataset contains audio features and metadata for one million contemporary popular songs, available for free. Its purposes include:

Encouraging research on commercial-scale algorithms
Providing reference datasets for evaluating research
Serving as a shortcut for creating large datasets using APIs (e.g., The Echo Nest API)
Helping entry-level researchers work in the MIR field

The core of the dataset is the feature analysis and metadata of one million songs. The dataset does not contain any audio; it only includes exported features. Example audio can be obtained from services like 7digital using code provided by Columbia University (https://github.com/tb2332/MSongsDB/tree/master/Tasks_Demos/Preview7digital).

Size: 280 GB

Count: One million songs!

SOTA: “Preliminary Study on a Recommender System for the Million Songs Dataset Challenge” (http://www.ke.tu-darmstadt.de/events/PL-12/papers/08-aiolli.pdf)

LibriSpeech

Link: http://www.openslr.org/12/

This dataset is a large corpus containing about 1,000 hours of English speech. The data comes from audiobooks from the LibriVox project. The dataset has been reasonably segmented and aligned. If you are still looking for a starting point, click here to see the acoustic models trained on this dataset and click here to see language models suitable for evaluation.

Size: About 60 GB

Count: 1000 hours of speech

SOTA: “Letter-Based Speech Recognition with Gated ConvNets” (https://arxiv.org/abs/1712.09444)

VoxCeleb

Link: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/

VoxCeleb is a large-scale speaker recognition dataset. It contains about 100,000 voice segments from 1,251 celebrities from YouTube videos. The data is generally gender-balanced (55% male). These celebrities have different accents, professions, and ages. There is no overlap between the development and test sets. Classifying and recognizing what big stars say is an interesting task.

Size: 150 MB

Count: 100,000 voices from 1,251 celebrities

SOTA: “VoxCeleb: a large-scale speaker identification dataset” (https://www.robots.ox.ac.uk/~vgg/publications/2017/Nagrani17/nagrani17.pdf)

To help you practice, we also provide some real-life problems and datasets for readers to try their hands on. In this section, we list the deep learning problems on the DataHack platform.

Twitter Sentiment Analysis Dataset

Link: https://datahack.analyticsvidhya.com/contest/practice-problem-twitter-sentiment-analysis/

Racist and sexist remarks have become a dilemma on Twitter, so it is essential to separate such tweets from others. In this real-world problem, the Twitter data we provide contains both ordinary remarks and extremist remarks. As a data scientist, your task is to identify which tweets are extremist and which are not.

Size: 3 MB

Count: 31,962 tweets

Indian Actor Age Detection Dataset

Link: https://datahack.analyticsvidhya.com/contest/practice-problem-age-detection/

For deep learning enthusiasts, this is a fascinating challenge. This dataset contains images of thousands of Indian actors, and your task is to determine their ages. All images are manually selected and cropped from video frames, resulting in high variability in scale, pose, expression, brightness, age, resolution, occlusion, and makeup.

Size: 48 MB (compressed)

Count: 19,906 images in the training set and 6,636 images in the test set

Urban Sound Classification Dataset

Link: https://datahack.analyticsvidhya.com/contest/practice-problem-urban-sound-classification/

This dataset contains over 8,000 urban sound clips from 10 categories. This real-world problem aims to introduce you to audio processing in common classification scenarios.

Size: Training set – 3 GB (compressed), Test set – 2 GB (compressed)

Count: 8,732 labeled urban sound clips from 10 categories (each audio clip duration <= 4s) 25 Must-Know Deep Learning Open Datasets for Data Scientists

Original link: https://www.analyticsvidhya.com/blog/2018/03/comprehensive-collection-deep-learning-datasets/

This article is compiled by Machine Heart, please contact this public account for authorization to reprint.

✄————————————————

Join Machine Heart (Full-time reporter/intern): [email protected]

Submissions or inquiries: [email protected]

Advertising & Business Cooperation: [email protected]

Leave a Comment Cancel reply