neural-networks.io

neural-networks.io

Datasets for deep learning

  • The MNIST database: database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples.
  • NIST Special Database 19: entire corpus of training materials for handprinted document and character recognition. It contains over 800,000 images with hand checked classifications.
  • The CIFAR-10 dataset: consists of 60000 32x32 colour images in 10 classes (airplane, bird, cat, truck ...) with 6000 images per class. There are 50000 training images and 10000 test images.
  • Caltech 101: pictures of objects belonging to 101 categories. About 40 to 800 images per category (roughly 300 x 200 pixels)
  • Caltech 101: pictures of about 30,000 images belonging to 256 categories.
  • The 20 Newsgroups data set: database of handwritten digits, has a training set of 60,000 examples, and a test set of 10,000 examples.
  • Reuters Corpora: a large collection of Reuters News stories for use in research and development of natural language processing, information retrieval, and machine learning systems.
  • English Web Treebank Propbank: provides semantic role annotation and predicate sense disambiguation for roughly 50,000 predicates, corresponding to all verbs, all adjectives in equational clauses and all nouns considered to be predicative.
  • The New York Times Annotated Corpus: contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007.
  • Web 1T 5-gram Version 1: contains English word n-grams and their observed frequency counts. The length of the n-grams ranges from unigrams (single words) to five-grams.
  • Wikipedia Database download: Wikipedia offers free copies of all available content to interested users.
  • Multi-Domain Sentiment Dataset: contains product reviews taken from Amazon.com from many product types.