DataSets

A various data sets for Machine Learning, Artificial Intelligence, and Data Science. Maintained by Community: https://www.Neuromancer.kr/

Machine Learning Data

Pix2Pix

source location: https://people.eecs.berkeley.edu/~tinghuiz/projects/pix2pix/datasets/

한국거래소(KRX)에서 일자별 시가총액 순위 데이터

1995-05-02~2019-04-30 (24년간), 1천만건 (CSV) https://github.com/FinanceData/marcap.git

https://www.kaggle.com/c/aerial-cactus-identification

Pix2Pix

geo

兵庫県_全域数値地形図_ポータル（2010年度～2018年度）https://www.geospatial.jp/ckan/dataset/2010-2018-hyogo-geo-potal

Image

refer from https://github.com/rudvlf0413/Dataset.git

Dog Breed Identification dataset
- The dataset is designed for multiclass classification problem as it has 120 breeds of dogs. It
- https://www.kaggle.com/c/dog-breed-identification/data

TTS

Dataset: http://www.openslr.org/60/

Youtube

https://research.google.com/youtube8m/index.html?fbclid=IwAR3JtSscHE1npIsYNwLpJtnSN_Oym_zO6TJTMSoVPv6u6FogzjunKVisyHI
- Google AI 에서 기존에 알려진 YouTube-8M의 일부를 확장하여, segment level의 annotation이 제공되는 데이터셋
- 기존의 YouTube-8M에서는 비디오/프레임 level의 머신이 생성한 레이블을 제공한 반면, 이번에는 segment level의 사람이 매뉴얼로 검증한 레이블이 제공
- 1,000개의 클래스에 대하여,
- 237K 개의 레이블 (사람이 매뉴얼하게)
- 하나의 비디오당 평균 5개의 segments
- 하나의 segment당, 비디오에서 무작위로 추출된 5초 구
- annotation 포맷은 기존의 YouTube-8M과 유사합니다. (segment의 시작과 끝, 그리고 각 segment당 레이블 정보)

Classification or Recognition or Generative

tencent-ml-images
- https://github.com/Tencent/tencent-ml-images.git
https://github.com/NVlabs/ffhq-dataset
Coil-20
- http://www.cs.columbia.edu/CAVE/software/softlib/coil-20.php
MS COCO
- http://mscoco.org/dataset/#overview
NVIDIA food Image classification
- https://github.com/corona10/FoodDataSet
CIFAR-10, CIFAR-100
- https://www.cs.toronto.edu/~kriz/cifar.html
Large-scale CelebFaces Attributes (CelebA) Dataset
- http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
Street View House Numbers (SVHN)
- http://ufldl.stanford.edu/housenumbers/
MNIST
- http://yann.lecun.com/exdb/mnist/
Facial Database
- http://www.face-rec.org/databases/
Simple Vector Drawing Datasets
- https://github.com/hardmaru/sketch-rnn-datasets
Places2 (Space)
- http://places2.csail.mit.edu/download.html
Yelp dataset (restorance)
- https://www.yelp.com/dataset_challenge
DeepFashion
- http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
Image to Latex (수식 이미지를 latex 코드로 만드는 데이터셋입니다.)
- https://zenodo.org/record/56198#.WTpQ73XyhPN
NIST Dataset(Fingerprint, Mugshot, OCR)
- https://www.nist.gov/srd/nist-special-database-4
Biometics ideal test dataset(Iris, Fingerprint, Face, palmprint, handwriting etc. - 로그인 필요!)
- http://biometrics.idealtest.org/index.jsp
PASCAL 2012 Dataset (Classification & Detection)
- http://host.robots.ox.ac.uk/pascal/VOC/voc2012/index.html#data

Medical

Lung cancer dataset
- https://luna.grand-challenge.org
- https://www.kaggle.com/c/data-science-bowl-2017
Brain tumor dataset
- http://braintumorsegmentation.org
Breast cancer dataset (kaggle)
- https://www.kaggle.com/uciml/breast-cancer-wisconsin-data
The cancer image archive
- http://www.cancerimagingarchive.net
Mammograpy dataset
- http://marathon.csee.usf.edu/Mammography/Database.html
Bio Image Dataset @ IIIT Delhi
- http://www.iab-rubric.org/resources.html
CAMELYON 16 - metatstasis detection in lymph node
- https://camelyon16.grand-challenge.org/
CAMELYON17 Dataset
- https://camelyon17.grand-challenge.org/

Video

YouTube-BoundingBoxes Dataset
- https://research.google.com/youtube-bb/index.html
Youtube-8M Dataset
- https://research.google.com/youtube8m/
The Kinetics Human Action Video Dataset
- https://deepmind.com/research/open-source/open-source-datasets/kinetics/

Text

Nerural Network Translation

StatMT(Machine Translation, summarization)
UN parallel Corpus
- https://conferences.unite.un.org/UNCorpus
IWSLT Dataset (including TED Translation)
- https://sites.google.com/site/iwsltevaluation2016/
The Stacks Project
- (대수기하학 책의 원본과 latex 코드 pair set?)
- http://stacks.math.columbia.edu/
Google sentence compression(Google에서 문장을 정형화 한 데이터입니다.)
- http://storage.googleapis.com/sentencecomp/compression-data.json
조선왕조실록(한글/한문 번역)
- http://sillok.history.go.kr/main/main.do

Categorical & Topic modeling

20 Newsgroups
- http://qwone.com/~jason/20Newsgroups/
Reuter dataset
- https://archive.ics.uci.edu/ml/datasets/reuters-21578+text+categorization+collection

Short text

Tweet data, a subset of TREC 2011 microblog track
- http://trec.nist.gov/data/tweets/
Title data, including news titles with class labels from some news websites
- http://www.sogou.com/al

QA

bAbI dataset (Facebook Question Answering)
- https://research.facebook.com/research/babi/
Question/Answering(빈칸추론문제) pairs using CNN/Daily Mail articles
- https://github.com/deepmind/rc-data
Stanford Question Answering Dataset
- https://rajpurkar.github.io/SQuAD-explorer/
CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning
- http://cs.stanford.edu/people/jcjohns/clevr/
WikiReading dataset
- https://github.com/google-research-datasets/wiki-reading

Word Embedding

Word2Vec에 쓰인 데이터셋(위키피디아, WMT11 등) https://code.google.com/archive/p/word2vec/
Fast Text pre-trained vector set
- https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

Sentiment Analysis

Stanford Sentiment Treebank(SST)
- http://nlp.stanford.edu/sentiment/

Sound

Nottingham music dataset
- https://www-labs.iro.umontreal.ca/~lisa/deep/data/
A large-scale dataset of manually annotated audio events (Google research)
- https://research.google.com/audioset/

Knowledge Base

Freebase
- https://datahub.io/ko_KR/dataset/freebase
Wordnet
- https://wordnet.princeton.edu/
Microsoft Concept Graph
- https://concept.msra.cn/Home/Download
DBPedia Dataset
- The DBpedia data set uses a large multi-domain ontology which has been derived from Wikipedia as well as localized versions of DBpedia in more than 100 l
- http://wiki.dbpedia.org/services-resources/datasets/dbpedia-datasets
Yago
- YAGO3 is a huge semantic knowledge base, derived from Wikipedia WordNet and GeoNames.
- https://datahub.io/ko_KR/dataset/yago
Google Knowledge graph API
- https://developers.google.com/knowledge-graph/

Social Networks & Recomendationdation

AMiner - Datasets for social network Analysis
- https://cn.aminer.org/data
Netflix Prize Data Set
- http://academictorrents.com/details/9b13183dc4d60676b773c9e2cd6de5e5542cee9a
논문 bibliography 데이터셋, Author Citation Networks
Politics sub redit
- http://snap.stanford.edu/data/politics_subreddit.tar.gz
Amazon dataset
- http://snap.stanford.edu/data/amazon-meta.html
Twitter Spammer network
- http://twitter.mpi-sws.org/spam/
Twitter tweets
- http://snap.stanford.edu/data/twitter7.html
Online reviews
- http://snap.stanford.edu/data/#reviews

Pre-trained Model

Word2Vect
- https://code.google.com/archive/p/word2vec/
GloVe
- https://nlp.stanford.edu/projects/glove/
FastText
- https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md

국내 데이터셋

SKT Bigdata hub
- https://www.bigdatahub.co.kr/index.do

ETC.

Titanic survivors dataset
- https://goo.gl/P9CMFY
Obama’s political speeches
- https://github.com/samim23/obama-rnn
Yahoo Finance dataset
- https://finance.yahoo.com/quote/GOOG/history?ltr=1
Linux code
- https://github.com/torvalds/linux
NYC Taxi dataset
- http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml
US Census dataset
- https://www.census.gov/topics/income-poverty/income/data/datasets.html~

Data Science Data

Diamond.csv
countries.csv
exprs_GSE5859.csv
movies.dat
movie_lines.txt
movie conversation
mtcars.csv
pollster_cleaned_2002_2008.csv
pollster_cleaned_2010.csv
pollster_cleaned_2012.csv
kospi_kospi.csv

Name		Name	Last commit message	Last commit date
Latest commit History 85 Commits
01_QuestionAnswering_Corpus		01_QuestionAnswering_Corpus
11_titanic		11_titanic
code		code
.gitignore		.gitignore
2_korean.md		2_korean.md
7a16_regression.ipynb		7a16_regression.ipynb
DeepFashion.md		DeepFashion.md
LICENSE		LICENSE
NSynth.md		NSynth.md
README.md		README.md
countries.csv		countries.csv
craw_naver_finace.ipynb		craw_naver_finace.ipynb
dbpedia_csv.tar.gz		dbpedia_csv.tar.gz
diamonds.csv		diamonds.csv
exprs_GSE5859.csv		exprs_GSE5859.csv
galaxies.md		galaxies.md
image-datasets.md		image-datasets.md
kospi.csv		kospi.csv
medical-datasets.md		medical-datasets.md
movie_lines.txt		movie_lines.txt
movies.dat		movies.dat
mtcars.csv		mtcars.csv
pollster_cleaned_2002_2008.csv		pollster_cleaned_2002_2008.csv
pollster_cleaned_2010.csv		pollster_cleaned_2010.csv
pollster_cleaned_2012.csv		pollster_cleaned_2012.csv
question_answering_corpus-datasets.md		question_answering_corpus-datasets.md
sample-data.csv		sample-data.csv
school_earnings.csv		school_earnings.csv
stock.csv		stock.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DataSets

Machine Learning Data

한국거래소(KRX)에서 일자별 시가총액 순위 데이터

Pix2Pix

geo

Image

TTS

Youtube

Classification or Recognition or Generative

Medical

Video

Text

Nerural Network Translation

Categorical & Topic modeling

Short text

QA

Word Embedding

Sentiment Analysis

Sound

Knowledge Base

Social Networks & Recomendationdation

Pre-trained Model

국내 데이터셋

ETC.

Data Science Data

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DataSets

Machine Learning Data

한국거래소(KRX)에서 일자별 시가총액 순위 데이터

Pix2Pix

geo

Image

TTS

Youtube

Classification or Recognition or Generative

Medical

Video

Text

Nerural Network Translation

Categorical & Topic modeling

Short text

QA

Word Embedding

Sentiment Analysis

Sound

Knowledge Base

Social Networks & Recomendationdation

Pre-trained Model

국내 데이터셋

ETC.

Data Science Data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages