Datasets
Some example datasets for analysis with Weka are included in the Weka distribution and can be found in the data folder of the installed software.
Miscellaneous collections of datasets#
- A jarfile containing 37 classification problems originally obtained from the UCI repository of machine learning datasets (datasets-UCI.jar, 1,190,961 Bytes).
- A jarfile containing 37 regression problems obtained from various sources (datasets-numeric.jar, 169,344 Bytes).
- A jarfile containing 6 agricultural datasets obtained from agricultural researchers in New Zealand (agridatasets.jar, 31,200 Bytes).
- A jarfile containing 30 regression datasets collected by Professor Luis Torgo (regression-datasets.jar, 10,090,266 Bytes).
- A gzip'ed tar containing UCI ML and UCI KDD datasets (uci-20070111.tar.gz, 17,952,832 Bytes)
- A gzip'ed tar containing StatLib datasets (statlib-20050214.tar.gz, 12,785,582 Bytes)
- A gzip'ed tar containing ordinal, real-world datasets donated by Professor Arie Ben David (datasets-arie_ben_david.tar.gz, 11,348 Bytes)
- A zip file containing 19 multi-class (1-of-n) text datasets donated by Dr George Forman (19MclassTextWc.zip, 14,084,828 Bytes)
- A bzip'ed tar file containing the Reuters21578 dataset split into separate files according to the ModApte split reuters21578-ModApte.tar.bz2, 81,745,032 Bytes
- A zip file containing 41 drug design datasets formed using the Adriana.Code software donated by Dr Mehmet Fatih Amasyali (Drug-datasets.zip, 11,376,153 Bytes)
- A zip file containing 80 artificial datasets generated from the Friedman function donated by Dr. M. Fatih Amasyali (Yildiz Technical Unversity) (Friedman-datasets.zip, 5,802,204 Bytes)
- A zip file containing a new, image-based version of the classic iris data, with 50 images for each of the three species of iris. The images have size 600x600. Please see the ARFF file for further information (iris_reloaded.zip, 92,267,000 Bytes). After expanding into a directory using your jar utility (or an archive program that handles tar-archives/zip files in case of the gzip'ed tars/zip files), these datasets may be used with Weka.
Bioinformatics datasets#
Some bioinformatics datasets in Weka's ARFF format. These are quite old but still available thanks to the Internet Archive.
- Protein datasets made available by Associate Professor Shuiwang Ji when he was a PhD student at Louisiana State University.
- Kent Ridge Biomedical Data Set Repository, which was put together by Professor Jinyan Li and Dr Huiqing Liu while they were at the Institute for Infocomm Research, Singapore.
- Repository for Epitope Datasets (RED), maintained by Professor Yasser El-Manzalawy when he was at Iowa State University.