A collection of multi-label and multi-target datasets is available here. Even more datasets are available at the MULAN Website (note that MULAN indexes labels as the final attributes, whereas MEKA indexs as the beginning). See the MEKA Tutorial for more information.
The following text datasets have been created / compiled into WEKA's ARFF format using the StringToWordVector filter. Also available are train/test splits and the original raw prefiltered text.
Dataset | L | N | LC | PU | Description and Original Source(s) |
---|---|---|---|---|---|
Enron | 53 | 1702 | 3.39 | 0.442 | A subset of the Enron Email Dataset, as labelled by the UC Berkeley Enron Email Analysis Project |
Slashdot | 22 | 3782 | 1.18 | 0.041 | Article titles and partial blurbs mined from Slashdot.org |
Language Log | 75 | 1460 | 1.18 | 0.208 | Articles posted on the Language Log |
IMDB (Updated) | 28 | 120919 | 2.00 | 0.037 | Movie plot text summaries labelled with genres sourced from the Internet Movie Database interface, labeled with genres. |
Key:
- N = The number of examples (training+testing) in the datasets
- L = The number of predefined labels relevant to this dataset
- LC = Label Cardinality. Average number of labels assigned per document
- PU = Percentage of documents with Unique label combinations