Text categorization with weka
In the following one can find some information of how to use Weka for text categorization.
Import#
Weka needs the data to be present in ARFF or XRFF format in order to perform any classification tasks.
Directories#
One can transform the text files with the following tools into ARFF format (depending on the version of Weka you are using):
- TextDirectoryToArff tool (3.4.x and >= 3.5.3)
this Java class transforms a directory of files into an ARFF file
- TextDirectoryLoader converter (> 3.5.3)
this converter is based on the TextDirectoryToArff tool and located in the
weka.core.converters
package
Example directory layout for TextDirectoryLoader:
...
|
+- text_example
|
+- class1
| |
| + file1.txt
| |
| + file2.txt
| |
| ...
|
+- class2
| |
| + another_file1.txt
| |
| + another_file2.txt
| |
| ...
CSV files#
CSV files can be imported in Weka easily via the Weka Explorer or via commandline via the CSVLoader
class:
By default, non-numerical attributes get imported as NOMINAL attributes, which is not necessarily desired for textual data, especially if one wants to use the StringToWordVector filter. In order to change the attribute to STRING, one can run the NominalToString
filter (package weka.filters.unsupervised.attribute
) on the data, specifying the attribute index or range of indices that should be converted (NB:
this filter does not exclude the class attribute from conversion!). In order to retain the attribute types, one needs to save the file in ARFF or XRFF format (or in the compressed version of these formats).
Third-party tools#
- TagHelper Tools, which allows one to transform texts into vectors of stemmed or unstemmed unigrams, bigrams, part-of-speech bigrams, and some user defined features, and then saves this representation to ARFF. Currently processes English, German, and Chinese. Spanish and Portugese are in progress.
Working with textual data#
Conversion#
Most classifiers in Weka cannot handle String attributes. For these learning schemes one has to process the data with appropriate filters, e.g., the StringToWordVector filter which can perform TF/IDF transformation.
The StringToWordVector
filter places the class attribute of the generated output data at the beginning. In case you'd to like to have it as last attribute again, you can use the Reorder filter with the following setup:
And with the MultiFilter you can also apply both filters in one go, instead of subsequently. Makes it easier in the Explorer for instance.
Stopwords#
The StringToWordVector filter can also work with a different stopword list than the built-in one (based on the Rainbow system). One can use the -stopwords
option to load the external stopwords file. The format for such a stopword file is one stopword per line, lines starting with '#' are interpreted as comments and ignored.
Note: There was a bug in Weka 3.5.6 (which introduced the support of external stopwords lists), which ignored the external stopwords list. Later versions from 21/07/2007 on will work correctly.
UTF-8#
In case you are working with text files containing non-ASCII characters, e.g., Arabic, you might encounter some display problems under Windows. Java was designed to display UTF-8, which should include arabic characters. By default, Java uses code page 1252 under Windows, which garbles the display of other characters. In order to fix this, you will have to modify the java command-line with which you start up Weka:
The-Dfile.encoding=utf-8
tells Java to explicitly use UTF-8 encoding instead of the default CP1252.
If you are starting Weka via start menu and you use a recent version (at least 3.5.8 or 3.4.13), then you will just have to modify the fileEncoding
placeholder in the RunWeka.ini
accordingly.
Examples#
- text_example.zip - contains a directory structure and example files that can be imported with the
TextDirectoryLoader
converter. - TextCategorizationTest.java - uses the
TextDirectoryLoader
converter to turn a directory structure into a dataset, applies theStringToWordVector
and builds a classifier with the filtered data.
See also#
- Batch filtering - for generating a test set with the same dictionary as the training set
- All text categorization articles