Weka now supports XML (eXtensible Markup Language) in several places:
WEKA now allows to start Classifiers and Experiments with the
-xml option followed by a filename to retrieve the command line options from the XML file instead of the command line.
For such simple classifiers like e.g. J48 this looks like overkill, but as soon as one uses Meta-Classifiers or Meta-Meta-Classifiers the handling gets tricky and one spends a lot of time looking for missing quotes. With the hierarchical structure of XML files it is simple to plug in other classifiers by just exchanging tags.
The DTD for the XML options is quite simple:
The simplest option that takes no arguments, like e.g. the
-Vflag for inversing an selection.
The option takes exactly one parameter, directly following after the option, e.g., for specifying the trainings file with
-t somefile.arff. Here the parameter value is just put between the opening and closing tag. Since single is the default value for the type tag we don't need to specify it explicitly.
AdaBoostM1take another classifier as option with the
-Woption, where the options for the base classifier follow after the
--. And here it is where the fun starts;
where to put parameters for the base classifier if the Meta-Classifier itself is a base classifier for another Meta-Classifier? E.g., does
-W weka.classifiers.trees.J48 -- -C 0.001become this:
Internally, all the options enclosed by the
optionstag are pushed to the end after the
--if one transforms the XML into a command line string.
A Meta-Classifier like
Stackingcan take several
-Boptions, where each single one encloses other options in quotes (this itself can contain a Meta-Classifier!). From
-B "weka.classifiers.trees.J48"we then get this XML:
With the XML representation one doesn't have to worry anymore about the level of quotes one is using and therefore doesn't have to care about the correct escaping (i.e. " ... \" ... \" ...") since this is done automatically.
And if we now put all together we can transform this more complicated command line (
java and the CLASSPATH omitted):
weka.classifiers.meta.Stacking -B "weka.classifiers.meta.AdaBoostM1 -W weka.classifiers.trees.J48 -- -C 0.001" -B "weka.classifiers.meta.Bagging -W weka.classifiers.meta.AdaBoostM1 -- -W weka.classifiers.trees.J48" -B "weka.classifiers.meta.Stacking -B \"weka.classifiers.trees.J48\"" -t test/datasets/hepatitis.arff
<options type="class" value="weka.classifiers.meta.Stacking"> <option name="B" type="quotes"> <options type="classifier" value="weka.classifiers.meta.AdaBoostM1"> <option name="W" type="hyphens"> <options type="classifier" value="weka.classifiers.trees.J48"> <option name="C">0.001</option> </options> </option> </options> </option> <option name="B" type="quotes"> <options type="classifier" value="weka.classifiers.meta.Bagging"> <option name="W" type="hyphens"> <options type="classifier" value="weka.classifiers.meta.AdaBoostM1"> <option name="W" type="hyphens"> <options type="classifier" value="weka.classifiers.trees.J48"/> </option> </options> </option> </options> </option> <option name="B" type="quotes"> <options type="classifier" value="weka.classifiers.meta.Stacking"> <option name="B" type="quotes"> <options type="classifier" value="weka.classifiers.trees.J48"/> </option> </options> </option> <option name="t">test/datasets/hepatitis.arff</option> </options>
valueattribute of the outermost
optionstag is not used while reading the parameters. It is merely for documentation purposes, so that one knows which class was actually started from the command line.
Serialization of Experiments#
It is now possible to serialize the Experiments from the WEKA Experimenter not only in the proprietary binary format Java offers with serialization (with this you run into problems trying to read old experiments with a newer WEKA version, due to different SerialUIDs), but also in XML. There are currently two different ways to do this:
The built-in serialization captures only the necessary informations of an experiment and doesn't serialize anything else. It's sole purpose is to save the setup of a specific experiment and can therefore not store any built models. Thanks to this limitation we'll never run into problems with mismatching SerialUIDs.
This kind of serialization is always available and can be selected via a Filter (*.xml) in the Save/Open-Dialog of the Experimenter.
The DTD is very simple and looks like this (for version 3.4.5):
<!DOCTYPE object[ <!ELEMENT object (#PCDATA | object)*> <!ATTLIST object name CDATA #REQUIRED> <!ATTLIST object class CDATA #REQUIRED> <!ATTLIST object primitive CDATA "no"> <!ATTLIST object array CDATA "no"> <!-- the dimensions of the array; no=0, yes=1 --> <!ATTLIST object null CDATA "no"> <!ATTLIST object version CDATA "3.4.5"> ]>
Prior to versions 3.4.5 and 3.5.0 it looked like this:
for general Serialization:
The Koala Object Markup Language (KOML) is published under the LGPL and is an alternative way of serializing and derserializing Java Objects in an XML file. Like the normal serialization it serializes everything into XML via an ObjectOutputStream, including the SerialUID of each class. Even though we have the same problems with mismatching SerialUIDs it is at least possible edit the XML files by hand and replace the offending IDs with the new ones.
In order to use KOML one only has to assure that the KOML classes are in the CLASSPATH with which the Experimenter is launched. As soon as KOML is present another Filter (*.koml) will show up in the Save/Open-Dialog.
The DTD for KOML can be found here.
The experiment class can of course read those XML files if passed as input or output file (see options of
Serialization of Classifiers#
The options for models of a classifier,
-l for the input model and
-d for the output model, now also supports XML serialized files. Here we have to differentiate between two different formats:
The built-in serialization captures only the options of a classifier but not the built model. With the
-lone still has to provide a training file, since we only retrieve the options from the XML file. It is possible to add more options on the command line, but it is no check performed whether they collide with the ones stored in the XML file. The file is expected to end with
Since the KOML serialization captures everything of a Java Object we can use it just like the normal Java serialization. The file is expected to end with
The built-in serialization can be used in the Experimenter for loading/saving options from algorithms that have been added to a Simple Experiment. Unfortunately it is not possible to create such a hierarchical structure like mentioned in Command Line. This is because of the loss of information caused by the
getOptions() method of classifiers:
it returns only a flat String-Array and not a tree structure.
The GraphVisualizer (
weka.gui.graphvisualizer.GraphVisualizer) can save graphs into the Interchange Format for Bayesian Networks (BIF). If started from command line with an XML filename as first parameter and not from the Explorer it can display the given file directly.
The DTD for BIF is this:
<!DOCTYPE BIF [ <!ELEMENT BIF ( NETWORK )*> <!ATTLIST BIF VERSION CDATA #REQUIRED> <!ELEMENT NETWORK ( NAME, ( PROPERTY | VARIABLE | DEFINITION )* )> <!ELEMENT NAME (#PCDATA)> <!ELEMENT VARIABLE ( NAME, ( OUTCOME | PROPERTY )* ) > <!ATTLIST VARIABLE TYPE (nature|decision|utility) "nature"> <!ELEMENT OUTCOME (#PCDATA)> <!ELEMENT DEFINITION ( FOR | GIVEN | TABLE | PROPERTY )* > <!ELEMENT FOR (#PCDATA)> <!ELEMENT GIVEN (#PCDATA)> <!ELEMENT TABLE (#PCDATA)> <!ELEMENT PROPERTY (#PCDATA)> ]>
The XSLT script options.xsl parses an XML file for the experimenter and outputs the options in two ways: