A common misconception is that the Weka machine learning software
cannot be applied to large datasets. When considering large datasets,
it is important to distinguish between training of machine learning
models and deploying such models for prediction. Weka is being used to
make predictions in real time in very demanding real-world
applications. This can be done with almost all Weka models once they
have been trained. However, training classifiers on large datasets can
be challenging, particularly using Weka's popular graphical Explorer
user interface. The Explorer always loads the entire training dataset
into the computer's main memory and also incurs significant overhead
due to visualisation, etc. Moreover, the amount of memory usable by
the Explorer depends on the "heap space" available to Java, which, by
default, is less than the physical amount of memory in the
computer. (It is possible to increase this heap space by configuring
the Java environment for Weka appropriately.) Fortunately, there are
alternatives: the Knowledge Flow interface for Weka, the command-line interface
(e.g., Weka's SimpleCLI), or programmatic application of
Weka with Java or a Java-based scripting language such as Groovy or
Jython. They make it possible to process datasets that are too big to
fit into the computer's main memory. For example, any so-called
"UpdateableClassifier" in Weka can be trained incrementally by loading
and processing each instance in a dataset separately. (The
massiveOnlineAnalysis package for Weka
provides access to the MOA
data stream software containing state-of-the-art incremental algorithms for
large datasets or data streams.) Additionally, non-incremental
learning algorithms can be applied to large datasets by subsampling
the
data. (Reservoir
sampling is an incremental sampling method
that can be used for this purpose.) Weka also has optional support for
distributed data mining with Hadoop and Spark. The
distributedWekaBase package
provides base "map" and "reduce" tasks that are not tied to any
specific distributed platform. The
distributedWekaHadoop package
provides Hadoop-specific wrappers and jobs for these base
tasks. The distributedWekaSpark package
provides Spark-specific wrappers.