Abstract: |
Text classification is an important tool for many applications,
in supervised, semi-supervised, and unsupervised scenarios. In order
to be processed by machine learning methods, a text (document)
is usually represented as a bag-of-words (BoW).
A BoW is a large vector of features (usually stored as
floating point values), which represent the relative frequency of occurrence of a
given word/term in each document. Typically, we have a large number
of features, many of which may be non-informative for classification
tasks and thus the need for feature transformation, reduction,
and selection arises.
In this paper, we propose two efficient
algorithms for feature transformation and reduction for BoW-like
representations. The proposed algorithms rely on simple statistical
analysis of the input pattern, exploiting the BoW and its
binary version. The algorithms are evaluated with
support vector machine (SVM) and AdaBoost classifiers on standard
benchmark datasets. The experimental results show the adequacy of
the reduced/transformed binary features for text classification
problems as well as the improvement on the test set error rate,
using the proposed methods. |