Classification Suite

Classification assigns individual items to discrete groups pre-specified by the user, based on features of the items. The classification module within the PAL Framework provides a unified API for three classification algorithms: Transformed Weight-Normalized Complement Naive Bayes, Maximum Entropy, and Decision Trees.

Overview

Classification assigns individual items to discrete groups pre-specified by the user, based on features of the items. The classification module within the PAL Framework provides a unified API for classifiers and implementations of three specific classification mechanisms:

  • Transformed Weight-normalized  Complement Naive Bayes
  • MaxEnt
  • Decision Tree

A classifier performs a mapping from a feature space X to a discrete set of labels Y. In other words, a classifier assigns a pre-defined class label to a sample. A classifier takes an object or a situation described by a set of attributes as an input and returns a “decision” – the predicted label. For example, a spam classifier labels an email as “Spam” or “Non Spam” based on some attributes of the email (sender, body, etc…).

Classifiers are used in many different areas including computer vision (medical image analysis, optical character recognition), speech recognition, natural language processing, drug discovery, document classification, internet search, etc. See the Limitations section below regarding the currently supported algorithms.

Different classification algorithms have different characteristics that can impact their suitability for a given problem. The classification framework provides a flexible means of accessing different classification algorithms though a common API.

Prerequisites

  • Java 1.5 or above

Limitations

  • The TWCNB and MaxEnt classifiers only support textual data.

Available Classifiers

Name Description Advantages Disadvantages
TWCNB
  • Based on the Naive Bayes Classifier, one of the most popular algorithms due to its simplicity, computational efficiency, and good performance.
  • Naive Bayes assumes that all features are independent, which is rarely true. In practice, Naive Bayes can work well even when the independence assumption does not hold.
  • TWCNB based on the findings of Jason Rennie, et al. in tackling violation of the independence assumptions
  • Well suited to problems with high dimensionality of inputs
  • Fast to train, fast to evaluate (for > 10,000 attributes)
  • Can be built with real-valued inputs
  • Requires a small amount of training data to estimate the parameters necessary for the classification
  • Reasonably effective for real-world problems
  • Less successful on more complex classification problems
MaxEnt
  • Based on the principle of maximum entropy, which states that, given some testable information about a probability distribution, the true distribution with respect to the information available is the one that maximizes information entropy. In other words, without external knowledge, uniform distributions should be preferred.
  • Very general
  • Used widely for a variety of natural language tasks, (e.g., language modeling, part-of-speech-tagging, text segmentation)
Decision Tree
  • A decision tree reaches a decision by performing a sequence of tests on a tree structure. Each internal node in the tree corresponds to a test of the value of one of the attributes of the input. Each leaf node of the tree specifies the value to be returned if that leaf is reached.
  • Simple to understand and interpret
  • Can handle both numerical and categorical data
  • Possible to validate a model using statistical tests
  • Robust performance with large amount of data
  • Learning an optimal decision-tree is NP-complete
  • Risk of data over fitting

Next to Classification API

Overview: DISTAR 14982 – Approved for Public Release, Distribution Unlimited
API and Example: DISTAR 15075 – Approved for Public Release, Distribution Unlimited