Clustering Suite

Clustering organizes objects into groups based on their similarity, as evaluated by comparison of designated object attributes. The PAL Framework provides a unified API for clustering for three state-of-the-art clustering algorithms: Latent Dirichlet Allocation, Lingo, and Katz.

Overview

Clustering is a learning algorithm that organizes objects into groups based on their similarity, as evaluated by comparison of designated object attributes. As an example, a collection of emails could be clustered into sets of emails that contain like topics. The PAL Framework provides both a unified API for generic clustering as well as implementations of three state-of-the-art clustering algorithms: Latent Dirichlet Allocation, Lingo, and Katz.

Different clustering algorithms have different characteristics and one may be more appropriate than another for a given problem space. The clustering suite attempts to provide a flexible means of accessing different clustering algorithms though a common API.

Prerequisites

  • Java 1.5 or above

Limitations

The clustering algorithms only support clustering of textual data.

Supported Algorithms

 

Name Description Advantages Disdvantages
LDA Latent Dirichlet Allocation
See http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf
Uses Gregor Heinrich’s lda-j, a Java port of David Blei’s lda-c.LDA facilitates a visual representation of the clustered text, enabling seeing why a document was selected for a specific cluster.The most relevant term for each cluster is returned with the results. Each document is assigned to a single cluster.
  • Facilitates a visual representation of clustered text.
  • Requires less memory than Lingo.
  • The term assigned as a cluster’s description is often not as meaningful as Lingo’s.
Lingo The default algorithm used in Carrot2 (http://project.carrot2.org/algorithms.html).”Unlike most other algorithms, it first attempts to discover descriptive names for future clusters and only then proceeds to
assigning each cluster with matching documents.”
(http://www.cs.put.poznan.pl/dweiss/site/publications/download/iipwm-osinski-weiss-2004-lingoeval.pdf)In many cases, the descriptive name returned with a cluster will be a meaningful phrase.A document may get assigned to more than one cluster.
  • Often returns a meaningful phrase as a cluster description.
  • Many tuning options.
  • Documents may get assigned to multiple clusters.
  • Requires more memory than either LDA or Katz (only if all centroids are specified with the input).
Katz Uses a Linear Programming model designed by Talbot Katz to determine cluster centroids.The model uses a distance matrix built using a document similarity algorithm on the tf-idf values for the documents.The model is used only if all centriods have not been identified within the input criteria.Each document is assigned to a single cluster.

The entity id for the cluster’s centroid is returned with the cluster results.

The memory and CPU requirements for linear programming model processing may rule out its use for all but relatively small samples. A user-configurable time-out value is used for the model processing portion of the algorithm.

  • Can determine cluster centroids.
  • Can cluster around known centroids.
  • Resource requirements are relatively small when all centroids are identified with the input.
  • A cluster’s description is the ID of the cluster’s centroid, rather than a more meaningful label.
  • Considerable memory and CPU requirements for all but small samples when the algorithm computes the centroids.

Next to Clustering API
Overview: DISTAR 14982 – Approved for Public Release, Distribution Unlimited
API and Example: DISTAR 15261 – Approved for Public Release, Distribution Unlimited