Clustering organizes objects into groups based on their similarity, as evaluated by comparison of designated object attributes. The PAL Framework provides a unified API for clustering for three state-of-the-art clustering algorithms: Latent Dirichlet Allocation, Lingo, and Katz.
Overview
Clustering is a learning algorithm that organizes objects into groups based on their similarity, as evaluated by comparison of designated object attributes. As an example, a collection of emails could be clustered into sets of emails that contain like topics. The PAL Framework provides both a unified API for generic clustering as well as implementations of three state-of-the-art clustering algorithms: Latent Dirichlet Allocation, Lingo, and Katz.
Different clustering algorithms have different characteristics and one may be more appropriate than another for a given problem space. The clustering suite attempts to provide a flexible means of accessing different clustering algorithms though a common API.
Prerequisites
- Java 1.5 or above
Limitations
The clustering algorithms only support clustering of textual data.
Supported Algorithms
Name | Description | Advantages | Disdvantages |
---|---|---|---|
LDA | Latent Dirichlet Allocation See http://www.cs.princeton.edu/~blei/papers/BleiNgJordan2003.pdf Uses Gregor Heinrich’s lda-j, a Java port of David Blei’s lda-c.LDA facilitates a visual representation of the clustered text, enabling seeing why a document was selected for a specific cluster.The most relevant term for each cluster is returned with the results. Each document is assigned to a single cluster. |
|
|
Lingo | The default algorithm used in Carrot2 (http://project.carrot2.org/algorithms.html).”Unlike most other algorithms, it first attempts to discover descriptive names for future clusters and only then proceeds to assigning each cluster with matching documents.” (http://www.cs.put.poznan.pl/dweiss/site/publications/download/iipwm-osinski-weiss-2004-lingoeval.pdf)In many cases, the descriptive name returned with a cluster will be a meaningful phrase.A document may get assigned to more than one cluster. |
|
|
Katz | Uses a Linear Programming model designed by Talbot Katz to determine cluster centroids.The model uses a distance matrix built using a document similarity algorithm on the tf-idf values for the documents.The model is used only if all centriods have not been identified within the input criteria.Each document is assigned to a single cluster.
The entity id for the cluster’s centroid is returned with the cluster results. The memory and CPU requirements for linear programming model processing may rule out its use for all but relatively small samples. A user-configurable time-out value is used for the model processing portion of the algorithm. |
|
|
Next to Clustering API
Overview: DISTAR 14982 – Approved for Public Release, Distribution Unlimited
API and Example: DISTAR 15261 – Approved for Public Release, Distribution Unlimited