PAL Semantic Extraction

Semantic Extraction learns from training examples to recognize entities and semantic meta-structures such as names, addresses and structured phrases from a body of text.


PAL Semantic Extraction is a general-purpose, lightweight extraction engine that employs machine learning to recognize semantically meaningful entities in text. It generalizes from training examples to recognize entities such as names, addresses, geographic locations (e.g., latitude/longitude) and email signatures. The learned entities can be either atomic (e.g., a name) or aggregate (e.g., a person with a name, address, and phone number). Semantic Extraction also supports the identification of acronyms and their expansions, and can extract quotes and associate them with a relevant person.

Semantic Extraction is pre-trained on a corpus of documents for a variety of extraction types, which enables it to work out of the box. The API also enables the user to both train and untrain through examples. As the user trains with new examples, Semantic Extraction generalizes its descriptors for use in future extraction tasks.

Semantic Extraction accepts plain text and HTML documents as input and provides a list of data structures representing extracted entities as output.


Java 1.5 and above


The Semantic Extraction system is a fast extraction system with a small footprint that comes pre-trained, but can be trained further by the user to expand and refine its acapabilities. It is not a comprehensive, ready-made solution for all extraction tasks.

Extractions Supported Out of the Box

  • Person names
  • Geographic locations (e.g., latitude/longitude, Military Grid Reference System (MGRS))
  • Time, Date
  • Street address
  • City, State, ZIP code
  • Some foreign addresses
  • Telephone numbers (US and Foreign)
  • HTTP address
  • Email address
  • Acronyms (and definitions)
  • Questions in text
  • Quotations
  • Money

Overview: DISTAR 15554 – Approved for Public Release, Distribution Unlimited
API and Example: DISTAR 15554 – Approved for Public Release, Distribution Unlimited
Source and Object Code: DISTAR 15554 – Approved for Public Release, Distribution Unlimited