C2RSS

C2RSS helps a user monitor information streams (e.g., RSS feeds) by identifying documents that are of potential interest to that user. The system automatically sorts information by topic and relevance, leveraging user feedback to refine suggestions.

Overview

In this information age, information floods us from every possible source. There is too much information to peruse it all, but missing important information may be very costly, most dramatically illustrated by overlooked information about terrorists. Using learning technology, C2RSS can help people track their interests by monitoring information sources and identifying documents that are relevant.

C2RSS is a learning-based application that processes generic raw RSS* feeds to generate its own user-personalized RSS feeds. RSS feeds are pervasive information sources continually published by virtually every news organization and business. Each RSS feed includes articles that are relevant to the feed but are not necessarily relevant to the C2RSS user. Users would normally have to scan all the articles to find those important to him. If the feed is sparse in relevant information, a user may waste a lot of time or may stop following the feed. C2RSS can improve a user’s efficiency by authoring digested feeds containing only articles relevant to him; irrelevant articles are never even seen. These digested feeds can be organized by his own topics rather than by the original feeds.

c2rss-conops

A C2RSS user chooses a set of feeds as sources and optionally indicates a set of topics he is interested in. As he reads the articles, C2RSS learns a model of the user’s interests by observing which articles he chooses to read. By reading an article, the user provides implicit feedback of the article’s relevance to him. C2RSS will also sort articles according to his topics. If needed, a user can always correct C2RSS. Feedback, both implicit and explicit, also allows C2RSS to improve its learned model of his topics. The use of implicit feedback minimizes the user’s cognitive load while gaining the benefit of the learning technology.

C2RSS aggregates the articles from the user’s feeds. Using the models and the PAL classification framework, C2RSS predicts the topic and relevance of new articles. The user can subscribe to C2RSS-generated RSS feeds of the articles based on topic or relevance. For instance, one feed might contain all relevant articles on the topic “Airline Industry” no matter which source RSS feed contained the article originally.

In C2RSS, the user is the sole authority as to the meaning of a topic. One person’s “Airline Industry” might be about transportation economy while another’s might be about airline safety. Consequently, C2RSS has to learn how to recognize the topic from the user. C2RSS bootstraps the process of learning the topic by Quick Learning. The user provides some keywords for C2RSS to find previously seen documents that may match the topic. The user scans these documents (or at least their titles) and indicates which are within the topic and are relevant to his interests. C2RSS then trains on the documents to build a classifier to recognize the topic better. As new documents arrive that are on topic per the user’s feedback, C2RSS adds these to the training set.

c2rss-learning-slide

The application of the technology behind C2RSS is not limited to RSS feeds. The technology can be applied to any set of documents, especially those that are continually new. Application areas might include email, bug reports, press releases, legislation, chat feeds, and tweets.

Prerequisites

  • Users:
    • RSS Reader
    • Web browser
  • Server machine:
    • OS: Windows XP, Windows Server 2003, Mac OSX, or Linux
    • Java 1.5 or 1.6 or newer
    • Apache Tomcat 6 or newer
    • PostgreSQL 8.3 or newer

Limitations

  • C2RSS currently classifies documents into a single topic.
  • The quality of results typically is good, but not perfect. Increased usage provides increased training, which improves results. However, issues such as documents that span two topics will remain problematic.
  • Topics that could be characterized as needle in a haystack, where there are very few documents in even a large corpus, will typically be difficult to detect.
  • As currently implemented, C2RSS does not provide a mechanism for extracting features from formatted data. It is best suited for normal free-flow textual (versus tabular or formatted) documents.

* RSS = Really Simple Syndication. A feed is similar to a magazine in that it consists of a number of articles chosen by the publisher. Users can point an RSS reader at a feed (similar to a web page address) to read the articles. The publisher will update the feed periodically, adding new articles and removing old ones. Daily, the user sees a stream of new articles.

Overview: DISTAR 16088 – Approved for Public Release, Distribution Unlimited