Wednesday, August 17, 2005

AutoSummary Alpha Release

It pleases me to announce that I have just released the first alpha testing version of the AutoSummary Semantic Analysis Engine under the BSD License.

When completed, AutoSummary will generate contextually-relevant summaries of plain text documents using various statistical and rule-based methods of Natural Language Processing. First, the part-of-speech and specific word-sense (meaning) is determined for each word. Next, each sentence is deconstructed and the subject/predicate/object is identified. A map of relationships between word is then created. From this, specific themes are identified and memes (general ideas) are generated. Finally, the memes are used to create summaries of the original document of varying length and detail.

Currently, only the part-of-speech tagger has been implemented. A JPhrase object is created which contains semantic information about a given phrase of words, including part-of-speech scores and possible word-sense combinations. The part-of-speech tagging method takes the JPhrase and returns a marked-up string with each word in the phrase associated with a tag corresponding to the part-of-speech (noun, verb, adjective, adverb) determined to be most likely used for this particular instance. To make this determination, usage statistics from the WordNet semantic concordance texts are used.

AutoSummary uses the WordNet lexical reference system via the JWords Java Interface for WordNet as its source of lexigraphical information. In order to run AutoSummary, you must first install WordNet 2.0 and edit the JWords configuration file.

This alpha test release is EXTREMELY limited. Support for articles (a, an, the, etc), plurals, tenses, pronouns, and the verb 'to be' has not yet been implemented. If these types of words are entered into the current version, the program will likely crash. The simple demo included in the release is merely intended to show how the part-of-speech tagging system will determine the part-of-speech for each word in a given phrase.

I've been pushing pretty hard over the last week or so to get this initial release out. Once I take a must deserved rest, I will get a publicly available prototype up and running for my wildcard-based general question answering script, go back and add WordNet 2.1 support to JWords, and also take a serious look at setting up a web version of the AutoSummary demo program.

AutoSummary is available for download at SourceForge.net http://sourceforge.net/projects/autosummary

Javadoc documentation for AutoSummary is available here.

No comments: