A Language-Based Approach to Categorical Analysis

Publication Type:



Media Arts and Sciences, Massachusetts Institute of Technology, Cambridge, MA (2001)




With the digitization of media, computers can be employed to help us with the process of classification,
both by learning from our behavior to perform the task for us and by exposing new ways for us to think
about our information. Given that most of our media comes in the form of electronic text, research in this
area focuses on building automatic text classification systems. The standard representation employed by
these systems, known as the bag-of-words approach to information retrieval, represents documents as
collections of words. As a byproduct of this model, automatic classifiers have difficulty distinguishing
between different meanings of a single word.

This research presents a new computational model of electronic text, called a synchronic imprint, which
uses structural information to contextualize the meaning of words. Every concept in the body of a text is
described by its relationships with other concepts in the same text, allowing classification systems to
distinguish between alternative meanings of the same word. This representation is applied to both the
standard problem of text classification and also to the task of enabling people to better identify large
bodies of text. The latter is achieved through the development of a visualization tool named flux that
models synchronic imprints as a spring network.