Tuesday, July 04, 2006

A Brief on Unstructured Data & Text Analytics - The Next Gen Analytics Niche

(From "Patterns for Success – Options for Analyzing Unstructured Information"; Dr. Fern Halper; Hurwitz & Associates; 6/21/06)
Text analytics is the process of extracting unstructured text and transforming it into structured information that can then be mined and analyzed in various ways. This transformed information can be combined with additional structured data a company owns (e.g. sales, demographic data) and analyzed using various predictive and automated discovery techniques. Or, the text can be extracted and transformed and then analyzed interactively to determine relationships and trends, look for clusters and so on. The actual extraction of the information is accomplished via techniques from the fields of computational linguistics, statistics, and other computer science disciplines. For example, computational linguistic algorithms can enable the parsing of sentences to extract the who, what where, when and why in text.

Text analytics differs from search, although it can be used to augment search. In basic search technologies, end users know what they are looking for. Interestingly, search is now evolving and converging with business intelligence to provide applications that might, for example, monitor news feeds to understand what competitors are doing.

While the field is still evolving, there are a number of players out there worth noting.

  • Business intelligence powerhouses SPSS and SAS both offer solutions in this space tied to their data mining and predictive analysis products. SPSS Predictive Text Analytics solution combines the linguistic technologies of their LexiQuest text mining products with the data mining capabilities of Clementine. SAS Text Miner is integrated with its Enterprise Miner product and provides users with the ability to mine structured and unstructured information. SAS also has technologies to deal with finding relationships between documents.
  • Other companies such as Attensity, Inxight, Clear Forest and nStein provide information extraction technologies that can be leveraged in various analytical activities. For example, Attensity offers a number of different extraction techniques together with a series of its own applications that allow users to interactively explore information found in text and also analyze it. Attensity also works with other third party software. Inxight provides text extraction software that can be used with its visualization technologies to determine relationships and trends in text data. It also has applications to augment the capabilities of search engines.
  • Companies such as Clarabridge Inc. deal with the preprocessing of text data in order to make it more useful in business intelligence packages. The product, Clarabridge Content Mining Platform, provides connectors to source information, transforms the information using various extraction techniques, then performs data quality and staging work on the data, and provides a schema that can serve the information up to various BI packages.
  • Even the big players like IBM, Oracle, and Microsoft are making moves to offer solutions in the text analytics space. IBM has developed the Unstructured Information Management Architecture (UIMA), an open-source framework that defines a common set of interfaces for integrating different text analytic components and applications.