With tremendous increase in volume of electronic document and web content, the automatic categorization of documents has become the key method for organizing information and knowledge discovery. Proper classification of e-documents, emails, office documents, blogs, online news, and digital content needs methodology or technique such as machine learning and natural language processing. Organizations need solutions which can capture high volumes of unstructured content, provide a repository to store in native formats, text-mine using various techniques, and extract entities, concepts, and categories to classify the content properly.
Many vendors promise solutions which claim to provide insights and categorization of unstructured content. But getting insights on unstructured content (big data) is not a straightforward task. It requires a text analytics solution that is accurate and detailed and which produces results that are transparent and clear. Most solutions on the market use a statistical method for natural language processing that simply finds an appropriate set of text fragments that have been classified according to the researcher’s goals. Although this can often produce reasonably good results in a relatively short time, this method is generalized and carries sampling bias. It becomes challenging when the actual results aren’t what was expected. In a case where input used in actual projects is different from the input data used during training, it becomes too cumbersome to maintain and recreate corpus and re-train the algorithm to get the expected results, because there are no clear links between the results and the steps that are needed to improve the results.
On the other hand, a Computational Linguistics approach is transparent. Every piece of linguistic knowledge is explicit and can be easily fine-tuned to produce quality results. This is true even if the software needs to be customized for a specialized task (such as processing legal or medical texts). Over time, the incremental improvements inherent in a knowledge-based system yield far greater improvements compared to the statistical method where the input data soon deviates from the original training set.
Moreover, Computational Linguistics is detailed and, instead of just working on keywords and word frequency, it applies various rules and makes use of dictionary and additional analysis techniques, including:
- Grammar to tokenization: how a document is split into sentences and a sentence into words.
- Morphological analysis: how words are modified to express features such as gender, number and tense.
- Syntactic analysis: how a sentence is split into phrases and how those phrases are assigned functions in the sentence e.g. “The can goes on the shelf” vs. “It can go on the shelf”.
- Semantic analysis: how words and phrases are interpreted to give meaning to a sentence (“pretty awful” is negative, “pretty case” is positive).
Overall, there are several key features of a Computational Linguistic Approach that make it much more powerful compared to the traditional Statistical Method. Below are some real examples of how Computational Linguistics provides insights into the real meaning of unstructured content.
Analyzing every word in every sentence
In a language, small words can make all the difference: it’s important to know the difference between “great” and “not so great”, between “I like this cake” and “it looks like rain”. Important linguistic information for each word is extracted from dictionaries; comprehensive rules which are defined into the multilingual grammars are used to break down text into phrases. The next step is to assign important roles within a sentence and use our semantic technology to extract meaning from the unstructured text.
Working with phrases rather than just keywords
To extract insights from text, it is important to know what the author intended to say. As humans, we structure the word according to the detailed attributes of the objects and ideas we have to deal with and we use phrases to express it: we know that a “fire engine” is not the same kind of thing as an “engine”. A system that works only with keywords can never express this level of detail. Another example can be “Ministry of Finance has taken steps against slow economy”, here Ministry of Finance is a single entity and ‘of’ plays a very important role in linking “Ministry” and “Finance”. Dropping ‘of’ in the extraction process will lead to different results.
Analyzing every opinion in every sentence
Generally documents consist of multiple opinions representing several closely related but subtle differences. By reducing this to a single score, many systems hide the best insights and prevent people from drilling down to extract really useful information from the text. E.g. “I hate vodka, but love beer and I enjoy red wine”. The sentence has 3 opinions being expressed and a separate score is returned for each one.
To benefit from automation of text processing, it is necessary to first extract useful items and then aggregate them. Many models can be used to add structure to flat lists of items. One of them is extracting the semantic structures from syntactic features in the text. For example, capture the relationship between a product or service and its components or features (“the screen of this iPad”, “the telco signal”). Another model is the categorization of relevant items according to taxonomy (“signal” and “reception” both belong to the category SERVICE). In this way a flat list of items can be turned into a useful hierarchy (Telco à Service à Quality).
How the results are calculated
Every opinion in the sentence is scored, and moreover it also shows exactly which words in the sentence express the opinion and the entities (people/brand/product etc.). This is true even for adverbs such as “really” or adjective such as “excellent” or negations such as “not”.
Document categorization is enhancing the level of data storage, data access, and the modification process. Many techniques and algorithms for categorization have been developed and not one of them is sufficient on its own. Organizations must seek solutions which are best fit to solve their specific challenges; solutions capable of dealing with large volumes and different formats of electronic content.
Rajan Sharma currently holds a position of Regional Sales Engineer, APAC, at Actuate and has over 11 years of experience in IT industry mainly as a solution and pre-sales consultant. He has proven success in working closely with sales and product teams to provide comprehensive pre-sales support. Rajan specializes in business requirements discovery, architecture and solution consulting, and contributes to CCM Insights blog on these topics.