Text analytics, sometimes called text data mining, is the process of uncovering insightful and actionable information, trends, or patterns from text.
The extracted and structured data is much more convenient than the original text, making it easier to determine the information’s data quality and usefulness. Developers and data scientists can then use the mined data in downstream data visualisations, analytics, machine learning, and applications.
Text analytics aims to identify facts, relationships, sentiments, or other contextual information. The types of information extracted often start with tagging entities such as people’s names, places, and products. It can advance to assigning topics, determining categories, and discovering sentiments.
When measures such as currencies, dates, or quantities are extracted, establishing their relationship to other entities (and any qualifiers) is a key text analytics capability.
Extracting data from documents versus form fields
The hardest challenges in text analytics are processing enterprise repositories and large documents such as aggregated news from websites, corporate SEC filings, electronic health records, and other unstructured or semistructured documents.
Parsing documents has some unique challenges as the document’s size and structure often dictate domain-specific preprocessing rules and NLP (natural language processing) algorithms. For example, categorising a 1,000-word blog post is a lot easier than ranking all of the topics found in a book collection.
Also, larger documents often require validating the extracted information based on context; for instance, the medical conditions of a patient should be categorised independently from the conditions listed in their family history.
But what if you want to perform a potentially simpler task of extracting information from a form field or other short text snippet? Consider these possible scenarios:
- Quantify feedback from an employee survey’s open-ended responses
- Process social media posts for their sentiments about brands or products
- Categorise different types of chatbot interactions
- Assign topics to user stories on an agile backlog
- Route service desk requests based on the problem details
- Parse information submitted to marketing on your website
These problems require more simplified algorithms than parsing documents because the text fields are identifiable, short, and often carry a specific type of information.
Let’s say you need to leverage unstructured field data in an application or are asked to include insightful information extracted from text in a data visualisation. Text analytics is an important first step, and agile data science teams often use spikes to conduct discovery work. The team needs tools, skills, and methodologies to perform text analytics. Here are three different approaches.
1 - Use a public cloud’s NLP and cognitive services
The major public clouds offer natural language processing and other cognitive services, so teams already working in these environments and skilled at using these algorithms should research these options.
- Azure Cognitive Services offers several related services. Form Recognizer can extract key/value pairs from text fields and documents, and Text Analytics can identify entities, sentiment, and key phrases. The more advanced Language Understanding capability can be used for developing NLP models in chatbot, mobile, and IoT applications.
- Google Cloud Platform has two separate natural language offerings. Developers can use the natural language API to analyse basic entities, extract sentiment, and categorise content into 700 predefined categories. The more advanced AutoML Natural Language creates custom categorization and sentiment models.
- AWS Comprehend has similar text analytics and NLP features with APIs for detecting entities, events, key phrases, topics, sentiments, and personally identifiable information. Developers and data scientists can also use Amazon SageMaker to test, train, and deploy NLP models such as BlazingText, BERT (Bidirectional Encoder Representations from Transformers), or SpaCy.
- IBM Watson Natural Language Understanding can extract entities, sentiment, categories, and concepts but also has more sophisticated features for identifying relations, emotions, and semantic roles.
2 - Use text analytics tools in data integration and machine learning platforms
If your organisation invested in data integration, machine learning, or analytics platforms, then it’s likely one has some text analytics and NLP capabilities. Using these platforms may be an easier and faster way to perform lightweight text analytics, rather than coding to APIs or in data science notebooks. Here are some examples:
- Alteryx Designer has text mining functions for preprocessing, topic modeling, and sentiment analysis.
- IBM SPSS Modeler Text Analytics can be used for categorisation and is a common tool in market research for processing survey responses.
- SAS Visual Text Analytics is a visual tool and open platform for parsing, information extraction, NLP modeling, sentiment analysis, and trend analysis.
Other data science platforms such as RapidMiner, Knime, and Dataiku offer text mining functions natively, through plug-ins and integrations with public cloud services.
3 - Use specialised text analytics tools
If coding on public cloud platforms is too complex, and if your organisation does not already have an analytics, data science, or machine learning platform with text mining capabilities, then you’re probably seeking a third option. Specialised text analytics tools may be the answer. Take a look at KeatText, Lexalytics, MeaningCloud, MonkeyLearn, NetOwl, Provalis Research, Rosette Text Analytics, and other platforms that offer text analytics capabilities.
Text analytics is also common in customer experience, marketing automation, market research, social listening, chatbot, and other platforms that capture qualitative information around customers and sales prospects.
It’s no surprise that many tools have text analytics capabilities. Some offer simple on-ramps with prebuilt models based on standardised entities, categories, and topics, whereas others enable robust model building. The platforms also differ by target use cases, with some focusing on specific industries, document types, integration requirements, or technology use cases.
If you’re just getting started with text analytics, there are a few best practices. Begin any data and analytics discovery exercise by defining questions and target outcomes that potentially deliver business value. From there, consider the overall complexity of the document, content, and text fields that require processing, and examine the details around the target entities, topics, and semantics.
Understanding the problem complexity can help separate whether an agile spike against a lightweight approach is viable or if a more extensive agile proof of concept co-constructed with text mining experts is needed.
Most importantly, recognise that text analytics and natural language processing is a form of machine learning. Arriving at robust solutions requires experimenting with different algorithms, improving models, adding new data sources, and validating the results’ quality. For organisations trying to improve customer experiences, text analytics is an important capability to develop.