Data wrangling, dataops, data prep, data integration—whatever your organisation calls it, managing the operations to integrate and cleanse data is labor intensive. Many businesses struggle to integrate new data sets efficiently, improve data quality, centralise master data records, and create cleansed customer data profiles.
Dataops isn’t a new challenge, but the stakes are higher as more companies want to become data-driven organisations and leverage analytics as a competitive advantage. Digital trailblazers are also extending dataops into unstructured data sources to build AI search capabilities and prep data for use in large language models.
Leveraging AI and ML for data transformation
Dataops must become more efficient, deliver better quality results, scale to handle large data volumes and velocities, work with more disparate data sources, and improve the reliability of data pipelines.
“Data needs to undergo transformation and refinement to unlock its true potential, and dataops is the vital discipline that revolutionises data management and maximises its value through efficient processes and automation,” says Newgen Software’s head of AI, Rajan Nagina. “Dataops involves integrating people, technology, and workflows to ensure that data is handled efficiently, with a focus on improving data quality, accessibility, and reliability.”
The tools for automating data pipelines are improving, and many leverage machine learning and artificial intelligence capabilities. AI and machine learning dataops techniques shift data operations from manual- and rule-based approaches toward intelligent automation.
Sunil Senan, senior vice president and global head of data, analytics, and AI at Infosys, adds several competitive benefits when enterprises leverage machine learning and AI in dataops.
“Enterprises can deploy AI for quick data discovery, cataloging, and rapid data profiling, while ML can detect anomalies, identify inconsistencies, and enrich data. Together, AI, ML, and automation can help generate improved data quality, harmonise master data, and create the fabric for building data products and effective data teams.”
Where can dataops teams extend automation and use machine learning and AI as game-changing capabilities? Here are five examples.
1. Reduce data prep for new data sets
“Advanced AI/ML capabilities enable a paradigm shift for data integration, transformation, and observability,” says Crux’s CEP Will Freiberg. “By using automated solutions, dataops teams can flip the ratio from 70% of their time spent on data preparation to 70% of their time spent on high-value analytics.”
Here are two key questions for dataops teams to consider regarding the impact of manual efforts:
- What’s the cycle time measured from the initial discovery of a new data set to when it’s loaded, cleaned, and joined in the organisation’s data lake and listed in the data catalog?
- Once there’s a data pipeline, are you using monitoring and automation to detect and adjust to changes in the data format?
When manual data processing steps are needed to load and support data pipelines, dataops teams can take the opportunity to improve cycle times for new data sources and the meantime to recover from data pipeline issues.
Freiberg continues, “Once data teams define standards for data quality and program them into AI, the technology can detect and manage schema changes and data profile anomalies when onboarding external datasets—preventing broken data pipelines and the need for manual intervention.”
2. Scale data observability and continuous monitoring
Broken data pipelines occur when dataops engineers don’t use monitoring, alerts, and automation to identify issues and implement fixes quickly. Proactive remediations include dataops observability tools and practices for logging data integration events and monitoring data pipelines.
“Manually finding and fixing problems is time-consuming, given the volume of data organisations must deal with today,” says Emily Washington, senior vice president of product management at Precisely. “An effective approach to ensuring data quality is to validate data as it enters the organisation’s ecosystem and ensure continuous monitoring by adopting data observability as part of an overall data integrity strategy.”
Data observability aims to provide consistent and reliable data pipelines for real-time decision-making, updating dashboards, and using in machine learning models. It’s one way for dataops teams to manage service-level objectives, a principle introduced in site reliability engineering that equally applies to data pipelines.
“Data observability helps organisations proactively identify and manage data quality at scale, resulting in healthier data pipelines, more productive teams, and happier customers,” says Washington.
Looking forward, when the dataops capabilities in generative AI become mainstream, they have the potential to enable data observability at scale by
- Identifying data issue patterns and recommending remediations or triggering automated cleansing
- Recommending code fixes and suggestions to data pipelines
- Documenting data pipelines and improving the information captured for data observation
3. Improve data analysis and classification
Dataops teams can also use AI and machine learning to analyse and classify data as it streams through data pipelines.
“AI-driven data capture enhances the quality of data flowing into the system early by doing anomaly detection, relevance assessment, and data matching,“ says Hillary Ashton, chief product officer at TeraData. “ML models can be leveraged to find hidden patterns in data, clean and harmonise to conform to standards, and classify sensitive data to ensure appropriate governance.”
Basic classifications include identifying personally identifiable information (PII) and other sensitive data in datasets that aren’t marked to contain this type of information. Once identified, data governance teams can define automation rules to reclassify the source and trigger other business rules.
Ashton believes generative AI will drive more powerful data quality and governance tools and says, “Dataops teams will look at leveraging business domain knowledge and data from collaboration platforms to provide richer context and patterns to the data.”
Another data-compliance use case is in security. I spoke with Tyler Johnson, co-founder and CTO of PrivOps, about how identity and access management is an often overlooked area where dataops can provide value with automation and AI.
“Automation can minimise the risk of bad actors using stale permissions to penetrate the organisation, but it does nothing to address threats from authorised users,” he says. “By extending data pipeline workflows to aggregate and integrate user access logging data with AI, dataops partnered with infosec can minimise threats from outside and inside the organisation. The AI identifies suspicious access patterns and alerts the security operations center (SOC) when detected.”
4. Provide faster access to cleansed data
Identifying sensitive information in a data stream and other anomalies is a fundamental data governance use case, but what business teams really want is faster access to cleansed data. A primary use case for marketing, sales, and customer service teams is real-time updates to customer data records, and streaming data into a customer data profile (CDP) database is one approach to centralising customer records.
“Applying the right tools to detect and address data quality issues throughout the data processing pipeline is critical, starting with scheduling automated exploratory data analysis, data cleansing, and deterministic and probabilistic user ID matching tools to run during data ingestion,” says Karl Wirth, CEO of Treasure Data.
“Real-time user ID stitching can be combined with automated segmentation (using clustering and other machine learning models) to enable insights and personalisation to be constantly refreshed as data accumulates. Finally, automated prediction and anomaly detection algorithms, combined with data drift detection, complete the picture by ensuring that quality remains intact over time.”
A second approach to managing customer data is master data management (MDM), where dataops defines the rules for identifying the primary customer records and fields from multiple data sources.
Manish Sood, CEO, founder, and chairman of Reltio, says machine learning helps combine information from multiple sources. “Modern approaches utilise automation and ML-based techniques to swiftly unify data from multiple sources, departing from the limited scope of traditional MDM systems,” he says.
Machine learning also helps reduce the number and complexity of business rules in MDM systems. “Automation has long been used by dataops to improve master data management, particularly data quality, for example, by hard-coding rules about metadata,” says David Cox, outbound product manager at Semarchy. “Artificial Intelligence and machine learning can help automate data quality at scale, as an infinite number of rules may be needed to control the quality of large, high velocity, complex data.”
Anthony Deighton, data products general manager at Tamr, shares an example of where machine learning can replace hard-to-maintain business rules. He says, “AI and machine learning are powerful tools that can make a real difference in dataops. For example, duplicate customer records can be merged into a single comprehensive record, resulting in greater data accuracy and better insights.”
Expect more generative AI capabilities in CDP and MDM solutions, especially around enriching customer records with information extracted from documents and other unstructured data sources.
5. Lower the cost and increase benefits of data cleansing
Dataops have the opportunity to use AI and machine learning to shift their primary responsibilities from data cleansing and pipeline fixing to providing value-added services such as data enrichment.
“As data volumes and complexity grow, manually establishing data quality rules no longer proves scalable, and AI/ML offers a promising approach to tackling scalability,” says Satish Jayanthi, co-founder and CTO of Coalesce. “These technologies can efficiently identify and rectify erroneous data by leveraging automation, thereby mitigating the negative consequences.”
Ashwin Rajeeva, co-founder and CTO of Acceldata, shares examples of how ML can enable continuous data quality improvements by learning through patterns. “Learnings can be applied to correct errors, fill in missing data, add labels, perform smart categorisation, and de-duplicate data.”
Eswar Nagireddy, senior product manager of data science at Exasol, notes the importance of driving efficiencies in dataops. “Today, most data and analytics teams don’t have the time and resources to keep up with the needs of data health and monitoring, especially as pressure grows to reduce operational costs and headcount. Data teams that take advantage of automated machine learning (AutoML), no-code, and low-code can more quickly realise the value of applied ML to business while ensuring the health of their data.”
Dataops teams can reduce workload, improve data quality, and increase data pipeline reliability by using AI and machine learning techniques and relying less on manual efforts or hard-coded business rules. Once those changes are in place, teams can use AI and machine learning to drive competitive business values by accelerating the time to integrate new data sets, as well as enriching customer records and improving data governance.