For enterprises looking to wrest the most value from their data, especially in real-time, the “data lakehouse” concept is starting to catch on.
The idea behind the data lakehouse is to merge together the best of what data lakes and data warehouses have to offer, says Gartner analyst Adam Ronthal.
Data warehouses, for their part, enable companies to store large amounts of structured data with well-defined schemas. They are designed to support a large number of simultaneous queries and to deliver the results quickly to many simultaneous users.
Data lakes, on the other hand, enable companies to collect raw, unstructured data in many formats for data analysts to hunt through. These vast pools of data have grown in prominence of late thanks to the flexibility they provide enterprises to store vast streams of data without first having to define the purpose of doing so.
The market for these two types of big data repositories is “converging in the middle, at the lakehouse concept,” Ronthal says, with established data warehouse vendors adding the ability to manage unstructured data, and data lake vendors adding structure to their offerings.
For example, on Amazon Web Services (AWS), enterprises can now pair Amazon Redshift, a data warehouse, with Amazon Redshift Spectrum, which enables Redshift to reach into Amazon’s unstructured S3 data lakes. Meanwhile, data lake Snowflake can now support unstructured data with external tables, Ronthal says.
When companies have separate lakes and warehouses, and data needs to move from one to the other, it introduces latency and costs time and money, Ronthal adds. Combining the two in one platform reduces effort and data movement, thereby accelerating the pace of uncovering data insights.
And, depending on the platform, a data lakehouse can also offer other features, such as support for data streaming, machine learning, and collaboration, giving enterprises additional tools for making the most of their data.
Here is a look at at the benefits of data lakehouses and how several leading organisations are making good on their promise as part of their analytics strategies.
Enhancing the video game experience
Sega Europe’s use of data repositories in support of its video games has evolved considerably in the past several years.
In 2016, the company began using the Amazon Redshift data warehouse to collect event data from its Football Manager video game. At first this event data consisted simply of players opening and closing games. The company had two staff members looking into this data, which streamed into Redshift at a rate of ten events per second.
“But there was so much more data we could be collecting,” says Felix Baker, the company’s head of data services. “Like what teams people were managing, or how much money they were spending.”
By 2017, Sega Europe was collecting 800 events a second, with five staff working on the platform. By 2020, the company’s system was capturing 7,000 events per second from a portfolio of 30 Sega games, with 25 staff involved.
At that point, the system was starting to hit its limits, Baker says. Because of the data structures needed for inclusion in the data warehouse, data was coming in batches and it took half an hour to an hour to analyse it, he says.
“We wanted to analyse the data in real-time,” he adds, but this functionality wasn’t available in Redshift at the time.
After performing proofs of concept with three platforms — Redshift, Snowflake, and Databricks — Sega Europe settled on using Databricks, one of the pioneers of the data lakehouse industry.
“Databricks offered an out-of-the-box managed services solution that did what we needed without us having to develop anything,” he says. That included not just real-time streaming but machine learning and collaborative workspaces.
In addition, the data lakehouse architecture enabled Sega Europe to ingest unstructured data, such as social media feeds, as well.
“With Redshift, we had to concentrate on schema design,” Baker says. “Every table had to have a set structure before we could start ingesting data. That made it clunky in many ways. With the data lakehouse, it’s been easier.”
Sega Europe’s Databricks platform went live into production in the summer of 2020. Two or three consultants from Databricks worked alongside six or seven people from Sega Europe to get the streaming solution up and running, matching what the company had in place previously with Redshift. The new lakehouse is built in three layers, the base layer of which is just one large table that everything gets dumped into.
“If developers create new events, they don’t have to tell us to expect new fields — they can literally send us everything,” Baker says. “And we can then build jobs on top of that layer and stream out the data we acquired.”
The transition to Databricks, which is built on top of Apache Spark, was smooth for Sega Europe, thanks to prior experience with the open source engine for large-scale data processing.
“Within our team, we had quite a bit of expertise already with Apache Spark,” Baker says. “That meant that we could set up streams very quickly based on the skills we already had.”
Today, the company processes 25,000 events per second, with more than 30 data staffers and 100 game titles in the system. Instead of taking 30 minutes to an hour to process, the data is ready within a minute.
“The volume of data collected has grown exponentially,” Baker says. In fact, after the pandemic hit, usage of some games doubled.
The new platform has also opened up new possibilities. For example, Sega Europe’s partnership with Twitch, a streaming platform where people watch other people play video games, has been enhanced to include a data stream for its Humankind game, so that viewers can get a player’s history, including the levels they completed, the battles they won, and the civilisations they conquered.
“The overlay on Twitch is updating as they play the game,” Baker says. “That is a use case that we wouldn’t have been able to achieve before Databricks.”
The company has also begun leveraging the lakehouse’s machine learning capabilities. For example, Sega Europe data scientists have designed models to figure out why players stop playing games and to make suggestions for how to increase retention.
“The speed at which these models can be built has been amazing, really,” Baker says. “They’re just cranking out these models, it seems, every couple of weeks.”
The business benefits of data lakehouses
The flexibility and catch-all nature of data lakehouses is fast proving attractive to organisations looking to capitalise on their data assets, especially as part of digital initiatives that hinge quick access to a wide array of data.
“The primary value driver is the cost efficiencies enabled by providing a source for all of an organisation’s structured and unstructured data,” says Steven Karan, vice president and head of insights and data at consulting company Capgemini Canada, which has helped implement data lakehouses at leading organisations in financial services, telecom, and retail.
Moreover, data lakehouses store data in such a way that it is readily available for use by a wide array of technologies, from traditional business intelligence and reporting systems to machine learning and artificial Intelligence, Karan adds. “Other benefits include reduced data redundancy, simplified IT operations, a simplified data schema to manage, and easier to enable data governance.”
One particularly valuable use case for data lakehouses is in helping companies get value from data previously trapped in legacy or siloed systems. For example, one Capgemini enterprise customer, which had grown through acquisitions over a decade, couldn’t access valuable data related to resellers of their products.
“By migrating the siloed data from legacy data warehouses into a centralised data lakehouse, the client was able to understand at an enterprise level which of their reseller partners were most effective, and how changes such as referral programs and structures drove revenue,” he says.
Putting data into a single data lakehouse makes it easier to manage, says Meera Viswanathan, senior product manager at Fivetran, a data pipeline company. Companies that have traditionally used both data lakes and data warehouses often have separate teams to manage them, making it confusing for the business units that needed to consume the data, she says.
In addition to Databricks, Amazon Redshift Spectrum, and Snowflake, other vendors in the data lakehouse space include Microsoft, with its lakehouse platform Azure Synapse, and Google, with its BigLake on Google Cloud Platform, as well as data lakehouse platform Starburst.
Accelerating data processing for better health outcomes
One company capitalising on these and other benefits of data lakehouses is life sciences analytics and services company IQVIA.
Before the pandemic, pharmaceutical companies running drug trials used to send employees to hospitals and other sites to collect data about things such adverse effects, says Wendy Morahan, senior director of clinical data analytics at IQVIA. “That is how they make sure the patient is safe.”
Once the pandemic hit and sites were locked down, however, pharmaceutical companies had to scramble to figure out how to get the data they needed — and to get it in a way that was compliant with regulations and fast enough to enable them to spot potential problems as quickly as possible.
Moreover, with the rise of wearable devices in healthcare, “you’re now collecting hundreds of thousands of data points,” Morahan adds.
IQVIA has been building technology to do just that for the past 20 years, says her colleague Suhas Joshi, also a senior director of clinical data analytics at the company. About four years ago, the company began using data lakehouses for this purpose, including Databricks and the data lakehouse functionality now available with Snowflake.
“With Snowflake and Databricks you have the ability to store the raw data, in any format,” Joshi says. “We get a lot of images and audio. We get all this data and use it for monitoring. In the past, it would have involved manual steps, going to different systems. It would have taken time and effort. Today, we’re able to do it all in one single platform.”
The data collection process is also faster, he says. In the past, the company would have to write code to acquire data. Now, the data can even be analysed without having to be processed first to fit a database format.
Take the example of a patient in a drug trial who gets a lab result that shows she’s pregnant, but the pregnancy form wasn’t filled out properly, and the drug is harmful during pregnancy.
Or a patient who has an adverse event and needs blood pressure medication, but the medication was not prescribed. Not catching these problems quickly can have drastic consequences. “You might be risking a patient’s safety,” says Joshi.