Big data brings new challenges to the world of analytics in terms of unstructured data types, very high volumes of data/streaming data, and events that need to be used as triggers for some analysis of that data.
The need to handle so much data and make sense out of it, combined with enterprise system data, led to the creation of the data lake concept. A data lake brings together data from different sources to clean the data, identify its source, ensure that it follows common business semantics for an organization, and make it accessible to the right users for further analytics, often in a self-service mode.
The data lake concept has evolved along with advances in big data concepts because big data brings new challenges to the world of analytics in terms of unstructured data types, very high volumes of data and streaming data, and events that need to be used as triggers for data analysis.
The following are some of the key motivations behind the data lake concept:
The figure below shows how the data lake sits in an application landscape and interacts with other systems and how users interact with it.
The outer edge of the figure shows the various users of the data lake:
Each organization must formulate the data management and data governance principles it wants to follow and include its desired metadata model, which should be easy to extend and have a set of semantics that is common across the business.
The data lake needs to be connected to the data sources throughout the enterprise through suitable governance—thus ensuring consistency and controllability of data.
All hyperscalers, such as AWS and Microsoft Azure, have come up with their data lake architectures and positioned their products, aimed to not only help with the storage but also with the governance and the various support functions described previously.
One of the major challenges with data lakes is the lack of proper governance in terms of data ownership, data quality standards, data lineage, and reusability. Also, IT teams often have to massage data into consumable information, while business teams find it difficult to leverage data from the data lake to the extent required. This is because of the loss of business context when data is pulled out of the business systems. These challenges led to the advent of data fabric, which is the solution for automating data management from multiple sources of data at any point in time, streamlining data, and enriching it through cleansing it, unifying it, securing it in complex distributed architectures, and making it ready for consumption by analytics and AI.
SAP Datasphere is a solution that has capabilities in this area. It provides the right platform for leveraging business data, and it uses AI tools from SAP BTP to bring major advantages to business through AI-enabled business analytics. The blog at the following link explains well how SAP Datasphere acts like a data fabric: http://s-prs.co/v597313.
In the next few years, data fabrics will work together with data lakes to provide the best data experience. They will also provide access to self-service analytics on huge amounts of authentic and semantically aligned data from multiple sources and more functions.
As big data has grown more beneficial to businesses, companies such as SAP needed to provide solutions to make sense of large swaths of data. With tools such as SAP Datasphere and SAP HANA, data analysts can utilize the popular data lake format of data storage as a way to sort through big data with SAP.
Editor’s note: This post has been adapted from a section of the book SAP S/4HANA: An Introduction by Devraj Bardhan, Axel Baumgartl, Madalina Dascalescu, Mark Dudgeon, Piotr Górecki, Asidhara Lahiri, Richard Maund, Bert Meijerink, and Andrew Worsley-Tonks. They are are a multinational author team working for IBM, SAP, and Accenture. They have
been working with SAP S/4HANA since its first release.
This post was originally published 11/2019 and updated 3/2025.