Big data brings new challenges to the world of analytics in terms of unstructured data types, very high volumes of data/streaming data, and events that need to be used as triggers for some analysis of that data.
Over the past decade, SAP has forayed deep into the world of big data with solutions such as the SAP HANA in-memory database and SAP Data Intelligence aggregator. With these tools, businesses can easily gather and analyze terabytes of data, making better business decisions.
For some, like those in the C-Suite, the knowledge that SAP can handle big data with ease is enough. But for those in charge of pooling data and structuring it in a way that makes sense, more information is needed. Two such pieces of information that are good to know is the concept of a data lake and the capabilities of SAP Vora.
What is a Data Lake?
The term data lake has suddenly gained a lot of popularity and is applied to mean different things. Generally, however, a data lake can be defined as a store of information from different sources that is brought together in such a way that it is cleaned, its source is identified, it follows common business semantics for an organization, and it’s made accessible to the right users for further analytics, often in a self-service mode.
The following are some of the key motivations behind the data lake concept:
- Data needs to be analyzed, but huge amounts of data makes it difficult to store in high performance and advanced analytical systems, such as SAP HANA; the storage cost would be extremely high, and not all the data is important to store in this way. This then requires data storage beyond the data warehousing systems, such as SAP BW on SAP HANA or SAP BW/4HANA, and more cost-efficient yet easy-to integrate systems, such as Hadoop clusters.
- An infrastructure is required that can store these analytical repositories without dependency on a specific data format. In other words, you need an infrastructure that can handle structured, unstructured, or semistructured data.
- The data lineage from the source system should be traceable. This is also useful for regulatory compliance.
- You need to enable business users to use the data for self-service analytics.
- Application rationalization needs data to be in a well-architected format in a few systems, rather than all over the landscape with several different types of data warehousing solutions and visualization tools and even more custom-built solutions, which make maintenance a nightmare for IT and ease of use a nightmare for business users. A structured approach makes it possible to use the data in a publish-subscribe mode.
- The data will eventually move into the transactional systems and hence needs to be of guaranteed quality; thus, the governance framework plays an important part.
- Proper metadata management capability is important for analysts to understand the data they’re consuming and a must-have for the data governance processes and data management processes.
The figure below shows how the data lake sits in an application landscape and interacts with other systems and how users interact with it.
Now you know why you need data lakes. Let’s discuss how to make them.
How to Create a Data Lake
The theory behind the data lake concept is that the data repositories at the core of the data lake should be designated to fit the criteria defined above. Each organization must formulate the data management and data governance principles they want to follow and include the metadata model to use, which should be easy to extend and have a set of semantics that is common across the business.
The data repositories include the traditional data warehouse systems and have options to store unstructured or semistructured data. In an SAP big data environment, this means including an existing (or migrated) SAP BW on SAP HANA system or a newly implemented SAP BW/4HANA system along with Hadoop clusters.
Thus, the data repositories provide extended storages for our enterprise data warehouse (EDW) systems but also can cater to any IoT, social media data, events, and so on. So, by using SAP Data Intelligence, which can handle several data types, and using machine learning, you can do advanced analytics on top of this data.
The data lake needs to be connected to the data sources throughout the enterprise through suitable governance—thus ensuring consistency and controllability of data. This can be assured by using integrated services and through the implementation of workflows to ensure suitable governance and data quality as and when required, depending on the source and type of data. SAP HANA has these integration services, and many of these functions are also provided by SAP Data Services.
In addition, the data lake needs to have access control, monitoring, and audition capabilities to ensure proper governance and compliance.
The outer edge of the above figure shows the various users of the data lake:
- The analytics team is a group of users, including data scientists, responsible for carrying out the advanced analytics across the data lake.
- The information curator is responsible for the management of the data catalog, which will be used by users to find the relevant data elements within the data lake.
- The governance, risk, and compliance team is responsible for defining the overall governance program of the data lake and any associated reporting functions to demonstrate compliance.
- The data lake operators are responsible for the day-to-day operations of the data lake.
- The line of business (LoB) users might have roles such as the manufacturing line users, finance users, sales team, and so on.
All hyperscalers, such as AWS and Microsoft Azure, have come up with their data lake architectures and positioned their products, aimed to not only help with the storage but also with the governance and the various support functions described previously. For example, you can see what the AWS architecture for the data lake solution looks like here: http://s-prs.co/v523211.
SAP Vora is a distributed computing solution that is deployed on Apache Hadoop and Spark clusters. Because it provides a semantic layer on your big data, which is stored in the Hadoop layer and provides integration with the SAP HANA platform, you can run combined analytics across enterprise and Hadoop data. It doesn’t require any additional hardware.
SAP Vora can also handle hierarchies, enterprise-ready calculations, and currency conversion, and it provides support for units. It has a simple web-based UI and supports SQL for querying data on Hadoop.
Advanced users, such as data scientists, can leverage programming languages, for example, SQL, Python, Scala, C++, and Java, and can create mashups from different data sources.
From the user’s perspective, SAP HANA and SAP Vora act as a single system with joint query optimization and automatic storage decisions that caters to big data analytical needs.
As big data has grown more beneficial to businesses, companies such as SAP needed to provide solutions to make sense of large swaths of data. With tools such as SAP Vora and SAP HANA, data analysts can utilize the popular data lake format of data storage as a way to sort through big data with SAP.
Editor’s note: This post has been adapted from a section of the book SAP S/4HANA: An Introduction byDevraj Bardhan, Axel Baumgartl, Nga-Sze Choi, Mark Dudgeon, Piotr Górecki, Asidhara Lahiri, Bert Meijerink, and Andrew Worsley-Tonks.