In order to perform data preparation with SAP, there are a few steps to follow so it’s ready for use in your system, such as populating data lakes or inputting information into a machine learning program.

The first step is data cleansing and data preprocessing, which includes finding missing data entries, populating missing entries with appropriate statistical measures such as mean or median, removing records marked for deletion, and removing duplicate entries and outliers.

Let’s say you are looking for goods receipts linked to purchase orders, and you find that, for some purchase order line items, receipts have not been recorded but a follow-on invoice document has been created. In this case, you may need to preprocess the entries from the invoice document for data preparation.

Similarly, you may need to preprocess entries from multiple relationally associated tables that may have linked fields such as purchase order header, item, history, receipt header, and item. Some preprocessing steps may include discarding deleted entries and defining unique identifiers, such as a combination of purchase order number and item number to link a purchase order with one or more goods receipts and/or invoices.

## Data Transformation Techniques

Data transformation includes the processing of data that may include normalization or binning.

### Data Normalization

Data normalization adjusts the values measured on different scales to a notionally common scale, often prior to the application of a mathematical model. For example, you may have to scale quality units in purchase order data to a common unit of measure (for example, kilograms) from multiple unit of measurements used in purchase orders (such as grams, tons, etc.).

Data Binning

Another method is binning that leads to assignment of values to groups to create a smaller set of discrete ranges. Binning is also useful to reduce the influence of outliers or extreme values on the model. Binning thus helps to mitigate model bias because the numerical quantities can be transformed to frequency distributed bins using quantiles (or the equal-height method). As a result, outliers won’t impact the model when trained.

Bins can be prepared for numerical quantities, such as the per unit price, the quantity of purchase, or the amount of purchase (quantity × price).

One example of binning is simple discretization using equal-width partitioning. For example, let’s say you have a list of purchase orders with a certain supplier in which the following quantities (measured in kilograms) for a material group (i.e., raw materials) are ordered: 5, 9, 10, 12, 16, 37, 51, 53, 74, 94, 202, and 211. The formula for binning into equal-width partitions is width = max − min) ÷ *N*, where *N *is the width of each interval. Let’s divide our 12 purchases into 3 groups using the formula: Width = (211 − 5) ÷ 3 = 70.

Therefore, the values in our three bins will be as follows:

- Bin 1 will have 9 values between 5 and 75 = {5, 9, 10, 12, 16, 37, 51, 53, 74}
- Bin 2 will have 1 value between 76 and 145 = {94}
- Bin 3 will have 2 values between 146 and 215 = {202, 211}

However, note that equal-width binning may dominate the calculation by skewing the data by including outliers.

### Equal-Depth Partitioning

Another data preparation method, equal-depth (frequency) partitioning, divides the range into *N *intervals with each interval containing approximately the same number of samples. Our purchase order data, when represented using equal-depth partitioning, will result in three bins with the following values:

- Bin 1 will have 4 values = {5, 9, 10, 12}
- Bin 2 will have 4 values = {16, 37, 51, 53}
- Bin 3 will have 4 values = {74, 94, 202, 211}

Rather than absolute values, these categorical bins have a smoothing effect on the data, thus contributing to training of the machine learning model.

## Aggregating Data

Once transformation is completed, data is aggregated using unique key combinations. For example, to calculate the goods receipt for a purchase order line item, you may have to aggregate all goods receipts for a purchase order line item with the correct movement types to compare the purchase order quantity with the goods receipt. For example, goods movement 101 can record the receipt of an item whereas goods movement 102 is the reversal of a goods receipt. Similarly, goods movement 103 and its reversal may highlight the quality of the supply and can be used to build a machine learning model for supplier selection in which quality is a criteria.

## Conclusion

Normalization and binning are two phases of preprocessing necessary to check for the various outliers and inconsistencies that occur in real world data. This is especially important when preparing data for use with machine learning tools such as those introduced with SAP Leonardo. Learn more about data management with SAP here.

*Editor’s note:* This post has been adapted from a section of the e-book *Introducing Machine Learning with SAP Leonardo* by Manu Kohli.

## Comments