Behind The Pretty Visuals

Written by Scott Kennedy

The Crucial Role of Data Cleaning and Enrichment for Retail POS Analytics


“The trophy is earned in the hours that no one is watching” (author unknown). 

For every retailer presentation a consumer brand analyst provides, the beautifully crafted visuals capture everyone’s attention. The bright green bars of the waterfall chart showing weekly level sales comp growth draw the eye of the buyer. Or, the bubble chart displaying sales versus profit by store opens everyone’s eyes to an unexpected cluster of stores outperforming expectations. However, behind every pretty visual, a deeply technical data engineering process must happen first. If, and only if, the data cleanup and modeling steps go off without a hitch, can the analyst even be in the position to shoot the theoretical buzzer beater.

The article will explore the following concepts every brand needs to understand to build a successful retail point of sale analytics capability.

  • P.O.S. data cleaning examples
  • P.O.S. data enrichment/modeling considerations to enable last-mile reporting requirements
Data Cleaning

Retail P.O.S. data comes in all shapes and sizes. Despite modern advances by forward-thinking retailers to create API’s that enable cleaner data transfer processes for brands, a significant amount of data cleanup remains.  

Converting Text Data Types

Extracting P.O.S. data from vendor portals involves web crawling. Given the unpredictable nature of the retailers reporting portals, web crawlers load the raw data output to a database as text values, even if the field should be numeric. Loading data as text data types creates a flexible pipeline, robust to a wide range of characters.

The analyst can’t perform basic mathematical operations over text data types. In the image below, Excel counted column B instead of summing. That’s a result of the data type being text. Humans can see numbers, but a computer requires the correct data type to perform the function.




Converting the data type from text to numeric isn’t a trivial problem. Let’s walk through two ways to transition the sales field from above to a numeric value. 

  1. Database level transformation – The SQL function CONVERT or CAST can clean up the data type issue before hitting the reporting layer. SaaS companies like Krunchbox can handle this transformation before the data reaches the brand.
  2. Reporting layer transformation – Inside Excel and Power BI, a powerful transformation engine exists called Power Query. Within PQ, the function ‘Number.FromText’ returns a number value from a given text value.
Both data conversion solutions become infinitely more complex when characters outside 0-9 creep into the process. For example, the currency symbol might appear in a sales data field. In this instance, a function to find and replace special characters must happen before the data type conversion occurs. This messy, complicated process requires a combination of retail domain and technical knowledge.

 Identifying missing values

Retail vendor portals produce reports with missing values, commonly represented with a zero value. The reports are connected to multiple upstream systems with varying update/refresh schedules. The simplified illustration below shows a system lineage going from left to right until it reaches the finished report. If level 2 kicks off before level 1 finishes, crucial data elements won’t make the final report, leaving the analyst with zero values in the final report. 

There are a few ways to identify and correct this situation:


  1. Develop simple business logic driven, if-then conditional statements that identify zeros and make a second attempt at re-running the report. This type of work falls on 3rd party companies handling automated data extraction processes. The brand never notices the issue. 
  2. Run an anomaly detection algorithm to determine if a zero value is, in fact, “probably” inaccurate. Again, this type of work falls on the 3rd party companies helping out with the data engineering leg of the race. 
 Identifying duplicated values


Retailers ask brands for new items each season, even if the product change is only a color change. This change request forces the brand to create a new UPC and item number. However, the retailer will ask to keep their id (i.e., DPCI, Walmart prime ID, etc.) consistent to help their replenishment system accurately project inventory requirements. Brands must understand this raw data level relationship when merging P.O.S. data against internal systems. Without careful consideration, the P.O.S. data could be duplicated since the retailer ID joins back against two different brand SKU’s.



Data Enrichment/Modeling

 After the data are clean, a series of data enrichment activities begin.  

  1.  Each retailer handles store file classifications differently. Target provides all locations in a single file. Walmart splits the values out of different report exports. All location types need a common language to see a unified, all-retailer view of the business
  2. Maintaining an accurate cross-reference table that aligns a brand’s item with current department classifications requires scanning the time-phased item master exports and dynamically selecting the most recent version every week. This step enables the brand to create a proper star-schema data model since item classifications are deduplicated.
  3. Tagging ‘top N’ fields for enhanced reporting can happen at various levels – database or reporting. For example, a brand might want to see top stores for all retailers and top stores for just Walmart. This step is easily handled at the database level, creating a simple flag on the location master file per use case. 
  4. Target and Walmart call “sales” different things. For example, Target uses the term “enterprise sales retail” while Walmart uses ‘sales retail $’. In order to roll up multi-retailer reporting, a consistent naming convention must happen to ensure the sales metrics all add up correctly. 
  5. Retailers employ different logic to seemingly universal fields. For example, let’s take a store number column. Retailer A might use five digits, while retailer B uses nine digits. When building a database to house this information, the store number values column is set up by data type and length. To create uniform values across retailers, a process of “padding” each store number value between retailers creates consistency in the reporting layer while satisfying the database requirements. 
Retail buyers expect brands to bring insights to the table at every meeting. Outsourcing the data extraction, enrichment, and modeling considerations reduces the need to hire in-house engineers. Hiring and maintaining this type of skillset uniquely differs from traditional FTE’s a brand is familiar with. For this reason, partnering with a third-party service like Krunchbox allows you to offload the complexity of building a P.O.S. analytics infrastructure and focus solely on driving value in the next meeting.