Skip to content

Pride, Not Prejudice: Introducing The Data Processing Pipeline

Data integrity concerns led to many inconvenient product issues during our online show. To catch this ahead of time, we needed a way of programmatically inspecting and processing vendor listings data.

pride not prejudice logo, with an electrician's panel opened. Gears and wiring are exposed.

Making Better Sense Of The Data

Leveraging the WooCommerce data importer was a good first step. We implicitly assumed we would receive machine-ready consistent data and that was an unreasonably high bar. Humans make mistakes. Depending on the incorrectness of the data, WooCommerce would still run the importer with unwanted results. We discovered at the last hour, or even during the show in some cases, that products were incorrectly listed: unavailable, missing variations, incorrect stock and prices, and swapped descriptions, among things.

We needed to provide a data format that was clearer to follow, and have a means of validating the data prior to attempting an import.

Our first draft of the spreadsheet format used the WooCommerce column names and expected values verbatim. However, many of our vendors did not necessarily have a technical familiarity or familiarity with the WooCommerce platform. The first improvement that we made was in changing some of the column names to use more recognizable vocabulary. We then took a stab at simplifying the variation product format.

Simplifying variation products

WooCommerce supports simple products and variation products. A simple product fits on a single row and is exactly that, a product with no options. A variation product has selection options, such as a shirt with sizes – these require a main row, and then related rows for each of the selection options.

Setting up variation products in the WooCommerce data format involved including a list of all of the possible values for a variation type in the main row. With the shirt size example, you would need to list sm, md, lg, xl, xxl, xxl. I understand why it’s done this way; this allows the data importer to go ahead and generate the internal metadata up front without needing to also process the subsequent related variation data rows first.

We did away with this list of variation values, allowing vendors to only need to specify the respective variation value in each of the variation rows. We had also asked vendors to label each row as simple, variable, or variation. This was very much a WooCommerce thing, and we got rid of it in our revamped data format.

Data processing pipeline

In order to process data in a format that varies from what WooCommerce expects, and validate said data, we needed to build a processing pipeline for validating and transforming the vendor data.

This processing pipeline is a computer program that reads in the vendor listings Excel spreadsheet row by row, generates a data model for the vendor listings, and finally generates a WooCommerce-compliant CSV file ready for upload. This approach allowed me to process data that followed my simplified format, validate the data integrity at multiple stages, and generate a WooCommerce CSV with any of the options configured to my needs.

If any of the data wasn’t what was expected, the pipeline failed and notified me before an import to our catalog was even attempted.

Most importantly, this data processing pipeline paved the road for future improvements towards automating the catalog population phase of Pride, not Prejudice. Up next, I turned my attention toward flagging images that may have not been included in the vendor listing bundle, and processing images that were much too big for websites.

Next in the series, Processing Your Product Photos