ETL and Dataclasses

Introduction

In Mastering Structural Pattern Matching I walked you through the theory of Structural Pattern Matching, so now it’s time to apply that knowledge and build something practical.

Let’s say you need to process data from one system (a JSON-based REST API) into another (a CSV file for use in Excel). A common task. Extracting, Transforming, and Loading (ETL) data is one of the things Python does especially well, and with pattern matching you can simplify and organize your business logic in such a way that it remains maintainable and understandable.

Let’s get some test data. For this you’ll need the requests library.

>>> resp = requests.get('https://demo.inspiredpython.com/invoices/')
>>> assert resp.ok
>>> data = resp.json()
>>> data[0]
{'recipient': {'company': 'Trommler',
               'address': 'Annette-Döring-Allee 5\n01231 Grafenau',
               'country_code': 'DE'},
 'invoice_id': 15134,
 'currency': 'JPY',
 'amount': 945.57,
 'sku': 'PROPANE-ACCESSORIES'}

Objectives

The data – feel free to use the demo URL provided in the example above – is a list of invoices for our fictional company that sells propane (and propane accessories.)

As part of any serious ETL process, you must consider the quality of the data. For this, I want to flag entries that may require human intervention:

  1. Find mismatched payment currencies and country codes. For instance, the example above lists the payment currency as JPY but the country code’s German.

  2. Ensure the invoice IDs are unique and that they are all integers less than 50000.

  3. Map each invoice to a dedicated Invoice dataclass, and each invoice recipient to a Company dataclass.

Then,

  1. Write the quality-assured invoices to a CSV file.

  2. Everything that fails that test is flagged and put in different CSV for manual review.

An important note though.

In a real application there would be a validation layer that checks the input data for obvious data errors, like integers in a string field, or missing fields. For brevity I will won’t include that part, but you should use a package like marshmallow or pydantic to formalize the contract you (the consumer) have with the data producer(s) you interface with to catch (and act on) these mistakes.

But, for the sake of argument, let’s assume the input data meets these basic standards. But it is not the job of a library like marshmallow to validate that, say, there the country code and currency is correct.