Medallion Architecture + Meroxa: Easily Work with Massive Amounts of Data

In today's data-driven world, the challenges of processing and analyzing large amounts of data continue to grow. Traditional data architectures take time to implement and don’t meet the needs of analytics on demand.  Many organizations have created their own way to logically represent data as it is processed to help address the ever-increasing challenges of working with data; one such solution is the Medallion Architecture from Databricks.

The Medallion Architecture is a design pattern used to logically organize data in a Data Lakehouse, with the goal of progressively improving the overall quality of the data. It uses the Delta Lake framework to logically organize the data into three layers: Bronze, Silver, and Gold. At each layer, the data is refined to make the curated business-level tables more accessible, accurate, and actionable.

In this blog post, we demonstrate how you can implement the Medallion Architecture using Meroxa and Turbine to streamline analytics and make it easier to work with large amounts of data.

💡 You can read the Databricks primer here to learn more about the Medallion Architecture and how it improves the data.

What is Meroxa?

Meroxa is a Stream Processing Application Platform as a Service (SAPaaS) where developers can run and scale their Turbine apps using cloud-native best practices. Turbine is Meroxa’s  stream processing application framework for building event-driven stream-processing apps that respond to data in real-time.

Meroxa handles the underlying streaming infrastructure so developers can focus on building their applications. Turbine applications start with an upstream resource. Once that upstream resource is connected, Meroxa will take care of streaming the data into the Turbine application so it can be run. 

Since Meroxa is a developer-first platform, engineers can ingest, orchestrate, transform, and stream data to and from anywhere using languages they already know, such as Go, JavaScript, Python, or Ruby. Support for Java, and C# is on the way.

💡 Meroxa has support for many source and destination resources. You can see which resources are supported here. If there's a resource not listed you can request it by joining our community or by writing to support@meroxa.com. Meroxa is capable of supporting **any** data resource as a connector.

Overview

We want to implement a Delta Lake architecture using Meroxa and Turbine as our transition from Bronze, to Silver, and ultimately to business-level Gold level data stores. To accomplish this we will use the following resources; all managed by the Meroxa platform:

  • Bronze: PostgreSQL will serve as our raw, unfiltered data ingested from various sources

  • Silver: Snowflake will serve as our intermediate cleaned and enriched data storage; valuable but not 100% business ready

  • Gold: Amazon Web Services S3 will be where our business-ready data will live, normalized and stored in the Delta Table format

As a bonus, we will also set up logging using Sentry, an error tracking and monitoring platform, to catch and report any exceptions that come up when writing our data. 

Visually, our application will look like this:

Diagram

Let’s get to the code

But first…

Before we can begin we will need to set up a few things. 

First, on the Meroxa Platform we will need both a PostgreSQL resource and a Snowflake resource. Using the documentation we can set up our Bronze PostgreSQL and Silver Snowflake resources. 

Secondly, we will need to set up our S3 bucket that will serve as our Delta Table resource. Although we will not need to add our S3 bucket to the Meroxa Platform in this particular example we will still need to set up access permissions as though we were. We can find those permissions in our documentation

In addition to setting up our resources we will also need to gather a few extra bits of information. We need to set up the following environment variables:

Environment Variable 

Description

AWS_ACCESS_KEY_ID

AWS Access Key for user accessing buckets

AWS_SECRET_ACCESS_KEY

AWS Secret for user accessing buckets

AWS_REGION

Region the bucket was created in

AWS_URI

The actual URI of the bucket (e.g.: S3://bucket-name/key-name )

SENTRY_DSN

Sentry Data Source Name (DSN) to upload logs and errors

GOOGLE_API_KEY

API Key to access Google Location API

We will address each one of these variables as we walk through our example. 

Writing our Turbine App

Our Turbine application has three main tasks: read our raw data from our Bronze data source, transform our data for Silver intermediate data store, and ultimately write to our Delta Table in our Gold destination. 
To start, we first need to retrieve our raw data. Turbine enables us to stream rows, referred to as Records in Turbine,  from our Bronze database, in this case a PostgreSQL database from the “employees” table.Code snippet

Taking a quick look at our raw data, we can see that Turbine has formatted it as JSON. 
Code snippet
 
With our raw data in hand, our next step is to enrich said data.  For this example we want to translate the postcode on our record to Latitude and Longitude through the use of  Google Address Validation API.  This can be accomplished by using the Requests library to make a GET request against the Address Validation and obtain our enhancement.

Notice that nothing in this code is specific to Turbine. The great thing about this code is that it is not Turbine specific. This particular code can be run alone as is. 
Code snippet
Another great feature of Turbine is that it allows us to abstract out any logic we want into a separate module so we can keep our code neat and organized. Here we have chosen to move our enrichment logic into a module called enrich.py  

We extract the postcode from each Record, a JSON representation of each row in our Bronze database, and enrich it using our previously created logic.  Once we’ve set up our processing logic we can use Turbine to execute our logic and write to our Silver database in only three lines of code. Code snippet
But we still need to write to our Gold database; our Delta Lake. Here we are using delta-rs, a Rust library with Python bindings, to initialize and write to our Delta Lake. Like our enrichment logic from before, this logic contains nothing Turbine specific and can be run on its own. In addition to our Delta writing logic we also use the Sentry SDK to log any errors we may encounter. 

Code snippet

Putting It All Together

All that’s left now is to piece it all together in our `main.py` module. Here we see how the Turbine framework helps us connect our source, enrichment logic, and finally our destinations together.Code snippet Remember the bonus logging we mentioned before?  In our complete Turbine application we’ve added an invocation of the sentry_sdk initialization function. Although turbine handles execution of your code for you. Youare more than welcome to bring your own logging tools for that extra bit of insight into how your code is performing. 

Deploying our Application

Now let's get our application up and running. Using git and the Meroxa CLI we will run the following commands:
$ git add .
$ git commit -m "Initial Commit"
$ meroxa apps deploy
💡 For more information on deployment, you can refer to the Meroxa Docs here.

While we wait, the Meroxa Platform is hard at work wiring all of our resources together; connecting our Bronze source to our function and our function to our Silver and Gold destinations. 

Once our application is deployed we will see that every record that is already in our Bronze source will be written to our Gold and Silver destinations with our updates in hand. The running Turbine application will continue to process all new records as they are written to our Bronze source.

Meroxa sets up the complex connections and polling logic and lets us focus on the real fun part; writing code. 

Final Thoughts

The Medallion Architecture and the Delta Lake framework combine to be an incredibly powerful way to organize and augment our data. However, a lot of time and effort is often spent on setting up the infrastructure we need to even begin to make use of this powerful combination. 

With Meroxa and Turbine we no longer need to concern ourselves with this complex overhead and instead we can focus on the logic that does the heavy lifting. 

We’ve seen that with Meroxa and Turbine we are able to:
  • Stream unstructured data from our Bronze PostgreSQL data source
  • Augment our data using whatever logic we want and any libraries or APIs we may need
  • Intermediately warehouse our augmented data in an AWS RedShift Database
  • Ultimately write our data into an AWS S3 backed Delta Table, ready to be consumed

And we did it all without having to set up any extra infrastructure or streaming logic. 

If you're interested in learning more about Meroxa, be sure to check out our documentation and Discord community. We support a wide range of source and destination resources, and you can use languages you already know to ingest, orchestrate, transform, and stream data to and from anywhere.

Thanks for reading, and we hope this post was helpful in your data-driven journey!

Here are some additional examples of what can be accomplished with Meroxa:

Topics: Meroxa, Turbine, Medallion Architecture, Databricks