The Future of the Modern Data Stack looks excellent for data engineers. However, as a software engineer, I kind of feel left out. Where is the modern data stack for software engineers?
Marketing teams and data engineers need data to answer questions; software engineers need data to build features. This difference is why you’ll find that tools like Segment don’t have connections for tools like ElasticSearch (Search Engine) or Redis (Cache).
A business may use the modern data stack to ask better questions about what’s happening in their business, applications, etc. A modern data stack is critical today if you want to succeed. Also, this world is filling fast with new SaaS data products and tools in abundance.
Here, I’d like to present a slightly different data problem for a separate data audience, software engineers.
Software engineers leverage data infrastructure in a very different way. The tools aren’t Google Analytics and Clearbit, but Upstash and Supabase
Engineers need to move data back and forth to build features and infrastructure that adds customer value.
Where are my tools to help me use code to move, process, or manipulate data between my application infrastructure? Today, I see a lot of one-off scripts, custom microservices, or tools that require me to scale a JVM.
The Data Integration Problem.
I want to tell you about a problem that every software engineer experiences: the data integration problem.
Due to infrastructure becoming easier to acquire and amazing tools like Heroku, Render, PlanetScale, Upstash and Supabase, it’s getting easier to acquire new data infrastructure.
data infrastructure — a new system that generates or stores data.
Keep this definition in mind; it’s crucial.
In general, writing software is becoming more data-centric every day. Engineers commonly pull data from all sorts of places from within (or without) our infrastructure to build applications that are data-intensive.
Data-intensive applications are complex and made up of many systems like:
- multiple microservices
- caches
- databases
- event brokers
- data warehouse
- search engines
- log aggregation systems
- CRM
- analytics platforms
- … and third-party tools.
Our software systems contain many specialized tools that accelerate development and growth. These additional tools and platforms solve real problems and help teams move fast. But, there is one catch.
We are slowly acquiring more specialized data infrastructure if you zoom out a bit. A distributed data infrastructure means that our systems generate and consume data from more and more data stores.
If not appropriately managed, the number of “data tasks” will continue to increase. This means we will keep spending less time building features and spend more time integrating data.
I’m not sure this is what we want.
I keep asking myself: Is spending tons of time moving data around a valuable activity for software engineers?
Today, there are production tools that software engineers may use to solve this problem, like Apache Kafka and Airflow. But deploying and managing these systems isn’t the greatest experience and requires people on your team whose only job is to manage these systems.
I’d argue that “easy data movement for developers” is still a super unsolved problem.
The data-centric developer mindset
I’m not sure this is even a problem that will go away. We will continue to use specialized tools that accelerate development and growth. In most cases:
ElasticSearch will always offer a better developer experience for searching than MySQL.
Snowflake will always offer a better developer experience for data warehousing than PostgresSQL.
There will be no magic data store 🪄. We will forever be in a data ecosystem that won’t consolidate much because data infrastructure will always have design decisions that will be good for one use case and possibly poor for others.
With that being said, the data-centric mindset is becoming more common when building software.
With data at the forefront of system design, engineers who used to ask themselves: “What database will I use for this application?”. Will now be asking themselves: “How will this new application integrate with my data infrastructure.”
The next generation of applications will be built with a data-first mindset.
What is the data integration problem?
Now, we can look at this problem from a data-centric mindset. Data integration problems are tasks that take the following form:
- Data in system A needs to get to system B.
- Data changes in A need to be continuously replicated into B.
We can map a vast landscape of problems to these. For example:
- Log Aggregation
- Syncing data from PostgreSQL to Redis for caching.
- Listening to changes from a PostgreSQL table and writing them to a data warehouse.
- Watching a file for changes and writing the changes to a database.
- Consuming data from a Kafka topic and writing it somewhere else.
If you squint and tilt your head to the side, you’ll notice that all of these problems are moving data from one place to another. These problems are specific to any specific industry; it applies to software engineering as a whole.
Some problems, such as the need for data warehousing you’d hit as you scale; others, like streaming data from a log, are ubiquitous amongst most software engineers.
We always code first, think later.
These problems move data from one place to another, yet we typically use different tools or build a custom tool. Moving data from one place to another is a task that looks simple on the surface, mainly because it’s super convenient to write a small service that does the data task you need.
But, most will eventually find that:
- Datastores and schemas improve, change and update over time.
- Managing real-time syncing between data infrastructure is 🥲.
- Relying on external data infrastructure (SaSS tools, External APIs) is impossible.
Then, some may then discover The Log and adopt Kafka. Kafka is an outstanding event-based streaming broker. But, it’s a massive addition to your infrastructure just to move data from one place to another.
What Now?
This is why we are working on a project called Conduit at Meroxa. We hope to change the experience software engineers have with data.
At a high level, Conduit is a data streaming tool written in GoLang. It aims to provide the best software developer experience for building and running real-time data pipelines.
I’d love to know what you think, and I’d love to see more data tools for software engineers.
Thank you for reading. Have a beautiful day ☀️