Designing sustainable data pipelinesReading time 5 minutes

Published: 06.07.2022

Data&AI

The topic of green software engineering has been increasingly popular in the last couple of years. However, there has been little discussion about building and designing sustainable data pipelines. In this blog post, I will share my thoughts and experiences with designing and developing data pipelines in a more environmentally friendly way.

But is it necessary to think about environmental impacts of your code? It is estimated that cloud computing has a greater environmental impact than the whole airline industry as of 2021. With the expansion of computing power in the world’s data centers and the environmental impact, it is becoming more and more critical to look into how to design your data processing efficiently. Also, by designing your data processing pipelines to be more efficient you may get additional benefits such as reduced cloud infrastructure costs.

Do you process the same data every day?

Designing data pipelines that process all your data every night is easy. And there might not be any performance issues as you have all night to process the data. This is often the approach used in traditional data warehouses that always build your fact and dimension tables from scratch whenever you are updating your data. Most of the time you are processing the same data again and again. That certainly won’t be the most efficient usage of computing resources.

By spending a little more effort to build incremental data pipelines that process only the changed data instead of all data you can save a lot of time and resources that are used to process the data. You might see other benefits such as faster load times and better performance when taking the effort in the first place.

Use your resources efficiently

When choosing between Infrastructure as a Service (IaaS) or Platform as a Service (PaaS), you could also assess the environmental impact. With IaaS you are dedicating a set of resources only to be used by you but with most PaaS solutions these same resources can be shared by multiple end-users resulting in more efficient usage of resources. For example, you might have Databricks cluster that is processing the same data hourly. Even if you are turning the cluster off when no jobs are running you are spending additional time starting up the cluster and installing packages each time cluster starts.

Sometimes running your data processing with dedicated resources is mandatory either because of data volumes, regulation, or security considerations. If you need to run clusters, virtual machines, or other dedicated resources sizing, and scaling of your resources is the key.

Don’t use a cannon to kill a fly

There are many ways to implement data pipelines in the cloud. You could use virtual machines, SQL servers, serverless functions, Spark clusters, or any other computing options. One thing that I see often is that companies choose to use methods that are popular even when there is no real need for them.

One common thing is to utilize cluster computing, usually in form of Spark/Databricks, even with light workloads. What you are doing there is generating a lot of overhead in terms of excess usage of capacity and inefficient usage of resources.

These kinds of light workloads could be processed more efficiently with serverless and/or shared resources such as Azure Functions, Synapse Serverless SQL, or even simple Azure SQL databases. This can also be applied to the granularity of data and storing all of the historical data forever.

Think efficiency when choosing your programming language

Did you know that there are huge differences between the energy efficiency of programming languages. For example, Python and C# have a 20x difference in energy efficiency. This might not be so black and white as data processing with Python is usually done with libraries that are written in more efficient languages like C++ but still, it is an important thing to factor in.

How to minimize the impact

After reading this long rant about things that you might be doing wrong, I want you to focus your attention on what you can do to minimize the impact. I think architecture design is a key area that is often overlooked when starting to build data platforms and pipelines in the cloud. Often there is one single solution that is used to tackle all the problems instead of building a flexible architecture that supports multiple different types of data processing solutions.

Luckily there are solutions in the cloud that help to store data in a way that is easily accessible by a variety of tools and design patterns that can lead to efficient processes. Methodologies such as Data Mesh that impose ownership and responsibilities for data and efficient data storage solutions like data lakes with compressed and columnar file types (e.g. Parquet) will help organizations to avoid siloing of data. These, if applied correctly, will help to maintain a single efficient dataset and avoid the processing of the same data again and again in different parts of the organization.

In Azure, Synapse Analytics with multiple different types of processing options that can be an easy starter that you can expand from. For example, you could use Synapse Pipelines (shared resources) to land your data in a data lake and process it with Serverless SQL. This can then be expanded with other processing methods such as Synapse Spark, Azure Functions, and others. But choosing architecture patterns that are not limited to a single solution will help to minimize your costs and at the same time the environmental impact.

To make it clearer, I have summarized my thoughts in a couple of bullet points:

Don’t process the same data over and over again, build incremental processing instead
Use Spark when there is a good use case for it, don’t use cannon to kill a fly
PaaS services often have better utilization of resources than IaaS solutions
Move unused data into cold storage or remove it altogether

Can you measure the impact?

Did you know that you can use the Azure emissions impact dashboard to calculate the greenhouse gas emissions related to your cloud usage? With the dashboard, you can see which resource types have the most impact on your Azure environment as well as how much emissions you are saving compared to on-premises alternatives.

If you’d like to assess usage in a multi-cloud environment or you are not using Azure you can also use Cloud Carbon Footprint tool which is an open-source tool to measure and visualize CO2 impact.

Datacenter efficiency is often related to the scale of data centers and the hyperscale data centers utilized by Microsoft and other cloud vendors are considered the most efficient. The efficiency can often be improved further by utilizing the waste heat of the data center. Future Azure datacenters in Espoo and Kirkkonummi will be the pioneers of utilizing waste heat at scale for the first time. Transparency about the emissions is critical, I have linked the Wired article that discusses this in more detail.