Reduce Your Cloud Storage Costs by Storing Files and Metadata in Parquet Files

Ever since the parquet format came out over a decade ago, it became very popular for analytics workloads. Being columnar in format, it allowed for massive scale analytics, while delivering strong and lossless compression. Various engines including Snowflake, Databricks, Synapse, SQL Server, and other databases I’m likely ignoring can all interact with Parquet. In it’s newer incarnations like Delta parquet, you can also update those files.

A young girl sitting on the floor beside a large mirror, looking playfully at her reflection in a hallway with multiple reflections extending into the distance.

There is a notion of a transaction log for each Delta parquet file–it exists in the form of JSON, and isn’t as efficient as a singular transaction log, especially for multi-table (or file) transactions. It’s not a replacement for an OLTP database, but for an analytics workload where you have to occasionally update something it works.

What I’m writing about today has nothing to do with analytics, per se. It has everything to do with cloud storage, and the way operations there are priced. Specifically, metadata operations–in the demo code I’ve shared we’re going from five files to one, but you can imagine going from a much larger number of files to much smaller number of files. You may ask–“Joey that sounds dumb, why are you reinventing zip and iso files”. Well, the main reason is that many cloud operations are priced on the number of objects–for example if you had to calculate a checksum across a number of files on S3. (For files/objects that were created before S3 automatically did checksums).

So the notion of this code, that I wanted to play with, was storing files within a parquet file. At first, I loaded 5 text files into a single parquet file. Then I added an index to the parquet file–thinking forward I added a mapping parquet file, in order to support multiple parquet files with five files each. You can see the demo in this GitHub repo. This is pretty basic code, but the notion is clear–if you have a very large number of small files, you need to store in object storage, and want to reduce that number, and potentially reduce the storage volume, you can use parquet to do it.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Trust DCAC with your data

Your data systems may be treading water today, but are they prepared for the next phase of your business growth?

Denny Cherry & Associates Consulting
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.