Ever since the parquet format came out over a decade ago, it became very popular for analytics workloads. Being columnar in format, it allowed for massive scale analytics, while delivering strong and lossless compression. Various engines including Snowflake, Databricks, Synapse, SQL Server, and other databases I’m likely ignoring can all interact with Parquet. In it’s newer incarnations like Delta parquet, you can also update those files.

There is a notion of a transaction log for each Delta parquet file–it exists in the form of JSON, and isn’t as efficient as a singular transaction log, especially for multi-table (or file) transactions. It’s not a replacement for an OLTP database, but for an analytics workload where you have to occasionally update something it works.
What I’m writing about today has nothing to do with analytics, per se. It has everything to do with cloud storage, and the way operations there are priced. Specifically, metadata operations–in the demo code I’ve shared we’re going from five files to one, but you can imagine going from a much larger number of files to much smaller number of files. You may ask–“Joey that sounds dumb, why are you reinventing zip and iso files”. Well, the main reason is that many cloud operations are priced on the number of objects–for example if you had to calculate a checksum across a number of files on S3. (For files/objects that were created before S3 automatically did checksums).
So the notion of this code, that I wanted to play with, was storing files within a parquet file. At first, I loaded 5 text files into a single parquet file. Then I added an index to the parquet file–thinking forward I added a mapping parquet file, in order to support multiple parquet files with five files each. You can see the demo in this GitHub repo. This is pretty basic code, but the notion is clear–if you have a very large number of small files, you need to store in object storage, and want to reduce that number, and potentially reduce the storage volume, you can use parquet to do it.