Petastorm: A Simple Approach to Deep Learning Models in Apache Parquet Format

3 min readJan 18, 2021

Learn how to generate a Petastorm dataset that is compatible with different machine learning frameworks, analyze and manipulate the dataset, and more.

Petastorm, an open-source data access library, enables single-node or distributed training as well as evaluation of deep learning models precisely from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames. As Andrey, a U.S.-based Python engineer, notes, it supports popular Python-based machine learning (ML) frameworks including Tensorflow, PyTorch, and PySpark. For more information about Petastorm, refer to the Petastorm GitHub page and Petastorm API documentation.

Petastorm enables either single machine or distributed training, as well as support for multiple Python-based ML frameworks such as NumPy, Tensorflow, Theano, Pytorch, and PySpark. It is the go-to library for the evaluation of deep learning models using Apache Parquet formatted datasets.

The article will take you through:

Generating a Petastorm dataset that is compatible with different ML frameworks
Analyzing and manipulating the dataset
Parallelizing data loading and decoding operations

What Are Some Petastorm Features?

To support different training scenarios for autonomous driving algorithms, Petastorm incorporates various features, including efficient implementation of data sharding, row filtering, shuffling, access to a subset of fields, and support of time-series data. These are also called n-grams.

What Is the Structure of a Typical Dataset

Multiple columns that contain sensor-acquired signals that have been collected during autonomous vehicle test runs, including cameras, radars, and lidar.
Manually generated labels that are stored as fields in a row.

The rows in a typical dataset are sorted in chronological order and grouped by runs. A typical row size ranges between 30 to 100.

Generating a Petastorm Dataset That Is Compatible With Different ML Frameworks

For you to generate a dataset using Petastorm, you will need to define a Unischema, which is simply a data schema. It is only at this step that you will need to define the schema, since Petastorm will translate it into all supported framework formats, which include TensorFlow, pure Python, and PySpark.

A path to the dataset is sufficient to read an instance of Unischema since it is serialized as a customized field into a Parquet store metadata.

Analyzing and Manipulating the Dataset

Analysis and manipulation of the dataset is made possible by the use of the Parquet data format, which is supported by Spark, hence the availability of Spark tools.

Parallelizing Data Loading and Decoding Operations

Petastorm avails two strategies to parallelizing data loading and decoding operations:

Thread pool implementation
Process pool implementation

The strategic choice will depend on the kind of data you want to read.

In a typical scenario, as Andrey illustrates in his project, “Machine Learning Model: Python Sklearn & Keras,” the thread pool implementation strategy is used when a row contains encoded and high-resolution images. This is because in this case, a lot of the processing time is being spent in decoding the images through a C++ code. In this instance, no Python Global Interpreter Lock (GIL) is being held.

The process pool implementation strategy, on the other hand, is more appropriate when row sizes are small. In this instance, most processing is done using Python code only. More than one process must run parallelly so as to overcome the execution serialization that is brought about by Global Interpreter Lock.

Summary

Petastorm, which we have learned is an open-source data access library developed by Uber ATG, enables both single machine and distributed training and the evaluation of deep learning models precisely from datasets in the Apache Parquet format.

This article discusses how to use Petastorm as the go-to approach because it enables a one dataset approach. It reviews the supported tools that help with evaluating deep learning models.

Petastorm supports popular machine learning frameworks that are Python-based, such as PyTorch, PySpark, and Tensorflow.