Petastorm: A Simple Approach to Deep Learning Models in Apache Parquet Format


Learn how to generate a Petastorm dataset that is compatible with different machine learning frameworks, analyze and manipulate the dataset, and more.

Petastorm, an open-source data access library, enables single-node or distributed training as well as evaluation of deep learning models precisely from datasets in Apache Parquet format and datasets that are already loaded as Apache Spark DataFrames. As Andrey, a U.S.-based Python engineer, notes, it supports popular Python-based machine learning (ML) frameworks including Tensorflow, PyTorch, and PySpark. For more information about Petastorm, refer to the Petastorm GitHub page and Petastorm API documentation.

  • Analyzing and manipulating the dataset
  • Parallelizing data loading and decoding operations

What Are Some Petastorm Features?

To support different training scenarios for autonomous driving algorithms, Petastorm incorporates various features, including efficient implementation of data sharding, row filtering, shuffling, access to a subset of fields, and support of time-series data. These are also called n-grams.

What Is the Structure of a Typical Dataset

  • Multiple columns that contain sensor-acquired signals that have been collected during autonomous vehicle test runs, including cameras, radars, and lidar.
  • Manually generated labels that are stored as fields in a row.

Generating a Petastorm Dataset That Is Compatible With Different ML Frameworks

For you to generate a dataset using Petastorm, you will need to define a Unischema, which is simply a data schema. It is only at this step that you will need to define the schema, since Petastorm will translate it into all supported framework formats, which include TensorFlow, pure Python, and PySpark.

Analyzing and Manipulating the Dataset

Analysis and manipulation of the dataset is made possible by the use of the Parquet data format, which is supported by Spark, hence the availability of Spark tools.

Parallelizing Data Loading and Decoding Operations

Petastorm avails two strategies to parallelizing data loading and decoding operations:

  1. Process pool implementation


Petastorm, which we have learned is an open-source data access library developed by Uber ATG, enables both single machine and distributed training and the evaluation of deep learning models precisely from datasets in the Apache Parquet format.

Founder & CEO of Education Ecosystem. Serial entrepreneur with experience from Amazon, GE & Rebate Networks,