Running LBANN

Sanity check with a simple test using LeNet

In the Quick Start guide there is an example of how to run a standard LeNet model using LBANN.

Anatomy of an LBANN experiment

Parallelism

LBANN is run under MPI, i.e. with multiple processes that communicate with message passing. This set of processes is subdivided into one or more “trainers.” Conceptually, a trainer owns parallel objects, like models and data readers, and generally operates independently of other trainers.

Comments:

LBANN targets HPC systems with homogeneous compute nodes and GPU accelerators, which motivates some simplifying assumptions:
- All trainers have the same number of processes.
- If GPU acceleration is enabled, each MPI process corresponds to one GPU.
Processors are block assigned to trainers based on MPI rank.
- In order to minimize the cost of intra-trainer communication, make sure to map processes to the hardware and network topologies. Typically, this just means choosing a sensible number of processes per trainer, e.g. a multiple of the number of GPUs per compute node.
Generally, increasing the number of processes per trainer will accelerate computation but require more intra-trainer communication. There is typically a sweet spot where run time is minimized, but it is complicated and sensitive to the nature of the computation, the mini-batch size, the data partitioning scheme, hardware and network properties, the communication algorithms, and myriad other factors.
- Rule-of-thumb: Configure experiments so that the bulk of run time is taken by compute-bound operations (e.g. convolution or matrix multiplication) and so that each process has enough work to achieve a large fraction of peak performance (e.g. by making the mini-batch size sufficiently large).
Most HPC systems are managed with job schedulers like Slurm. Typically, users can not immediately access compute nodes but must request them from login nodes. The login nodes can be accessed directly (e.g. via ssh), but users are discouraged from doing heavy computation on them.
- For debugging and quick testing, it’s convenient to request an interactive session (salloc or sxterm with Slurm).
- If you need to run multiple experiments or if experiments are not time-sensitive, it’s best to submit a batch job (sbatch with Slurm).
- When running an experiment, make sure you know what scheduler account to charge (used by the scheduler for billing and determining priority) and what scheduler partition to run on (compute nodes on a system are typically subdivided into multiple groups, e.g. for batch jobs and for debugging).
  - With salloc, specify the partition using the --partition command-line argument and specify the account using --account.
- Familiarize yourself with the rules for the systems you use (e.g. the expected work for each partition, time limits, job submission limits) and be a good neighbor.

Model components

Layer: A tensor operation, arranged within a directed acyclic graph.
- During evaluation (“forward prop”), a layer receives input tensors from its parents and sends an output tensor to each child.
- During automatic differentiation (“backprop”), a layer receives “input error signals” (objective function gradients w.r.t. output tensors) from its children and sends “output error signals” (objective function gradients w.r.t. input tensors) to its parents. If the layer has any associated weight tensors, it will also compute objective function gradients w.r.t. the weight tensors.
- Most layers require a specific number of parents and children, but LBANN will insert layers into the graph if there is a mismatch and the intention is obvious. For example, if a layer expects one child but has multiple, then a split layer (with multiple output tensors all identical to the input tensor) is inserted. Similarly, if a layer has fewer children than expected, dummy layers will be inserted. However, this does not work if there is any ambiguity. In such cases (common with input and slice layers), it is recommended to manually insert identity layers so that the parent/child relationships are absolutely unambiguous.
- See lbann/src/proto/layers.proto for a full list of supported layers.
Weights: A tensor consisting of trainable parameters, typically associated with one or more layers. A weight tensor owns an initializer to initially populate its values and an optimizer to find values that minimize the objective function.
- A weight tensor without a specified initializer will use a zero initializer.
- A weight tensor without a specified optimizer will use the model’s default optimizer.
- If a layer requires weight tensors and none are specified, it will create the needed weight tensors. The layer will pick sensible initializers and optimizers for the weight tensors. For example, a convolution layer will initialize its kernel tensor with He normal initialization and with the model’s default optimizer.
- The dimensions of a weight tensor is determined by their associated layers. The user can not set it directly.
Objective function: Mathematical expression that the optimizers will attempt to minimize. It is made up of multiple terms that are added together (possibly with scaling factors).
- An objective function term can get its value from a scalar-valued layer, i.e. a layer with an output tensor with one entry.
Metric: Mathematical expression that will be reported to the user. This typically does not affect training, but is helpful for evaluating the progress of training. A canonical example for classification problems is classification accuracy.
Callback: Function that is performed at various points during an experiment. Callbacks are helpful for reporting, debugging, and performing advanced training techniques. Please consult the Callbacks documentation for detailed descriptions of the callbacks.
- This is the natural home for experimental training techniques.
- A common use-case is to export values with the “dump outputs” callback so that the user can perform data post-processing or visualization.

Data readers

Warning

The core infrastructure for data readers is slated for significant refactoring, so expect major changes in the future.

Data readers are responsible for managing a data set and providing data samples to models. A data set is comprised of independent data samples, each of which is made up of multiple tensors. For example, a data sample for a labeled image classification problem consists of an image tensor and a one-hot label vector.

Note

The data readers are currently hard-coded to assume this simple classification paradigm. Hacks are needed if your data does not match it exactly, e.g. if a data sample is comprised of more than two tensors. The most basic approach is to flatten all tensors and concatenate them into one large vector. The model is then responsible for slicing this vector into the appropriate chunks and resizing the chunks into the appropriate dimensions. Done correctly, this should not impose any additional overhead.

Specifically, data readers and models interact via input layers. Each model must have exactly one input layer and its output tensors are populated by a data reader every mini-batch step. This is typically performed by a background thread pool, so data ingestion will efficiently overlap with other computation, especially if the data reader’s work is IO-bound or if the computation is largely on GPUs.

Note

An input layer has an output tensor for each data sample tensor. Since each data sample has two tensors (one for the data and one for the label), it follows that every input layer should have two child layers. To make parent/child relationships unambiguous, we recommend manually creating identity layers as children of the input layer.

Note that layers within a model treat the data for a mini-batch as a single tensor where the leading dimension is the mini-batch size. Thus, corresponding tensors in all data samples must have the same dimensions. The data dimensions must be known from the beginning of the experiment and can not change. However, real data is rarely so consistent and some preprocessing is typically required. See lbann/src/proto/transforms.proto for a list of available preprocessing transforms.

Warning

The Python data reader will trigger some process forking that doesn’t interact with InfiniBand all that well by default. Users may encounter hangs on clusters that use InfiniBand. To avoid this, ensure that IBV_FORK_SAFE=1 is exported into the environment when running LBANN.

Python frontend

LBANN provides a Python frontend with syntax reminiscent of PyTorch. See a simple implementation of LeNet.

Comments:

Under-the-hood, the Python frontend is actually a convenience wrapper around the Protobuf frontend. The core infrastructure allows users to configure an experiment and “compiles” it to a Prototext text file.
The Python interface can only configure and launch experiments. It is not active during an experiment and it does not allow for any dynamic control flow.
Only Python 3 is supported.

Setup

The lbann Python package is installed as part of the LBANN build process. If you used the build_lbann.sh script for installation or installed in a Spack environment, you will need to activate the Spack LBANN environment:

spack env activate -p lbann

Basic usage

A typical workflow involves the following steps:

Configuring a Trainer.
Configuring LBANN model components (like the graph of Layer s) and creating a Model.

Classes for model components are automatically generated from the LBANN Protobuf specifications in lbann/src/proto. These files are currently the best source of documentation. Message fields in the Protobuf specification are optional keyword arguments for the corresponding Python class constructor. If a keyword argument is not provided, it is logically zero (e.g. false for Boolean fields and empty for string fields)

Configuring the default Optimizer to be used by the Weights objects.
Loading in a Protobuf text file describing the data reader.
- The Python frontend currently does not have good support for specifying data readers. If any data reader properties need to be set programmatically, the user must do it directly via the Protobuf Python API.
Launching LBANN by calling run.
- lbann.run should be run from a compute node. If a node allocation is not available, the batch_job option can be set to submit a batch job to the scheduler.
- A timestamped work directory will be created each time LBANN is run. The default location of these work directories can be set with the environment variable LBANN_EXPERIMENT_DIR.
- Supported job managers are Slurm and LSF.
- LLNL users and collaborators may prefer to use lbann.contrib.launcher.run. This is similar to lbann.run, with defaults and optimizations for certain systems.

A simple example

import lbann

# ----------------------------------
# Construct layer graph
# ----------------------------------

# Input data
image = lbann.Input(data_field="samples")
label = lbann.Input(data_field="labels")

# Softmax classifier
y = lbann.FullyConnected(image, num_neurons = 10, has_bias = True)
pred = lbann.Softmax(y)

# Loss function and accuracy
loss = lbann.CrossEntropy([pred, label])
acc = lbann.CrossEntropy([pred, label])

# ----------------------------------
# Setup experiment
# ----------------------------------

# Setup trainer
mini_batch_size = 64
trainer = lbann.Trainer(mini_batch_size)

# Setup model
num_epochs = 5
model = lbann.Model(num_epochs,
                    layers=lbann.traverse_layer_graph([image,label]),
                    objective_function=loss,
                    metrics=[lbann.Metric(acc, name='accuracy', unit='%')],
                    callbacks=[lbann.CallbackPrint(), lbann.CallbackTimer()])

# Setup optimizer
opt = lbann.SGD(learn_rate=0.01, momentum=0.9)

# Load data reader from prototext
import google.protobuf.text_format
data_reader_proto = lbann.lbann_pb2.LbannPB()
with open('path/to/lbann/model_zoo/data_readers/data_reader_mnist.prototext', 'r') as f:
    google.protobuf.text_format.Merge(f.read(), data_reader_proto)
data_reader_proto = data_reader_proto.data_reader

# ----------------------------------
# Run experiment
# ----------------------------------

lbann.run(trainer, model, data_reader_proto, opt)

Useful submodules

`lbann.modules`

A Module is a pattern of layers that can be applied multiple times in a neural network. Once created, a Module is callable, taking a layer as input and returning a layer as output. They will create and manage Weights es internally, so they are convenient for weight sharing between different layers. They are also useful for complicated patterns like RNN cells.

A possible note of confusion: “Modules” in LBANN are similar to “layers” in PyTorch, TensorFlow, and Keras. LBANN uses “layer” to refer to tensor operations, in a similar manner as Caffe.

`lbann.models`

Several common and influential neural network models are implemented as Module s. They can be used as building blocks within more complicated models.

`lbann.proto`

The save_prototext function will export a Protobuf text file, which can be fed into the Protobuf frontend.

`lbann.onnx`

This contains functionality to convert between LBANN and ONNX models. See python/docs/onnx/README.md for full documentation.

Protobuf frontend (advanced)

The main LBANN driver uses Protobuf text files (sometimes called prototext files) to specify experiments. The Python frontend operates by “compiling” an experiment configuration into a Protobuf text file and passing it into the LBANN driver. Aside from quick debugging, there are very few situations where directly manipulating Protobuf text files is superior to using the Python frontend. In fact, it is possible to use Protobuf’s Python API to programmatically manipulate Protobuf messages, if such fine control is necessary.

In order to fully specify an experiment, the user must provide Protobuf text files for the model, default optimizer, and data reader. These can be provided as three separate files or one unified file. The basic template for running LBANN is

<mpi-launcher> <mpi-options> \
    lbann --prototext=experiment.prototext

The LBANN Protobuf format is defined in src/proto/lbann.proto. It is important to remember that the default value of a Protobuf field is logically zero (e.g. false for Boolean fields and empty for string fields).

Setup of LBANN for manual CMake build (advanced)

If LBANN was compiled and installed directly using CMake there is a bit more work that is required to ensure that the Python front end will work. The lbann Python package is installed as part of the LBANN build process. However, it is necessary to update the PYTHONPATH environment variable to make sure Python detect it. There are several ways to do this:

If LBANN has been built with the Spack user build process, loading LBANN will automatically update PYTHONPATH:

module load lbann

Warning

The above will not work if LBANN has been built with scripts/build_lbann_lc.sh or with the Spack developer build process.

LBANN includes a modulefile that updates PYTHONPATH:

module use <install directory>/etc/modulefiles
module load lbann/<version>

Directly manipulate PYTHONPATH:

export PYTHONPATH=<install directory>/lib/python<version>/site-packages:${PYTHONPATH}

Note that LBANN depends on the Protobuf Python package, which can be installed with:

pip install protobuf

If the user does not own the site-packages directory, then it may be necessary to pass the --user flag to pip.

Running LBANN

Sanity check with a simple test using LeNet

Anatomy of an LBANN experiment

Parallelism

Model components

Data readers

Python frontend

Setup

Basic usage

A simple example

Useful submodules

lbann.modules

lbann.models

lbann.proto

lbann.onnx

Protobuf frontend (advanced)

Setup of LBANN for manual CMake build (advanced)

`lbann.modules`

`lbann.models`

`lbann.proto`

`lbann.onnx`