Learning Layers

Layer	Description
ChannelwiseFullyConnected	Apply affine transformation to tensor channels
ChannelwiseScaleBias	Apply per-channel scale and bias
Convolution	Convolution
Deconvolution	Deconvolution
Embedding	Lookup table to embedding vectors
EntrywiseScaleBias	Apply entry-wise scale and bias
FullyConnected	Affine transformation
GRU	Stacked gated recurrent unit

ChannelwiseFullyConnected

The ChannelwiseFullyConnected layer applies an affine transformation to tensor channels.

The input tensor is sliced along the first tensor dimension (the “channel” dimension for image data in CHW format) and the same affine transformation is applied to each slice. Following a row-vector convention:

\[y(i,*) = \text{vec}( x(i,*) ) W^T + b\]

Two weights are required if bias is applied: the linearity and the bias. Only the linearity weights are required if bias is not applied. If weights aren’t provided, the linearity weights are initialized with He normal initialization and the bias weights are initialized to zero.

Arguments:

output_channel_dims

(repeated uint64) Output tensor dimensions, excluding the channel dimension

bias

(google.protobuf.BoolValue, optional) Whether to apply bias

Default: True

transpose

(google.protobuf.BoolValue, optional) Whether to apply transpose of weights matrix

Default: False

Back to Top

ChannelwiseScaleBias

The ChannelwiseScaleBias layer applies per-channel scale and bias. The input tensor is sliced along the first tensor dimension (the “channel” dimension, assuming image data in CHW format) and scale and bias terms are applied independently to each slice. More precisely, given input and output tensors \(X,Y\in\mathbb{R}^{d_1\times\cdots\times d_n}\) and scale and bias vectors \(a,b\in\mathbb{R}^{d_1}\):

\[Y_{i,j,\cdots} = a_i X_{i,j,\cdots} + b_i\]

The scale and bias vectors are fused into a single weights tensor to reduce the number of gradient allreduces during backprop. In particular, the weights tensor is a \(\text{num_channels} \times 2\) matrix, where the first column corresponds to scale terms and the second column to bias terms.

Arguments: None

Back to Top

Convolution

The Convolution layer applies convolution (more precisely, cross-correlation) to the input tensor. This is primarily optimized for image data in CHW format.

Two weights are required if bias is applied: a kernel tensor (in KCHW format) and per-channel biases. Only the kernel weights are required if bias is not applied. If weights aren’t provided, the kernel weights are initialized with He normal initialization and the bias weights are initialized to zero.

Arguments:

num_dims

(int64) Number of spatial dimensions

The first data dimension is treated as the channel dimension, and all others are treated as spatial dimensions (recall that the mini-batch dimension is implicit).

out_channels

(int64) Channel dimension of output tensor

kernel_size

(list[int64] or int64) Convolution kernel dimensions

List of integers, one for each spatial dimension.

padding

(list[int64] or int64) Convolution padding

List of integers, one for each spatial dimension.

stride

(list[int64] or int64) Convolution strides

List of integers, one for each spatial dimension. Used when has_vectors is enabled.

dilation

(list[int64] or int64) Convolution dilations

List of integers, one for each spatial dimension. Defaults to dilations of 1, i.e. undilated convolution.

has_bias

(bool) Whether to apply per-channel bias

groups

(int64, optional) Number of channel groups for grouped convolution

Default: 1

conv_tensor_op_mode

(ConvTensorOpsMode) Special behavior with FP16 tensor cores

Ignored for non-GPU layers.

Back to Top

Deconvolution

The Deconvolution layer is the transpose of standard deep learning convolution.

Pedantic comments: this operation is commonly called “deconvolution” in the deep learning community, but it is not a true deconvolution. Also, the “convolution” operation commonly used in the deep learning is actually cross-correlation.

Arguments:

num_dims

(int): Number of spatial dimensions

out_channels

(int): Channel dimension of output tensor

kernel_size

(list[int] or int): Convolution kernel dimensions

stride

(list[int] or int): Convolution stride

padding

(list[int] or int): Convolution padding

output_padding

(list[int] or int): Padding for output tensor. The output tensor size is ambiguous when the convolution is strided. If this is not set, then we will output the smallest valid output tensor.

groups

(int): Number of convolution groups (default: 1)

has_bias

(bool): Whether to apply channel-wise bias (default: True)

dilation

(list[int] or int): Convolution dilation (default: 1)

conv_tensor_op_mode

(ConvTensorOpsMode) Special behavior with FP16 tensor cores

Ignored for non-GPU layers.

Back to Top

Embedding

The Embedding layer is a lookup table to embedding vectors.

Takes a scalar input, interprets it as an index, and outputs the corresponding vector. The number of embedding vectors and the size of vectors are fixed. If the index is out-of-range, then the output is a vector of zeros.

The embedding vectors are stored in an \(\text{embedding_dim} \times \text{num_embeddings}\) weights matrix. Note that this is the transpose of the weights in the PyTorch embedding layer.

num_embeddings

(int64) Size of dictionary of embeddings

embedding_dim

(int64) Size of embedding vectors

padding_idx

(google.protobuf.Int64Value) If the index is set, then the corresponding vector is initialized with zeros. The function gradient w.r.t. this embedding vector always

Back to Top

EntrywiseScaleBias

The EntrywiseScaleBias layer applies entry-wise scale and bias.

Scale and bias terms are applied independently to each tensor entry. More precisely, given input, output, scale, and bias tensors \(X,Y,A,B\in\mathbb{R}^{d_1\times\cdots\times d_n}\):

\[Y = A \circ X + B\]

The scale and bias terms are fused into a single weights tensor to reduce the number of gradient allreduces during backprop. In particular, the weights tensor is a \(\text{size} \times 2\) matrix, where the first column correspond to scale terms and the second column to bias terms.

Arguments: None

Back to Top

FullyConnected

The FullyConnected layer is an affine transformation.

Flattens the input tensor, multiplies with a weights matrix, and optionally applies an entry-wise bias. Following a row-vector convention:

\[y = \text{vec}(x) W^T + b\]

Two weights are required if bias is applied: the linearity and the bias. Only the linearity weights are required if bias is not applied. If weights aren’t provided, the linearity weights are initialized with He normal initialization and the bias weights are initialized to zero.

For flat data, this layer is similar to Keras’ dense layer or PyTorch’s linear operation. However, it implicitly flattens multi-dimensional data. To avoid this flattening, consider the channel-wise fully-connected layer.

Arguments:

num_neurons

(int64) Output tensor size

has_bias

(bool) Whether to apply entry-wise bias

transpose

(bool) Whether to apply transpose of weights

Back to Top

GRU

The GRU layer is a stacked gated recurrent unit.

Expects two inputs: a 2D input sequence ( \(\text{sequence_length}\times\text{input_size}\)) and a 2D initial hidden state ( \(\text{num_layers}\times\text{hidden_size}\)).

Uses four weights per GRU cell: “ih_matrix” ( \(3 \text{hidden_size}\times\text{input_size}\) for layer 0 and \(3 \text{hidden_size}\times\text{hidden_size}\) for other layers), “hh_matrix” (\(3 \text{hidden_size}\times\text{hidden_size}\)), “ih_bias” (\(3 \text{hidden_size}\)), “hh_bias” (\(3 \text{hidden_size}\)).

Support is experimental and requires either cuDNN (on GPU) or oneDNN (on CPU).

Todo

Support bidirectional RNNs

Arguments:

hidden_size

(uint64) Size of each hidden state and output vector

num_layers

(google.protobuf.UInt64Value, optional) Number of stacked GRU cells

Default: 1

Back to Top