Regularization Layers
Layer |
Description |
---|---|
Channel-wise batch normalization |
|
Probabilistically drop tensor entries |
|
Entry-wise batch normalization |
|
Normalize over data channels |
|
Normalize over data samples |
|
Local-response normalization |
|
Scaled dropout for use with SELU activations |
BatchNormalization
The BatchNormalization
layer performs channel-wise batch
normalization.
Each input channel is normalized across the mini-batch to have zero mean and unit standard deviation. Learned scaling factors and biases are then applied. This uses the standard approach of maintaining the running mean and standard deviation (with exponential decay) for use at test time.
This layer maintains four weights: scales, biases, running means, and running variances. Each has a size equal to the number of channels. In order to disable the affine operation, manually construct weights without optimizers.
See:
Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” In International Conference on Machine Learning, pp. 448-456. 2015.
Arguments:
- decay
(
double
, optional) Decay factor for running statisticsDefault: 0.9
- epsilon
(
double
, optional) Small number for numerical stabilityDefault: 1e-5
- statistics_group_size
(
int64
, optional) Size of process group for computing statisticsDefault: 1
A group size of 1 implies purely local statistics. A negative group size indicates global statistics (i.e. statistics over the entire mini-batch).
Deprecated arguments:
- stats_aggregation
(
string
)
Deprecated and unused arguments:
- scale_init
(
double
)- bias_init
(
double
)
Dropout
The Dropout
layer probabilistically drops tensor entries.
The values are multiplied by 1/(keep probability) at training time. Keep probabilities of 0.5 for fully-connected layers and 0.8 for input layers are good starting points. See:
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. “Dropout: a simple way to prevent neural networks from overfitting.” The Journal of Machine Learning Research 15, no. 1 (2014): 1929-1958.
Arguments:
- keep_prob
(
double
) Probability of keeping each tensor entryRecommendation: 0.5
EntrywiseBatchNormalization
The EntrywiseBatchNormalization
layer performs entry-wise
batch normalization.
Each input entry is normalized across the mini-batch to have zero mean and unit standard deviation. This uses the standard approach of maintaining the running mean and standard deviation (with exponential decay) for use at test time.
This layer maintains two weights: running means, and running variances. Each has a shape identical to the data tensor. It is common to apply an affine operation after this layer, e.g. with the entry-wise scale/bias layer.
See:
Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” In International Conference on Machine Learning, pp. 448-456. 2015.
Arguments:
- decay
(
double
) Decay factor for running statisticsRecommendation: 0.9
- epsilon
(
double
) Small number for numerical stabilityRecommendation: 1e-5
InstanceNorm
The InstanceNorm
layer normalizes data samples over data
channels.
Each channel within a data sample is normalized to have zero mean and unit standard deviation. See:
Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. “Instance normalization: The missing ingredient for fast stylization.” arXiv preprint arXiv:1607.08022 (2016).
This is equivalent to applying layer normalization independently to each channel. It is common to apply an affine operation after this layer, e.g. with the channel-wise scale/bias layer.
Arguments:
- epsilon
(
google.protobuf.DoubleValue
, optional) Small number to avoid division by zero.Default: 1e-5
LayerNorm
The LayerNorm
layer normalizes data samples.
Each data sample is normalized to have zero mean and unit standard deviation. See:
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer normalization.” arXiv preprint arXiv:1607.06450 (2016).
It is common to apply an affine operation after this layer, e.g. with the entry-wise scale/bias layer.
Arguments:
- epsilon
(
google.protobuf.DoubleValue
, optional) Small number to avoid division by zero.Default: 1e-5
LocalResponseNormalization
The LocalResponseNormalization
layer normalizes values
within a local neighborhood.
See:
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet classification with deep convolutional neural networks.” In Advances in Neural Information Processing Systems, pp. 1097-1105. 2012.
Arguments:
- window_width
(
int64
)- lrn_alpha
(
double
)- lrn_beta
(
double
)- lrn_k
(
double
)
SeluDropout
The SeluDropout
layer is a scaled dropout for use with SELU
activations.
A default keep probability of 0.95 is recommended. See:
Gunter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. “Self-normalizing neural networks.” In Advances in Neural Information Processing Systems, pp. 971-980. 2017.
Arguments:
- keep_prob
(
double
) Recommendation: 0.95- alpha
(
double
, optional) Default: 1.6732632423543772848170429916717- scale
(
double
, optional) Default: 1.0507009873554804934193349852946