Hyperscience serves Global 2000 companies and government agencies. Our clients care deeply about the accuracy of their data and the end result, which has real consequences for their end customers. This is why we built a system that consists of a Deep Learning model and human annotators/reviewers where the model can ask for help when it is uncertain – the classical human-in-the-loop (HITL) setup that is known as “Supervision” in the Hyperscience Platform. The more accurate the model is, the less it needs to resort to human intervention to review and resolve these exception/edge cases.

Recently, a portion of the Hyperscience ML team embarked on a project to achieve significant throughput improvement of the overall model and annotators system without sacrificing model accuracy. We’ve made the annotators faster with practical UX, but human throughput can only be optimized to a certain point. Hence our only lever is to speed up the model while retaining the exact same accuracy!

## Possible Solutions

There are a number of possible solutions for speeding up Deep Learning models, and one of the quickest and safest options is using a better library for the underlying computations. We Implemented that two years ago when we started using an MKL build of Tensorflow. That gave us a solid 20% increase in speed, which was good, but still less than what we hoped for this time around. (Fortunately, it barely changed the output of the model so we retained the exact same level of accuracy.)

Our system is deployed on-premise and due to restrictions within our customer’s environments, we can’t use a GPU to speed up our model. This is why we optimize our Deep Learning computations for CPU inference, and for Intel CPUs, we haven’t found a library that beats Intel’s MKL.

As a result, we decided to go back to research mode in order to achieve a higher throughput model. The first thing we tried was using smaller architecture, but almost all result in worse accuracy than our current solution, which was unacceptable. We tried knowledge distillation, but that didn’t hit our desired quality either. **We then decided to investigate model quantization.** Our current architecture did not lend itself to quantization, so we tried a model architecture of similar size that was quantizable, which turned out to be the desired solution.

But enabling quantization was not as simple as changing the library for the underlying computations, which brings us to this blog. In this post, we’ll introduce you to model quantization and outline the quantization tools present in Pytorch.

## Quantization

### Introduction

In the context of Deep Learning, quantization is running the same computations but using integer math instead of floating point. This reduces the amount of time and memory needed to do inference, but if you’re not careful, it can reduce the model accuracy as well. When applied in the right way, however, we can get a significant boost in speed, while retaining the same model accuracy.

Posts by Pytorch report 2x improvement for a ResNet-50, and even though we only quantized a part of our model architecture, we also observed a 2x speed-up.

Let’s start by applying quantization to a single number. Simply casting the number from a floating point type to an integer one won’t do the trick; For most use cases, we use floating point values for a reason, and we need the fractional part.

To quantize a number,we define boxes like these:

The axis below is the numbers line and the numbers in the boxes are the representation of all numbers falling in that box(e.g., 0.25 is represented as 5 when quantized in this scheme). Note that 0.3 falls in the 5 box as well and will be represented as 5. Quantization adds some rounding noise that **WILL** change the model’s output.

Mathematically quantization looks like this:

As you can see, it’s just linear transformations and rounding.

scale and zero_point are parameters of the quantization, which let’s assume for now they are known beforehand. In the graphic above, their values are scale = 0.25; zero_point = 4. We need scale to represent the fractional part of numbers without losing accuracy and zero_point helps us represent negative numbers in unsigned types (such as uint8).

Let’s quantize the number 0.7 with the parameters above:

To quantize a number in addition to scale and zero_point, we need to provide a target type. Quantization types are based on integer types used to store the representation(e.g., quint8, qint8, etc.). To understand the difference between them, let’s see how we quantize the number -1.5 in the above scale and zero_point.

Plugging in -1.5 into our quantization equation we get -2, but we can’t store negative numbers in uint8, so instead we store it as 0. We do the same when storing numbers bigger than 255, clipping them to 255. Clipping to the smallest or biggest number instead of overflowing is called **saturation**. This minimizes the error from quantization. Note that all quantized operations are designed to saturate, including addition and convolution.

In Pytorch, the framework we used for our quantized model, there are 3 quantized dtypes: quint8, qint8 and qint32.

quint8 is used to store layer activations and qint8 is used to store model weights. This is an implementation detail specific to Pytorch. qint32 can be used for auxiliary computations, but we only use it during initialization.

### Quantizing tensors in Pytorch

To use quantization, we must first quantize a floating point tensor. The easiest way to do this is via the torch.quantize_per_tensor(input, scale, zero_point, dtype) function. It quantizes the input tensor and stores the result in the quantized type dtype. All values of the tensor are quantized with the same scalar scale and zero_point.

In some cases, different channels of the same tensor have different ranges, and then it makes sense to have a different scale and zero_point for each channel. We can achieve this by using torch.quantize_per_channel(input, scales, zero_points, axis, dtype). Now scales and zero_points are 1d tensors instead of scalars. axis specifies the axis that defines the parameters that are used (e.g., if input is 4d and axis=0, then x[i, j, k, m] is quantized using scales[i] and zero_points[i]). Note that Pytorch supports per channel quantized tensors only in the weights of convolutional and linear layers. To convert a quantized tensor q to a float one, just call q.dequantize().

Here’s an example of quantized_per_tensor usage:

`>>> qt = torch.quantize_per_tensor(`

input=torch.tensor([0.7, 0.5, -0.7]),

scale=0.25,

zero_point=4,

dtype=torch.quint8

)

>>> qt

tensor([ 0.7500, 0.5000, -0.7500], size=(3,), dtype=torch.quint8,

quantization_scheme=torch.per_tensor_affine, scale=0.25, zero_point=4)

>>> qt.int_repr()

tensor([7, 6, 1], dtype=torch.uint8)

### Quantization modules in Pytorch

The easiest way to use models in Pytorch is to use modules. There are a number of quantization-related modules:

- Quantize and DeQuantize: Modules that convert their input from float to a quantized representation and vice versa. You can use them in a torch.nn.Sequential to quantize only part of the model
- Conv1d, Conv2d and Conv3d: Quantized convolutions with most of the convolution bells and whistles – options for kernel_size, stride, dilation and groups. Using the groups option you can create depthwise-convolutions, which actually achieve better speed-ups than full convolutions.
- ConvReLU*d: Since convolutions are frequently followed by ReLUs, and quantization is all about speed, we have fused ConvReLU layers.
- QFunctional: Class used to implement quantized arithmetic operations(e.g., the addition in the residual layers).

An interesting addition are the QuantStub, DeQuantStub and FloatFunctional. They are helper modules that allow us to make a floating point module that can be quantized automatically. QuantStub and DeQuantStub output their input and FloatFunctional has methods like QFunctional but for floating point numbers. But when the model is quantized those models are converted to Quantize, DeQuantize and QFunctional.

With this non-exhaustive list of quantized operations we can build state-of- the-art image processing models that are core to the Hyperscience Platform.

It’s worth noting that all of these layers (except DeQuantize) need to have an output scale and zero_point. This is why QFunctional is used to wrap arithmetic operations.

Here’s an example of using the floating point layers and stubs to create a model that will be quantized:

`>>> float_model = torch.nn.Sequential(`

torch.quantization.QuantStub(),

torch.nn.Conv2d(3, 16, 3, padding=1),

torch.nn.BatchNorm2d(16),

torch.nn.ReLU(),

torch.nn.Conv2d(16, 24, 3, padding=1),

torch.nn.BatchNorm2d(24),

torch.nn.ReLU(),

torch.quantization.DeQuantStub(),

torch.nn.Flatten(),

torch.nn.Linear(24 * 28 * 28, 10),

torch.nn.Softmax(dim=-1),

)

One important class of modules missing from the above list is BatchNorm*d. They actually have a quantized counterpart, but this isn’t the most efficient use of batch normalization. During inference, batch normalization is just multiplication and subtraction with predefined parameters –basically a linear operation. The layer right before it, convolution, is a linear operation, too. We can replace the Convolution + Batch Normalization pair with a single Convolution in a process known as **fusion**.

To do that fusion, you need to know the parameters of both layers, but it’s just a few arithmetic operations. The resulting convolution is quantized as normal.

The fusion is implemented in the torch.quantization.fuse_modules function, which requires the names of the modules that will be fused.

Here’s an example:

`>>> float_model.eval()`

>>> # ['1', '2', '3'] are the names of the modules, that will be fused in the first layer

>>> fused_float_model = torch.quantization.fuse_modules(float_model, [['1', '2', '3'], ['4', '5', '6']])

>>> fused_float_model

Sequential(

(0): QuantStub()

(1): ConvReLU2d(

(0): Conv2d(3, 16, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

(1): ReLU()

)

(2): Identity()

(3): Identity()

(4): ConvReLU2d(

(0): Conv2d(16, 24, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))

(1): ReLU()

)

(5): Identity()

(6): Identity()

(7): DeQuantStub()

(8): Flatten()

(9): Linear(in_features=18816, out_features=10, bias=True)

(10): Softmax(dim=-1)

)

The Conv + batch norm + ReLU layers are replaced by a fused ConvReLU2d layer + two Identity layers, so the model has the same number of layers.

### Quantized Model Lifecycle

Now that we have most of the components for quantization, let’s take a step back and describe the whole process of building a quantized model:

**Train the floating point model as usual.**Most of the time, no special handling is required. After the final checkpoint is chosen, no training with backpropagation will be used.**Fuse Batch Normalizations**as described in the paragraph above.**Compute scales and zero_points.**Most operations require concrete values for those parameters. The process of selecting the values for them will be described in the following paragraphs.**Quantization.**Quantize the model weights and replace the floating point modules with the quantized ones. There are helper functions in Pytorch to help with this task.

### Selecting scales and zero_points

Selecting scales and zero_points is the same as selecting the minimal and maximal representable value by a given tensor. (For example, if the type of the tensor is quint8, then Q(min) =0 andQ(max)=255.) Solving these equations gives us the values of scale and zero_point.

The process of selecting these parameters is quite critical. If done correctly, you should get mostly the same accuracy as the floating point model. If you’re not careful, however, the model can produce the same output for all inputs. If the min-max range is too narrow, we will saturate most of the values in the tensor and work with capped values, but if the range is too wide, we will lose resolution by introducing rounding errors and significant differences between values will be lost. It’s a delicate balance!

### Observers

The classic approach to finding the optimal range for a quantized tensor is to gather statistics of actual values. This is achieved using **Observers**, or objects that gather sufficient statistics from the tensors they “observe.” The scales and zero_points are computed based on those statistics.

How we use the observer object depends on the usage of the tensor we’re quantizing. If we’re quantizing a model weight, then we create a separate observer that will only observe that tensor weight.

Finding the ranges for the outputs of the quantized layers, however, is a bit more complex.

In that case, we create an observer object for each layer and then run the model on a sample of the training or the validation sets. Each observer looks at the corresponding layer’s outputs for each example and updates it’s statistics. Тhere are multiple observer algorithms.

It’s worth stressing the importance of a good dataset for the above step. If the dataset is not diverse enough, examples from other distributions might suffer from excessive clipping. If the dataset contains examples outside of the distribution, they might stretch the range between min and max too much, which will result in a bigger rounding error for in-distribution examples.

### MinMaxObserver

The simplest observer is the MinMaxObserver. As it’s name conveys, it keeps track of the minimal and maximal tensor value it sees. It’s implemented in Pytorch and is probably the first observer you should try. One notable downside is that it’s *really* sensitive to outliers, so if you have even one extremely big or small value, your range will be too wide and you’ll lose resolution. Outliers in the model weights are rare, so this observer does a good job for them. If the ranges of your channels are different, you can still use per channel quantization and PerChannelMinMaxObserver.

Here’s an example of the process of attaching observers and finishing the quantization process:

`# QConfig defines how we quantize the model’s activations and weights. For the weights Pytorch requires that we use the torch.quint8 dtype`

>>> qconfig = torch.quantization.QConfig(

activation=torch.quantization.MinMaxObserver,

weight=torch.quantization.MinMaxObserver.with_args(dtype=torch.qint8),

)

>>> # We attach the qconfigs to the layers that will be quantized.

>>> fused_float_model[0].qconfig = qconfig

>>> fused_float_model[1].qconfig = qconfig

>>> fused_float_model[4].qconfig = qconfig

>>> fused_float_model[7].qconfig = qconfig

>>> # We prepare the model for statistics gathering

>>> prepared_float_model = torch.quantization.prepare(fused_float_model)

>>> # We gather the statistics on a sampling dataset

>>> run_on_dataset(prepared_float_model, '')

>>> # After we’ve gathered the statistics we quantize the modules

>>> quantized_model = torch.quantization.convert(prepared_float_model)

>>> quantized_model

Sequential(

(0): Quantize(scale=tensor([0.0039]), zero_point=tensor([0]), dtype=torch.quint8)

(1): QuantizedConvReLU2d(3, 16, kernel_size=(3, 3), stride=(1, 1), scale=0.003506328444927931, zero_point=0, padding=(1, 1))

(2): Identity()

(3): Identity()

(4): QuantizedConvReLU2d(16, 24, kernel_size=(3, 3), stride=(1, 1), scale=0.001739502651616931, zero_point=0, padding=(1, 1))

(5): Identity()

(6): Identity()

(7): DeQuantize()

(8): Flatten()

(9): Linear(in_features=18816, out_features=10, bias=True)

(10): Softmax(dim=-1)

)

As you can see, the stubs and convolution layers are converted to their quantized counterparts, and we can thus do quantized inference with this model.

## More On Quantization

### Quantization-Aware Training

The quantization workflow described in this post is one of the simplest quantization workflows. One common addition to it is the so-called Quantization-Aware Training (QAT), which is applied if quantizing the model introduces too much degradation. A possible cause might be that when the model is trained, there’s no rounding error and values are not clipped to the minimal/maximal representable value.

In QAT, we start with the normal floating point training, but then we add fake quantization layers and continue the training process. The fake quantization layers simulate the rounding and saturation errors, but their result is still a floating point value. After the model is trained with QAT and the checkpoint is selected, the quantization process is resumed as described above.

### Dynamic Quantization

The methods described above use the so-called “static quantization” where the quantization parameters are not influenced by the input of the model. If you quantize LSTMs that might not be optimal, the output ranges are more volatile. In that case, you can use dynamic quantization to compute quantization parameters on the fly.

If you’re interested in learning more about quantization, a good place to start is the Pytorch quantization page. The math behind the quantization operations is described in Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference, linked here. This also provides a good list of quantization-related papers and resources.

*Author’s Note: Thank you to Romain Sauvestre on the ML team for his feedback, revisions and recommendations, as well as Annie Christian on our Marketing team.*

###

*Daniel Balchev is a Staff Machine Learning Engineer at Hyperscience based out of our Sofia office. He is a regular contributor to this blog. **Connect with him on LinkedIn.*