小模型量化更新小事_杨振互联网服务中心

简单聊聊基于TensorRT的量化在2024年发展成什么样了。
在TensorRT版本10.x前，量化方式一般有两种：
隐式量化，通过trt提供的trtexec api校准得到scale然后构建量化模型，或者已有scale，python api直接设置scale再构造模型
显式量化，通过QDQ节点量化，QDQ中包含了scale，通过trt的quantize和dequantize节点去显式控制量化节点
首先说下隐式量化，大家应该都用过trtexec –onnx=model.onnx –saveEngine=model.engine –fp16 –int8这类似的命令，这个命令会直接执行implicit quantization (IQ)，校准同时寻找性能最优的op（可能是int8可能是fp16），只支持CNN模型（transformer的不大行），量化精度不是特别可控，不好复现，这个方式也只支持INT8。
除了命令行的方式，我们一般会使用API：
calibration_table = ‘temp_cal.cache’
calibration_stream = DataLoader(image_path=”/data/homework/code/data/cal_data”)
calibrator = Calibrator(calibration_stream, calibration_table)

Parse the ONNX graph through TensorRT and build the engine

trt_engine = build_engine_onnx(args.onnx_fp32, args.verbose, calibrator=calibrator)
再说下显式量化：
显式量化是后来TensorRT支持的方式，比较灵活比较可控，但是最终量化性能可能会不如隐式量化，但可以无限接近：

只要我们得到上图右侧的带有QDQ节点的ONNX图，再让TensorRT读取，就可以执行显式量化了。
ModelOpt[1]
starting from version 0.11, nvidia-ammo is renamed as nvidia-modelopt.
ModelOpt是随着TRT-LLM大模型量化推出的另一个库，源于ammo，后来吸收了pytorch-quantization[2]，现在大小模型官方都建议使用这个工具量化
Assuming this question is related to ONNX quantization, certain models can achieve faster speeds with the default TensorRT 10 implicit quantization (IQ), but ModelOpt’s explicit quantization (EQ) will generally yield faster speeds for most network types, especially for Vision Transformers (ViTs). If users find that their network is not faster with ModelOpt quantization, they should file a bug report with a reproducible example. 不过咋说，大部分的CNN模型还是隐式量化性能更好，对于Transformers类型的模型使用显式量化理论上性能精度会更好
TensorRT will depreciate IQ in future releases and recommend EQ for users seeking better speed improvements and accuracy control. 隐式量化将来会被舍弃，不过目前TRT-10.3还支持
modelopt量化有两种方式：
基于modelopt.onnx，其内部调用onnxruntime，对提供的onnx模型插入QDQ节点，产出量化后的onnx文件；
基于modelopt.torch，对pytorch模型做操作，同样是插入Quantizer层，量化完还是同一个模型，类似于之前的pytorch-quantization。可再导出为onnx文件，也可以直接走pt2trt这条路子
简单的命令：

这一步得到量化后带有qdq的onnx

python -m modelopt.onnx.quantization –onnx_path=model.onnx –quantize_mode=int8 –calibration_data=calib.npy –output_path=model.quant.onnx

读取onnx模型执行显式量化

trtexec –onnx=model.quant.onnx –saveEngine=model.engine –fp16 –int8
上述onnxruntime的作用是在trt实际构建engine之前跑下模拟量化过程，看下插入QDQ的onnx有无精度问题，没问题后再转为trt。
混合精度
因为模型全INT8量化可能精度会崩，可以混合精度在速度和精度之间取一个trade-off，大部分时间我们可以取得一个量化后精度和fp16持平、同时速度有明显提升的模型。
最简单的方法就是控制每一层的精度，通过设置每一层不同精度（INT8或FP16）找到一个最合适的，关于怎么找，可以经验或者是启发式、二分法等。
可以通过API的方式设置：
for layer in network:
if layer.name in layer_precisions:
layer.precision = layer_precisions[layer.name]
对于显式量化，我们只需要按需插入QDQ节点即可，我们可以默认认为包在QDQ操作中间的op都是INT8类型的op，也就是我们需要量化的op具体细节可以看我之前的一篇文章：

基于modelopt的话，官方建议只用显式量化，目前opt对onnx的支持较多（可以控制量化和不量化的节点），但是torch的支持还稍微粗糙些，需要修改源码去控制QDQ节点的位置。
不过不是没有办法，我们还有Torch-TensorRT[3]，下篇继续分析。
参考
https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/5[4]
https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/25[5]
https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/11[6]
https://github.com/NVIDIA/TensorRT/issues/3701[7]
https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/54[8]
https://github.com/NVIDIA/TensorRT/issues/3987[9]
参考资料
[1]ModelOpt: https://nvidia.github.io/TensorRT-Model-Optimizer/
[2]pytorch-quantization: https://github.com/NVIDIA/TensorRT/tree/release/8.6/tools/pytorch-quantization
[3]Torch-TensorRT: https://github.com/pytorch/TensorRT
[4]https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/5: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/5
[5]https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/25: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/25
[6]https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/11: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/11
[7]https://github.com/NVIDIA/TensorRT/issues/3701: https://github.com/NVIDIA/TensorRT/issues/3701
[8]https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/54: https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/54
[9]https://github.com/NVIDIA/TensorRT/issues/3987: https://github.com/NVIDIA/TensorRT/issues/3987

声明：文中观点不代表本站立场。本文传送门：https://eyangzhen.com/420954.html

小模型量化更新小事

作者专栏

oldpan博客