TensorRT FP16 slower than FP32 on the RTX 5070 Ti

Verdict

On an RTX 5070 Ti, converting to TensorRT FP16 does not always beat TensorRT FP32. For a wide set of detection models, FP16 runs slower. D-FINE-N drops from 98.7 to 79.7 FPS. ECDet-S drops from 63.4 to 43.5 FPS. The effect spans nano through extra-large variants.

The table below covers every model on the RTX 5070 Ti where TensorRT FP16 throughput came in under TensorRT FP32, with both figures and the mAP change side by side. mAP is shown in percent form. Some gaps are small: DEIMv2-X moves from 38.1 to 37.9 FPS. Others are wide: D-FINE-S moves from 98.7 to 86.5, and DEIM-S from 97.6 to 82.4.

Model	TensorRT FP32 FPS	TensorRT FP16 FPS	Speedup	mAP delta (pts)
YOLOX-X	48.4	72.1	1.49x	-1.0
RT-DETR-R101	64.4	88.0	1.37x	-2.0
RT-DETRv2-R101	64.8	87.4	1.35x	+1.0
RT-DETRv2-R34	102.5	132.9	1.30x	+1.0
RT-DETR-L	94.2	122.0	1.29x	+5.0
YOLOX-L	61.8	79.3	1.28x	-1.0
D-FINE-X	57.9	73.0	1.26x	+4.0
PicoDet-M	53.5	65.7	1.23x	+126.0
RT-DETRv4-X	58.3	71.0	1.22x	-6.0
DEIM-X	58.6	70.7	1.21x	-12.0
DEIM-N	67.5	81.3	1.20x	-14.0
RT-DETR-X	68.5	82.3	1.20x	-9.0
PicoDet-L	32.0	38.0	1.19x	-9.0
YOLOX-M	70.1	83.0	1.18x	-9.0
RT-DETRv2-R50	88.5	103.6	1.17x	-1.0

TensorRT FP16 vs TensorRT FP32 on NVIDIA RTX 5070 Ti, batch 1. mAP delta in percentage points.

Which models and where

The regression appears across the YOLOv9, RF-DETR, D-FINE, DEIM, DEIMv2, RT-DETR, RT-DETRv4, ECDet, and YOLOX families. It is not tied to model size: YOLOv9-T at 59.0 FPS and DEIMv2-X at 37.9 FPS both regress. Every row here is on the RTX 5070 Ti. This finding does not extend to other hardware.

What we did not determine

We did not determine why FP16 lands slower. We did not vary TensorRT version, driver, batch size, or input resolution beyond the fixed protocol. We did not measure whether the gap reproduces on a second card of the same model. The claim is narrow: on this GPU, at batch and resolution held fixed, FP16 measured slower than FP32 for these models.

Every number on this page comes from the verified dataset: same 500-image COCO val2017 slice, conf 0.001, IoU 0.6, max 300 detections, pycocotools mAP, identical protocol across all hardware and runtimes. The full protocol is on the methodology page. To rerun this comparison with your own filters, open compare. Accuracy is measured on LibreYOLO retrained checkpoints; other weight sources can yield different values.

TensorRT FP16 slower than FP32 on the RTX 5070 Ti

Which models and where

What we did not determine

Run any model with one line