Verdict

Across 51 models on the NVIDIA RTX 5070 Ti, TensorRT FP16 gives a median 2.22x over PyTorch FP32, ranging from 0.96x to 4.72x. The biggest gain is DEIM-X at 4.72x, from 15.0 to 70.7 FPS. FP16 is not free here: four models lose half a point of mAP or more, and one model, PicoDet-L, actually runs slower after conversion. Check your specific model before committing. mAP is shown in percent form.

PyTorch FP32 is the reference runtime. TensorRT FP16 is the usual deployment path on desktop NVIDIA GPUs. This compares both on the same 51 models, same COCO protocol, same device: an NVIDIA RTX 5070 Ti.

ModelPyTorch FP32 FPSTensorRT FP16 FPSSpeedupmAP delta (pts)
DEIM-X15.070.74.72x-11.0
RT-DETR-L27.7122.04.40x+2.0
RT-DETRv4-X16.671.04.27x-5.0
D-FINE-X18.173.04.02x+5.0
RT-DETR-X21.882.33.78x-9.0
RT-DETR-R10123.688.03.74x-3.0
RT-DETRv2-R10125.787.43.40x-1.0
RT-DETRv2-R3440.4132.93.29x+1.0
RT-DETR-R5029.897.43.27x-13.0
DEIMv2-Atto40.4132.13.27x-167.0
RT-DETRv2-R5032.1103.63.23x-4.0
D-FINE-L21.969.33.17x+6.0
RT-DETRv4-L24.778.03.15x-6.0
RT-DETR-R50m35.3110.13.12x+3.0
DEIM-M28.483.32.94x-11.0
Per-model FPS under PyTorch FP32 and TensorRT FP16 on RTX 5070 Ti, with speedup and mAP delta. Top 15 by speedup.

Speedup varies by family

Conversion gain depends on the model family. RT-DETR gains a median 3.27x, the most of any family here. PicoDet gains only 1.46x, the least. Set your speedup expectation by family, not by the single global median.

Accuracy cost

Most models hold accuracy, but four of the 51 lose half a point of mAP or more under FP16. DEIMv2-Atto falls from 27.5 to 25.8, the largest drop. YOLOX-S falls from 44.3 to 43.4, PicoDet-M from 37.9 to 37.3, and DEIMv2-X from 61.3 to 60.8. If you run one of these four, measure mAP after conversion rather than assuming parity.

One model runs slower

TensorRT FP16 is not a guaranteed win on this card. PicoDet-L regresses to 0.96x, dropping from 39.5 to 38.0 FPS after conversion. It is the only model in the set that ends up slower than its PyTorch baseline. The rest all gain, but this is the reason to benchmark your own model instead of trusting the median.

TensorRT FP16 is a strong default on this GPU, but not automatic. One model got slower and four gave up accuracy. Verify both speed and mAP for your specific model before shipping.

Every number on this page comes from the verified dataset: same 500-image COCO val2017 slice, conf 0.001, IoU 0.6, max 300 detections, pycocotools mAP, identical protocol across all hardware and runtimes. The full protocol is on the methodology page. To rerun this comparison with your own filters, open compare. Accuracy is measured on LibreYOLO retrained checkpoints; other weight sources can yield different values.