ONNX FP32 vs PyTorch FP32 on RTX 5070 Ti: 55 models

Verdict

Across 55 models on the NVIDIA RTX 5070 Ti, ONNX Runtime FP32 gives a median 1.92x over PyTorch FP32, ranging from 0.87x to 2.71x. The biggest gain is DEIM-L at 2.71x, from 18.6 to 50.5 FPS. Because precision does not change, no model loses accuracy, and only one model runs slower. It is a low-risk speedup when you do not want to touch mAP. mAP is shown in percent form.

PyTorch FP32 is the reference runtime. ONNX Runtime FP32 keeps the same precision and runs the exported graph. This compares both on the same 55 models, same COCO protocol, same device: an NVIDIA RTX 5070 Ti.

Model	PyTorch FP32 FPS	ONNX Runtime FP32 FPS	Speedup	mAP delta (pts)
DEIM-L	18.6	50.5	2.71x	+1.0
DEIM-X	15.0	39.8	2.66x	+0.0
DEIMv2-Atto	40.4	102.0	2.52x	+8.0
D-FINE-N	32.5	78.9	2.43x	+1.0
D-FINE-L	21.9	50.9	2.33x	+3.0
RT-DETR-L	27.7	64.5	2.33x	-1.0
RT-DETRv4-X	16.6	38.3	2.30x	+0.0
D-FINE-X	18.1	41.5	2.29x	+1.0
RT-DETR-X	21.8	48.4	2.22x	+1.0
DEIMv2-N	35.1	76.3	2.18x	+0.0
RT-DETR-R50m	35.3	76.4	2.17x	+0.0
RT-DETR-R50	29.8	62.8	2.11x	+0.0
D-FINE-M	28.8	59.8	2.08x	-6.0
D-FINE-S	34.2	70.9	2.07x	-1.0
RT-DETRv2-R50	32.1	66.0	2.05x	-6.0

Per-model FPS under PyTorch FP32 and ONNX Runtime FP32 on RTX 5070 Ti, with speedup and mAP delta. Top 15 by speedup.

Speedup varies by family

Conversion gain depends on the model family. D-FINE gains a median 2.29x, the most here. PicoDet gains only 1.02x, effectively flat. The DETR-style families cluster around 2x, while the YOLOX models gain the least outside PicoDet. Set your speedup expectation by family, not by the single global median.

Accuracy holds

Because ONNX FP32 keeps full precision, no model loses more than half a point of mAP through the export. The accuracy numbers match the PyTorch baseline within measurement noise. That is the appeal on this card: you get most of a 2x speedup without spending any accuracy to get it.

One model runs slower

The single exception is PicoDet-L, which regresses to 0.87x, dropping from 39.5 to 34.2 FPS. It is the only model in the set that ends up slower than its PyTorch baseline. Every other model gains, but this is the reason to confirm your own model before shipping the ONNX path.

Every number on this page comes from the verified dataset: same 500-image COCO val2017 slice, conf 0.001, IoU 0.6, max 300 detections, pycocotools mAP, identical protocol across all hardware and runtimes. The full protocol is on the methodology page. To rerun this comparison with your own filters, open compare. Accuracy is measured on LibreYOLO retrained checkpoints; other weight sources can yield different values.

ONNX FP32 vs PyTorch FP32 on RTX 5070 Ti: 55 models

Speedup varies by family

Accuracy holds

One model runs slower

Run any model with one line