Methodology
How we benchmark object detection models fairly and reproducibly
Most published benchmarks only report inference time — the time the neural network takes to process a tensor. This hides two critical costs:
- Preprocessing: Loading images, resizing, normalizing, and converting to tensors (typically 1-5ms)
- Postprocessing: Non-Maximum Suppression (NMS), decoding boxes, and formatting output (typically 1-5ms for YOLO models)
For real-world applications, you pay the full cost. A model that reports "8ms inference" might actually take 15ms end-to-end.
This matters especially when comparing architectures like YOLOv10 (NMS-free) vs traditional YOLO models that require NMS postprocessing.
Accuracy Metrics
- mAP@50-95Primary metric, COCO standard (IoU 0.5 to 0.95)
- mAP@50VOC-style metric (IoU 0.5)
- mAP@75Stricter localization (IoU 0.75)
- mAP S/M/LPer-size breakdown (small/medium/large objects)
Speed Metrics
- PreprocessImage loading, resize, normalize, to tensor
- InferenceNeural network forward pass only
- PostprocessNMS, decoding, formatting
- TotalFull end-to-end latency
- FPSThroughput (1000 / total_ms)
Dataset
COCO val2017 (5,000 images, 80 classes). This is the industry standard for object detection evaluation.
Input Processing
- Letterbox resize to 640x640 (maintains aspect ratio with padding)
- ImageNet normalization (mean: 0.485, 0.456, 0.406; std: 0.229, 0.224, 0.225)
- Identical preprocessing for all models
Inference Settings
- Batch size: 1 (for latency measurements)
- Warm-up: 50 iterations (critical for GPU benchmarks)
- Timing runs: 100 iterations (for statistical significance)
- Confidence threshold: 0.001 (for mAP), 0.25 (for speed)
- NMS IoU threshold: 0.7
GPU Synchronization
We use torch.cuda.synchronize() between timing measurements to ensure accurate GPU timing.
| Aspect | Ultralytics / Others | Vision Analysis |
|---|---|---|
| Timing | Inference only | Full end-to-end |
| NMS | Often excluded from timing | Always included |
| Model Coverage | Favors own models | All major architectures |
| Hardware | Cherry-picked | Comprehensive matrix |
| Reproducibility | Varies | Docker commands provided |
Every benchmark can be reproduced with a single Docker command:
docker run --gpus all \
-v /data/coco:/data/coco \
ghcr.io/vision-analysis/benchmark-runner:latest \
benchmark \
--model yolov8x \
--dataset coco_val2017 \
--hardware a100 \
--format tensorrt_fp16All benchmark configurations, model weights, and environment specifications are version-controlled in our GitHub repository.
- Single GPU only: Multi-GPU throughput not measured
- Batch size 1: Batch processing can improve throughput but adds latency
- Fixed input size: Models may perform differently at other resolutions
- COCO only: Performance on domain-specific datasets may vary
- Mock data: Current benchmarks use simulated data — real benchmarks coming soon