Methodology

How we benchmark object detection models fairly and reproducibly

Why End-to-End Benchmarking?

Most published benchmarks only report inference time — the time the neural network takes to process a tensor. This hides two critical costs:

  • Preprocessing: Loading images, resizing, normalizing, and converting to tensors (typically 1-5ms)
  • Postprocessing: Non-Maximum Suppression (NMS), decoding boxes, and formatting output (typically 1-5ms for YOLO models)

For real-world applications, you pay the full cost. A model that reports "8ms inference" might actually take 15ms end-to-end.

This matters especially when comparing architectures like YOLOv10 (NMS-free) vs traditional YOLO models that require NMS postprocessing.

What We Measure

Accuracy Metrics

  • mAP@50-95Primary metric, COCO standard (IoU 0.5 to 0.95)
  • mAP@50VOC-style metric (IoU 0.5)
  • mAP@75Stricter localization (IoU 0.75)
  • mAP S/M/LPer-size breakdown (small/medium/large objects)

Speed Metrics

  • PreprocessImage loading, resize, normalize, to tensor
  • InferenceNeural network forward pass only
  • PostprocessNMS, decoding, formatting
  • TotalFull end-to-end latency
  • FPSThroughput (1000 / total_ms)
Testing Protocol

Dataset

COCO val2017 (5,000 images, 80 classes). This is the industry standard for object detection evaluation.

Input Processing

  • Letterbox resize to 640x640 (maintains aspect ratio with padding)
  • ImageNet normalization (mean: 0.485, 0.456, 0.406; std: 0.229, 0.224, 0.225)
  • Identical preprocessing for all models

Inference Settings

  • Batch size: 1 (for latency measurements)
  • Warm-up: 50 iterations (critical for GPU benchmarks)
  • Timing runs: 100 iterations (for statistical significance)
  • Confidence threshold: 0.001 (for mAP), 0.25 (for speed)
  • NMS IoU threshold: 0.7

GPU Synchronization

We use torch.cuda.synchronize() between timing measurements to ensure accurate GPU timing.

How We Differ from Other Benchmarks
AspectUltralytics / OthersVision Analysis
TimingInference onlyFull end-to-end
NMSOften excluded from timingAlways included
Model CoverageFavors own modelsAll major architectures
HardwareCherry-pickedComprehensive matrix
ReproducibilityVariesDocker commands provided
Reproducibility

Every benchmark can be reproduced with a single Docker command:

docker run --gpus all \
  -v /data/coco:/data/coco \
  ghcr.io/vision-analysis/benchmark-runner:latest \
  benchmark \
  --model yolov8x \
  --dataset coco_val2017 \
  --hardware a100 \
  --format tensorrt_fp16

All benchmark configurations, model weights, and environment specifications are version-controlled in our GitHub repository.

Known Limitations
  • Single GPU only: Multi-GPU throughput not measured
  • Batch size 1: Batch processing can improve throughput but adds latency
  • Fixed input size: Models may perform differently at other resolutions
  • COCO only: Performance on domain-specific datasets may vary
  • Mock data: Current benchmarks use simulated data — real benchmarks coming soon