Methodology

How we benchmark object detection models fairly and reproducibly

Why End-to-End Benchmarking?

Most published benchmarks only report inference time — the time the neural network takes to process a tensor. This hides two critical costs:

Preprocessing: Loading images, resizing, normalizing, and converting to tensors (typically 1-5ms)
Postprocessing: Non-Maximum Suppression (NMS), decoding boxes, and formatting output (typically 1-5ms for YOLO models)

For real-world applications, you pay the full cost. A model that reports "8ms inference" might actually take 15ms end-to-end.

This matters especially when comparing architectures like YOLOv10 (NMS-free) vs traditional YOLO models that require NMS postprocessing.

What We Measure

Accuracy Metrics

mAP@50-95Primary metric, COCO standard (IoU 0.5 to 0.95)
mAP@50VOC-style metric (IoU 0.5)
mAP@75Stricter localization (IoU 0.75)
mAP S/M/LPer-size breakdown (small/medium/large objects)

Speed Metrics

PreprocessImage loading, resize, normalize, to tensor
InferenceNeural network forward pass only
PostprocessNMS, decoding, formatting
TotalFull end-to-end latency
FPSThroughput (1000 / total_ms)

Testing Protocol

Dataset

COCO val2017 (5,000 images, 80 classes). This is the industry standard for object detection evaluation.

Input Processing

Letterbox resize to 640x640 (maintains aspect ratio with padding)
ImageNet normalization (mean: 0.485, 0.456, 0.406; std: 0.229, 0.224, 0.225)
Identical preprocessing for all models

Inference Settings

Batch size: 1 (for latency measurements)
Warm-up: 50 iterations (critical for GPU benchmarks)
Timing runs: 100 iterations (for statistical significance)
Confidence threshold: 0.001 (for mAP), 0.25 (for speed)
NMS IoU threshold: 0.7

GPU Synchronization

We use torch.cuda.synchronize() between timing measurements to ensure accurate GPU timing.

How We Differ from Other Benchmarks

Aspect	Ultralytics / Others	Vision Analysis
Timing	Inference only	Full end-to-end
NMS	Often excluded from timing	Always included
Model Coverage	Favors own models	All major architectures
Hardware	Cherry-picked	Comprehensive matrix
Reproducibility	Varies	Docker commands provided

Reproducibility

Every benchmark can be reproduced with a single Docker command:

docker run --gpus all \
  -v /data/coco:/data/coco \
  ghcr.io/vision-analysis/benchmark-runner:latest \
  benchmark \
  --model yolov8x \
  --dataset coco_val2017 \
  --hardware a100 \
  --format tensorrt_fp16

All benchmark configurations, model weights, and environment specifications are version-controlled in our GitHub repository.

Known Limitations

Single GPU only: Multi-GPU throughput not measured
Batch size 1: Batch processing can improve throughput but adds latency
Fixed input size: Models may perform differently at other resolutions
COCO only: Performance on domain-specific datasets may vary
Mock data: Current benchmarks use simulated data — real benchmarks coming soon