Tuning guide

Optimizing Proxmox Backup performance

Improve backup, verify, and restore throughput with the right storage layout, network tuning, concurrency, and scheduling patterns.

Quick wins

Tune throughput fast.

  • 01Dial in storage + ARC sizing
  • 02Fix MTU/offloads for clean links
  • 03Set sensible concurrency + limits
  • 04Stagger prune/GC/verify schedules

1) Storage layout and ARC sizing

  • Use dedicated datastore disks. Prefer ZFS mirrors/RAID10 on SAS/NVMe for predictable IOPS and latency.
  • ARC: allocate enough RAM for metadata. Start with 8–16 GB ARC for small repos; 25–50% of RAM for larger, watching pressure.
  • Add an L2ARC or special device for metadata if verify/restore is slow on large datasets.
  • Separate OS disk from datastore; avoid shared boot/datastore devices.
  • Use sync=standard unless a workload needs stricter semantics; ensure a good SLOG if you enable sync=always.

2) Network tuning for PBS and PVE

  • Dedicated backup VLAN/VRF; keep management separate.
  • Match MTU end-to-end (e.g., 9000). Validate with ping -M do -s 8972 <pbs>.
  • Check NIC offloads: if high CPU or drops, test disabling problematic offloads (ethtool -K <nic> gro off lro off) and measure.
  • Pin PBS NICs to backup traffic; use LACP bonds where possible.
  • On PVE, avoid congesting live-migration and backup on the same interface without QoS.

3) Concurrency and bandwidth controls

  • Datastore concurrency: start with 4–6 concurrent tasks per datastore; raise slowly while watching latency and CPU.
  • Client-side concurrency: avoid scheduling all VMs at the same minute; stagger jobs by cluster.
  • Bandwidth limits: set per-job caps during business hours; remove or loosen limits overnight.
  • Compression: keep zstd unless CPU-bound. If CPU is saturated, tune compression level or add cores.
  • CPU pinning: ensure PBS VM (if virtualized) has dedicated vCPUs and NUMA alignment to avoid contention.

4) Verify, prune, and GC timing

  • Prune nightly after backups; mirror retention policy.
  • GC after prune, but away from backup peaks. Weekly is typical; nightly on fast storage with heavy churn.
  • Verify weekly (daily for critical data). If storage is the bottleneck, add a faster cache/special device or stagger verify across namespaces.
  • Do not overlap heavy verify with GC on the same datastore; serialize for consistent latency.

5) Speeding up restores

  • Keep recent data hot: larger ARC and/or special device helps file-level and full restores.
  • Ensure restore targets (PVE storage) are as fast or faster than backup storage; avoid restoring to slow tiers.
  • Network: restore paths should match backup paths (MTU, VLAN). Avoid cross-DC restores without sufficient bandwidth.
  • Test restores monthly; measure throughput and adjust concurrency or bandwidth caps accordingly.

6) Monitoring and bottleneck hunting

  • Watch PBS task logs for chunk errors, namespace warnings, and slow tasks.
  • Monitor disk latency (iostat/zpool), ARC hit ratios, and CPU steal (if virtualized).
  • Track network drops/retransmits; check switch counters during backup windows.
  • Alert on verify/prune/GC failures; correlate with storage/CPU/network telemetry.

Example tuning playbooks

10 Gbps, mixed workloads

  • Datastore on SAS mirrors; ARC ~16 GB; no L2ARC.
  • Concurrency: 4–6 tasks; stagger cluster jobs by 15 minutes.
  • Prune nightly; GC weekly; verify weekly.
  • Bandwidth caps during business hours; uncapped overnight.

25 Gbps, large datasets

  • Datastore on NVMe or SAS with special device for metadata; ARC 32–64 GB + L2ARC.
  • Concurrency: 6–10 tasks; monitor latency to adjust.
  • Prune nightly; GC nightly; verify staggered per namespace.
  • Ensure jumbo frames are clean end-to-end; validate with ping tests.

Virtualized PBS (no dedicated hardware)

  • Reserve vCPUs and RAM; avoid CPU steal. Use virtio for NIC/disk.
  • Keep datastore on dedicated virtual disk backed by fast storage; avoid shared OS/datastore virtual disks.
  • Lower concurrency (3–4) to avoid noisy neighbors; watch latency closely.
  • Verify less frequently if IO is a bottleneck; prioritize prune + GC consistency.

Need faster backup or restore windows?

We tune PBS storage, networking, and schedules to hit tight RPO/RTO targets.