Optimizing Proxmox Backup Performance (Throughput, Verify, Restore)

1) Storage layout and ARC sizing

Use dedicated datastore disks. Prefer ZFS mirrors/RAID10 on SAS/NVMe for predictable IOPS and latency.
ARC: allocate enough RAM for metadata. Start with 8–16 GB ARC for small repos; 25–50% of RAM for larger, watching pressure.
Add an L2ARC or special device for metadata if verify/restore is slow on large datasets.
Separate OS disk from datastore; avoid shared boot/datastore devices.
Use sync=standard unless a workload needs stricter semantics; ensure a good SLOG if you enable sync=always.

Dedicated backup VLAN/VRF; keep management separate.
Match MTU end-to-end (e.g., 9000). Validate with ping -M do -s 8972 <pbs>.
Check NIC offloads: if high CPU or drops, test disabling problematic offloads (ethtool -K <nic> gro off lro off) and measure.
Pin PBS NICs to backup traffic; use LACP bonds where possible.
On PVE, avoid congesting live-migration and backup on the same interface without QoS.

Datastore concurrency: start with 4–6 concurrent tasks per datastore; raise slowly while watching latency and CPU.
Client-side concurrency: avoid scheduling all VMs at the same minute; stagger jobs by cluster.
Bandwidth limits: set per-job caps during business hours; remove or loosen limits overnight.
Compression: keep zstd unless CPU-bound. If CPU is saturated, tune compression level or add cores.
CPU pinning: ensure PBS VM (if virtualized) has dedicated vCPUs and NUMA alignment to avoid contention.

Prune nightly after backups; mirror retention policy.
GC after prune, but away from backup peaks. Weekly is typical; nightly on fast storage with heavy churn.
Verify weekly (daily for critical data). If storage is the bottleneck, add a faster cache/special device or stagger verify across namespaces.
Do not overlap heavy verify with GC on the same datastore; serialize for consistent latency.

Keep recent data hot: larger ARC and/or special device helps file-level and full restores.
Ensure restore targets (PVE storage) are as fast or faster than backup storage; avoid restoring to slow tiers.
Network: restore paths should match backup paths (MTU, VLAN). Avoid cross-DC restores without sufficient bandwidth.
Test restores monthly; measure throughput and adjust concurrency or bandwidth caps accordingly.

Watch PBS task logs for chunk errors, namespace warnings, and slow tasks.
Monitor disk latency (iostat/zpool), ARC hit ratios, and CPU steal (if virtualized).
Track network drops/retransmits; check switch counters during backup windows.
Alert on verify/prune/GC failures; correlate with storage/CPU/network telemetry.

10 Gbps, mixed workloads

25 Gbps, large datasets

Datastore on NVMe or SAS with special device for metadata; ARC 32–64 GB + L2ARC.
Concurrency: 6–10 tasks; monitor latency to adjust.
Prune nightly; GC nightly; verify staggered per namespace.
Ensure jumbo frames are clean end-to-end; validate with ping tests.

Virtualized PBS (no dedicated hardware)

Reserve vCPUs and RAM; avoid CPU steal. Use virtio for NIC/disk.
Keep datastore on dedicated virtual disk backed by fast storage; avoid shared OS/datastore virtual disks.
Lower concurrency (3–4) to avoid noisy neighbors; watch latency closely.
Verify less frequently if IO is a bottleneck; prioritize prune + GC consistency.