1) Storage layout and ARC sizing
- Use dedicated datastore disks. Prefer ZFS mirrors/RAID10 on SAS/NVMe for predictable IOPS and latency.
- ARC: allocate enough RAM for metadata. Start with 8–16 GB ARC for small repos; 25–50% of RAM for larger, watching pressure.
- Add an L2ARC or special device for metadata if verify/restore is slow on large datasets.
- Separate OS disk from datastore; avoid shared boot/datastore devices.
- Use
sync=standardunless a workload needs stricter semantics; ensure a good SLOG if you enablesync=always.
2) Network tuning for PBS and PVE
- Dedicated backup VLAN/VRF; keep management separate.
- Match MTU end-to-end (e.g., 9000). Validate with
ping -M do -s 8972 <pbs>. - Check NIC offloads: if high CPU or drops, test disabling problematic offloads (
ethtool -K <nic> gro off lro off) and measure. - Pin PBS NICs to backup traffic; use LACP bonds where possible.
- On PVE, avoid congesting live-migration and backup on the same interface without QoS.
3) Concurrency and bandwidth controls
- Datastore concurrency: start with 4–6 concurrent tasks per datastore; raise slowly while watching latency and CPU.
- Client-side concurrency: avoid scheduling all VMs at the same minute; stagger jobs by cluster.
- Bandwidth limits: set per-job caps during business hours; remove or loosen limits overnight.
- Compression: keep
zstdunless CPU-bound. If CPU is saturated, tune compression level or add cores. - CPU pinning: ensure PBS VM (if virtualized) has dedicated vCPUs and NUMA alignment to avoid contention.
4) Verify, prune, and GC timing
- Prune nightly after backups; mirror retention policy.
- GC after prune, but away from backup peaks. Weekly is typical; nightly on fast storage with heavy churn.
- Verify weekly (daily for critical data). If storage is the bottleneck, add a faster cache/special device or stagger verify across namespaces.
- Do not overlap heavy verify with GC on the same datastore; serialize for consistent latency.
5) Speeding up restores
- Keep recent data hot: larger ARC and/or special device helps file-level and full restores.
- Ensure restore targets (PVE storage) are as fast or faster than backup storage; avoid restoring to slow tiers.
- Network: restore paths should match backup paths (MTU, VLAN). Avoid cross-DC restores without sufficient bandwidth.
- Test restores monthly; measure throughput and adjust concurrency or bandwidth caps accordingly.
6) Monitoring and bottleneck hunting
- Watch PBS task logs for chunk errors, namespace warnings, and slow tasks.
- Monitor disk latency (iostat/zpool), ARC hit ratios, and CPU steal (if virtualized).
- Track network drops/retransmits; check switch counters during backup windows.
- Alert on verify/prune/GC failures; correlate with storage/CPU/network telemetry.
Example tuning playbooks
10 Gbps, mixed workloads
- Datastore on SAS mirrors; ARC ~16 GB; no L2ARC.
- Concurrency: 4–6 tasks; stagger cluster jobs by 15 minutes.
- Prune nightly; GC weekly; verify weekly.
- Bandwidth caps during business hours; uncapped overnight.
25 Gbps, large datasets
- Datastore on NVMe or SAS with special device for metadata; ARC 32–64 GB + L2ARC.
- Concurrency: 6–10 tasks; monitor latency to adjust.
- Prune nightly; GC nightly; verify staggered per namespace.
- Ensure jumbo frames are clean end-to-end; validate with ping tests.
Virtualized PBS (no dedicated hardware)
- Reserve vCPUs and RAM; avoid CPU steal. Use virtio for NIC/disk.
- Keep datastore on dedicated virtual disk backed by fast storage; avoid shared OS/datastore virtual disks.
- Lower concurrency (3–4) to avoid noisy neighbors; watch latency closely.
- Verify less frequently if IO is a bottleneck; prioritize prune + GC consistency.
Need faster backup or restore windows?
We tune PBS storage, networking, and schedules to hit tight RPO/RTO targets.