Documentation of my hybrid storage infrastructure: Linstor DRBD distributed storage for VMs, and active-passive ZFS replication for cold data with a highly available NFS server.
## Context and Problem Statement
### Hybrid Storage Architecture
My Proxmox cluster uses two types of storage with different needs and constraints:
#### High-Performance Storage for VM/LXC: Linstor DRBD
- **Usage**: System disks for virtual machines and containers
- **Requirements**: Synchronous replication, live migration, RPO ~0
- **Support**: NVMe SSDs on Proxmox nodes
- **Technology**: Linstor DRBD (see [blog post on distributed storage](/blog/stockage-distribue-proxmox-ha))
#### Cold Data Storage: Replicated ZFS
- **Usage**: Media, user files, Proxmox Backup Server backups
- **Requirements**: Large capacity, data integrity, high availability but live migration not required
- **Support**: USB drives on Proxmox nodes (independent ZFS pools)
- **Technology**: Active-passive ZFS replication with Sanoid/Syncoid
### Why Not Use Linstor DRBD for Everything?
Synchronous distributed storage like Linstor DRBD has several constraints for cold data:
- **Write Performance**: Every write must be confirmed on multiple nodes, penalizing large file transfers
- **Network Consumption**: Synchronous replication would saturate the 1 Gbps network during massive transfers
- **Unnecessary Complexity**: Cold data doesn't need live migration or near-zero RPO
- **Cost/Benefit**: Resource over-consumption for a need that can be satisfied by asynchronous replication
### The Solution: Active-Passive ZFS Replication
For cold data, **asynchronous snapshot-based replication** offers the best compromise:
| Criteria | Linstor DRBD | Replicated ZFS |
|---------|--------------|--------------|
| Replication Type | Synchronous | Asynchronous (snapshots) |
| Network Overhead | High (continuous) | Low (periodic) |
| RPO | ~0 | Snapshot interval (10 min) |
| Live Migration | Yes | Not necessary |
| Data Integrity | Good | Excellent (ZFS checksums) |
| Suited for | VM/LXC system | Large cold data |
An RPO of 10 minutes is **perfectly acceptable** for media and user files: in case of node failure, only changes from the last 10 minutes could be lost.
The `shared=1` option is **mandatory** for the ZFS dataset bind mount. This option tells Proxmox VE that this storage is shared across cluster nodes, allowing High Availability (HA) to work properly without being blocked.
The NFS container rootfs is stored on Linstor DRBD to benefit from **Proxmox high availability**. This allows the LXC to automatically fail over to the other node in case of failure, with only about **60 seconds** of downtime.
Without shared/distributed storage, Proxmox HA couldn't automatically migrate the container, requiring manual intervention.
:::
#### Automatic Replication Script
The [`zfs-nfs-replica.sh`](https://forgejo.tellserv.fr/Tellsanguis/zfs-sync-nfs-ha) script runs every **10 minutes** via a systemd timer and implements the following logic:
The `no_root_squash` option allows NFS clients to perform operations as root. This is acceptable in a trusted home network (192.168.100.0/24), but would constitute a **major security risk** on an untrusted network.
:::
### Systemd Services
Active NFS services on LXC:
```bash
nfs-server.service enabled # Main NFS server
nfs-blkmap.service enabled # pNFS block layout support
nfs-client.target enabled # Target for NFS clients
This configuration is used on my [Docker Compose & Ansible production VM](/docs/homelab-actuel/docker-compose) which hosts all my containerized services.
:::
### Mount Options Explained
| Option | Description |
|--------|-------------|
| `hard` | In case of NFS server unavailability, I/O operations are **blocked waiting** rather than failing (ensures integrity) |
| `intr` | Allows interrupting blocked I/O operations with Ctrl+C (useful in case of network issues) |
| `timeo=100` | 10-second timeout (100 tenths of a second) before retry |
| `retrans=30` | Number of retransmissions before declaring error (30 × 10s = 5 minutes of retry) |
| `_netdev` | Indicates mount requires network (systemd waits for network connectivity) |
| `nofail` | Doesn't prevent boot if mount fails (avoids boot blocking) |
| `x-systemd.automount` | Automatic mount on first use (avoids blocking boot) |
| `0 0` | No dump or fsck (not applicable for NFS) |
### Behavior During NFS Failover
Thanks to `hard` and `retrans=30` options, during NFS server failover (~60 seconds):
1.**During Failover**: Ongoing I/O operations are **suspended** (hard mount)
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3534180]: NEWEST SNAPSHOT: autosnap_2025-12-18_16:30:10_frequently
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3534180]: INFO: no snapshots on source newer than autosnap_2025-12-18_16:30:10_frequently on target. Nothing to do.
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3534221]: NEWEST SNAPSHOT: autosnap_2025-12-18_16:30:10_frequently
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3534221]: INFO: no snapshots on source newer than autosnap_2025-12-18_16:30:10_frequently on target. Nothing to do.
Unlike Linstor DRBD which offers near-zero RPO, ZFS replication every 10 minutes means that in case of master node failure, **changes from the last 10 minutes** could be lost.
For cold data (media, files), this is acceptable. For critical data requiring RPO ~0, Linstor DRBD remains the appropriate solution.
### ~60 Second Downtime During Failover
Automatic LXC failover takes approximately **60 seconds**. During this time, the NFS server is inaccessible.
NFS clients will see their I/O operations blocked, then automatically resume once the server is available again (thanks to NFS retry mechanisms).
### Unidirectional Replication
At any time T, replication always occurs **from master to standby**. There is no simultaneous bidirectional replication.
If modifications are made on the standby (which shouldn't happen in normal use), they will be **overwritten** during the next replication.
### Network Dependency
Replication requires network connectivity between nodes. In case of network partition (split-brain), each node could believe itself to be master.
The script implements checks to minimize this risk, but in a prolonged split-brain scenario, manual intervention may be necessary.
## Conclusion
The **hybrid storage** architecture combining Linstor DRBD and replicated ZFS offers the best of both worlds:
- **Linstor DRBD** for VM/LXC: synchronous replication, live migration, RPO ~0
- **Replicated ZFS** for cold data: large capacity, excellent integrity, minimal overhead
The highly available NFS server, with its **rootfs on DRBD** and **automatic ZFS replication**, ensures:
- Failover time of **~60 seconds** in case of failure
- Automatic adaptation to Proxmox HA failover
- **Maximum data loss of 10 minutes** (RPO)
- No manual intervention required
This solution is **perfectly suited** for a homelab requiring high availability for a cold data NFS server, while preserving resources (CPU, RAM, network) for critical services.