vault backup: 2025-12-18 19:19:50
This commit is contained in:
parent
e1b9d9bd31
commit
067f7987df
5 changed files with 1298 additions and 0 deletions
|
|
@ -0,0 +1,679 @@
|
|||
---
|
||||
sidebar_position: 1
|
||||
tags: [zfs, nfs, high-availability, replication, proxmox, homelab, sanoid, syncoid]
|
||||
last_update:
|
||||
date: 2025-12-18
|
||||
---
|
||||
|
||||
# ZFS Replication and Highly Available NFS Server
|
||||
|
||||
Documentation of my hybrid storage infrastructure: Linstor DRBD distributed storage for VMs, and active-passive ZFS replication for cold data with a highly available NFS server.
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
### Hybrid Storage Architecture
|
||||
|
||||
My Proxmox cluster uses two types of storage with different needs and constraints:
|
||||
|
||||
#### High-Performance Storage for VM/LXC: Linstor DRBD
|
||||
|
||||
- **Usage**: System disks for virtual machines and containers
|
||||
- **Requirements**: Synchronous replication, live migration, RPO ~0
|
||||
- **Support**: NVMe SSDs on Proxmox nodes
|
||||
- **Technology**: Linstor DRBD (see [blog post on distributed storage](/blog/stockage-distribue-proxmox-ha))
|
||||
|
||||
#### Cold Data Storage: Replicated ZFS
|
||||
|
||||
- **Usage**: Media, user files, Proxmox Backup Server backups
|
||||
- **Requirements**: Large capacity, data integrity, high availability but live migration not required
|
||||
- **Support**: USB drives on Proxmox nodes (independent ZFS pools)
|
||||
- **Technology**: Active-passive ZFS replication with Sanoid/Syncoid
|
||||
|
||||
### Why Not Use Linstor DRBD for Everything?
|
||||
|
||||
Synchronous distributed storage like Linstor DRBD has several constraints for cold data:
|
||||
|
||||
- **Write Performance**: Every write must be confirmed on multiple nodes, penalizing large file transfers
|
||||
- **Network Consumption**: Synchronous replication would saturate the 1 Gbps network during massive transfers
|
||||
- **Unnecessary Complexity**: Cold data doesn't need live migration or near-zero RPO
|
||||
- **Cost/Benefit**: Resource over-consumption for a need that can be satisfied by asynchronous replication
|
||||
|
||||
### The Solution: Active-Passive ZFS Replication
|
||||
|
||||
For cold data, **asynchronous snapshot-based replication** offers the best compromise:
|
||||
|
||||
| Criteria | Linstor DRBD | Replicated ZFS |
|
||||
|---------|--------------|--------------|
|
||||
| Replication Type | Synchronous | Asynchronous (snapshots) |
|
||||
| Network Overhead | High (continuous) | Low (periodic) |
|
||||
| RPO | ~0 | Snapshot interval (10 min) |
|
||||
| Live Migration | Yes | Not necessary |
|
||||
| Data Integrity | Good | Excellent (ZFS checksums) |
|
||||
| Suited for | VM/LXC system | Large cold data |
|
||||
|
||||
An RPO of 10 minutes is **perfectly acceptable** for media and user files: in case of node failure, only changes from the last 10 minutes could be lost.
|
||||
|
||||
## Architecture
|
||||
|
||||
### Overview
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────┐
|
||||
│ Proxmox HA Cluster │
|
||||
│ │
|
||||
│ ┌──────────────────┐ ┌──────────────────┐ │
|
||||
│ │ acemagician │ │ elitedesk │ │
|
||||
│ │ │◄────────────►│ │ │
|
||||
│ │ - zpool1 (10TB) │ Replication │ - zpool1 (10TB) │ │
|
||||
│ │ - zpool2 (2TB) │ Sanoid/ │ - zpool2 (2TB) │ │
|
||||
│ │ │ Syncoid │ │ │
|
||||
│ └────────┬─────────┘ └─────────┬────────┘ │
|
||||
│ │ │ │
|
||||
│ │ ┌──────────────┐ │ │
|
||||
│ └────────►│ LXC 103 │◄────────┘ │
|
||||
│ │ NFS Server │ │
|
||||
│ │ (rootfs on │ │
|
||||
│ │ DRBD) │ │
|
||||
│ └──────┬───────┘ │
|
||||
└────────────────────────────┼──────────────────────────────┘
|
||||
│
|
||||
▼
|
||||
NFS Clients (VMs)
|
||||
192.168.100.0/24
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
#### ZFS Pools on Proxmox Nodes
|
||||
|
||||
Each node has **two independent ZFS pools**:
|
||||
|
||||
**zpool1** (~10 TB): Large data
|
||||
- `zpool1/data-nfs-share`: Main NFS share (6.83 TB used)
|
||||
- `zpool1/pbs-backups`: Proxmox Backup Server backups
|
||||
|
||||
**zpool2** (~2 TB): Media and files
|
||||
- `zpool2/photos`: Photo library (14.7 GB)
|
||||
- `zpool2/storage`: Miscellaneous files (19.1 GB)
|
||||
|
||||
**Pool status on nodes**:
|
||||
|
||||
```bash
|
||||
# Node acemagician
|
||||
root@acemagician:~# zfs list
|
||||
NAME USED AVAIL REFER MOUNTPOINT
|
||||
zpool1 7.83T 2.95T 104K /zpool1
|
||||
zpool1/data-nfs-share 6.83T 2.95T 6.79T /zpool1/data-nfs-share
|
||||
zpool1/pbs-backups 96K 1024G 96K /zpool1/pbs-backups
|
||||
zpool2 33.9G 1.72T 104K /zpool2
|
||||
zpool2/photos 14.7G 1.72T 12.7G /zpool2/photos
|
||||
zpool2/storage 19.1G 1.72T 19.1G /zpool2/storage
|
||||
|
||||
# Node elitedesk
|
||||
root@elitedesk:~# zfs list
|
||||
NAME USED AVAIL REFER MOUNTPOINT
|
||||
zpool1 7.83T 2.97T 96K /zpool1
|
||||
zpool1/data-nfs-share 6.83T 2.97T 6.79T /zpool1/data-nfs-share
|
||||
zpool1/pbs-backups 96K 1024G 96K /zpool1/pbs-backups
|
||||
zpool2 33.9G 1.72T 112K /zpool2
|
||||
zpool2/photos 14.7G 1.72T 12.7G /zpool2/photos
|
||||
zpool2/storage 19.1G 1.72T 19.1G /zpool2/storage
|
||||
```
|
||||
|
||||
Note that pools are **perfectly synchronized** between the two nodes, with identical sizes for each dataset.
|
||||
|
||||
:::info
|
||||
Pools are **identical on both nodes** thanks to automatic bidirectional replication. The active node (hosting the LXC) is always the master.
|
||||
:::
|
||||
|
||||
#### LXC 103: Highly Available NFS Server
|
||||
|
||||
The LXC 103 container acts as an NFS server with the following characteristics:
|
||||
|
||||
- **Rootfs on Linstor DRBD**: Enables high availability via Proxmox HA
|
||||
- **ZFS Dataset Mounting**: Direct access to host node pools via bind mount
|
||||
- **NFS Service**: Exposes datasets via NFS to network clients
|
||||
- **Automatic Failover**: In case of failure, Proxmox HA restarts the LXC on the other node (~60s downtime)
|
||||
|
||||

|
||||
|
||||
**Detailed configuration**:
|
||||
- **CPU**: 2 cores
|
||||
- **RAM**: 1 GB (+ 512 MB swap)
|
||||
- **Rootfs**: 8 GB on `linstor_storage` (DRBD distributed storage)
|
||||
- **Mount Point (mp0)**: `/zpool1/data-nfs-share,mp=/data-nfs-share,shared=1`
|
||||
|
||||
:::warning Important: shared=1 option
|
||||
The `shared=1` option is **mandatory** for ZFS dataset bind mount. Without this option, the container could encounter access or permission issues when mounting the dataset from the host node.
|
||||
:::
|
||||
|
||||
:::tip Why Linstor DRBD for LXC rootfs?
|
||||
The NFS container rootfs is stored on Linstor DRBD to benefit from **Proxmox high availability**. This allows the LXC to automatically fail over to the other node in case of failure, with only about **60 seconds** of downtime.
|
||||
|
||||
Without shared/distributed storage, Proxmox HA couldn't automatically migrate the container, requiring manual intervention.
|
||||
:::
|
||||
|
||||
#### Automatic Replication Script
|
||||
|
||||
The [`zfs-nfs-replica.sh`](https://forgejo.tellserv.fr/Tellsanguis/zfs-sync-nfs-ha) script runs every **10 minutes** via a systemd timer and implements the following logic:
|
||||
|
||||

|
||||
|
||||
1. **Automatic Master Detection**: The node hosting LXC 103 automatically becomes the master
|
||||
2. **Dynamic Sanoid Configuration**:
|
||||
- **Master Node**: `autosnap=yes`, `autoprune=yes` (snapshot creation)
|
||||
- **Standby Node**: `autosnap=no`, `autoprune=yes` (receive only)
|
||||
3. **Replication via Syncoid**: Incremental transfer of snapshots from master to standby
|
||||
4. **Security Checks**:
|
||||
- Triple verification that the correct node is master
|
||||
- Size comparison to detect empty replacement disk
|
||||
- Size history to prevent accidental overwrites
|
||||
|
||||
## Technical Operation
|
||||
|
||||
### Automatic Master Node Detection
|
||||
|
||||
The script determines which node hosts LXC 103:
|
||||
|
||||
```bash
|
||||
# Detect active node
|
||||
ACTIVE_NODE=$(pvesh get /cluster/resources --type vm --output-format json | \
|
||||
jq -r '.[] | select(.vmid==103) | .node')
|
||||
|
||||
# Compare with local node
|
||||
CURRENT_NODE=$(hostname)
|
||||
|
||||
if [ "$ACTIVE_NODE" = "$CURRENT_NODE" ]; then
|
||||
# This node is the master
|
||||
configure_as_master
|
||||
else
|
||||
# This node is on standby
|
||||
configure_as_standby
|
||||
fi
|
||||
```
|
||||
|
||||
This detection ensures the system automatically adapts to LXC migrations, whether **planned** (maintenance) or **automatic** (Proxmox HA failover).
|
||||
|
||||
### Dynamic Sanoid Configuration
|
||||
|
||||
Sanoid is configured differently based on node role:
|
||||
|
||||
#### Master Node (hosts LXC 103)
|
||||
|
||||
```ini
|
||||
[zpool1/data-nfs-share]
|
||||
use_template = production
|
||||
recursive = yes
|
||||
autosnap = yes # Automatic snapshot creation
|
||||
autoprune = yes # Old snapshot cleanup
|
||||
|
||||
[zpool2/photos]
|
||||
use_template = production
|
||||
recursive = yes
|
||||
autosnap = yes
|
||||
autoprune = yes
|
||||
|
||||
[zpool2/storage]
|
||||
use_template = production
|
||||
recursive = yes
|
||||
autosnap = yes
|
||||
autoprune = yes
|
||||
```
|
||||
|
||||
#### Standby Node
|
||||
|
||||
```ini
|
||||
[zpool1/data-nfs-share]
|
||||
use_template = production
|
||||
recursive = yes
|
||||
autosnap = no # No snapshot creation
|
||||
autoprune = yes # Old snapshot cleanup
|
||||
|
||||
[zpool2/photos]
|
||||
use_template = production
|
||||
recursive = yes
|
||||
autosnap = no
|
||||
autoprune = yes
|
||||
|
||||
[zpool2/storage]
|
||||
use_template = production
|
||||
recursive = yes
|
||||
autosnap = no
|
||||
autoprune = yes
|
||||
```
|
||||
|
||||
### Replication with Syncoid
|
||||
|
||||
Syncoid performs incremental snapshot replication from master to standby:
|
||||
|
||||
```bash
|
||||
# Replicate each dataset
|
||||
syncoid --no-sync-snap --recursive \
|
||||
root@master:zpool1/data-nfs-share \
|
||||
zpool1/data-nfs-share
|
||||
|
||||
syncoid --no-sync-snap --recursive \
|
||||
root@master:zpool2/photos \
|
||||
zpool2/photos
|
||||
|
||||
syncoid --no-sync-snap --recursive \
|
||||
root@master:zpool2/storage \
|
||||
zpool2/storage
|
||||
```
|
||||
|
||||
The `--no-sync-snap` option avoids creating an additional sync snapshot, using only existing Sanoid snapshots.
|
||||
|
||||
### Security Mechanisms
|
||||
|
||||
The script implements several checks to prevent data loss:
|
||||
|
||||
#### Triple Replication Direction Verification
|
||||
|
||||
Before each replication, the script verifies **three times** that:
|
||||
1. LXC 103 is on the local node
|
||||
2. The local node is the master
|
||||
3. Sanoid configuration is in master mode
|
||||
|
||||
If any of these checks fails, replication is **aborted** to prevent replication in the wrong direction.
|
||||
|
||||
#### Empty Disk Protection
|
||||
|
||||
Before replicating, the script compares dataset sizes:
|
||||
|
||||
```bash
|
||||
# Get sizes
|
||||
SOURCE_SIZE=$(ssh root@master "zfs get -Hp -o value used zpool1/data-nfs-share")
|
||||
TARGET_SIZE=$(zfs get -Hp -o value used zpool1/data-nfs-share)
|
||||
|
||||
# If source is significantly smaller than target
|
||||
if [ $SOURCE_SIZE -lt $(($TARGET_SIZE / 2)) ]; then
|
||||
echo "ERROR: Suspicious source size, empty replacement disk?"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
This prevents an empty replacement disk from overwriting standby data.
|
||||
|
||||
#### Size History
|
||||
|
||||
The script maintains a dataset size history to detect abnormal variations (sudden size drop indicating a problem).
|
||||
|
||||
## NFS Configuration
|
||||
|
||||
### NFS Exports on LXC 103
|
||||
|
||||
The `/etc/exports` file defines NFS shares:
|
||||
|
||||
```bash
|
||||
# zpool2 pools exposed to specific VM (192.168.100.250)
|
||||
/zpool2 192.168.100.250(sync,wdelay,hide,crossmnt,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
|
||||
/zpool2/photos 192.168.100.250(sync,wdelay,hide,crossmnt,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
|
||||
/zpool2/storage 192.168.100.250(sync,wdelay,hide,crossmnt,no_subtree_check,sec=sys,rw,secure,no_root_squash,no_all_squash)
|
||||
|
||||
# Main share accessible to entire network
|
||||
/data-nfs-share 192.168.100.0/24(sync,wdelay,hide,no_subtree_check,sec=sys,rw,insecure,no_root_squash,no_all_squash)
|
||||
```
|
||||
|
||||
### NFS Options Explained
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `sync` | Confirms writes only after disk commit (integrity) |
|
||||
| `wdelay` | Groups writes to improve performance |
|
||||
| `hide` | Hides sub-mounts from NFS v3 clients |
|
||||
| `crossmnt` | Allows crossing mounts (useful with ZFS datasets) |
|
||||
| `no_subtree_check` | Disables subtree checking (performance) |
|
||||
| `rw` | Read/write |
|
||||
| `secure` | Requires requests from ports < 1024 (security) |
|
||||
| `insecure` | Allows ports > 1024 (necessary for some clients) |
|
||||
| `no_root_squash` | Preserves root permissions (avoids mapping to nobody) |
|
||||
| `no_all_squash` | Preserves user UIDs/GIDs |
|
||||
|
||||
:::warning Security: no_root_squash
|
||||
The `no_root_squash` option allows NFS clients to perform operations as root. This is acceptable in a trusted home network (192.168.100.0/24), but would constitute a **major security risk** on an untrusted network.
|
||||
:::
|
||||
|
||||
### Systemd Services
|
||||
|
||||
Active NFS services on LXC:
|
||||
|
||||
```bash
|
||||
nfs-server.service enabled # Main NFS server
|
||||
nfs-blkmap.service enabled # pNFS block layout support
|
||||
nfs-client.target enabled # Target for NFS clients
|
||||
nfs-exports-update.timer enabled # Automatic export updates
|
||||
```
|
||||
|
||||
### Network Ports
|
||||
|
||||
NFS listening ports:
|
||||
|
||||
```
|
||||
2049/tcp # NFSv4 (main)
|
||||
111/tcp # Portmapper (rpcbind)
|
||||
```
|
||||
|
||||
## Client-Side NFS Mounting
|
||||
|
||||
### /etc/fstab Configuration
|
||||
|
||||
To automatically mount the NFS share on VM or container boot, add the following entry to `/etc/fstab`:
|
||||
|
||||
```fstab
|
||||
192.168.100.150:/data-nfs-share /mnt/storage nfs hard,intr,timeo=100,retrans=30,_netdev,nofail,x-systemd.automount 0 0
|
||||
```
|
||||
|
||||
:::tip Real Example
|
||||
This configuration is used on my [Docker Compose & Ansible production VM](/docs/homelab-actuel/docker-compose) which hosts all my containerized services.
|
||||
:::
|
||||
|
||||
### Mount Options Explained
|
||||
|
||||
| Option | Description |
|
||||
|--------|-------------|
|
||||
| `hard` | In case of NFS server unavailability, I/O operations are **blocked waiting** rather than failing (ensures integrity) |
|
||||
| `intr` | Allows interrupting blocked I/O operations with Ctrl+C (useful in case of network issues) |
|
||||
| `timeo=100` | 10-second timeout (100 tenths of a second) before retry |
|
||||
| `retrans=30` | Number of retransmissions before declaring error (30 × 10s = 5 minutes of retry) |
|
||||
| `_netdev` | Indicates mount requires network (systemd waits for network connectivity) |
|
||||
| `nofail` | Doesn't prevent boot if mount fails (avoids boot blocking) |
|
||||
| `x-systemd.automount` | Automatic mount on first use (avoids blocking boot) |
|
||||
| `0 0` | No dump or fsck (not applicable for NFS) |
|
||||
|
||||
### Behavior During NFS Failover
|
||||
|
||||
Thanks to `hard` and `retrans=30` options, during NFS server failover (~60 seconds):
|
||||
|
||||
1. **During Failover**: Ongoing I/O operations are **suspended** (hard mount)
|
||||
2. **Automatic Retry**: NFS client retries for 5 minutes (30 × 10s)
|
||||
3. **Transparent Recovery**: As soon as NFS server restarts on the other node, I/O operations resume **automatically**
|
||||
4. **No Intervention**: Applications don't need to restart or remount the share
|
||||
|
||||
:::info Fault Tolerance
|
||||
The retry time (5 minutes) is well above the NFS server RTO (~60 seconds), ensuring clients survive failover without errors.
|
||||
:::
|
||||
|
||||
### Manual Mounting
|
||||
|
||||
To temporarily mount the NFS share:
|
||||
|
||||
```bash
|
||||
# Create mount point
|
||||
mkdir -p /mnt/storage
|
||||
|
||||
# Manual mount
|
||||
mount -t nfs -o hard,intr,timeo=100,retrans=30 \
|
||||
192.168.100.150:/data-nfs-share /mnt/storage
|
||||
|
||||
# Verify mount
|
||||
df -h /mnt/storage
|
||||
mount | grep nfs
|
||||
```
|
||||
|
||||
### Verify Automatic Mounting
|
||||
|
||||
```bash
|
||||
# Reload systemd to account for fstab
|
||||
systemctl daemon-reload
|
||||
|
||||
# Test mount without reboot
|
||||
mount -a
|
||||
|
||||
# Verify mount is active
|
||||
systemctl status mnt-storage.mount
|
||||
|
||||
# Display NFS statistics
|
||||
nfsstat -m
|
||||
```
|
||||
|
||||
## High Availability and Failover Time
|
||||
|
||||
### HA Architecture Thanks to Linstor DRBD
|
||||
|
||||
The NFS server benefits from Proxmox high availability thanks to **LXC 103 rootfs stored on Linstor DRBD**:
|
||||
|
||||

|
||||
|
||||
The screenshot above shows the Proxmox HA configuration of the NFS server:
|
||||
- **LXC 103 (nfs-server)**: HA resource with Max. Restart = 2, currently hosted on `elitedesk` node
|
||||
- The LXC can automatically restart on the other node in case of failure, thanks to its rootfs on shared DRBD storage
|
||||
|
||||
### Failure Scenario: Automatic Failover
|
||||
|
||||
In case of failure of a node hosting LXC 103:
|
||||
|
||||
1. **Detection** (5-10s): Proxmox HA Manager detects node failure via quorum
|
||||
2. **Decision** (1-2s): HA Manager decides to restart LXC on surviving node
|
||||
3. **Storage Migration** (0s): DRBD rootfs is already replicated and accessible on the other node
|
||||
4. **LXC Startup** (40-50s): LXC starts on new node
|
||||
5. **ZFS Mount and NFS Start** (5-10s): Local ZFS datasets are mounted and NFS service starts
|
||||
|
||||
**Total failover time: ~60 seconds**
|
||||
|
||||
:::info RPO and RTO
|
||||
- **RPO (Recovery Point Objective)**: 10 minutes (ZFS replication interval)
|
||||
- **RTO (Recovery Time Objective)**: ~60 seconds (LXC failover time)
|
||||
|
||||
These values are **widely acceptable** for a cold data NFS server in a homelab context.
|
||||
:::
|
||||
|
||||
### Automatic Replication Adaptation
|
||||
|
||||
After LXC failover to the other node:
|
||||
|
||||
1. The replication script detects LXC is now on the new node
|
||||
2. Sanoid configuration is automatically reversed:
|
||||
- The **former master** becomes **standby** (autosnap=no)
|
||||
- The **new master** becomes active (autosnap=yes)
|
||||
3. Replication now occurs in the **opposite direction**
|
||||
|
||||
No manual intervention required.
|
||||
|
||||
## Installation and Deployment
|
||||
|
||||
### Prerequisites
|
||||
|
||||
- Proxmox cluster with at least 2 nodes
|
||||
- Identical ZFS pools on each node
|
||||
- LXC with rootfs on shared/distributed storage (Linstor DRBD)
|
||||
- Sanoid and Syncoid installed on Proxmox nodes
|
||||
- SSH access between nodes (SSH keys configured)
|
||||
|
||||
### Script Installation
|
||||
|
||||
```bash
|
||||
# On each Proxmox node
|
||||
|
||||
# 1. Clone Git repository
|
||||
cd /tmp
|
||||
git clone https://forgejo.tellserv.fr/Tellsanguis/zfs-sync-nfs-ha.git
|
||||
cd zfs-sync-nfs-ha
|
||||
|
||||
# 2. Install script
|
||||
cp zfs-nfs-replica.sh /usr/local/bin/
|
||||
chmod +x /usr/local/bin/zfs-nfs-replica.sh
|
||||
|
||||
# 3. Install systemd services
|
||||
cp zfs-nfs-replica.service /etc/systemd/system/
|
||||
cp zfs-nfs-replica.timer /etc/systemd/system/
|
||||
|
||||
# 4. Enable and start timer
|
||||
systemctl daemon-reload
|
||||
systemctl enable --now zfs-nfs-replica.timer
|
||||
|
||||
# 5. Cleanup
|
||||
cd ..
|
||||
rm -rf zfs-sync-nfs-ha
|
||||
```
|
||||
|
||||
### Basic Sanoid Configuration
|
||||
|
||||
Create `/etc/sanoid/sanoid.conf` with production template:
|
||||
|
||||
```ini
|
||||
[template_production]
|
||||
frequently = 0
|
||||
hourly = 24
|
||||
daily = 7
|
||||
weekly = 4
|
||||
monthly = 6
|
||||
yearly = 0
|
||||
autosnap = yes
|
||||
autoprune = yes
|
||||
```
|
||||
|
||||
The script will automatically modify `autosnap` parameters based on node role.
|
||||
|
||||
### Verify Operation
|
||||
|
||||
```bash
|
||||
# Check timer status
|
||||
systemctl status zfs-nfs-replica.timer
|
||||
|
||||
# Display logs from last execution
|
||||
journalctl -u zfs-nfs-replica.service -n 50
|
||||
|
||||
# List Sanoid snapshots
|
||||
sanoid --monitor-snapshots
|
||||
|
||||
# Verify replication on standby
|
||||
zfs list -t snapshot | grep zpool1/data-nfs-share
|
||||
```
|
||||
|
||||
#### Snapshot Status
|
||||
|
||||
```bash
|
||||
root@elitedesk:~# sanoid --monitor-snapshots | grep -E "(zpool1|zpool2)"
|
||||
OK: all monitored datasets (zpool1, zpool1/data-nfs-share, zpool1/pbs-backups,
|
||||
zpool2, zpool2/photos, zpool2/storage) have fresh snapshots
|
||||
```
|
||||
|
||||
All configured datasets have up-to-date snapshots, confirming Sanoid is working correctly.
|
||||
|
||||
#### Replication Service Logs
|
||||
|
||||
Example logs during successful execution on master node (elitedesk):
|
||||
|
||||
```bash
|
||||
root@elitedesk:~# journalctl -u zfs-nfs-replica.service --since "1 hour ago"
|
||||
Dec 18 17:44:35 elitedesk systemd[1]: Starting zfs-nfs-replica.service - ZFS NFS HA Replication Service...
|
||||
Dec 18 17:44:35 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:35] [info] [global] ========================================
|
||||
Dec 18 17:44:35 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:35] [info] [global] Starting script version 2.0.1
|
||||
Dec 18 17:44:35 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:35] [info] [global] Node: elitedesk
|
||||
Dec 18 17:44:35 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:35] [info] [global] ========================================
|
||||
Dec 18 17:44:35 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:35] [info] [global] Checking updates from https://forgejo.tellserv.fr
|
||||
Dec 18 17:44:35 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:35] [info] [global] ✓ Script up to date (version 2.0.1)
|
||||
Dec 18 17:44:35 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:35] [info] [global] Remote node configured: acemagician (192.168.100.11)
|
||||
Dec 18 17:44:35 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:35] [info] [global] Configured pools: zpool1 zpool2
|
||||
Dec 18 17:44:35 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:35] [info] [global] Verification #1/3 of LXC 103 status
|
||||
Dec 18 17:44:37 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:37] [info] [global] Verification #1/3 successful: LXC 103 is active on this node
|
||||
Dec 18 17:44:39 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:39] [info] [global] Verification #2/3 of LXC 103 status
|
||||
Dec 18 17:44:41 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:41] [info] [global] Verification #2/3 successful: LXC 103 is active on this node
|
||||
Dec 18 17:44:43 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:43] [info] [global] Verification #3/3 of LXC 103 status
|
||||
Dec 18 17:44:45 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:45] [info] [global] Verification #3/3 successful: LXC 103 is active on this node
|
||||
Dec 18 17:44:45 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:45] [info] [global] ✓ Triple verification successful: LXC 103 is on this node
|
||||
Dec 18 17:44:45 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:45] [info] [global] Configuring Sanoid in ACTIVE mode (autosnap=yes)
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [global] SSH connection to acemagician (192.168.100.11)
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [global] ========================================
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [global] Starting replication of 2 pool(s)
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [global] ========================================
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] ========================================
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] Starting pool replication: zpool1
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] ========================================
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] Lock acquired for zpool1
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] Starting recursive replication: zpool1
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] Checking common snapshots between master and standby
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] ✓ 209 common snapshot(s) found
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] Mode: Incremental replication (most recent common snapshot)
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] Datasets to replicate:
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] - zpool1/data-nfs-share
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] - zpool1/pbs-backups
|
||||
Dec 18 17:44:46 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:46] [info] [zpool1] === Replicating zpool1/data-nfs-share (recursive) ===
|
||||
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3534180]: NEWEST SNAPSHOT: autosnap_2025-12-18_16:30:10_frequently
|
||||
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3534180]: INFO: no snapshots on source newer than autosnap_2025-12-18_16:30:10_frequently on target. Nothing to do.
|
||||
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:47] [info] [zpool1] ✓ zpool1/data-nfs-share replicated successfully
|
||||
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:47] [info] [zpool1] === Replicating zpool1/pbs-backups (recursive) ===
|
||||
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3534221]: NEWEST SNAPSHOT: autosnap_2025-12-18_16:30:10_frequently
|
||||
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3534221]: INFO: no snapshots on source newer than autosnap_2025-12-18_16:30:10_frequently on target. Nothing to do.
|
||||
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:47] [info] [zpool1] ✓ zpool1/pbs-backups replicated successfully
|
||||
Dec 18 17:44:47 elitedesk zfs-nfs-replica[3533956]: [2025-12-18 17:44:47] [info] [zpool1] Number of datasets processed: 2
|
||||
```
|
||||
|
||||
Key points visible in logs:
|
||||
- **Triple verification** that LXC 103 is on local node before any replication
|
||||
- **Automatic configuration** of Sanoid in ACTIVE mode (autosnap=yes)
|
||||
- **Incremental replication** based on 209 common snapshots
|
||||
- **No transfer needed**: datasets are already synchronized (last modification at 16:30, replication at 17:44)
|
||||
- **Recursive processing** of all child datasets
|
||||
|
||||
### Restore from Snapshot
|
||||
|
||||
```bash
|
||||
# List available snapshots
|
||||
zfs list -t snapshot zpool1/data-nfs-share
|
||||
|
||||
# Rollback to specific snapshot
|
||||
zfs rollback zpool1/data-nfs-share@autosnap_2025-12-18_12:00:00_hourly
|
||||
|
||||
# Or clone snapshot for inspection
|
||||
zfs clone zpool1/data-nfs-share@autosnap_2025-12-18_12:00:00_hourly \
|
||||
zpool1/data-nfs-share-restore
|
||||
```
|
||||
|
||||
### Script Update
|
||||
|
||||
The script includes auto-update functionality:
|
||||
|
||||
```bash
|
||||
# Script automatically checks for updates
|
||||
# Force update check
|
||||
/usr/local/bin/zfs-nfs-replica.sh --check-update
|
||||
```
|
||||
|
||||
## Limitations and Considerations
|
||||
|
||||
### 10-Minute RPO
|
||||
|
||||
Unlike Linstor DRBD which offers near-zero RPO, ZFS replication every 10 minutes means that in case of master node failure, **changes from the last 10 minutes** could be lost.
|
||||
|
||||
For cold data (media, files), this is acceptable. For critical data requiring RPO ~0, Linstor DRBD remains the appropriate solution.
|
||||
|
||||
### ~60 Second Downtime During Failover
|
||||
|
||||
Automatic LXC failover takes approximately **60 seconds**. During this time, the NFS server is inaccessible.
|
||||
|
||||
NFS clients will see their I/O operations blocked, then automatically resume once the server is available again (thanks to NFS retry mechanisms).
|
||||
|
||||
### Unidirectional Replication
|
||||
|
||||
At any time T, replication always occurs **from master to standby**. There is no simultaneous bidirectional replication.
|
||||
|
||||
If modifications are made on the standby (which shouldn't happen in normal use), they will be **overwritten** during the next replication.
|
||||
|
||||
### Network Dependency
|
||||
|
||||
Replication requires network connectivity between nodes. In case of network partition (split-brain), each node could believe itself to be master.
|
||||
|
||||
The script implements checks to minimize this risk, but in a prolonged split-brain scenario, manual intervention may be necessary.
|
||||
|
||||
## Conclusion
|
||||
|
||||
The **hybrid storage** architecture combining Linstor DRBD and replicated ZFS offers the best of both worlds:
|
||||
|
||||
- **Linstor DRBD** for VM/LXC: synchronous replication, live migration, RPO ~0
|
||||
- **Replicated ZFS** for cold data: large capacity, excellent integrity, minimal overhead
|
||||
|
||||
The highly available NFS server, with its **rootfs on DRBD** and **automatic ZFS replication**, ensures:
|
||||
- Failover time of **~60 seconds** in case of failure
|
||||
- Automatic adaptation to Proxmox HA failover
|
||||
- **Maximum data loss of 10 minutes** (RPO)
|
||||
- No manual intervention required
|
||||
|
||||
This solution is **perfectly suited** for a homelab requiring high availability for a cold data NFS server, while preserving resources (CPU, RAM, network) for critical services.
|
||||
|
||||
## Resources
|
||||
|
||||
- [zfs-sync-nfs-ha Script](https://forgejo.tellserv.fr/Tellsanguis/zfs-sync-nfs-ha)
|
||||
- [Sanoid Documentation](https://github.com/jimsalterjrs/sanoid)
|
||||
- [Article: Distributed Storage Proxmox](/blog/stockage-distribue-proxmox-ha)
|
||||
- [Proxmox HA Manager](https://pve.proxmox.com/wiki/High_Availability)
|
||||
Loading…
Add table
Add a link
Reference in a new issue