We will enable CephFS mirroring on the primarily-used cluster and add a secondary (destination) as a peer to it. The resources from the primary cluster are synchronized to the secondary cluster at a scheduled time creating a snapshot of the primary filesystem. This results in regular backups, which can restored whenever needed.
Once a successful peer is added, any volume created on the primary, shall be synchronized and backed up on the secondary, based on the schedule configured.
Based on your data you can then configure a backup plan. A backup schedule should incrementally backup the FileSystems resources. Additionally, we advise that the integrity of the backups to be checked periodically.
You need at least 2x Rook Ceph clusters with at least 3 Nodes (for testing, you can also use Rook Ceph test cluster manifest. The first cluster is used as the source and the second cluster will be used for mirroring CephFS data from the first one.
For both your Rook Ceph clusters you need to make sure that they can reach each other. One of the recommended way to make sure networking is setup correctly is by enabling host networking before you deploy the cluster.
apiVersion:ceph.rook.io/v1kind:CephFilesystemmetadata:name:myfsnamespace:rook-cephspec:metadataPool:failureDomain:hostreplicated:size:3dataPools:-name:replicatedfailureDomain:hostreplicated:size:3preserveFilesystemOnDelete:truemetadataServer:activeCount:1activeStandby:truemirroring:enabled:true# list of Kubernetes Secrets containing the peer token# for more details see: https://docs.ceph.com/en/latest/dev/cephfs-mirroring/#bootstrap-peers# Add the secret name if it already exists else specify the empty list here.peers:secretNames:#- secondary-cluster-peer# specify the schedule(s) on which snapshots should be taken# see the official syntax here https://docs.ceph.com/en/latest/cephfs/snap-schedule/#add-and-remove-schedulessnapshotSchedules:-path:/interval:24h# daily snapshots# The startTime should be mentioned in the format YYYY-MM-DDTHH:MM:SS# If startTime is not specified, then by default the start time is considered as midnight UTC.# see usage here https://docs.ceph.com/en/latest/cephfs/snap-schedule/#usage# startTime: 2022-07-15T11:55:00# manage retention policies# see syntax duration here https://docs.ceph.com/en/latest/cephfs/snap-schedule/#add-and-remove-retention-policiessnapshotRetention:-path:/duration:"h24"
Note: If you are using single node cluster for testing, be sure to change the replicated size to 1.
We uploaded a 10 GB test data file to this filesystem and then checked the data integrity of the file on primary cluster, by noting the checksum of the file on the primary, we will use this to verify the file on secondary cluster later on.
We see, snap sync count got incremented, we will be able to see a persistent volume got created on the secondary cluster.
Now we verify on the secondary(destination) cluster, the checksum of the file after volume mount we verify if the snapshot after file being written got synchronized...
In case of any failure (disaster), since CephFS mirroring is a one way mirroring you would need to remove the existing peer, and reverse the roles i.e. make the destination cluster new primary and primary, the destination, so that we can sync back the snapshots. This sounds tedious but is a simple process that can be used at the time of restoration.
After the peer is configured, the snapshots should be synchronized back to the previous primary(cluster1), you should be able to see Persistent Volumes created on cluster1.
Once you are sure all the resources are synchronized back to cluster1(current secondary), you can remove the peer and make current secondary(cluster1) back the primary and cluster2 can be peered back to resume backup on them. Please ensure, the recovery be performed ensuring the primary cluster where we are restoring backups is in healthy state.
We ran Rook Ceph on hardware, with 16GB of RAM on each of the 3 nodes. The result was an effortless and successful synchronization of the CephFS Persistent Volumes with the sample data integrity intact, it took around 89 sec for a snapshot to be synchronized.
Note: The results may vary depending on the hardware used and network bandwidth of the cluster, along with other factors.