Saturday, March 8, 2014

Compute Engine Persistent Disk Backups using Snapshots

Persistent Disks in Google Compute Engine allow you to take point-in-time snapshots of your disks. Once a snapshot is taken, it’s globally available and can be restored to any zone, even if the zone where the disk was originated is hit by a meteor. On top of that, it’s easy to create and manage snapshots. These two features make snapshots very appealing as a backup solution. However, there are some details you need to take into account in order to use snapshots safely for backup.



First, let’s take a peek on how snapshots work. A snapshot is an exact copy of your disk at a point in time. A snapshot can be taken at the same time reads and writes are happening to the disk. In physical terms, it’s like stopping the disk head while the disk is in use and copying all disk sectors out, except that, thanks to Persistent Disk design, it takes only a fraction of a second to create the snapshot. The rest of the time is spent uploading the bits to Google Cloud Storage and there is no performance impact to your disks.



Snapshots are guaranteed to be consistent at the block level (block size=4KiB). This means that snapshots never contain a partially written block (no torn writes). However, if application is making a write operation that spans multiple blocks, it’s possible that only the first portion of the write operation is stored in the snapshot. In addition, writes made to the file system cache will not be present in the snapshot, meaning a snapshot is equivalent to a disk that underwent an unclean shutdown. While most file systems and server applications can handle this just fine (due to journaled file systems, redo logs, etc.), it may require time-consuming steps to restore the disk back to a consistent state. Certain applications may not be able to handle this situation gracefully.



There are a few simple steps you can take to ensure your snapshot is consistent. First, we strongly recommend using mounted volumes for application data (not boot disk). Among other things, it’s easier and safer to take snapshots in a running instance. These are the steps you need to follow before snapshots are taken:

  1. Pause IOs coming from the application if possible. Certain applications have explicit commands to pause and resume IOs, e.g. Oracle, MySQL. If your application doesn’t support pause/resume, make sure that your application can survive unclean shutdowns (e.g. MongoDB).

  2. Flush the file system cache (sync), and

  3. Pause IOs coming from file system. You can use fsfreeze command for Ext3/4, ReiserFS, JFS, and XFS.

Once there are no more IOs being sent to disk, it’s safe to start the snapshot. Note that IOs need to be paused for just the fraction of second that it takes to create the snapshot. IO operations can then be resumed while snapshot data is copied to Cloud Storage in the background.



Here is an example where we snapshot the disk “my-disk” using gcutil tool:

> # freeze IOs in application
> sudo sync
> sudo fsfreeze -f /mnt/my-disk
> gcutil addsnapshot --source_disk=my-disk my-snapshot --nosynchronous_mode


After issuing those commands, we wait until the snapshot operation transitions to either UPLOADING or READY state:

> gcutil getsnapshot my-snapshot
+----------------------+-------------------------------+
| name | my-snapshot |
| description | |
| creation-time | 2013-09-30T19:34:24.354-07:00 |
| status | UPLOADING |
| disk-size-gb | 10 |
| storage-bytes | 3802018 |
| storage-bytes-status | UP_TO_DATE |
| source-disk | disks/my-disk |
+----------------------+-------------------------------+


We can now safely resume IOs to the disk while the upload process completes in the background:

> sudo fsfreeze -u /mnt/my-disk
> # resume IOs in application


To restore the snapshot, wait until snapshot operation completes, and execute:

> gcutil adddisk --source_snapshot=my-snapshot my-disk-copy


Mount your disk and verify that it is working as expected. I can’t stress enough how important it is to test your backups regularly and to ensure your backup procedure is correct.



Once your application data is successfully backed up, you may still want to backup the root volume to save machine configuration and tools that you have installed. It’s not safe to use fsfreeze against your root volume because any command that requires write access to disk will be blocked. For example, logging on to a machine requires write access to disk, therefore you might get locked out if your ssh session disconnects while root is frozen. The safest way to snapshot root volume is to shut down your instance and then take the snapshot. If that is not possible, reduce the number of writes to root as much as possible to reduce recovery time, and take the snapshot. Your journaled file system will be able to recover safely after your disk is restored.



-Posted by Fabricio Voznika, Software Engineer

Related Posts by Categories

0 comments:

Post a Comment