Sometimes, there is a VM disk image whose contents you want to manipulate
without booting the VM. For raw images, that process is usually fairly simple,
because most Linux systems bring tools for the job, e.g.:
- dd to just copy data to and from given offsets,
- parted to manipulate the partition table,
- kpartx to present all partitions as block devices,
- mount to access filesystems’ contents.
Sadly, but naturally, such tools only work for raw images, and not for images
e.g. in QEMU’s qcow2 format. To access such an image’s content, the format has
to be translated to create a raw image, for example by:
- Exporting the image file with
qemu-nbd -c
as an NBD block device file,
- Converting between image formats using
qemu-img convert
,
- Accessing the image from a guest, where it appears as a normal block device.
Unfortunately, none of these methods is perfect: qemu-nbd -c
generally
requires root rights, converting to a temporary raw copy requires additional
disk space and the conversion process takes time, and accessing the image from a
guest is just quite cumbersome in general (and also specifically something that
we set out to avoid in the first sentence of this blog post).
As of QEMU 6.0, there is another method, namely FUSE block exports.
Conceptually, these are rather similar to using qemu-nbd -c
, but they do not
require root rights.
Note: FUSE block exports are a feature that can be enabled or disabled
during the build process with --enable-fuse
or --disable-fuse
, respectively;
omitting either configure option will enable the feature if and only if libfuse3
is present. It is possible that the QEMU build you are using does not have FUSE
block export support, because it was not compiled in.
FUSE (Filesystem in Userspace) is a technology to let userspace processes
provide filesystem drivers. For example, sshfs is a program that allows
mounting remote directories from a machine accessible via SSH.
QEMU can use FUSE to make a virtual block device appear as a normal file on the
host, so that tools like kpartx can interact with it regardless of the image
format.
File mounts
A perhaps little-known fact is that, on Linux, filesystems do not need to have
a root directory, they only need to have a root node. A filesystem that only
provides a single regular file is perfectly valid.
Conceptually, every filesystem is a tree, and mounting works by replacing one
subtree of the global VFS tree by the mounted filesystem’s tree. Normally, a
filesystem’s root node is a directory, like in the following example:
|
Fig. 1: Mounting a regular filesystem with a directory as its root node |
Here, the directory /foo
and its content (the files /foo/a
and /foo/b
) are
shadowed by the new filesystem (showing /foo/x
and /foo/y
).
Note that a filesystem’s root node generally has no name. After mounting, the
filesystem’s root directory’s name is determined by the original name of the
mount point.
Because a tree does not need to have multiple nodes but may consist of just a
single leaf, a filesystem with a file for its root node works just as well,
though:
|
Fig. 2: Mounting a filesystem with a regular (unnamed) file as its root node |
Here, FS B only consists of a single node, a regular file with no name. (As
above, a filesystem’s root node is generally unnamed.) Consequently, the mount
point for it must also be a regular file (/foo/a
in our example), and just
like before, the content of /foo/a
is shadowed, and when opening it, one will
instead see the contents of FS B’s unnamed root node.
QEMU block exports
QEMU allows exporting block nodes via various protocols (as of 6.0: NBD,
vhost-user, FUSE). A block node is an element of QEMU’s block graph (see e.g.
Managing the New Block Layer,
a talk given at KVM Forum 2017), which can for example be attached to guest
devices. Here is a very simple example:
|
Fig. 3: A simple block graph for attaching a qcow2 image to a virtio-blk guest device |
This is the simplest example for a block graph that connects a virtio-blk
guest device to a qcow2 image file. The file block driver, instanced in the
form of a block node named prot-node, accesses the actual file and provides
the node above it access to the raw content. This node above, named fmt-node,
is handled by the qcow2 block driver, which is capable of interpreting the
qcow2 format. Parents of this node will therefore see the actual content of the
virtual disk that is represented by the qcow2 image. There is only one parent
here, which is the virtio-blk guest device, which will thus see the virtual
disk.
The command line to achieve the above could look something like this:
$ qemu-system-x86_64 \
-blockdev node-name=prot-node,driver=file,filename=$image_path \
-blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
-device virtio-blk,drive=fmt-node
Besides attaching guest devices to block nodes, you can also export them for
users outside of qemu, for example via NBD. Say you have a QMP channel open for
the QEMU instance above, then you could do this:
{
"execute": "nbd-server-start",
"arguments": {
"addr": {
"type": "inet",
"data": {
"host": "localhost",
"port": "10809"
}
}
}
}
{
"execute": "block-export-add",
"arguments": {
"type": "nbd",
"id": "fmt-node-export",
"node-name": "fmt-node",
"name": "guest-disk"
}
}
This opens an NBD server on localhost:10809
, which exports fmt-node (under
the NBD export name guest-disk). The block graph looks as follows:
|
Fig. 4: Block graph extended by an NBD server |
NBD clients connecting to this server will see the raw disk as seen by the
guest – we have exported the guest disk:
$ qemu-img info nbd://localhost/guest-disk
image: nbd://localhost:10809/guest-disk
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: unavailable
QEMU storage daemon
If you are not running a guest, and so do not need guest devices, but all you
want is to use the QEMU block layer (for example to interpret the qcow2 format)
and export nodes from the block graph, then you can use the more lightweight
QEMU storage daemon instead of a full-blown QEMU process:
$ qemu-storage-daemon \
--blockdev node-name=prot-node,driver=file,filename=$image_path \
--blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
--nbd-server addr.type=inet,addr.host=localhost,addr.port=10809 \
--export type=nbd,id=fmt-node-export,node-name=fmt-node,name=guest-disk
Which creates the following block graph:
|
Fig. 5: Exporting a qcow2 image over NBD |
FUSE block exports
Besides NBD exports, QEMU also supports vhost-user and FUSE exports. FUSE block
exports make QEMU become a FUSE driver that provides a filesystem that consists
of only a single node, namely a regular file that has the raw contents of the
exported block node. QEMU will automatically mount this filesystem on a given
existing regular file (which acts as the mount point, as described in the
“File mounts” section).
Thus, FUSE exports can be used like this:
$ touch mount-point
$ qemu-storage-daemon \
--blockdev node-name=prot-node,driver=file,filename=$image_path \
--blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
--export type=fuse,id=fmt-node-export,node-name=fmt-node,mountpoint=mount-point
The mount point now appears as the raw VM disk that is stored in the qcow2
image:
$ qemu-img info mount-point
image: mount-point
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: 196 KiB
And mount tells us that this is indeed its own filesystem:
$ mount | grep mount-point
/dev/fuse on /tmp/mount-point type fuse (rw,nosuid,nodev,relatime,user_id=1000,
group_id=100,default_permissions,allow_other,max_read=67108864)
The block graph looks like this:
|
Fig. 6: Exporting a qcow2 image over FUSE |
Closing the storage daemon (e.g. with Ctrl-C) automatically unmounts the export,
turning the mount point back into an empty normal file:
$ mount | grep -c mount-point
0
$ qemu-img info mount-point
image: mount-point
file format: raw
virtual size: 0 B (0 bytes)
disk size: 0 B
Mounting an image on itself
So far, we have seen what FUSE exports are, how they work, and how they can be
used. Now let’s add an interesting twist.
What happens to the old tree under a mount point?
Mounting a filesystem only shadows the mount point’s original content, it does
not remove it. The original content can no longer be looked up via its
(absolute) path, but it is still there, much like a file that has been unlinked
but is still open in some process. Here is an example:
First, create some file in some directory, and have some process keep it open:
$ mkdir foo
$ echo 'Is anyone there?' > foo/bar
$ irb
irb(main):001:0> f = File.open('foo/bar', 'r+')
=> #<File:foo/bar>
irb(main):002:0> ^Z
[1] + 35494 suspended irb
Next, mount something on the directory:
$ sudo mount -t tmpfs tmpfs foo
The file cannot be found anymore (because foo’s content is shadowed by the
mounted filesystem), but the process who kept it open can still read from it,
and write to it:
$ ls foo
$ cat foo/bar
cat: foo/bar: No such file or directory
$ fg
f.read
irb(main):002:0> f.read
=> "Is anyone there?\n"
irb(main):003:0> f.puts('Hello from the shadows!')
=> nil
irb(main):004:0> exit
$ ls foo
$ cat foo/bar
cat: foo/bar: No such file or directory
Unmounting the filesystem lets us see our file again, with its updated content:
$ sudo umount foo
$ ls foo
bar
$ cat foo/bar
Is anyone there?
Hello from the shadows!
Letting a FUSE export shadow its image file
The same principle applies to file mounts: The original inode is shadowed (along
with its content), but it is still there for any process that opened it before
the mount occurred. Because QEMU (or the storage daemon) opens the image file
before mounting the FUSE export, you can therefore specify an image’s path as
the mount point for its corresponding export:
$ qemu-img create -f qcow2 foo.qcow2 20G
Formatting 'foo.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off
compression_type=zlib size=21474836480 lazy_refcounts=off refcount_bits=16
$ qemu-img info foo.qcow2
image: foo.qcow2
file format: qcow2
virtual size: 20 GiB (21474836480 bytes)
disk size: 196 KiB
cluster_size: 65536
Format specific information:
compat: 1.1
compression type: zlib
lazy refcounts: false
refcount bits: 16
corrupt: false
extended l2: false
$ qemu-storage-daemon --blockdev \
node-name=node0,driver=qcow2,file.driver=file,file.filename=foo.qcow2 \
--export type=fuse,id=node0-export,node-name=node0,mountpoint=foo.qcow2 &
[1] 40843
$ qemu-img info foo.qcow2
image: foo.qcow2
file format: raw
virtual size: 20 GiB (21474836480 bytes)
disk size: 196 KiB
$ kill %1
[1] + 40843 done qemu-storage-daemon --blockdev --export
In graph form, that looks like this:
|
Fig. 6: Exporting a qcow2 image via FUSE on its own path |
QEMU (or the storage daemon in this case) keeps the original (qcow2) file open,
and so it keeps access to it, even after the mount. However, any other process
that opens the image by name (i.e. open("foo.qcow2")
) will open the raw disk
image exported by QEMU. Therefore, it looks like the qcow2 image is in raw
format now.
qemu-fuse-disk-export.py
Because the QEMU storage daemon command line tends to become kind of long, I’ve
written a script to facilitate the process:
qemu-fuse-disk-export.py
(direct download link).
This script automatically detects the image format, and its --daemonize
option
allows safe use in scripts, where it is important that the process blocks until
the export is fully set up.
Using qemu-fuse-disk-export.py
, the above example looks like this:
$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2
$ qemu-fuse-disk-export.py foo.qcow2 &
[1] 13339
All exports set up, ^C to revert
$ qemu-img info foo.qcow2 | grep 'file format'
file format: raw
$ kill -SIGINT %1
[1] + 13339 done qemu-fuse-disk-export.py foo.qcow2
$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2
Or, with --daemonize
/-d
:
$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2
$ qemu-fuse-disk-export.py -dp qfde.pid foo.qcow2
$ qemu-img info foo.qcow2 | grep 'file format'
file format: raw
$ kill -SIGINT $(cat qfde.pid)
$ qemu-img info foo.qcow2 | grep 'file format'
file format: qcow2
Bringing it all together
Now we know how to make disk images in any format understood by QEMU appear as
raw images. We can thus run any application on them that works with such raw
disk images:
$ qemu-fuse-disk-export.py \
-dp qfde.pid \
Arch-Linux-x86_64-basic-20210711.28787.qcow2
$ parted Arch-Linux-x86_64-basic-20210711.28787.qcow2 p
WARNING: You are not superuser. Watch out for permissions.
Model: (file)
Disk /tmp/Arch-Linux-x86_64-basic-20210711.28787.qcow2: 42.9GB
Sector size (logical/physical): 512B/512B
Partition Table: gpt
Disk Flags:
Number Start End Size File system Name Flags
1 1049kB 2097kB 1049kB bios_grub
2 2097kB 42.9GB 42.9GB btrfs
$ sudo kpartx -av Arch-Linux-x86_64-basic-20210711.28787.qcow2
add map loop0p1 (254:0): 0 2048 linear 7:0 2048
add map loop0p2 (254:1): 0 83881951 linear 7:0 4096
$ sudo mount /dev/mapper/loop0p2 /mnt/tmp
$ ls /mnt/tmp
bin boot dev etc home lib lib64 mnt opt proc root run sbin srv
swap sys tmp usr var
$ echo 'Hello, qcow2 image!' > /mnt/tmp/home/arch/hello
$ sudo umount /mnt/tmp
$ sudo kpartx -d Arch-Linux-x86_64-basic-20210711.28787.qcow2
loop deleted : /dev/loop0
$ kill -SIGINT $(cat qfde.pid)
And launching the image, in the guest we see:
[arch@archlinux ~] cat hello
Hello, qcow2 image!
A note on allow_other
In the example presented in the above section, we access the exported image with
a different user than the one who exported it (to be specific, we export it as a
normal user, and then access it as root). This does not work prior to QEMU 6.1:
$ qemu-fuse-disk-export.py -dp qfde.pid foo.qcow2
$ sudo stat foo.qcow2
stat: cannot statx 'foo.qcow2': Permission denied
QEMU 6.1 has introduced support for FUSE’s allow_other
mount option. Without
that option, only the user who exported the image has access to it. By default,
if the system allows for non-root users to add allow_other
to FUSE mount
options, QEMU will add it, and otherwise omit it. It does so by simply
attempting to mount the export with allow_other
first, and if that fails, it
will try again without. (You can also force the behavior with the
allow_other=(on|off|auto)
export parameter.)
Non-root users can pass allow_other
if and only if /etc/fuse.conf
contains
the user_allow_other
option.
Conclusion
As shown in this blog post, FUSE block exports are a relatively simple way to
access images in any format understood by QEMU as if they were raw images.
Any tool that can manipulate raw disk images can thus manipulate images in any
format, simply by having the QEMU storage daemon provide a translation layer.
By mounting the FUSE export on the original image path, this translation layer
will effectively be invisible, and the original image will look like it is in
raw format, so it can directly be accessed by those tools.
The current main disadvantage of FUSE exports is that they offer relatively bad
performance. That should be fine as long as your use case is just light
manipulation of some VM images, like manually modifying some files on them.
However, we did not yet really try to optimize performance, so if more serious
use cases appear that would require better performance, we can try.
/2021/08/22/fuse-blkexport/