Exporting block devices as raw image files with FUSE



  • Sometimes, there is a VM disk image whose contents you want to manipulate without booting the VM. For raw images, that process is usually fairly simple, because most Linux systems bring tools for the job, e.g.:

    • dd to just copy data to and from given offsets,
    • parted to manipulate the partition table,
    • kpartx to present all partitions as block devices,
    • mount to access filesystems’ contents.

    Sadly, but naturally, such tools only work for raw images, and not for images e.g. in QEMU’s qcow2 format. To access such an image’s content, the format has to be translated to create a raw image, for example by:

    • Exporting the image file with qemu-nbd -c as an NBD block device file,
    • Converting between image formats using qemu-img convert,
    • Accessing the image from a guest, where it appears as a normal block device.

    Unfortunately, none of these methods is perfect: qemu-nbd -c generally requires root rights, converting to a temporary raw copy requires additional disk space and the conversion process takes time, and accessing the image from a guest is just quite cumbersome in general (and also specifically something that we set out to avoid in the first sentence of this blog post).

    As of QEMU 6.0, there is another method, namely FUSE block exports. Conceptually, these are rather similar to using qemu-nbd -c, but they do not require root rights.

    Note: FUSE block exports are a feature that can be enabled or disabled during the build process with --enable-fuse or --disable-fuse, respectively; omitting either configure option will enable the feature if and only if libfuse3 is present. It is possible that the QEMU build you are using does not have FUSE block export support, because it was not compiled in.

    FUSE (Filesystem in Userspace) is a technology to let userspace processes provide filesystem drivers. For example, sshfs is a program that allows mounting remote directories from a machine accessible via SSH.

    QEMU can use FUSE to make a virtual block device appear as a normal file on the host, so that tools like kpartx can interact with it regardless of the image format.

    Background information

    File mounts

    A perhaps little-known fact is that, on Linux, filesystems do not need to have a root directory, they only need to have a root node. A filesystem that only provides a single regular file is perfectly valid.

    Conceptually, every filesystem is a tree, and mounting works by replacing one subtree of the global VFS tree by the mounted filesystem’s tree. Normally, a filesystem’s root node is a directory, like in the following example:

    Regular filesystem: Root directory is mounted to a directory mount point
    Fig. 1: Mounting a regular filesystem with a directory as its root node

    Here, the directory /foo and its content (the files /foo/a and /foo/b) are shadowed by the new filesystem (showing /foo/x and /foo/y).

    Note that a filesystem’s root node generally has no name. After mounting, the filesystem’s root directory’s name is determined by the original name of the mount point.

    Because a tree does not need to have multiple nodes but may consist of just a single leaf, a filesystem with a file for its root node works just as well, though:

    Mounting a file root node to a regular file mount point
    Fig. 2: Mounting a filesystem with a regular (unnamed) file as its root node

    Here, FS B only consists of a single node, a regular file with no name. (As above, a filesystem’s root node is generally unnamed.) Consequently, the mount point for it must also be a regular file (/foo/a in our example), and just like before, the content of /foo/a is shadowed, and when opening it, one will instead see the contents of FS B’s unnamed root node.

    QEMU block exports

    QEMU allows exporting block nodes via various protocols (as of 6.0: NBD, vhost-user, FUSE). A block node is an element of QEMU’s block graph (see e.g. Managing the New Block Layer, a talk given at KVM Forum 2017), which can for example be attached to guest devices. Here is a very simple example:

    Block graph: image file <-> file node (label: prot-node) <-> qcow2 node (label: fmt-node) <-> virtio-blk guest device
    Fig. 3: A simple block graph for attaching a qcow2 image to a virtio-blk guest device

    This is the simplest example for a block graph that connects a virtio-blk guest device to a qcow2 image file. The file block driver, instanced in the form of a block node named prot-node, accesses the actual file and provides the node above it access to the raw content. This node above, named fmt-node, is handled by the qcow2 block driver, which is capable of interpreting the qcow2 format. Parents of this node will therefore see the actual content of the virtual disk that is represented by the qcow2 image. There is only one parent here, which is the virtio-blk guest device, which will thus see the virtual disk.

    The command line to achieve the above could look something like this:

    $ qemu-system-x86_64 \
        -blockdev node-name=prot-node,driver=file,filename=$image_path \
        -blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
        -device virtio-blk,drive=fmt-node
    

    Besides attaching guest devices to block nodes, you can also export them for users outside of qemu, for example via NBD. Say you have a QMP channel open for the QEMU instance above, then you could do this:

    {
        "execute": "nbd-server-start",
        "arguments": {
            "addr": {
                "type": "inet",
                "data": {
                    "host": "localhost",
                    "port": "10809"
                }
            }
        }
    }
    {
        "execute": "block-export-add",
        "arguments": {
            "type": "nbd",
            "id": "fmt-node-export",
            "node-name": "fmt-node",
            "name": "guest-disk"
        }
    }
    

    This opens an NBD server on localhost:10809, which exports fmt-node (under the NBD export name guest-disk). The block graph looks as follows:

    Same block graph as fig. 3, but with an NBD server attached to fmt-node
    Fig. 4: Block graph extended by an NBD server

    NBD clients connecting to this server will see the raw disk as seen by the guest – we have exported the guest disk:

    $ qemu-img info nbd://localhost/guest-disk
    image: nbd://localhost:10809/guest-disk
    file format: raw
    virtual size: 20 GiB (21474836480 bytes)
    disk size: unavailable
    

    QEMU storage daemon

    If you are not running a guest, and so do not need guest devices, but all you want is to use the QEMU block layer (for example to interpret the qcow2 format) and export nodes from the block graph, then you can use the more lightweight QEMU storage daemon instead of a full-blown QEMU process:

    $ qemu-storage-daemon \
        --blockdev node-name=prot-node,driver=file,filename=$image_path \
        --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
        --nbd-server addr.type=inet,addr.host=localhost,addr.port=10809 \
        --export type=nbd,id=fmt-node-export,node-name=fmt-node,name=guest-disk
    

    Which creates the following block graph:

    Block graph: image file <-> file node (label: prot-node) <-> qcow2 node (label: fmt-node) <-> NBD server
    Fig. 5: Exporting a qcow2 image over NBD

    FUSE block exports

    Besides NBD exports, QEMU also supports vhost-user and FUSE exports. FUSE block exports make QEMU become a FUSE driver that provides a filesystem that consists of only a single node, namely a regular file that has the raw contents of the exported block node. QEMU will automatically mount this filesystem on a given existing regular file (which acts as the mount point, as described in the “File mounts” section).

    Thus, FUSE exports can be used like this:

    $ touch mount-point
    
    $ qemu-storage-daemon \
      --blockdev node-name=prot-node,driver=file,filename=$image_path \
      --blockdev node-name=fmt-node,driver=qcow2,file=prot-node \
      --export type=fuse,id=fmt-node-export,node-name=fmt-node,mountpoint=mount-point
    

    The mount point now appears as the raw VM disk that is stored in the qcow2 image:

    $ qemu-img info mount-point
    image: mount-point
    file format: raw
    virtual size: 20 GiB (21474836480 bytes)
    disk size: 196 KiB
    

    And mount tells us that this is indeed its own filesystem:

    $ mount | grep mount-point
    /dev/fuse on /tmp/mount-point type fuse (rw,nosuid,nodev,relatime,user_id=1000,
    group_id=100,default_permissions,allow_other,max_read=67108864)
    

    The block graph looks like this:

    Block graph: image file <-> file node (label: prot-node) <-> qcow2 node (label: fmt-node) <-> FUSE server <-> exported file
    Fig. 6: Exporting a qcow2 image over FUSE

    Closing the storage daemon (e.g. with Ctrl-C) automatically unmounts the export, turning the mount point back into an empty normal file:

    $ mount | grep -c mount-point
    0
    
    $ qemu-img info mount-point
    image: mount-point
    file format: raw
    virtual size: 0 B (0 bytes)
    disk size: 0 B
    

    Mounting an image on itself

    So far, we have seen what FUSE exports are, how they work, and how they can be used. Now let’s add an interesting twist.

    What happens to the old tree under a mount point?

    Mounting a filesystem only shadows the mount point’s original content, it does not remove it. The original content can no longer be looked up via its (absolute) path, but it is still there, much like a file that has been unlinked but is still open in some process. Here is an example:

    First, create some file in some directory, and have some process keep it open:

    $ mkdir foo
    
    $ echo 'Is anyone there?' > foo/bar
    
    $ irb
    irb(main):001:0> f = File.open('foo/bar', 'r+')
    => #<File:foo/bar>
    irb(main):002:0> ^Z
    [1]  + 35494 suspended  irb
    

    Next, mount something on the directory:

    $ sudo mount -t tmpfs tmpfs foo
    

    The file cannot be found anymore (because foo’s content is shadowed by the mounted filesystem), but the process who kept it open can still read from it, and write to it:

    $ ls foo
    
    $ cat foo/bar
    cat: foo/bar: No such file or directory
    
    $ fg
    f.read
    irb(main):002:0> f.read
    => "Is anyone there?\n"
    irb(main):003:0> f.puts('Hello from the shadows!')
    => nil
    irb(main):004:0> exit
    
    $ ls foo
    
    $ cat foo/bar
    cat: foo/bar: No such file or directory
    

    Unmounting the filesystem lets us see our file again, with its updated content:

    $ sudo umount foo
    
    $ ls foo
    bar
    
    $ cat foo/bar
    Is anyone there?
    Hello from the shadows!
    

    Letting a FUSE export shadow its image file

    The same principle applies to file mounts: The original inode is shadowed (along with its content), but it is still there for any process that opened it before the mount occurred. Because QEMU (or the storage daemon) opens the image file before mounting the FUSE export, you can therefore specify an image’s path as the mount point for its corresponding export:

    $ qemu-img create -f qcow2 foo.qcow2 20G
    Formatting 'foo.qcow2', fmt=qcow2 cluster_size=65536 extended_l2=off
     compression_type=zlib size=21474836480 lazy_refcounts=off refcount_bits=16
    
    $ qemu-img info foo.qcow2
    image: foo.qcow2
    file format: qcow2
    virtual size: 20 GiB (21474836480 bytes)
    disk size: 196 KiB
    cluster_size: 65536
    Format specific information:
        compat: 1.1
        compression type: zlib
        lazy refcounts: false
        refcount bits: 16
        corrupt: false
        extended l2: false
    
    $ qemu-storage-daemon --blockdev \
       node-name=node0,driver=qcow2,file.driver=file,file.filename=foo.qcow2 \
       --export type=fuse,id=node0-export,node-name=node0,mountpoint=foo.qcow2 &
    [1] 40843
    
    $ qemu-img info foo.qcow2
    image: foo.qcow2
    file format: raw
    virtual size: 20 GiB (21474836480 bytes)
    disk size: 196 KiB
    
    $ kill %1
    [1]  + 40843 done       qemu-storage-daemon --blockdev  --export
    

    In graph form, that looks like this:

    Two graphs: First, foo.qcow2 is opened by QEMU; second, a FUSE server exports the raw disk under foo.qcow2, thus shadowing the original foo.qcow2
    Fig. 6: Exporting a qcow2 image via FUSE on its own path

    QEMU (or the storage daemon in this case) keeps the original (qcow2) file open, and so it keeps access to it, even after the mount. However, any other process that opens the image by name (i.e. open("foo.qcow2")) will open the raw disk image exported by QEMU. Therefore, it looks like the qcow2 image is in raw format now.

    qemu-fuse-disk-export.py

    Because the QEMU storage daemon command line tends to become kind of long, I’ve written a script to facilitate the process: qemu-fuse-disk-export.py (direct download link). This script automatically detects the image format, and its --daemonize option allows safe use in scripts, where it is important that the process blocks until the export is fully set up.

    Using qemu-fuse-disk-export.py, the above example looks like this:

    $ qemu-img info foo.qcow2 | grep 'file format'
    file format: qcow2
    
    $ qemu-fuse-disk-export.py foo.qcow2 &
    [1] 13339
    All exports set up, ^C to revert
    
    $ qemu-img info foo.qcow2 | grep 'file format'
    file format: raw
    
    $ kill -SIGINT %1
    [1]  + 13339 done       qemu-fuse-disk-export.py foo.qcow2
    
    $ qemu-img info foo.qcow2 | grep 'file format'
    file format: qcow2
    

    Or, with --daemonize/-d:

    $ qemu-img info foo.qcow2 | grep 'file format'
    file format: qcow2
    
    $ qemu-fuse-disk-export.py -dp qfde.pid foo.qcow2
    
    $ qemu-img info foo.qcow2 | grep 'file format'
    file format: raw
    
    $ kill -SIGINT $(cat qfde.pid)
    
    $ qemu-img info foo.qcow2 | grep 'file format'
    file format: qcow2
    

    Bringing it all together

    Now we know how to make disk images in any format understood by QEMU appear as raw images. We can thus run any application on them that works with such raw disk images:

    $ qemu-fuse-disk-export.py \
        -dp qfde.pid \
        Arch-Linux-x86_64-basic-20210711.28787.qcow2
    
    $ parted Arch-Linux-x86_64-basic-20210711.28787.qcow2 p
    WARNING: You are not superuser.  Watch out for permissions.
    Model:  (file)
    Disk /tmp/Arch-Linux-x86_64-basic-20210711.28787.qcow2: 42.9GB
    Sector size (logical/physical): 512B/512B
    Partition Table: gpt
    Disk Flags:
    
    Number  Start   End     Size    File system  Name  Flags
     1      1049kB  2097kB  1049kB                     bios_grub
     2      2097kB  42.9GB  42.9GB  btrfs
    
    $ sudo kpartx -av Arch-Linux-x86_64-basic-20210711.28787.qcow2
    add map loop0p1 (254:0): 0 2048 linear 7:0 2048
    add map loop0p2 (254:1): 0 83881951 linear 7:0 4096
    
    $ sudo mount /dev/mapper/loop0p2 /mnt/tmp
    
    $ ls /mnt/tmp
    bin   boot  dev  etc  home  lib  lib64  mnt  opt  proc  root  run  sbin  srv
    swap  sys   tmp  usr  var
    
    $ echo 'Hello, qcow2 image!' > /mnt/tmp/home/arch/hello
    
    $ sudo umount /mnt/tmp
    
    $ sudo kpartx -d Arch-Linux-x86_64-basic-20210711.28787.qcow2
    loop deleted : /dev/loop0
    
    $ kill -SIGINT $(cat qfde.pid)
    

    And launching the image, in the guest we see:

    [arch@archlinux ~] cat hello
    Hello, qcow2 image!
    

    A note on allow_other

    In the example presented in the above section, we access the exported image with a different user than the one who exported it (to be specific, we export it as a normal user, and then access it as root). This does not work prior to QEMU 6.1:

    $ qemu-fuse-disk-export.py -dp qfde.pid foo.qcow2
    
    $ sudo stat foo.qcow2
    stat: cannot statx 'foo.qcow2': Permission denied
    

    QEMU 6.1 has introduced support for FUSE’s allow_other mount option. Without that option, only the user who exported the image has access to it. By default, if the system allows for non-root users to add allow_other to FUSE mount options, QEMU will add it, and otherwise omit it. It does so by simply attempting to mount the export with allow_other first, and if that fails, it will try again without. (You can also force the behavior with the allow_other=(on|off|auto) export parameter.)

    Non-root users can pass allow_other if and only if /etc/fuse.conf contains the user_allow_other option.

    Conclusion

    As shown in this blog post, FUSE block exports are a relatively simple way to access images in any format understood by QEMU as if they were raw images. Any tool that can manipulate raw disk images can thus manipulate images in any format, simply by having the QEMU storage daemon provide a translation layer. By mounting the FUSE export on the original image path, this translation layer will effectively be invisible, and the original image will look like it is in raw format, so it can directly be accessed by those tools.

    The current main disadvantage of FUSE exports is that they offer relatively bad performance. That should be fine as long as your use case is just light manipulation of some VM images, like manually modifying some files on them. However, we did not yet really try to optimize performance, so if more serious use cases appear that would require better performance, we can try.



    /2021/08/22/fuse-blkexport/

Log in to reply
 

© Lightnetics 2024