ZFS at CLUG: the morning after
28 Mar 2007
The Cape Town Linux User Group was lucky to get both a behind-the-scenes and front-end explanation by Johan Hartzenburg of ZFS - Sun Microsystems' new advanced all-singing all-dancing filesystem which is also a volume manager and, I'm sure will eventually be able to send email before becoming Emacs.
Johan explained to us how ZFS manages to always be consistent - by never editing existing metadata entries, but rather copying the entry to a new entry, editing the new entry, and then replacing the link of the original entry's parent to the new entry. But, of course, because it never edits an entry directly, the parent goes through the same process, until it reaches the uberblock. The uberblock never has a new copy created of itself, but there are multiple copies of it, and updating the uberblock is an atomic operation. Even if things go awry while this is half-complete, any of the uberblocks is consistent (and, I think it has a timestamp to fall back on).
This all sounds really inefficient, but ends up not being so. The new blocks are generally all written near each other, making a whole bunch of random writes actually often be more efficient by having all the new data and metadata all be written near each other.
Unused metadata and data blocks are then removed.
Using this design makes snapshots pretty trivial - since all you need to do is not delete the original metadata and data blocks used in the snapshot. Everything speeds on ahead, and the scrubber just doesn't free those blocks.
Also, using this design makes changing on-disk options pretty simple. This includes, for example, how ZFS can efficiently handle different byte orders. On read, ZFS can handle either order, but on write will always use the most efficient byte order. Similarly, compression can be used on a data block level - every time a change happens to a file, it can compress the new data block that is created.
This also includes how to bring more members into the pool and harness the increased I/O bandwidth. The "allocator" just needs to allocate the new blocks created to be edited to the new member, and as time goes by, all members of the pool naturally tend to have equal amounts of data, and thus maximising bandwidth in concurrent read or write requests.
The command line tools are incredibly simple and powerful, and with ZFS you don't have to worry about device renaming, as it records on the disks all the information necessary to find out where in the ZFS hierarchy that disk lives. Easy to use, and hard to screw up? How can it possibly succeed?
Solaris servers with Xen as Dom0 (which seems to be progressing well) with a massive ZFS storage pool and multiple virtual machines just so sounds like a winning plan. Or FreeBSD, once ZFS-on-FreeBSD (going well, I see) and Xen-Dom0-on-FreeBSD (not quite as encouraging) are available in stable forms.
5 old-style comments
wjv — March 28, 2007 at 12:11 PM.
Neil Blakey-Milner — March 28, 2007 at 12:18 PM.
wjv — March 28, 2007 at 12:25 PM.
In general, though, I'd say the experience from the desktop translates: You get a thoroughly BSD-like system with a very well thought out and slick front-end which allows you to ignore the finer technical details of certain parts of the system if and when you need to. It combines the traditional "internet server" feeling of a UNIX box with the "desktop support server" feeling of a Windows-based server, saving you having to split those responsibilities between two environments. And it's very, very sexy. :-)
All else being equal (which is rarely the case) I would happily specify an Xserve-based network if I ever again had to administer one.
Oscar Reitsma — March 28, 2007 at 12:40 PM.
ZFS eats a lot of CPU when doing small writes (for example, a single byte). There are two root causes, currently being solved: a) Translating from znode to dnode is slower than necessary because ZFS doesn't use translation information it already has, and b) Current partial-block update code is very inefficient..
I think its still a bit young, but I will be looking at it in more detail in the future.
jerith — March 28, 2007 at 04:29 PM.