The Cape Town Linux User Group was lucky to get both a behind-the-scenes and front-end explanation by Johan Hartzenburg of ZFS - Sun Microsystems' new advanced all-singing all-dancing filesystem which is also a volume manager and, I'm sure will eventually be able to send email before becoming Emacs.

Johan explained to us how ZFS manages to always be consistent - by never editing existing metadata entries, but rather copying the entry to a new entry, editing the new entry, and then replacing the link of the original entry's parent to the new entry.  But, of course, because it never edits an entry directly, the parent goes through the same process, until it reaches the uberblock.  The uberblock never has a new copy created of itself, but there are multiple copies of it, and updating the uberblock is an atomic operation.  Even if things go awry while this is half-complete, any of the uberblocks is consistent (and, I think it has a timestamp to fall back on).

This all sounds really inefficient, but ends up not being so.  The new blocks are generally all written near each other, making a whole bunch of random writes actually often be more efficient by having all the new data and metadata all be written near each other.

Unused metadata and data blocks are then removed.

Using this design makes snapshots pretty trivial - since all you need to do is not delete the original metadata and data blocks used in the snapshot.  Everything speeds on ahead, and the scrubber just doesn't free those blocks.

Also, using this design makes changing on-disk options pretty simple.  This includes, for example, how ZFS can efficiently handle different byte orders.  On read, ZFS can handle either order, but on write will always use the most efficient byte order.  Similarly, compression can be used on a data block level - every time a change happens to a file, it can compress the new data block that is created.

This also includes how to bring more members into the pool and harness the increased I/O bandwidth.  The "allocator" just needs to allocate the new blocks created to be edited to the new member, and as time goes by, all members of the pool naturally tend to have equal amounts of data, and thus maximising bandwidth in concurrent read or write requests.

The command line tools are incredibly simple and powerful, and with ZFS you don't have to worry about device renaming, as it records on the disks all the information necessary to find out where in the ZFS hierarchy that disk lives. Easy to use, and hard to screw up?  How can it possibly succeed?

Solaris servers with Xen as Dom0 (which seems to be progressing well) with a massive ZFS storage pool and multiple virtual machines just so sounds like a winning plan.  Or FreeBSD, once ZFS-on-FreeBSD (going well, I see) and Xen-Dom0-on-FreeBSD (not quite as encouraging) are available in stable forms.

5 old-style comments

  1. wjvMarch 28, 2007 at 12:11 PM.

    Can't wait for Leopard!
  2. Neil Blakey-MilnerMarch 28, 2007 at 12:18 PM.

    That's a good point - I forgot about OS X. Have you ever used it as a server? How does it handle?
  3. wjvMarch 28, 2007 at 12:25 PM.

    I once had a G5 Xserve for a short time as an evaluation unit (and I have collaborators in Shanghai who run an all-Mac shop, including HPC) so my experience is limited.

    In general, though, I'd say the experience from the desktop translates: You get a thoroughly BSD-like system with a very well thought out and slick front-end which allows you to ignore the finer technical details of certain parts of the system if and when you need to. It combines the traditional "internet server" feeling of a UNIX box with the "desktop support server" feeling of a Windows-based server, saving you having to split those responsibilities between two environments. And it's very, very sexy. :-)

    All else being equal (which is rarely the case) I would happily specify an Xserve-based network if I ever again had to administer one.
  4. Oscar ReitsmaMarch 28, 2007 at 12:40 PM.

    I've kept a distant eye on ZFS after reading about it a while ago. I might be mistaken, but it was originally spec'd to be a cluster file system, which then fell by the wayside. If that is the case its a bit disappointing. I'm also a bit worried about what its performance will be on higher end dbms' such as Oracle due to this note:

    ZFS eats a lot of CPU when doing small writes (for example, a single byte). There are two root causes, currently being solved: a) Translating from znode to dnode is slower than necessary because ZFS doesn't use translation information it already has, and b) Current partial-block update code is very inefficient..

    I think its still a bit young, but I will be looking at it in more detail in the future.
  5. jerithMarch 28, 2007 at 04:29 PM.

    Oscar: I read a blog post a while back by some Sun people about performance testing Oracle on ZFS. As I recall, they were getting something like 80-90% of the performance of UFS by tuning various parameters. Then again, I could be mistaken with the numbers (it was several months ago) and the post was dealing with beta versions of ZFS.
blog comments powered by Disqus