So what is the size of that file? Sparse Files on Linux

I had a very interesting conversation with my mate Tomas yesterday, and it all turned into an interesting disk usage puzzle for a Saturday night. 

The Question

So he says, here’s a rebus to solve. Take a look at these:

$ ls -l /mnt/vbox/centos.img 
-rw------- 1 alc alc 8589935104 Mar 15 20:32 /mnt/vbox/centos.img
$ du -sk /mnt/vbox/centos.img 
1002388	/mnt/vbox/centos.img
$ df -hT | egrep "File|vbox"
Filesystem     Type    Size  Used Avail Use% Mounted on
/dev/sda5      ext4     20G  5.8G   13G  31% /mnt/vbox
# dumpe2fs /dev/sda5 | grep "Block size"
dumpe2fs 1.42.5 (29-Jul-2012)
Block size:               4096

And tell me, what’s the size of the centos.img file?

The Quest

So what do we know?

According to the ls listing, the file size is ~8GB.

The du command’s output shows the file space usage of 1002388 KB, what would be roughly 1GB.

The df reported disk usage says there’re 5.8GB of space in use and 13GB available out of total 20GB. Filesystem is ext4 with a block size of 4096B, or 4KB.

Exploring Info Pages

Assuming the centos.img file size is 8GB, how is that possible that df shows only 5.8GB in use and du reports the file’s size of 1GB?

We need some more info. Let’s check the output of ls -ls:

$ ls -ls /mnt/vbox/centos.img
1002388 -rw------- 1 alc alc 8589935104 Mar 15 20:32 /mnt/vbox/centos.img

According to ls info page:

-s, --size : print the disk allocation of each file to the left of the filename. This is the amount of disk space used by the file, which is usually a bit more than the file's size, but it can be less if the file has holes. Normally the disk allocation is printed in units of 1024 bytes, but this can be overridden.

As we may see above, -s parameters reports the amount of disk space that is in use by a file, in kilobytes (1024B = 1KB). For our centos.img file, it’s 1002388KB, or ~1GB. It is the same amount of space that was reported by du -sk command earlier.

The ls -l long listing format reports the allocated file size (the difference between the end-of-file and the beginning-of-file), while ls -s shows the real amount of disk space in use in blocks. In our particular case it means that centos.img file most likely contains holes: it has 8GB of file size allocated, but only 1GB is actually in use (written data) on a disk.

The du -sk command shows blocks in use on a disk, the output is the same as for ls -s.

The dd Case

Now, let’s create a copy of the centos.img file with dd command:

$ dd < /mnt/vbox/centos.img > /mnt/vbox/centos.img_dd
16777217+0 records in
16777217+0 records out
8589935104 bytes (8.6 GB) copied, 87.2762 s, 98.4 MB/s

And check the size of the new centos.img_dd file:

$ ls -ls /mnt/vbox/centos.img*
1002388 -rw------- 1 alc alc 8589935104 Mar 15 20:32 /mnt/vbox/centos.img 
8388616 -rw-r--r-- 1 alc alc 8589935104 Mar 15 20:39 /mnt/vbox/centos.img_dd
$ du -sk /mnt/vbox/centos.img*
1002388 /mnt/vbox/centos.img
8388616 /mnt/vbox/centos.img_dd

What do we see here? The real size of the centos.img_dd file, or, the disk usage in other words, is 8388616KB (~8GB). This is because the dd utility does a low-level copying bypassing the filesystem layer. This means that a bit-for-bit copy makes any holes a file may contain overwritten with zeros. The file content itself doesn’t actually change, but holes are no longer empty disk space – it’s zeroes.

The cp Case

The most interesting case we found was to copy the image file with the cp command:

$ cp -p /mnt/vbox/centos.img /mnt/vbox/centos.img_cp

If we check the size of the new centos.img_cp file, we see the following:

$ ls -ls /mnt/vbox/centos.img*
1002388 -rw------- 1 alc alc 8589935104 Mar 15 20:32 /mnt/vbox/centos.img
 813036 -rw------- 1 alc alc 8589935104 Mar 15 20:51 /mnt/vbox/centos.img_cp
8388616 -rw-r--r-- 1 alc alc 8589935104 Mar 15 20:39 /mnt/vbox/centos.img_dd
$ du -sk /mnt/vbox/centos.img*
1002388 /mnt/vbox/centos.img
 813036 /mnt/vbox/centos.img_cp
8388616 /mnt/vbox/centos.img_dd

So, the disk usage of the centos.img_cp file, 813036KB, is smaller than the original centos.img file’s, 1002388KB. How did that happen? Let’s get back to info pages. The info page for cp says:

--sparse=WHEN : a "sparse file" contains "holes"--a sequence of zero bytes that does not occupy any 
physical disk blocks. By default, "cp" detects holes in input source files via a crude heuristic and 
makes the corresponding output file sparse as well. Only regular files may be sparse.

The WHEN value can be one of the following:

auto : the default behavior: if the input file is sparse, attempt to make the output file sparse, too. However, if an output file exists but refers to a non-regular file, then do not attempt to make it sparse.

always : for each sufficiently long sequence of zero bytes in the input file, attempt to create a corresponding hole in the output file, even if the input file does not appear to be sparse.

never : never make the output file sparse.

This puts some light on the case – the cp commands uses crude heuristics to detect holes in an input file by default, and produces the sparse output file if possible. Therefore our centos.img file has to be a sparse file.

Disk Usage and File Size: Some More Stuff

To get a better idea of disk usage and allocated file size, let’s create a one byte’s file with dd:

$ dd if=/dev/zero of=filex bs=1 count=1
1+0 records in
1+0 records out
1 byte (1 B) copied, 5.4057e-05 s, 18.5 kB/s

We know the file size is 1B, we can check that with the ls command:

$ ls -ls filex 
4 -rw-r--r-- 1 alc alc 1 Mar 15 22:07 filex

The file size is 1B, but the disk usage is 4KB. More and more curious, isn’t it? Not really. This is because our filesystem is configured to use a block size of 4096B, or 4KB. Even if we create a file of 1B, it will still consume 4K of space on a disk. This means that we have wasted 4095B of disk space by creating a 1B file.

Let’s create a 4K file now:

$ dd if=/dev/zero of=filex bs=1 count=4096
4096+0 records in
4096+0 records out
4096 bytes (4.1 kB) copied, 0.0121268 s, 338 kB/s

Here’s the ls output:

$ ls -ls filex
4 -rw-r--r-- 1 alc alc 4096 Mar 15 22:08 filex

The file size is 4K now, the disk usage is the same as before, 4K. How about creating a 4097B file? This one should now use 8KB of the disk space.

$ dd if=/dev/zero of=filex bs=1 count=4097
4097+0 records in
4097+0 records out
4097 bytes (4.1 kB) copied, 0.0155807 s, 263 kB/s

List sizes again:

$ ls -ls filex
8 -rw-r--r-- 1 alc alc 4097 Mar 15 22:09 filex

Wuala, all exactly as we expected. The disk usage increased to 8KB, while the file’s size is slightly more than 4KB.

This example, and the one before, showed the disk size equal or bigger than the file size. How about if we do the opposite thing? How about if we create a file which is bigger than its disk usage?

Let’s extend the size of the existing “filex” file to 16KB:

$ truncate -s 16K filex

If the file specified is shorter, it is extended and the extended part (hole) reads as zero bytes. This should not affect the disk space usage, but would only increase the allocated file’s size instead.

$ ls -ls filex 
8 -rw-r--r-- 1 alc alc 16384 Mar 15 22:11 filex

As we may notice above, the disk usage is still the same, 8KB. The file size, however, has increased to 16KB. We learn something new every day.

One thought on “So what is the size of that file? Sparse Files on Linux

  1. Thanks for the write up! I recently had a bizarre situation where a NetApp system seemed to think there was more free space available than Windows. It turned out to be a SQL Instant File Initialisation feature, which reclaimed used disk space without filling that space with zeros. I had some empty database files which didn’t occupy much space on the storage controller until.

Leave a Reply

Your email address will not be published. Required fields are marked *