12 minute read

I was poking around the mount command on one day to mount a nfs remote on my local machine. I stumbled upon many other filesystems in the listed under mount command. Some of them were totally unknown to me. Each of them have different use cases and different characteristics on their own. I wanted to learn them at least at a high level so I know what file system is used when and for what purposes in the world of Unix and Unix like systems. I often use Google Cloud Shell (which is a g1-small instance type VM based on Debian in GCP) for quick checks on Linux commands and features. It is an ephemeral instance, so contents stored in other than the home directory are lost upon deactivation. Also, it is free and automatically reclaimed after a period of inactivity, so I don’t have to worry about billing or maintaining the instance.

In the list of mounted filesystems on this cloud shell instance, I could see 10 different filesytems.

$ mount | grep -o "type [a-z,0-9].* "  | sort | uniq | nl
     1  type cgroup
     2  type devpts
     3  type ext4
     4  type mqueue
     5  type nsfs
     6  type overlay
     7  type proc
     8  type securityfs
     9  type sysfs
    10  type tmpfs

Looking into the list of file systems supported by this machine, there were about 28 of them. Only few of them require a block device as the backend (indicated by the absence of nodev).

~ $ cat /proc/filesystems | sort -k 2 | nl
     1          ext2
     2          ext3
     3          ext4
     4  nodev   autofs
     5  nodev   bdev
     6  nodev   binfmt_misc
     7  nodev   bpf
     8  nodev   cgroup
     9  nodev   cgroup2
    10  nodev   cpuset
    11  nodev   dax
    12  nodev   debugfs
    13  nodev   devpts
    14  nodev   devtmpfs
    15  nodev   efivarfs
    16  nodev   hugetlbfs
    17  nodev   mqueue
    18  nodev   overlay
    19  nodev   pipefs
    20  nodev   proc
    21  nodev   pstore
    22  nodev   ramfs
    23  nodev   rootfs
    24  nodev   securityfs
    25  nodev   sockfs
    26  nodev   sysfs
    27  nodev   tmpfs
    28  nodev   tracefs

I’m trying to learn the basics of each of these filesytem and post a summary as I go along. It will be too long if I post all of them into one, so breaking down into 3 or 4 parts.

autofs

Autofs is program/service used to mount file systems on demand. The typical use cases are to mount remote filesystems like NFS mounts and CIFS shares, mount removable media like USB, CD-Drive etc. The specified filesystem is auto mounted upon access and unmounted after a period of inactivity. The traditional approach is to specify all the necessary mounts in /etc/fstab so that the OS can mount them during the bootup. With remote filesystems, having an always-connected filesystem may incur significant network bandwidth. Autofs conserves the bandwidth by mounting the filesystems only when needed. The operations are controlled through the config and map files.

autofs-references

cgroup

This is a virtual filesystem, used by Linux cgroups kernel feature to manage cgroup operations. There is no cgroup specific system call. All cgroup actions are handled through operations on the files inside this virtual file system. In a typical configuration, a tmpfs filesystem is mounted at the top of cgroup hierarchy, usually at the path /sys/fs/cgroup. Each cgroup controller (e.g. memory, cpu, cpuset, blkio etc..) then has its own cgroup filesystem under this hierarchy. A new cgroup can be simply created by creating a directory under the desired controller (note: there are other ways too, mkdir is just one of them.). Kernel then automatically populates the directory with files specific to that controller.

A quick grep for cgroup in the list of mounted file system…can see that every cgroup controller has its own cgroup filesystem.

~ $ grep "cgroup" /proc/mounts
tmpfs /sys/fs/cgroup tmpfs rw,nosuid,nodev,noexec,relatime,mode=755 0 0
cgroup /sys/fs/cgroup/systemd cgroup rw,nosuid,nodev,noexec,relatime,xattr,name=systemd 0 0
cgroup /sys/fs/cgroup/net_cls,net_prio cgroup rw,nosuid,nodev,noexec,relatime,net_cls,net_prio 0 0
cgroup /sys/fs/cgroup/hugetlb cgroup rw,nosuid,nodev,noexec,relatime,hugetlb 0 0
cgroup /sys/fs/cgroup/devices cgroup rw,nosuid,nodev,noexec,relatime,devices 0 0
cgroup /sys/fs/cgroup/cpuset cgroup rw,nosuid,nodev,noexec,relatime,cpuset 0 0
cgroup /sys/fs/cgroup/memory cgroup rw,nosuid,nodev,noexec,relatime,memory 0 0
cgroup /sys/fs/cgroup/pids cgroup rw,nosuid,nodev,noexec,relatime,pids 0 0
cgroup /sys/fs/cgroup/freezer cgroup rw,nosuid,nodev,noexec,relatime,freezer 0 0
cgroup /sys/fs/cgroup/cpu,cpuacct cgroup rw,nosuid,nodev,noexec,relatime,cpu,cpuacct 0 0
cgroup /sys/fs/cgroup/blkio cgroup rw,nosuid,nodev,noexec,relatime,blkio 0 0
cgroup /sys/fs/cgroup/perf_event cgroup rw,nosuid,nodev,noexec,relatime,perf_event 0 0
cgroup /sys/fs/cgroup/rdma cgroup rw,nosuid,nodev,noexec,relatime,rdma 0 0

The files inside each controller directory are the interfaces to the cgroup operations

~ $ cd /sys/fs/cgroup/blkio/
blkio $ ls
blkio.io_merged            blkio.io_service_bytes            blkio.io_service_time            blkio.leaf_weight         blkio.sectors_recursive                    blkio.throttle.io_serviced_recursive  blkio.throttle.write_iops_device  blkio.weight_device    tasks
blkio.io_merged_recursive  blkio.io_service_bytes_recursive  blkio.io_service_time_recursive  blkio.leaf_weight_device  blkio.throttle.io_service_bytes            blkio.throttle.read_bps_device        blkio.time                        cgroup.clone_children
blkio.io_queued            blkio.io_serviced                 blkio.io_wait_time               blkio.reset_stats         blkio.throttle.io_service_bytes_recursive  blkio.throttle.read_iops_device       blkio.time_recursive              cgroup.procs
blkio.io_queued_recursive  blkio.io_serviced_recursive       blkio.io_wait_time_recursive     blkio.sectors             blkio.throttle.io_serviced                 blkio.throttle.write_bps_device       blkio.weight                      notify_on_release

~ $ cd /sys/fs/cgroup/memory/
memory $ ls
cgroup.clone_children  memory.force_empty              memory.kmem.slabinfo                memory.kmem.tcp.usage_in_bytes  memory.memsw.failcnt             memory.move_charge_at_immigrate  memory.soft_limit_in_bytes  memory.use_hierarchy
cgroup.event_control   memory.kmem.failcnt             memory.kmem.tcp.failcnt             memory.kmem.usage_in_bytes      memory.memsw.limit_in_bytes      memory.numa_stat                 memory.stat                 notify_on_release
cgroup.procs           memory.kmem.limit_in_bytes      memory.kmem.tcp.limit_in_bytes      memory.limit_in_bytes           memory.memsw.max_usage_in_bytes  memory.oom_control               memory.swappiness           tasks
memory.failcnt         memory.kmem.max_usage_in_bytes  memory.kmem.tcp.max_usage_in_bytes  memory.max_usage_in_bytes       memory.memsw.usage_in_bytes      memory.pressure_level            memory.usage_in_bytes

Say we want to create a new cgroup to limit the memory consumed by processes initiated by product-developers within a particular group. We can create a directory within /sys/fs/cgroup/memory/ first, and then set the desired values in the corresponding files.

memory $ sudo mkdir prod_devs
memory $ cd prod_devs/
prod_devs $ ls
cgroup.clone_children  memory.force_empty              memory.kmem.slabinfo                memory.kmem.tcp.usage_in_bytes  memory.memsw.failcnt             memory.move_charge_at_immigrate  memory.soft_limit_in_bytes  memory.use_hierarchy
cgroup.event_control   memory.kmem.failcnt             memory.kmem.tcp.failcnt             memory.kmem.usage_in_bytes      memory.memsw.limit_in_bytes      memory.numa_stat                 memory.stat                 notify_on_release
cgroup.procs           memory.kmem.limit_in_bytes      memory.kmem.tcp.limit_in_bytes      memory.limit_in_bytes           memory.memsw.max_usage_in_bytes  memory.oom_control               memory.swappiness           tasks
memory.failcnt         memory.kmem.max_usage_in_bytes  memory.kmem.tcp.max_usage_in_bytes  memory.max_usage_in_bytes       memory.memsw.usage_in_bytes      memory.pressure_level            memory.usage_in_bytes

memory $ echo 4G > /sys/fs/cgroup/memory/prod_devs/memory.limit_in_bytes

memory $ cat /sys/fs/cgroup/memory/prod_devs/memory.limit_in_bytes
4294967296

memory $ sudo rmdir prod_devs/

cgroup version 2 uses a unified hierarchy instead of having a separate file system for each controller. It uses the cgroup2 filesystem. It is possible to have both version 1 and 2 running in the same system, although there are some restrictions on the interoperability. There is so much to learn about cgroups, its operations and applications in the container world. That will be an endless path, so stopping wth just a overview of the file system for now.

~ $ grep "cgroup" /proc/mounts
tmpfs /sys/fs/cgroup tmpfs ro,nosuid,nodev,noexec,mode=755 0 0
cgroup /sys/fs/cgroup/unified cgroup2 rw,nosuid,nodev,noexec,relatime,nsdelegate 0 0

cgroup-references

devpts

devpts is a virtual filesystem used to manage the pseudo terminal devices, typically mounted at /dev/pts. It is mostly pseudo terminal that we deal with these days through applications like xterm, iterm etc that emulates a hardware terminal. Pseudo terminals work in a master-slave relationship. There used to individual master-slave pair for each pseudo-terminal. That has been replaced with a terminal mulitplexer (/dev/ptsmx) in the later releases. When a process (like xterm or any other terminal application) tries to open a new pseudo terminal, the multiplexes allocates a slave pseudo terminal in /dev/pts/ and returns the file descriptor to the calling process. There is quite a bit of history in the development of pseudo terminals.

~ $ mount -t devpts
devpts on /dev/pts type devpts (rw,nosuid,noexec,relatime,gid=5,mode=620,ptmxmode=000)

~ $ ls /dev/pts/
0  ptmx
~ $ tty
/dev/pts/0

~ $ #opening another terminal in another window.
~ $ ls /dev/pts/
0  1  ptmx

~ $ #send message to the new terminal
~ $ echo "hello, from pty0" >> /dev/pts/1

~ $ #sending a self message
~ $ echo "hello, from pty0" >> /dev/pts/0
hello, from pty0

devpts-references

devtmpfs

devtmpfs is an advancement over the now deprecated devfs filesystem. Both are meant for dynamic identification and management of system devices. devfs ran in kernel space and had some inherent issues that led to its deprecation over time. devfs is both a filesystem and a device manager. While devtmpfs takes care of the filesystem operations, udev takes care of the actual device management. udev deamon mounts the devtmpfs filesystem at /dev/ during the boot (see /etc/rc.d/udev or /etc/init.d/udev if using init as init daemon). All devices visible to the system will have an entry in /dev/. New device nodes are created in the devtmpfs filesystem by the kernel and are notified to the udev daemon for additional processing based on udev rules.

~ $ mount -t devtmpfs
udev on /dev type devtmpfs (rw,nosuid,relatime,size=1885344k,nr_inodes=471336,mode=755)

~ $ cd /dev
dev $ ls
autofs           disk       kmsg              network_latency     rtc       stdin   tty14  tty22  tty30  tty39  tty47  tty55  tty63  uinput   vcsa         vhci
block            fd         log               network_throughput  rtc0      stdout  tty15  tty23  tty31  tty4   tty48  tty56  tty7   urandom  vcsa1        vhost-net
bsg              full       loop-control      null                sda       tty     tty16  tty24  tty32  tty40  tty49  tty57  tty8   vcs      vcsa2        zero
btrfs-control    fuse       mapper            port                sda1      tty0    tty17  tty25  tty33  tty41  tty5   tty58  tty9   vcs1     vcsa3
char             hpet       mcelog            ppp                 sg0       tty1    tty18  tty26  tty34  tty42  tty50  tty59  ttyS0  vcs2     vcsa4
console          hugepages  mem               psaux               shm       tty10   tty19  tty27  tty35  tty43  tty51  tty6   ttyS1  vcs3     vcsa5
core             hwrng      memory_bandwidth  ptmx                snapshot  tty11   tty2   tty28  tty36  tty44  tty52  tty60  ttyS2  vcs4     vcsa6
cpu_dma_latency  initctl    mqueue            pts                 snd       tty12   tty20  tty29  tty37  tty45  tty53  tty61  ttyS3  vcs5     vfio
cuse             input      net               random              stderr    tty13   tty21  tty3   tty38  tty46  tty54  tty62  uhid   vcs6     vga_arbiter
dev $ ls /dev/block/
8:0  8:1
dev $ ls -l /dev/block/
total 0
lrwxrwxrwx 1 root root 6 Jan 16 21:29 8:0 -> ../sda
lrwxrwxrwx 1 root root 7 Jan 16 21:29 8:1 -> ../sda1

macOS seems to use devfs instead of devtmpfs. I need to find more on this.

~ $ mount -t devtmpfs
~ $ mount -t devfs
devfs on /dev (devfs, local, nobrowse)

devtmpfs-references

ramfs

ramfs allows us to use the physical memory as the backend of a filesystem with the help of Linux page caching mechanism. Since the backend is a volatile storage, all contents stored in a ramfs will be lost after unmounting. All file read and write operations typically hit the page cache first (unless it is a direct I/O). For reads, data is returned from the cache if present. If not, data is read from the backing store (e.g. disk), cached and then returned. For writes, blocks are first written to the cache and marked dirty. Once the dirty blocks are flushed (i.e. written permanently) to the backing store, the corresponding dirty blocks are marked clean and kept in cache to serve future operations. Only the clean blocks are allowed to be freed when there is a need for eviction. Dirty blocks are never freed. When we write into a file on ramfs filesystem, the blocks in page cache are allocated as usual but are never marked clean because there is no backing store to flush to. Therefore one can grow a ramfs as big as the physical memory itself, at which point system will stop to respond as there is no memory left to operate. For such reasons, only root user is allowed to create a ramfs.

Prior to ramfs, a concept called ramdisk was used. ramdisk simulates a fake block device backed by the RAM. We can then install the desired filesystem (e.g. zfs) on top of the ramdisk. This is certainly expensive as there are lot of overheads and conflicts in managing the filesystem on top of the volatile block device.

Some of the shortcomings of ramfs are addressed in tmpfs.

rootfs

rootfs is a special form of tmpfs (or ramfs if tmpfs is not available) that is always mounted and cannot be unmounted. It is mainly used in the system boot up and run the init process. The kernel doc has only minimal explanation this. rootfs doesn’t show up in mount or df output, which makes me wonder whether rootfs is hidden or just something else is used in place rootfs. I found few links (so thread, so thread2), but still not very clear about the answers.

tmpfs

tmpfs was built on top of ramfs with the following additional capabilities:

  • size limiting
  • swap space usage

Because blocks written in a ramfs filesystem are never freed, we can keep writing to a ramfs until the full capacity is exhausted. tmpfs provides an option to limit the maximum size of the filesystem. The limit can be changed dynamically too. It also supports page swapping where the unneeded blocks are swapped out from the memory into the swap space backed by the persistent storage like disk. When those blocks are accessed later on, they are paged into the memory. This allows better memory management and finer controls for the applications making use of the memory for quick file based operations. When cache miss occurs in a tmpfs filesystem, blocks will have to be loaded from the swap space. That will incur disk I/O, so it is possible for a read/write op to a tmpfs filesystem to wait for disk I/O operation to complete. Like ramfs, the contents in tmpfs are volatile.

The size limit can be specified in terms of raw size in k/m/g units or % of the physical memory. The default behavior is to consume 50% of the memory if no size is specified.

Mounting a tmpfs filesystem

The general syntax for the mount command is mount [OPTIONS] DEVICE MOUNTPOINT. The device identifier doesn’t matter for tmpfs. In this case, the device identifier used is appdata and mount point is /mnt/appdata. The device identifier could be anything here. The mount point must exist though. I think the general convention is to use tmpfs as the device name when using a tmpfs filesystem. (e.g. mount -t tmpfs -o size=10M tmpfs /mmt/mymountpoint)

~ $ sudo mkdir /mnt/appdata
~ $ sudo mount -t tmpfs -o size=100M appdata /mnt/appdata
~ $ df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          41G   34G  6.4G  85% /
tmpfs            64M     0   64M   0% /dev
tmpfs           847M     0  847M   0% /sys/fs/cgroup
/dev/sdb1       4.8G   13M  4.6G   1% /home
/dev/sda1        41G   34G  6.4G  85% /root
shm              64M     0   64M   0% /dev/shm
overlayfs       1.0M  164K  860K  17% /etc/ssh/ssh_host_dsa_key
overlayfs       1.0M  164K  860K  17% /etc/ssh/keys
tmpfs           847M  704K  846M   1% /run/metrics
tmpfs           847M     0  847M   0% /run/google/devshell
appdata         100M     0  100M   0% /mnt/appdata

~ $ mount -l -t tmpfs
tmpfs on /dev type tmpfs (rw,nosuid,size=65536k,mode=755)
tmpfs on /sys/fs/cgroup type tmpfs (rw,nosuid,nodev,noexec,relatime,mode=755)
shm on /dev/shm type tmpfs (rw,nosuid,nodev,noexec,relatime,size=65536k)
tmpfs on /run/metrics type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /google/host/var/run type tmpfs (rw,nosuid,nodev,mode=755)
tmpfs on /run/google/devshell type tmpfs (rw,relatime)
appdata on /mnt/appdata type tmpfs (rw,relatime,size=102400k)

The space used by tmpfs mounts are counted towards shared memory consumption, thus they show up under Shmem in the output of /proc/meminfo Lets create some dummy files in the tmpfs path and list the space consumption using df -h and /proc/meminfo

~ $ df -h | grep appdata
appdata         100M     0  100M   0% /mnt/appdata
~ $ dd if=/dev/zero of=/mnt/appdata/dummy bs=1M count=50
50+0 records in
50+0 records out
52428800 bytes (52 MB, 50 MiB) copied, 0.0720446 s, 728 MB/s
~ $ df -h | grep appdata
appdata         100M   50M   50M  50% /mnt/appdata
~ $ grep -i "shmem" /proc/meminfo
Shmem:             52008 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
~ $ dd if=/dev/zero of=/mnt/appdata/dummy1 bs=1M count=20
20+0 records in
20+0 records out
20971520 bytes (21 MB, 20 MiB) copied, 0.0129131 s, 1.6 GB/s
~ $ grep -i "shmem" /proc/meminfo
Shmem:             72488 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB

ramfs-roots-tmpfs-references

Many thanks to those who contributed in the reference pages. Those were immensely helpful. If you find some misunderstanding or misconception in the notes, please let me know.

Comments