Title: Linux Virtual File System
1Linux Virtual File System
2Aims
- Present the data structures in Linux VFS
- Provide information about flow of control
- Describe methods and invariants needed to
implement a new file system - Illustrate with some examples
3History
File access
- BSD implemented VFS for NFS aim dispatch to
different filesystems - VMS had elaborate filesystem
- NT/Win95 have VFS type interfaces
- Newer systems integrate VM with buffer cache.
VFS
nfs
ufs
Coda
disk
Venus
udp
4Linux Filesystems
- Media based
- ext2 - Linux native
- ufs - BSD
- fat - DOS FS
- vfat - win 95
- hpfs - OS/2
- minix - well.
- Isofs - CDROM
- sysv - Sysv Unix
- hfs - Macintosh
- affs - Amiga Fast FS
- NTFS - NTs FS
- adfs - Acorn-strongarm
- Network
- nfs
- Coda
- AFS - Andrew FS
- smbfs - LanManager
- ncpfs - Novell
- Special ones
- procfs -/proc
- umsdos - Unix in DOS
- userfs - redirector to user
5Linux Filesystems (ctd)
- Forthcoming
- devfs - device file system
- DFS - DCE distributed FS
- Varia
- cfs - crypt filesystem
- cfs - cache filesystem
- ftpfs - ftp filesystem
- mailfs - mail filesystem
- pgfs - Postgres versioning file system
- Linux serves (unrelated to the VFS!)
- NFS - user kernel
- Coda
- AppleShare - netatalk/CAP
- SMB - samba
- NCP - Novell
6Linux is Obsolete
Usefulness
7Linux VFS
- Multiple interfaces build up VFS
- files
- dentries
- inodes
- superblock
- quota
- VFS can do all caching provides utility fctns
to FS - FS provides methods to VFS many are optional
File access
VFS
nfs
VFS
ext2fs
Coda FS
VFS
disk
udp
Venus
8User level file access
- Typical user level types and code
- pathnames /myfile
- file descriptors fd open(/myfile)
- attributes in struct stat stat(/myfile,
mybuf), chmod, chown... - offsets write, read, lseek
- directory handles DIR dh opendir(/mydir)
- directory entries struct dirent ent
readdir(dh)
9VFS
- Manages kernel level file abstractions in one
format for all file systems - Receives system call requests from user level
(e.g. write, open, stat, link) - Interacts with a specific file system based on
mount point traversal - Receives requests from other parts of the kernel,
mostly from memory management
10File system level
- Individual File Systems
- responsible for managing file directory data
- responsible for managing meta-data timestamps,
owners, protection etc - translates data between
- particular FS data e.g. disk data, NFS data,
Coda/AFS data - VFS data attributes etc in standard format
- e.g. nfs_getattr(.) returns attributes in VFS
format, acquires attributes in NFS format to do
so.
11Anatomy of stat system call
sys_stat(path, buf) dentry namei(path)
if ( dentry NULL ) return -ENOENT inode
dentry-gtd_inode rc inode-gti_op-gti_permission(i
node) if ( rc ) return -EPERM rc
inode-gti_op-gti_getattr(inode, buf)
dput(dentry) return rc
Establish VFS data
Call into inode layer of filesystem
Call into inode layer of filesystem
12Anatomy of fstatfs system call
sys_fstatfs(fd, buf) / for things
like df / file fget(fd) if ( file
NULL ) return -EBADF superb
file-gtf_dentry-gtd_inode-gti_super rc
superb-gtsb_op-gtsb_statfs(sb, buf) return rc
Translate fd to VFS data structure
Call into superblock layer of filesystem
13Data structures
- VFS data structures for
- VFS handle to the file inode (BSD vnode)
- User instantiated file handle file (BSD file)
- The whole filesystem superblock (BSD vfs)
- A name to inode translation dentry
14Shorthand method notation
- super block methods sss_methodname
- inode methods iii_methodname
- dentry methods ddd_methodname
- file methods fff_methodname
- instead of
- inode i_op lookup we write iii_lookup
15namei
FS
VFS
struct dentry namei(parent, name) if (dentry
d_lookup(parent,name)) else
ddd_hash(parent, name) ddd_revalidate(dentry) iii
_lookup(parent, name) sss_read_inode()
struct inode iget(ino, dev) / try cache
else .. /
16Superblocks
- Handle metadata only (attributes etc)
- Responsible for retrieving and storing metadata
from the FS media or peers - Struct superblocks hold things like
- device, blocksize, dirty flags, list of dirty
inodes - super operations
- wait queue
- pointer to the root inode of this FS
17Super Operations (sss_)
- Ops on Inodes
- read_inode
- put_inode
- write_inode
- delete_inode
- clear_inode
- notify_change
- Superblock manips
- read_super (mount)
- put_super (unmount)
- write_super (unmount)
- statfs (attributes)
18Inodes
- Inodes are VFS abstraction for the file
- Inode has operations (iii_methods)
- VFS maintains an inode cache, NOT the individual
FSs (compare NT, BSD etc) - Inodes contain an FS specific area where
- ext2 stores disk block numbers etc
- AFS would store the FID
- Extraordinary inode ops are good for dealing with
stale NFS file handles etc.
19Whats inside an inode - 1
list_head i_hash list_head i_list list_head
i_dentry int i_count long i_ino int
i_dev m,a,ctime u,gid mode size n_link
caching
Identifies file
Usual stuff
20Whats inside an inode -2
superblock i_sb inode_ops i_op wait objects,
semaphore lock vm_area_struct pipe/socket
info page information union
ext2fs_inode_info i_ext2 nfs_inode_info i_nfs
coda_inode_info i_coda .. u
Which FS
For mmap, networking waiting
FS Specific info blocknos fids etc
21Inode state
- Inode can be on one or two lists
- (hash in_use) or (hash dirty ) or unused
- inode has a use count i_count
- Transitions
- unused ? hash iget calls sss_read_inode
- dirty? in_use sss_write_inode
- hash ? unused call on sss_clear_inode, but if
- i_nlink 0 iput calls sss_delete_inode when
i_count falls to 0
22Inode Cache
1. iget if i_countgt0 2. iput if i_countgt1 - -
3. free_inodes 4. syncing inodes
Players
Inode_hashtable
sss_clear_inode (freeing inos) or sss_delete_inode
(iput)
sss_read_inode (iget)
Unused inodes
Dirty inodes
sss_write_inode (sync one)
media fs only (mark_inode_dirty)
Used inodes
23Sales
Red Hat Software sold 240,000 copies of Red Hat
Linux in 1997 and expects to reach 400,000 in
1998.Estimates of installed servers
(InfoWorld)- Linux 7 million- OS/2 5
million- Macintosh 1 million
24Inode operations (iii_)
- symbolic links
- readlink
- follow link
- pages
- readpage, writepage, updatepage - read or write
page. Generic for mediafs. - bmap - return disk block number of logical block
- special operations
- revalidate - see dentry sect
- truncate
- permission
- lookup return inode
- calls iget
- creation/removal
- create
- link
- unlink
- symlink
- mkdir
- rmdir
- mknod
- rename
25Dentry world
- Dentry is a name to inode translation structure
- Cached agressively by VFS
- Eliminates lookups by FS private caches
- timing on Coda FS ls -lR 1000 files after
priming cache - linux 2.0.32 7.2secs
- linux 2.1.92 0.6secs
- disk fs less benefit, NFS even more
- Negative entries!
- Namei is dramatically simplified
26Inside dentrys
- name
- pointer to inode
- pointer to parent dentry
- list head of children
- chains for lots of lists
- use count
27Dentry associated lists
Legend
inode
dentry
dentry inode relationship
dentry tree relationship
inode I_dentry list head
inode i_dentry list head
d_inode pointer
d_parent pointer
d_child chains place d_alloc remove d_prune,
d_invalidate, d_put
d_alias chains place d_instantiate remove
dentry_iput
28Dcache
dentry_hashtable (d_hash chains)
- namei tries cache d_lookup
- ddd_compare
- Success ddd_revalidate
- d_invalidate if fails
- proceed if success
- Failure iii_lookup
- find inode
- iget
- sss_read_inode
- finish
- d_add
- can give negative entry in dcache
dhash(parent, name) list head
namei iii_lookup d_add
prune d_invalidate d_drop
unused dentries (d_lru chains)
29Dentry methods
- ddd_revalidate can force new lookup
- ddd_hash compute hash value of name
- ddd_compare are names equal?
- ddd_delete, ddd_put, ddd_iput FS cleanup
opportunity
30Dentry particulars
- ddd_hash and ddd_compare have to deal with
extraordinary cases for msdos/vfat - case insensitive
- long and short filename pleasantries
- ddd_revalidate -- can force new lookup if inode
not in use - used for NFS/SMBfs aging
- used for Coda/AFS callbacks
31Style
Dijkstra probably hates me Linus Torvalds
32Memory mapping
- vm_area structure has
- vm_operations
- inode, addresses etc.
- vm_operations
- map, unmap
- swapin, swapout
- nopage -- read when page isnt in VM
- mmap
- calls on iii_readpage
- keeps a use count on the inode until unmap