








                AA FFaasstt FFiillee SSyysstteemm ffoorr UUNNIIXX**


          _M_a_r_s_h_a_l_l _K_i_r_k _M_c_K_u_s_i_c_k_, _W_i_l_l_i_a_m _N_. _J_o_y_,
             _S_a_m_u_e_l _J_. _L_e_f_f_l_e_r_, _R_o_b_e_r_t _S_. _F_a_b_r_y
              Computer Systems Research Group
                 Computer Science Division
 Department of Electrical Engineering and Computer Science
             University of California, Berkeley
                    Berkeley, CA  94720


                          _A_B_S_T_R_A_C_T

          A reimplementation of the UNIX file system is
     described.  The reimplementation provides substan-
     tially higher throughput rates by using more flex-
     ible allocation policies that allow better  local-
     ity  of  reference  and  can  be adapted to a wide
     range of peripheral and processor characteristics.
     The  new file system clusters data that is sequen-
     tially accessed and provides two  block  sizes  to
     allow fast access to large files while not wasting
     large amounts of  space  for  small  files.   File
     access  rates  of  up to ten times faster than the
     traditional  UNIX  file  system  are  experienced.
     Long   needed  enhancements  to  the  programmers'
     interface are discussed.  These include  a  mecha-
     nism  to place advisory locks on files, extensions
     of the name space across file systems, the ability
     to  use long file names, and provisions for admin-
     istrative control of resource usage.


     Revised February 18, 1984



-----------
* UNIX is a trademark of Bell Laboratories.
  William  N.  Joy  is  currently employed by: Sun
Microsystems, Inc, 2550  Garcia  Avenue,  Mountain
View, CA 94043
  Samuel  J.  Leffler  is  currently  employed by:
Lucasfilm Ltd., PO Box 2009, San Rafael, CA 94912
This work was done under grants from the  National
Science  Foundation  under  grant MCS80-05144, and
the Defense Advance Research Projects Agency (DoD)
under ARPA Order No. 4031 monitored by Naval Elec-
tronic   System   Command   under   Contract   No.
N00039-82-C-0235.









SMM:05-2                         A Fast File System for UNIX


CR Categories and Subject Descriptors: D.4.3 [[OOppeerraattiinngg SSyyss--
tteemmss]]:  File  Systems Management - _f_i_l_e _o_r_g_a_n_i_z_a_t_i_o_n_, _d_i_r_e_c_-
_t_o_r_y _s_t_r_u_c_t_u_r_e_s_, _a_c_c_e_s_s _m_e_t_h_o_d_s; D.4.2 [[OOppeerraattiinngg  SSyysstteemmss]]:
Storage  Management  -  _a_l_l_o_c_a_t_i_o_n_/_d_e_a_l_l_o_c_a_t_i_o_n  _s_t_r_a_t_e_g_i_e_s_,
_s_e_c_o_n_d_a_r_y _s_t_o_r_a_g_e _d_e_v_i_c_e_s; D.4.8 [[OOppeerraattiinngg  SSyysstteemmss]]:  Per-
formance - _m_e_a_s_u_r_e_m_e_n_t_s_, _o_p_e_r_a_t_i_o_n_a_l _a_n_a_l_y_s_i_s; H.3.2 [[IInnffoorr--
mmaattiioonn SSyysstteemmss]]: Information Storage - _f_i_l_e _o_r_g_a_n_i_z_a_t_i_o_n

Additional Keywords and Phrases: UNIX, file system organiza-
tion,  file system performance, file system design, applica-
tion program interface.

General Terms: file system, measurement, performance.


















































A Fast File System for UNIX                         SMM:05-3


                     TTAABBLLEE OOFF CCOONNTTEENNTTSS


11..  IInnttrroodduuccttiioonn

22..  OOlldd ffiillee ssyysstteemm

33..  NNeeww ffiillee ssyysstteemm oorrggaanniizzaattiioonn
3.1.    Optimizing storage utilization
3.2.    File system parameterization
3.3.    Layout policies

44..  PPeerrffoorrmmaannccee

55..  FFiillee ssyysstteemm ffuunnccttiioonnaall eennhhaanncceemmeennttss
5.1.     Long file names
5.2.     File locking
5.3.     Symbolic links
5.4.     Rename
5.5.     Quotas

AAcckknnoowwlleeddggeemmeennttss

RReeffeerreenncceess

11..  IInnttrroodduuccttiioonn

     This paper describes the changes from the original  512
byte  UNIX  file system to the new one released with the 4.2
Berkeley Software Distribution.  It presents the motivations
for  the  changes, the methods used to effect these changes,
the rationale behind the design decisions, and a description
of the new implementation.  This discussion is followed by a
summary of the results that have been  obtained,  directions
for  future  work,  and  the additions and changes that have
been made to the facilities that are available  to  program-
mers.

     The original UNIX system that runs on  the  PDP-11  has
simple  and  elegant  file  system  facilities.  File system
input/output is buffered by the kernel; there are no  align-
ment  constraints  on  data transfers and all operations are
made to appear synchronous.  All transfers to the  disk  are
in  512  byte blocks, which can be placed arbitrarily within
the data area of the file system.  Virtually no  constraints
other  than  available  disk space are placed on file growth
[Ritchie74], [Thompson78].*


-----------
 DEC, PDP, VAX, MASSBUS, and UNIBUS are trademarks
of Digital Equipment Corporation.
* In practice, a file's size is constrained to  be
less than about one gigabyte.









SMM:05-4                         A Fast File System for UNIX


     When  used  on  the  VAX-11  together  with  other UNIX
enhancements, the original 512  byte  UNIX  file  system  is
incapable  of  providing the data throughput rates that many
applications require.  For  example,  applications  such  as
VLSI  design  and image processing do a small amount of pro-
cessing on a large quantities of data and  need  to  have  a
high throughput from the file system.  High throughput rates
are also needed by programs that map  files  from  the  file
system  into  large  virtual address spaces.  Paging data in
and out of the file system is  likely  to  occur  frequently
[Ferrin82b].   This  requires a file system providing higher
bandwidth than the original 512 byte UNIX one that  provides
only  about  two  percent  of  the maximum disk bandwidth or
about 20 kilobytes per second per arm [White80], [Smith81b].

     Modifications have been made to the UNIX file system to
improve its performance.  Since the UNIX file system  inter-
face is well understood and not inherently slow, this devel-
opment retained  the  abstraction  and  simply  changed  the
underlying  implementation to increase its throughput.  Con-
sequently, users of the system have not been faced with mas-
sive software conversion.

     Problems  with  file system performance have been dealt
with extensively in the literature;  see  [Smith81a]  for  a
survey.   Previous work to improve the UNIX file system per-
formance has been done by [Ferrin82a].  The  UNIX  operating
system  drew  many  of its ideas from Multics, a large, high
performance  operating  system  [Feiertag71].   Other   work
includes  Hydra  [Almes78],  Spice  [Thompson80], and a file
system for a LISP environment [Symbolics81].  A good  intro-
duction  to  the physical latencies of disks is described in
[Pechura83].



22..  OOlldd FFiillee SSyysstteemm

     In the file system developed at Bell Laboratories  (the
``traditional''  file  system),  each  disk drive is divided
into one or more partitions.  Each of these disk  partitions
may contain one file system.  A file system never spans mul-
tiple  partitions.  A file system is described by its super-
block, which contains the basic parameters of the file  sys-
tem.   These  include  the number of data blocks in the file
-----------
 By ``partition'' here we refer to the subdivision
of  physical space on a disk drive.  In the tradi-
tional file system, as in  the  new  file  system,
file  systems  are  really located in logical disk
partitions that may overlap.  This overlapping  is
made  available, for example, to allow programs to
copy entire disk drives containing  multiple  file
systems.









A Fast File System for UNIX                         SMM:05-5


system, a count of  the  maximum  number  of  files,  and  a
pointer  to  the  _f_r_e_e  _l_i_s_t,  a linked list of all the free
blocks in the file system.

     Within the file system are files.   Certain  files  are
distinguished  as  directories and contain pointers to files
that may  themselves  be  directories.   Every  file  has  a
descriptor  associated  with  it  called an _i_n_o_d_e.  An inode
contains information describing ownership of the file,  time
stamps  marking  last  modification and access times for the
file, and an array of indices that point to the data  blocks
for  the  file.  For the purposes of this section, we assume
that the first 8 blocks of the file are directly  referenced
by values stored in an inode itself*.   An  inode  may  also
contain  references  to  indirect  blocks containing further
data block indices.  In a file system with a 512 byte  block
size,  a  singly  indirect  block contains 128 further block
addresses, a doubly indirect block contains 128 addresses of
further  singly indirect blocks, and a triply indirect block
contains 128 addresses of further doubly indirect blocks.

     A 150 megabyte traditional UNIX file system consists of
4  megabytes  of  inodes  followed by 146 megabytes of data.
This organization segregates the inode information from  the
data; thus accessing a file normally incurs a long seek from
the file's inode to its data.  Files in a  single  directory
are  not  typically  allocated  consecutive  slots  in the 4
megabytes of inodes, causing many non-consecutive blocks  of
inodes  to  be  accessed  when  executing  operations on the
inodes of several files in a directory.

     The allocation of data blocks to files is also subopti-
mum.   The traditional file system never transfers more than
512 bytes per disk transaction and often finds that the next
sequential  data  block is not on the same cylinder, forcing
seeks between 512 byte transfers.  The  combination  of  the
small block size, limited read-ahead in the system, and many
seeks severely limits file system throughput.

     The first work at Berkeley  on  the  UNIX  file  system
attempted  to  improve both reliability and throughput.  The
reliability was improved by staging modifications to  criti-
cal  file  system  information  so that they could either be
completed or repaired cleanly by a  program  after  a  crash
[Kowalski78].  The file system performance was improved by a
factor of more than two by changing  the  basic  block  size
from  512  to  1024  bytes.  The increase was because of two
factors: each disk transfer accessed twice as much data, and
most  files  could be described without need to access indi-
rect blocks since the direct blocks contained twice as  much
data.  The file system with these changes will henceforth be
-----------
*  The  actual number may vary from system to sys-
tem, but is usually in the range 5-13.









SMM:05-6                         A Fast File System for UNIX


referred to as the _o_l_d _f_i_l_e _s_y_s_t_e_m_.

     This performance improvement gave a  strong  indication
that increasing the block size was a good method for improv-
ing throughput.  Although the throughput  had  doubled,  the
old  file  system was still using only about four percent of
the disk bandwidth.  The main problem was that although  the
free  list  was  initially  ordered  for  optimal access, it
quickly became scrambled as files were created and  removed.
Eventually  the  free  list  became entirely random, causing
files to have their blocks allocated randomly over the disk.
This  forced a seek before every block access.  Although old
file systems provided transfer rates of up to 175  kilobytes
per  second when they were first created, this rate deterio-
rated to 30 kilobytes per second after a few weeks of moder-
ate  use  because of this randomization of data block place-
ment.  There was no way of restoring the performance  of  an
old  file  system  except  to dump, rebuild, and restore the
file  system.   Another   possibility,   as   suggested   by
[Maruyama76],  would  be to have a process that periodically
reorganized the data on the disk to restore locality.



33..  NNeeww ffiillee ssyysstteemm oorrggaanniizzaattiioonn

     In the new file system organization (as in the old file
system  organization),  each disk drive contains one or more
file systems.  A file system  is  described  by  its  super-
block,  located  at  the beginning of the file system's disk
partition.  Because the super-block contains critical  data,
it is replicated to protect against catastrophic loss.  This
is done when the file system is created;  since  the  super-
block  data  does  not change, the copies need not be refer-
enced unless a head crash or other hard  disk  error  causes
the default super-block to be unusable.

     To  insure that it is possible to create files as large
as 232 bytes with only two levels of indirection, the  mini-
mum  size of a file system block is 4096 bytes.  The size of
file system blocks can be any power of two greater  than  or
equal  to 4096.  The block size of a file system is recorded
in the file system's super-block so it is possible for  file
systems  with  different  block  sizes  to be simultaneously
accessible on the same  system.   The  block  size  must  be
decided at the time that the file system is created; it can-
not be subsequently changed without rebuilding the file sys-
tem.

     The  new file system organization divides a disk parti-
tion into one or  more  areas  called  _c_y_l_i_n_d_e_r  _g_r_o_u_p_s.   A
cylinder  group  is  comprised  of  one  or more consecutive
cylinders on a disk.  Associated with each cylinder group is
some  bookkeeping information that includes a redundant copy









A Fast File System for UNIX                         SMM:05-7


of the super-block, space for inodes, a bit  map  describing
available blocks in the cylinder group, and summary informa-
tion describing the usage of data blocks within the cylinder
group.   The  bit  map  of  available blocks in the cylinder
group replaces the traditional file system's free list.  For
each  cylinder  group a static number of inodes is allocated
at file system creation time.   The  default  policy  is  to
allocate  one  inode  for  each  2048  bytes of space in the
cylinder group, expecting this to be far more than will ever
be needed.

     All the cylinder group bookkeeping information could be
placed at the beginning of each cylinder group.  However  if
this approach were used, all the redundant information would
be on the top  platter.   A  single  hardware  failure  that
destroyed the top platter could cause the loss of all redun-
dant copies of the super-block.   Thus  the  cylinder  group
bookkeeping  information begins at a varying offset from the
beginning of the cylinder group.  The offset for  each  suc-
cessive  cylinder  group is calculated to be about one track
further from the beginning of the cylinder  group  than  the
preceding  cylinder group.  In this way the redundant infor-
mation spirals down into the pack so that any single  track,
cylinder,  or  platter can be lost without losing all copies
of the super-block.  Except for the  first  cylinder  group,
the  space  between  the beginning of the cylinder group and
the beginning of the cylinder group information is used  for
data blocks.

33..11..  OOppttiimmiizziinngg ssttoorraaggee uuttiilliizzaattiioonn

     Data  is  laid  out so that larger blocks can be trans-
ferred in a single disk transaction, greatly increasing file
system  throughput.   As  an example, consider a file in the
new file system composed of 4096 byte data blocks.   In  the
old  file  system  this  file would be composed of 1024 byte
-----------
  While  it  appears that the first cylinder group
could be laid out  with  its  super-block  at  the
``known''  location,  this would not work for file
systems with  blocks  sizes  of  16  kilobytes  or
greater.   This  is  because of a requirement that
the first 8 kilobytes of the disk be reserved  for
a  bootstrap  program  and  a separate requirement
that the cylinder group  information  begin  on  a
file system block boundary.  To start the cylinder
group on a file system block boundary,  file  sys-
tems  with  block  sizes  larger  than 8 kilobytes
would have to leave an empty space between the end
of  the boot block and the beginning of the cylin-
der group.  Without knowing the size of  the  file
system  blocks,  the  system  would  not know what
roundup function to use to find the  beginning  of
the first cylinder group.









SMM:05-8                         A Fast File System for UNIX


blocks.  By increasing the block size, disk accesses in  the
new file system may transfer up to four times as much infor-
mation per disk transaction.  In large files,  several  4096
byte  blocks may be allocated from the same cylinder so that
even larger data transfers are possible before  requiring  a
seek.

     The  main  problem with larger blocks is that most UNIX
file systems are composed of many small files.  A  uniformly
large  block size wastes space.  Table 1 shows the effect of
file system block size on the amount of wasted space in  the
file  system.   The  files  measured to obtain these figures
reside on one of our time sharing systems that  has  roughly
1.2  gigabytes  of  on-line  storage.   The measurements are
based on the active user file systems containing  about  920
megabytes of formatted space.


+------------+---------+--------------------------------------------------+
|Space used  | % waste | Organization                                     |
+------------+---------+--------------------------------------------------+
| 775.2 Mb   |   0.0   | Data only, no separation between files           |
| 807.8 Mb   |   4.2   | Data only, each file starts on 512 byte boundary |
| 828.7 Mb   |   6.9   | Data + inodes, 512 byte block UNIX file system   |
| 866.5 Mb   |  11.8   | Data + inodes, 1024 byte block UNIX file system  |
| 948.5 Mb   |  22.4   | Data + inodes, 2048 byte block UNIX file system  |
| 1128.3 Mb  |  45.6   | Data + inodes, 4096 byte block UNIX file system  |
+------------+---------+--------------------------------------------------+
Table 1 - Amount of wasted space as a function of block size.

The space wasted is calculated to be the percentage of space
on the disk not containing user data.  As the block size  on
the disk increases, the waste rises quickly, to an intolera-
ble 45.6% waste with 4096 byte file system blocks.

     To be able to use large  blocks  without  undue  waste,
small files must be stored in a more efficient way.  The new
file system accomplishes this goal by allowing the  division
of  a  single  file system block into one or more _f_r_a_g_m_e_n_t_s.
The file system fragment size is specified at the time  that
the  file  system  is  created;  each  file system block can
optionally be broken into 2, 4,  or  8  fragments,  each  of
which  is addressable.  The lower bound on the size of these
fragments is constrained by the disk sector size,  typically
512  bytes.   The  block  map  associated with each cylinder
group records the space available in a cylinder group at the
fragment  level;  to  determine  if  a  block  is available,
aligned fragments are examined.  Figure 1 shows a piece of a
map  from  a  4096/1024  file  system.   Each bit in the map
records the status of a fragment; an ``X''  shows  that  the
fragment is in use, while a ``O'' shows that the fragment is
available for allocation.  In this example,  fragments  0-5,
10,  and  11  are in use, while fragments 6-9, and 12-15 are
free.  Fragments of adjoining blocks cannot  be  used  as  a









A Fast File System for UNIX                         SMM:05-9




+-----------------+----------------------------+
|Bits in map      | XXXX   XXOO   OOXX   OOOO  |
|Fragment numbers | 0-3    4-7    8-11   12-15 |
|Block numbers    |  0      1      2       3   |
+-----------------+----------------------------+
Figure 1 - Example layout of blocks and fragments in a 4096/1024 file system.

full block, even if they are large enough.  In this example,
fragments 6-9 cannot be allocated  as  a  full  block;  only
fragments 12-15 can be coalesced into a full block.

     On  a file system with a block size of 4096 bytes and a
fragment size of 1024 bytes, a file is represented  by  zero
or  more  4096  byte  blocks  of data, and possibly a single
fragmented block.  If a file system block must be fragmented
to  obtain  space  for a small amount of data, the remaining
fragments of the block are made available for allocation  to
other  files.   As  an  example  consider an 11000 byte file
stored on a 4096/1024 byte file  system.   This  file  would
uses  two full size blocks and one three fragment portion of
another block.  If no block with three aligned fragments  is
available at the time the file is created, a full size block
is split yielding  the  necessary  fragments  and  a  single
unused  fragment.   This remaining fragment can be allocated
to another file as needed.

     Space is allocated to a file  when  a  program  does  a
_w_r_i_t_e system call.  Each time data is written to a file, the
system checks to see if the size of the file has increased*.
If  the  file needs to be expanded to hold the new data, one
of three conditions exists:

1)   There is enough space  left  in  an  already  allocated
     block  or  fragment to hold the new data.  The new data
     is written into the available space.

2)   The file contains no fragmented blocks  (and  the  last
     block  in  the file contains insufficient space to hold
     the new data).  If space  exists  in  a  block  already
     allocated,  the  space is filled with new data.  If the
     remainder of the new data contains  more  than  a  full
     block  of data, a full block is allocated and the first
     full block of new data is written there.  This  process
     is  repeated  until  less than a full block of new data
     remains.  If the remaining new data to be written  will
     fit  in less than a full block, a block with the neces-
     sary fragments is located, otherwise a  full  block  is
     located.   The  remaining  new data is written into the
-----------
* A program may be overwriting data in the  middle
of  an  existing  file  in  which case space would
already have been allocated.









SMM:05-10                        A Fast File System for UNIX


     located space.

3)   The file contains one or more fragments (and the  frag-
     ments contain insufficient space to hold the new data).
     If the size of the new data plus the size of  the  data
     already  in  the  fragments  exceeds the size of a full
     block, a new block is allocated.  The contents  of  the
     fragments  are copied to the beginning of the block and
     the remainder of the block is  filled  with  new  data.
     The process then continues as in (2) above.  Otherwise,
     if the new data to be written will fit in less  than  a
     full  block,  a  block  with the necessary fragments is
     located, otherwise a full block is located.   The  con-
     tents  of  the existing fragments appended with the new
     data are written into the allocated space.

     The problem with expanding a file one fragment at  a  a
time  is  that data may be copied many times as a fragmented
block expands to a full block.  Fragment reallocation can be
minimized if the user program writes a full block at a time,
except for a partial block at the end of  the  file.   Since
file  systems  with  different block sizes may reside on the
same system, the file system interface has been extended  to
provide  application programs the optimal size for a read or
write.  For files the optimal size is the block size of  the
file  system on which the file is being accessed.  For other
objects, such as pipes and sockets, the optimal size is  the
underlying  buffer  size.  This feature is used by the Stan-
dard Input/Output Library, a package used by most user  pro-
grams.   This  feature is also used by certain system utili-
ties such as archivers and loaders that do their  own  input
and  output  management  and  need the highest possible file
system bandwidth.

     The amount of wasted space in the  4096/1024  byte  new
file system organization is empirically observed to be about
the same as in the 1024 byte old file  system  organization.
A  file  system with 4096 byte blocks and 512 byte fragments
has about the same amount of wasted space as  the  512  byte
block UNIX file system.  The new file system uses less space
than the 512 byte or 1024 byte  file  systems  for  indexing
information for large files and the same amount of space for
small files.  These savings are offset by the  need  to  use
more  space for keeping track of available free blocks.  The
net result is about the same disk  utilization  when  a  new
file  system's  fragment  size  equals  an old file system's
block size.

     In order for the layout policies  to  be  effective,  a
file  system  cannot be kept completely full.  For each file
system there is a parameter, termed the free space  reserve,
that  gives the minimum acceptable percentage of file system
blocks that should be free.  If the number  of  free  blocks
drops  below  this  level  only the system administrator can









A Fast File System for UNIX                        SMM:05-11


continue to allocate blocks.  The value  of  this  parameter
may  be  changed  at  any time, even when the file system is
mounted and active.  The transfer rates that appear in  sec-
tion 4 were measured on file systems kept less than 90% full
(a reserve of 10%).  If the number of free blocks  falls  to
zero,  the  file  system throughput tends to be cut in half,
because of the inability of  the  file  system  to  localize
blocks  in  a file.  If a file system's performance degrades
because of overfilling, it may be restored by removing files
until  the amount of free space once again reaches the mini-
mum acceptable level.  Access rates for files created during
periods of little free space may be restored by moving their
data once enough space is available.  The free space reserve
must  be added to the percentage of waste when comparing the
organizations given in Table 1.   Thus,  the  percentage  of
waste in an old 1024 byte UNIX file system is roughly compa-
rable to a new 4096/512 byte file system with the free space
reserve  set at 5%.  (Compare 11.8% wasted with the old file
system to 6.9% waste + 5% reserved space  in  the  new  file
system.)

33..22..  FFiillee ssyysstteemm ppaarraammeetteerriizzaattiioonn

     Except  for  the initial creation of the free list, the
old file system ignores the  parameters  of  the  underlying
hardware.   It  has no information about either the physical
characteristics of the mass storage device, or the  hardware
that interacts with it.  A goal of the new file system is to
parameterize the processor  capabilities  and  mass  storage
characteristics  so that blocks can be allocated in an opti-
mum configuration-dependent way.   Parameters  used  include
the  speed  of  the processor, the hardware support for mass
storage transfers, and the characteristics of the mass stor-
age  devices.  Disk technology is constantly improving and a
given installation can have several different disk technolo-
gies  running  on  a  single processor.  Each file system is
parameterized so that it can be adapted to the  characteris-
tics of the disk on which it is placed.

     For  mass  storage  devices such as disks, the new file
system tries to allocate new blocks on the same cylinder  as
the  previous  block in the same file.  Optimally, these new
blocks will also be rotationally well positioned.  The  dis-
tance   between   ``rotationally   optimal''  blocks  varies
greatly; it can be a consecutive  block  or  a  rotationally
delayed  block  depending  on  system characteristics.  On a
processor with an input/output channel that does not require
any  processor  intervention  between  mass storage transfer
requests, two consecutive disk blocks can often be  accessed
without  suffering  lost time because of an intervening disk
revolution.  For processors without  input/output  channels,
the main processor must field an interrupt and prepare for a
new disk transfer.  The expected time to service this inter-
rupt  and  schedule a new disk transfer depends on the speed









SMM:05-12                        A Fast File System for UNIX


of the main processor.

     The physical characteristics of each disk  include  the
number  of  blocks  per track and the rate at which the disk
spins.  The allocation routines use this information to cal-
culate  the  number  of milliseconds required to skip over a
block.  The characteristics of  the  processor  include  the
expected  time  to  service  an interrupt and schedule a new
disk transfer.  Given a block allocated to a file, the allo-
cation  routines calculate the number of blocks to skip over
so that the next block in the file will come  into  position
under  the  disk head in the expected amount of time that it
takes to start a new disk transfer operation.  For  programs
that  sequentially access large amounts of data, this strat-
egy minimizes the amount of time spent waiting for the  disk
to position itself.

     To ease the calculation of finding rotationally optimal
blocks, the cylinder group summary  information  includes  a
count of the available blocks in a cylinder group at differ-
ent rotational positions.  Eight  rotational  positions  are
distinguished,  so the resolution of the summary information
is 2 milliseconds for a typical 3600 revolution  per  minute
drive.   The  super-block  contains a vector of lists called
_r_o_t_a_t_i_o_n_a_l _l_a_y_o_u_t _t_a_b_l_e_s.  The vector is  indexed  by  rota-
tional  position.   Each  component  of the vector lists the
index into the block map for every data block  contained  in
its  rotational  position.   When looking for an allocatable
block, the system first looks through the summary counts for
a  rotational position with a non-zero block count.  It then
uses the index of the rotational position to find the appro-
priate  list to use to index through only the relevant parts
of the block map to find a free block.

     The parameter that defines the minimum number  of  mil-
liseconds  between the completion of a data transfer and the
initiation of another data transfer on the same cylinder can
be changed at any time, even when the file system is mounted
and active.  If a file system is parameterized  to  lay  out
blocks  with  a rotational separation of 2 milliseconds, and
the disk pack is then moved to a system that has a processor
requiring  4  milliseconds to schedule a disk operation, the
throughput will drop precipitously because of lost disk rev-
olutions  on  nearly  every  block.   If the eventual target
machine is known, the file system can be  parameterized  for
it  even  though it is initially created on a different pro-
cessor.  Even if the move is not known in advance, the rota-
tional  layout  delay  can be reconfigured after the disk is
moved so that all further allocation is done  based  on  the
characteristics of the new host.













A Fast File System for UNIX                        SMM:05-13


33..33..  LLaayyoouutt ppoolliicciieess

     The  file  system  layout policies are divided into two
distinct parts.  At the top level are global  policies  that
use  file  system wide summary information to make decisions
regarding the placement  of  new  inodes  and  data  blocks.
These routines are responsible for deciding the placement of
new directories and files.  They also calculate rotationally
optimal  block layouts, and decide when to force a long seek
to a new  cylinder  group  because  there  are  insufficient
blocks  left  in the current cylinder group to do reasonable
layouts.  Below the global policy  routines  are  the  local
allocation routines that use a locally optimal scheme to lay
out data blocks.

     Two methods for improving file system  performance  are
to  increase  the  locality  of  reference  to minimize seek
latency as described by [Trivedi80], and to improve the lay-
out  of  data to make larger transfers possible as described
by  [Nevalainen77].   The  global  layout  policies  try  to
improve performance by clustering related information.  They
cannot attempt to localize all  data  references,  but  must
also  try  to spread unrelated data among different cylinder
groups.  If too much localization is  attempted,  the  local
cylinder  group  may run out of space forcing the data to be
scattered  to  non-local  cylinder  groups.   Taken  to   an
extreme,  total  localization  can  result  in a single huge
cluster of data resembling the old file system.  The  global
policies  try to balance the two conflicting goals of local-
izing data that is concurrently accessed while spreading out
unrelated data.

     One allocatable resource is inodes.  Inodes are used to
describe both files and directories.  Inodes of files in the
same  directory are frequently accessed together.  For exam-
ple, the ``list directory'' command often accesses the inode
for  each  file  in a directory.  The layout policy tries to
place all the inodes of files in a  directory  in  the  same
cylinder  group.   To  ensure  that  files  are  distributed
throughout the disk, a different policy is used  for  direc-
tory  allocation.   A  new directory is placed in a cylinder
group that has a greater than average number of free inodes,
and  the  smallest number of directories already in it.  The
intent of this policy is to allow the inode clustering  pol-
icy  to  succeed most of the time.  The allocation of inodes
within a cylinder group is done using a next free  strategy.
Although  this allocates the inodes randomly within a cylin-
der group, all the inodes for a  particular  cylinder  group
can  be  read with 8 to 16 disk transfers.  (At most 16 disk
transfers are required because a cylinder group may have  no
more  than  2048  inodes.)   This  puts a small and constant
upper bound on the number  of  disk  transfers  required  to
access the inodes for all the files in a directory.  In con-
trast, the old  file  system  typically  requires  one  disk









SMM:05-14                        A Fast File System for UNIX


transfer to fetch the inode for each file in a directory.

     The  other  major  resource is data blocks.  Since data
blocks for a file are typically accessed together, the  pol-
icy  routines try to place all data blocks for a file in the
same cylinder  group,  preferably  at  rotationally  optimal
positions in the same cylinder.  The problem with allocating
all the data blocks in the same cylinder group is that large
files  will  quickly  use up available space in the cylinder
group, forcing a spill over to other areas.  Further,  using
all  the space in a cylinder group causes future allocations
for any file in the cylinder group to also  spill  to  other
areas.   Ideally  none  of  the  cylinder groups should ever
become completely full.  The heuristic solution chosen is to
redirect block allocation to a different cylinder group when
a file exceeds 48 kilobytes, and at  every  megabyte  there-
after.*  The newly chosen cylinder group  is  selected  from
those  cylinder groups that have a greater than average num-
ber of free blocks left.  Although  big  files  tend  to  be
spread  out  over  the disk, a megabyte of data is typically
accessible before a long seek must  be  performed,  and  the
cost of one long seek per megabyte is small.

     The  global  policy routines call local allocation rou-
tines with requests for specific blocks.  The local  alloca-
tion routines will always allocate the requested block if it
is  free,  otherwise  it  allocates  a  free  block  of  the
requested size that is rotationally closest to the requested
block.  If the global layout policies had complete  informa-
tion,  they could always request unused blocks and the allo-
cation routines would  be  reduced  to  simple  bookkeeping.
However,  maintaining  complete  information is costly; thus
the implementation of the global layout policy uses  heuris-
tics that employ only partial information.

     If  a requested block is not available, the local allo-
cator uses a four level allocation strategy:

1)   Use the next available block  rotationally  closest  to
     the  requested  block  on  the  same  cylinder.   It is
     assumed here that head switching time is zero.  On disk
-----------
*  The  first  spill over point at 48 kilobytes is
the point at which a file on  a  4096  byte  block
file  system  first  requires  a  single  indirect
block.  This appears to be a natural  first  point
at  which to redirect block allocation.  The other
spillover points are chosen  with  the  intent  of
forcing  block  allocation to be redirected when a
file has used about 25% of the data  blocks  in  a
cylinder  group.  In observing the new file system
in day to day use, the heuristics appear  to  work
well in minimizing the number of completely filled
cylinder groups.









A Fast File System for UNIX                        SMM:05-15


     controllers  where this is not the case, it may be pos-
     sible  to  incorporate  the  time  required  to  switch
     between  disk platters when constructing the rotational
     layout tables.  This, however, has not yet been  tried.

2)   If  there are no blocks available on the same cylinder,
     use a block within the same cylinder group.

3)   If that cylinder group is entirely full,  quadratically
     hash the cylinder group number to choose another cylin-
     der group to look for a free block.

4)   Finally if the hash fails, apply an  exhaustive  search
     to all cylinder groups.

     Quadratic  hash is used because of its speed in finding
unused slots in nearly full  hash  tables  [Knuth75].   File
systems that are parameterized to maintain at least 10% free
space rarely use this strategy.  File systems that  are  run
without  maintaining  any  free  space typically have so few
free blocks that almost any allocation is random;  the  most
important  characteristic  of  the  strategy used under such
conditions is that the strategy be fast.



44..  PPeerrffoorrmmaannccee

     Ultimately, the proof of the effectiveness of the algo-
rithms  described  in  the previous section is the long term
performance of the new file system.

     Our empirical studies have shown that the inode  layout
policy  has  been effective.  When running the ``list direc-
tory'' command on a large  directory  that  itself  contains
many  directories  (to  force the system to access inodes in
multiple cylinder groups), the number of disk  accesses  for
inodes is cut by a factor of two.  The improvements are even
more dramatic for large directories containing  only  files,
disk  accesses  for  inodes  being cut by a factor of eight.
This is most encouraging for programs such as spooling  dae-
mons that access many small files, since these programs tend
to flood the disk request queue on the old file system.

     Table 2 summarizes the measured throughput of  the  new
file  system.   Several  comments  need to be made about the
conditions under which these tests were run.  The test  pro-
grams  measure  the rate at which user programs can transfer
data to or from a file without performing any processing  on
it.   These  programs  must  read  and  write enough data to
insure that buffering  in  the  operating  system  does  not
affect  the results.  They are also run at least three times
in succession; the first to get  the  system  into  a  known
state  and  the second two to insure that the experiment has









SMM:05-16                        A Fast File System for UNIX


stabilized and is repeatable.   The  tests  used  and  their
results  are discussed in detail in [Kridle83].  The systems
were running multi-user but were otherwise quiescent.  There
was  no  contention for either the CPU or the disk arm.  The
only difference between the UNIBUS and MASSBUS tests was the
controller.   All tests used an AMPEX Capricorn 330 megabyte
Winchester disk.  As Table 2 shows,  all  file  system  test
runs  were  on  a  VAX 11/750.  All file systems had been in
production use for at least a month before  being  measured.
The same number of system calls were performed in all tests;
the basic system call overhead was a negligible  portion  of
the total running time of the tests.


+------------------------------+--------------------------------------+
|   Type of      Processor and |                 Read                 |
| File System    Bus Measured  |     Speed         Bandwidth    % CPU |
+------------------------------+--------------------------------------+
|  old 1024       750/UNIBUS   | 29 Kbytes/sec     29/983 3%     11%  |
|new 4096/1024    750/UNIBUS   | 221 Kbytes/sec   221/983 22%    43%  |
|new 8192/1024    750/UNIBUS   | 233 Kbytes/sec   233/983 24%    29%  |
|new 4096/1024    750/MASSBUS  | 466 Kbytes/sec   466/983 47%    73%  |
|new 8192/1024    750/MASSBUS  | 466 Kbytes/sec   466/983 47%    54%  |
+------------------------------+--------------------------------------+
Table 2a - Reading rates of the old and new UNIX file systems.

+------------------------------+--------------------------------------+
|   Type of      Processor and |                Write                 |
| File System    Bus Measured  |     Speed         Bandwidth    % CPU |
+------------------------------+--------------------------------------+
|  old 1024       750/UNIBUS   | 48 Kbytes/sec     48/983 5%     29%  |
|new 4096/1024    750/UNIBUS   | 142 Kbytes/sec   142/983 14%    43%  |
|new 8192/1024    750/UNIBUS   | 215 Kbytes/sec   215/983 22%    46%  |
|new 4096/1024    750/MASSBUS  | 323 Kbytes/sec   323/983 33%    94%  |
|new 8192/1024    750/MASSBUS  | 466 Kbytes/sec   466/983 47%    95%  |
+------------------------------+--------------------------------------+
Table 2b - Writing rates of the old and new UNIX file systems.


     Unlike  the old file system, the transfer rates for the
new file system do not appear  to  change  over  time.   The
throughput  rate is tied much more strongly to the amount of
free space that is maintained.  The measurements in Table  2
were  based  on a file system with a 10% free space reserve.
Synthetic work loads suggest that throughput deteriorates to
about  half the rates given in Table 2 when the file systems
are full.

     The percentage of bandwidth given in Table 2 is a  mea-
sure  of  the  effective utilization of the disk by the file
-----------
  A  UNIX  command  that is similar to the reading
test that we used is ``cp file /dev/null'',  where
``file'' is eight megabytes long.









A Fast File System for UNIX                        SMM:05-17


system.  An upper bound on the transfer rate from  the  disk
is  calculated by multiplying the number of bytes on a track
by the number of revolutions of the disk  per  second.   The
bandwidth is calculated by comparing the data rates the file
system is able to achieve as  a  percentage  of  this  rate.
Using  this  metric, the old file system is only able to use
about 3-5% of the disk bandwidth, while the new file  system
uses up to 47% of the bandwidth.

     Both reads and writes are faster in the new system than
in the old system.  The biggest factor in  this  speedup  is
because  of  the larger block size used by the new file sys-
tem.  The overhead of allocating blocks in the new system is
greater  than  the  overhead of allocating blocks in the old
system, however fewer blocks need to be allocated in the new
system  because they are bigger.  The net effect is that the
cost per byte allocated is about the same for both  systems.

     In  the  new file system, the reading rate is always at
least as fast as the writing rate.  This is to  be  expected
since  the  kernel  must do more work when allocating blocks
than when simply reading them.  Note that  the  write  rates
are  about the same as the read rates in the 8192 byte block
file system; the write rates are slower than the read  rates
in  the 4096 byte block file system.  The slower write rates
occur because the kernel has to do twice as many disk  allo-
cations  per  second, making the processor unable to keep up
with the disk transfer rate.

     In contrast the old file system is about 50% faster  at
writing  files than reading them.  This is because the write
system call is asynchronous and the kernel can generate disk
transfer  requests  much  faster  than they can be serviced,
hence disk transfers queue up  in  the  disk  buffer  cache.
Because the disk buffer cache is sorted by minimum seek dis-
tance, the average seek between the scheduled disk writes is
much  less  than it would be if the data blocks were written
out in the random disk order in which  they  are  generated.
However  when the file is read, the read system call is pro-
cessed synchronously so the disk blocks  must  be  retrieved
from  the  disk  in the non-optimal seek order in which they
are requested.  This forces the disk scheduler  to  do  long
seeks resulting in a lower throughput rate.

     In  the  new system the blocks of a file are more opti-
mally ordered on the disk.  Even though reads are still syn-
chronous,  the  requests are presented to the disk in a much
better order.  Even though the writes  are  still  asynchro-
nous, they are already presented to the disk in minimum seek
order so there is no gain to  be  had  by  reordering  them.
Hence the disk seek latencies that limited the old file sys-
tem have little effect in the new file system.  The cost  of
allocation  is  the  factor  in  the  new system that causes
writes to be slower than reads.









SMM:05-18                        A Fast File System for UNIX


     The performance of the new  file  system  is  currently
limited by memory to memory copy operations required to move
data from disk buffers in the system's address space to data
buffers  in the user's address space.  These copy operations
account for about  40%  of  the  time  spent  performing  an
input/output operation.  If the buffers in both address spa-
ces were properly aligned, this transfer could be  performed
without  copying  by using the VAX virtual memory management
hardware.  This would be especially desirable when transfer-
ring  large  amounts  of  data.   We  did not implement this
because it would change the user interface to the file  sys-
tem  in  two  major ways: user programs would be required to
allocate buffers on page boundaries, and data  would  disap-
pear from buffers after being written.

     Greater  disk throughput could be achieved by rewriting
the disk drivers to chain  together  kernel  buffers.   This
would  allow  contiguous  disk blocks to be read in a single
disk transaction.  Many disks used with UNIX systems contain
either  32  or  48  512  byte sectors per track.  Each track
holds exactly two or three 8192 byte file system blocks,  or
four  or six 4096 byte file system blocks.  The inability to
use contiguous disk blocks effectively  limits  the  perfor-
mance on these disks to less than 50% of the available band-
width.  If the next block for a file cannot be laid out con-
tiguously,  then the minimum spacing to the next allocatable
block on any platter is between a sixth and a half a revolu-
tion.   The  implication  of  this is that the best possible
layout without contiguous blocks uses only half of the band-
width  of  any  given  track.  If each track contains an odd
number of sectors, then it is possible to resolve the  rota-
tional  delay  to  any  number of sectors by finding a block
that begins at the desired rotational  position  on  another
track.   The  reason that block chaining has not been imple-
mented is because it would require rewriting  all  the  disk
drivers  in the system, and the current throughput rates are
already limited by the speed of the available processors.

     Currently only one block is allocated to a  file  at  a
time.   A  technique  used  by the DEMOS file system when it
finds that a file is growing rapidly, is to preallocate sev-
eral  blocks at once, releasing them when the file is closed
if they remain unused.  By batching up allocations, the sys-
tem can reduce the overhead of allocating at each write, and
it can cut down on the number of disk writes needed to  keep
the  block  pointers on the disk synchronized with the block
allocation [Powell79].   This  technique  was  not  included
because  block  allocation  currently accounts for less than
10% of the time spent in  a  write  system  call  and,  once
again,  the  current throughput rates are already limited by
the speed of the available processors.












A Fast File System for UNIX                        SMM:05-19


55..  FFiillee ssyysstteemm ffuunnccttiioonnaall eennhhaanncceemmeennttss

     The performance enhancements to the  UNIX  file  system
did  not require any changes to the semantics or data struc-
tures visible to  application  programs.   However,  several
changes had been generally desired for some time but had not
been introduced because they would require users to dump and
restore  all  their file systems.  Since the new file system
already required all existing file systems to be dumped  and
restored,  these  functional enhancements were introduced at
this time.

55..11..  LLoonngg ffiillee nnaammeess

     File names can now be of nearly arbitrary length.  Only
programs  that read directories are affected by this change.
To promote portability to UNIX systems that are not  running
the new file system, a set of directory access routines have
been introduced to provide a consistent interface to  direc-
tories on both old and new systems.

     Directories  are  allocated  in  512  byte units called
chunks.  This size is chosen so that each allocation can  be
transferred  to disk in a single operation.  Chunks are bro-
ken  up  into  variable  length  records  termed   directory
entries.   A directory entry contains the information neces-
sary to map the name of a file to its associated inode.   No
directory  entry  is  allowed  to span multiple chunks.  The
first three fields of a directory entry are fixed length and
contain:  an  inode  number,  the size of the entry, and the
length of the file name contained in the entry.  The remain-
der  of an entry is variable length and contains a null ter-
minated file name, padded to a 4 byte boundary.  The maximum
length  of a file name in a directory is currently 255 char-
acters.

     Available space in a directory is  recorded  by  having
one or more entries accumulate the free space in their entry
size fields.  This results in  directory  entries  that  are
larger  than  required  to  hold  the  entry name plus fixed
length fields.  Space allocated to a directory should always
be  completely accounted for by totaling up the sizes of its
entries.  When an entry is deleted  from  a  directory,  its
space  is returned to a previous entry in the same directory
chunk by increasing the size of the previous  entry  by  the
size  of  the deleted entry.  If the first entry of a direc-
tory chunk is free, then the entry's inode number is set  to
zero to indicate that it is unallocated.

55..22..  FFiillee lloocckkiinngg

     The old file system had no provision for locking files.
Processes that needed to synchronize the updates of  a  file
had to use a separate ``lock'' file.  A process would try to









SMM:05-20                        A Fast File System for UNIX


create a ``lock'' file.  If the creation succeeded, then the
process  could  proceed  with  its  update;  if the creation
failed, then the process would wait  and  try  again.   This
mechanism  had three drawbacks.  Processes consumed CPU time
by looping over attempts to create locks.  Locks left  lying
around  because of system crashes had to be manually removed
(normally in a system  startup  command  script).   Finally,
processes running as system administrator are always permit-
ted to create files, so were forced to use a different mech-
anism.   While  it is possible to get around all these prob-
lems, the solutions are not straight forward, so a mechanism
for locking files has been added.

     The  most  general  schemes allow multiple processes to
concurrently update a file.  Several of these techniques are
discussed  in [Peterson83].  A simpler technique is to seri-
alize access to a file with  locks.   To  attain  reasonable
efficiency, certain applications require the ability to lock
pieces of a file.  Locking down to the byte level  has  been
implemented  in  the Onyx file system by [Bass81].  However,
for the standard system applications, a mechanism that locks
at the granularity of a file is sufficient.

     Locking schemes fall into two classes, those using hard
locks and those using advisory locks.  The  primary  differ-
ence  between advisory locks and hard locks is the extent of
enforcement.  A hard lock is always enforced when a  program
tries  to  access  a  file; an advisory lock is only applied
when it is requested by a program.  Thus advisory locks  are
only  effective  when  all programs accessing a file use the
locking scheme.  With hard locks there must be some override
policy  implemented  in the kernel.  With advisory locks the
policy is left to the user programs.  In  the  UNIX  system,
programs  with  system  administrator  privilege are allowed
override any protection scheme.  Because many  of  the  pro-
grams  that  need  to  use locks must also run as the system
administrator, we chose to implement advisory  locks  rather
than  create an additional protection scheme that was incon-
sistent with the UNIX philosophy or could  not  be  used  by
system administration programs.

     The  file locking facilities allow cooperating programs
to apply advisory _s_h_a_r_e_d or _e_x_c_l_u_s_i_v_e locks on files.   Only
one  process may have an exclusive lock on a file while mul-
tiple shared locks may be present.  Both shared  and  exclu-
sive locks cannot be present on a file at the same time.  If
any lock is requested when another process holds  an  exclu-
sive  lock,  or  an exclusive lock is requested when another
process holds any lock, the lock request  will  block  until
the  lock  can  be  obtained.   Because shared and exclusive
locks are advisory only, even if a process  has  obtained  a
lock on a file, another process may access the file.











A Fast File System for UNIX                        SMM:05-21


     Locks  are applied or removed only on open files.  This
means that locks can be manipulated without needing to close
and  reopen  a  file.   This  is useful, for example, when a
process wishes to apply a shared lock, read some information
and  determine  whether an update is required, then apply an
exclusive lock and update the file.

     A request for a lock will cause a process to  block  if
the  lock  can  not  be  immediately  obtained.   In certain
instances this is unsatisfactory.  For  example,  a  process
that  wants only to check if a lock is present would require
a separate mechanism to find out this  information.   Conse-
quently,  a  process  may  specify  that its locking request
should return with an error if a lock can not be immediately
obtained.   Being  able  to  conditionally request a lock is
useful to ``daemon'' processes that wish to service a spool-
ing  area.   If  the  first instance of the daemon locks the
directory where spooling takes place, later daemon processes
can  easily  check to see if an active daemon exists.  Since
locks exist only while the  locking  processes  exist,  lock
files  can  never be left active after the processes exit or
if the system crashes.

     Almost no deadlock detection is  attempted.   The  only
deadlock  detection  done  by the system is that the file to
which a lock is applied must not already have a lock of  the
same  type (i.e. the second of two successive calls to apply
a lock of the same type will fail).

55..33..  SSyymmbboolliicc lliinnkkss

     The traditional UNIX file system allows multiple direc-
tory  entries  in the same file system to reference a single
file.  Each directory entry ``links'' a file's  name  to  an
inode  and  its  contents.  The link concept is fundamental;
inodes do not reside in directories,  but  exist  separately
and are referenced by links.  When all the links to an inode
are removed, the inode is deallocated.  This style of refer-
encing  an  inode  does not allow references across physical
file systems, nor does it support inter-machine linkage.  To
avoid these limitations _s_y_m_b_o_l_i_c _l_i_n_k_s similar to the scheme
used by Multics [Feiertag71] have been added.

     A symbolic link is implemented as a file that  contains
a  pathname.   When  the  system  encounters a symbolic link
while interpreting a component of a pathname,  the  contents
of  the  symbolic link is prepended to the rest of the path-
name, and this name is interpreted to  yield  the  resulting
pathname.   In UNIX, pathnames are specified relative to the
root  of  the  file  system  hierarchy,  or  relative  to  a
process's  current  working  directory.  Pathnames specified
relative to the root are called absolute  pathnames.   Path-
names  specified  relative  to the current working directory
are termed relative pathnames.  If a symbolic link  contains









SMM:05-22                        A Fast File System for UNIX


an  absolute pathname, the absolute pathname is used, other-
wise the contents of the symbolic link is evaluated relative
to the location of the link in the file hierarchy.

     Normally programs do not want to be aware that there is
a symbolic link in a pathname that they are using.   However
certain  system utilities must be able to detect and manipu-
late symbolic links.  Three new  system  calls  provide  the
ability  to  detect,  read,  and write symbolic links; seven
system utilities required changes to use these calls.

     In future Berkeley software  distributions  it  may  be
possible   to  reference  file  systems  located  on  remote
machines using pathnames.  When this occurs, it will be pos-
sible to create symbolic links that span machines.

55..44..  RReennaammee

     Programs  that create a new version of an existing file
typically create the new version as  a  temporary  file  and
then  rename  the temporary file with the name of the target
file.  In the old UNIX file system renaming  required  three
calls  to  the system.  If a program were interrupted or the
system crashed between these calls, the target file could be
left with only its temporary name.  To eliminate this possi-
bility the _r_e_n_a_m_e system call has been  added.   The  rename
call  does the rename operation in a fashion that guarantees
the existence of the target name.

     Rename works both on data files and directories.   When
renaming  directories, the system must do special validation
checks to insure that the directory tree  structure  is  not
corrupted  by the creation of loops or inaccessible directo-
ries.  Such corruption would occur  if  a  parent  directory
were  moved  into  one  of  its descendants.  The validation
check requires tracing the descendents of the target  direc-
tory  to insure that it does not include the directory being
moved.

55..55..  QQuuoottaass

     The UNIX system has traditionally  attempted  to  share
all  available  resources  to  the greatest extent possible.
Thus any single user can allocate all the available space in
the  file system.  In certain environments this is unaccept-
able.  Consequently, a quota mechanism has  been  added  for
restricting  the amount of file system resources that a user
can obtain.  The quota mechanism sets  limits  on  both  the
number  of  inodes and the number of disk blocks that a user
may allocate.  A separate quota can be set for each user  on
each  file  system.   Resources  are given both a hard and a
soft limit.  When a program exceeds a soft limit, a  warning
is  printed  on the users terminal; the offending program is
not terminated unless it exceeds its hard limit.   The  idea









A Fast File System for UNIX                        SMM:05-23


is  that  users  should  stay below their soft limit between
login sessions, but they may use more resources  while  they
are actively working.  To encourage this behavior, users are
warned when logging in if they are over any  of  their  soft
limits.   If users fails to correct the problem for too many
login sessions, they are eventually  reprimanded  by  having
their soft limit enforced as their hard limit.



AAcckknnoowwlleeddggeemmeennttss

     We thank Robert Elz for his ongoing interest in the new
file system, and for adding disk quotas in  a  rational  and
efficient  manner.   We  also acknowledge Dennis Ritchie for
his suggestions on the appropriate modifications to the user
interface.   We  appreciate Michael Powell's explanations on
how the DEMOS file system worked; many  of  his  ideas  were
used  in  this implementation.  Special commendation goes to
Peter Kessler and Robert Henry for acting  like  real  users
during the early debugging stage when file systems were less
stable than they should have been.  The criticisms and  sug-
gestions  by  the  reviews  contributed significantly to the
coherence of the paper.  Finally we thank our sponsors,  the
National Science Foundation under grant MCS80-05144, and the
Defense Advance Research Projects Agency  (DoD)  under  ARPA
Order  No. 4031 monitored by Naval Electronic System Command
under Contract No. N00039-82-C-0235.



RReeffeerreenncceess


[Almes78]           Almes, G., and Robertson, G.  "An Exten-
                    sible File System for Hydra" Proceedings
                    of the Third International Conference on
                    Software Engineering, IEEE, May 1978.

[Bass81]            Bass,  J.   "Implementation  Description
                    for File Locking", Onyx Systems Inc,  73
                    E.  Trimble  Rd,  San Jose, CA 95131 Jan
                    1981.

[Feiertag71]        Feiertag, R. J.  and  Organick,  E.  I.,
                    "The  Multics Input-Output System", Pro-
                    ceedings of the Third Symposium on Oper-
                    ating Systems Principles, ACM, Oct 1971.
                    pp 35-41

[Ferrin82a]         Ferrin, T.E., "Performance  and  Robust-
                    ness  Improvements  in  Version 7 UNIX",
                    Computer Graphics  Laboratory  Technical
                    Report 2, School of Pharmacy, University









SMM:05-24                        A Fast File System for UNIX


                    of California,  San  Francisco,  January
                    1982.   Presented  at  the  1982  Winter
                    Usenix Conference, Santa  Monica,  Cali-
                    fornia.

[Ferrin82b]         Ferrin,  T.E.,  "Performance  Issuses of
                    VMUNIX Revisited", ;login:  (The  Usenix
                    Association   Newsletter),  Vol  7,  #5,
                    November 1982. pp 3-6

[Kridle83]          Kridle, R., and McKusick,  M.,  "Perfor-
                    mance  Effects of Disk Subsystem Choices
                    for VAX Systems  Running  4.2BSD  UNIX",
                    Computer Systems Research Group, Dept of
                    EECS,  Berkeley,  CA  94720,   Technical
                    Report #8.

[Kowalski78]        Kowalski,  T.   "FSCK  - The UNIX System
                    Check Program", Bell Laboratory,  Murray
                    Hill, NJ 07974. March 1978

[Knuth75]           Knuth, D.  "The Art of Computer Program-
                    ming", Volume 3 - Sorting and Searching,
                    Addison-Wesley  Publishing  Company Inc,
                    Reading, Mass, 1975. pp 506-549

[Maruyama76]        Maruyama, K., and  Smith,  S.   "Optimal
                    reorganization of Distributed Space Disk
                    Files",  CACM,  19,  11.  Nov  1976.  pp
                    634-642

[Nevalainen77]      Nevalainen,  O., Vesterinen, M.  "Deter-
                    mining Blocking Factors  for  Sequential
                    Files  by  Heuristic  Methods", The Com-
                    puter  Journal,  20,  3.  Aug  1977.  pp
                    245-247

[Pechura83]         Pechura,  M., and Schoeffler, J.  "Esti-
                    mating  File  Access  Time   of   Floppy
                    Disks",  CACM,  26,  10.  Oct  1983.  pp
                    754-763

[Peterson83]        Peterson, G.  "Concurrent Reading  While
                    Writing",  ACM  Transactions on Program-
                    ming Languages and Systems, ACM,  5,  1.
                    Jan 1983. pp 46-55

[Powell79]          Powell,  M.   "The  DEMOS  File System",
                    Proceedings of the  Sixth  Symposium  on
                    Operating  Systems  Principles, ACM, Nov
                    1977. pp 33-42

[Ritchie74]         Ritchie, D. M. and  Thompson,  K.,  "The
                    UNIX  Time-Sharing  System", CACM 17, 7.









A Fast File System for UNIX                        SMM:05-25


                    July 1974. pp 365-375

[Smith81a]          Smith,  A.   "Input/Output  Optimization
                    and  Disk Architectures: A Survey", Per-
                    formance and Evaluation 1. Jan 1981.  pp
                    104-117

[Smith81b]          Smith, A.  "Bibliography on File and I/O
                    System Optimization and Related Topics",
                    Operating  Systems  Review,  15,  4. Oct
                    1981. pp 39-54

[Symbolics81]       "Symbolics File System", Symbolics  Inc,
                    9600  DeSoto  Ave,  Chatsworth, CA 91311
                    Aug 1981.

[Thompson78]        Thompson,  K.   "UNIX   Implementation",
                    Bell  System  Technical  Journal, 57, 6,
                    part 2. pp 1931-1946 July-August 1978.

[Thompson80]        Thompson,  M.   "Spice   File   System",
                    Carnegie-Mellon  University,  Department
                    of Computer Science, Pittsburg, PA 15213
                    #CMU-CS-80, Sept 1980.

[Trivedi80]         Trivedi,  K.   "Optimal Selection of CPU
                    Speed,  Device  Capabilities,  and  File
                    Assignments", Journal of the ACM, 27, 3.
                    July 1980. pp 457-473

[White80]           White, R. M.  "Disk Storage Technology",
                    Scientific   American,   243(2),  August
                    1980.



























