








           BBeerrkkeelleeyy SSooffttwwaarree AArrcchhiitteeccttuurree MMaannuuaall
                       44..44BBSSDD EEddiittiioonn


              _M_. _K_i_r_k _M_c_K_u_s_i_c_k_, _M_i_c_h_a_e_l _K_a_r_e_l_s
                _S_a_m_u_e_l _L_e_f_f_l_e_r_, _W_i_l_l_i_a_m _J_o_y
                        _R_o_b_e_r_t _F_a_b_r_y
              Computer Systems Research Group
                 Computer Science Division
 Department of Electrical Engineering and Computer Science
             University of California, Berkeley
                    Berkeley, CA  94720


                          _A_B_S_T_R_A_C_T

          This  document  summarizes  the  system calls
     provided by the 4.4BSD operating system.  It  does
     not  attempt  to  act as a tutorial for use of the
     system, nor does it attempt to explain or  justify
     the  design  of  the  system facilities.  It gives
     neither motivation nor implementation details,  in
     favor of brevity.

          The  first section describes the basic kernel
     functions provided to a  process:  process  naming
     and protection, memory management, software inter-
     rupts, time and statistics functions, object  ref-
     erences   (descriptors),  and  resource  controls.
     These facilities, as well as facilities for  boot-
     strap,  shutdown  and process accounting, are pro-
     vided solely by the kernel.

          The second  section  describes  the  standard
     system  abstractions  for  files  and filesystems,
     communication, terminal handling, and process con-
     trol  and  debugging.  These facilities are imple-
     mented by  the  operating  system  or  by  network
     server processes.





















PSD:5-4                           4.4BSD Architecture Manual


NNoottaattiioonn aanndd TTyyppeess

     The notation used to describe system calls is a variant
of a C language function call,  consisting  of  a  prototype
call  followed by the declaration of parameters and results.
An additional keyword rreessuulltt, not part of the normal C  lan-
guage,  is  used  to indicate which of the declared entities
receive results.  As an example, consider the _r_e_a_d call,  as
described in section 2.1.1:

     cc = read(fd, buf, nbytes);
     result ssize_t cc; int fd; result void *buf; size_t nbytes;

The  first  line  shows how the _r_e_a_d routine is called, with
three parameters.  As shown on the second line,  the  return
value  _c_c  is  a size_t and _r_e_a_d also returns information in
the parameter _b_u_f.

     The descriptions of error conditions arising from  each
system  call are not provided here; they appear in section 2
of the Programmer's Reference Manual.  In  particular,  when
accessed from the C language, many calls return a character-
istic -1 value when an error  occurs,  returning  the  error
code  in  the  global  variable  _e_r_r_n_o.  Other languages may
present errors in different ways.

     A number of system standard types are  defined  by  the
include  file  _<_s_y_s_/_t_y_p_e_s_._h_>  and used in the specifications
here and in many C programs.


Type       Value
--------------------------------------------------------------
caddr_t    char *               /* a memory address */
clock_t    unsigned long        /* count of CLK_TCK's */
gid_t      unsigned long        /* group ID */
int16_t    short                /* 16-bit integer */
int32_t    int                  /* 32-bit integer */
int64_t    long long            /* 64-bit integer */
int8_t     signed char          /* 8-bit integer */
mode_t     unsigned short       /* file permissions */
off_t      quad_t               /* file offset */
pid_t      long                 /* process ID */
qaddr_t    quad_t *
quad_t     long long
size_t     unsigned int         /* count of bytes */
ssize_t    int                  /* signed size_t */
time_t     long                 /* seconds since the Epoch */
u_char     unsigned char
u_int      unsigned int
u_long     unsigned long
u_quad_t   unsigned long long
u_short    unsigned short










4.4BSD Architecture Manual                           PSD:5-5


uid_t      unsigned long        /* user ID */
uint       unsigned int         /* System V compatibility */
uint16_t   unsigned short       /* unsigned 16-bit integer */
uint32_t   unsigned int         /* unsigned 32-bit integer */
uint64_t   unsigned long long   /* unsigned 64-bit integer */
uint8_t    unsigned char        /* unsigned 8-bit integer */
ushort     unsigned short       /* System V compatibility */


11..  KKeerrnneell pprriimmiittiivveess


     The facilities available to a user  process  are  logi-
cally  divided  into  two  parts: kernel facilities directly
implemented by code running in  the  operating  system,  and
system  facilities  implemented  either by the system, or in
cooperation with a _s_e_r_v_e_r _p_r_o_c_e_s_s.   The  kernel  facilities
are described in section 1.

     The  facilities  implemented  in  the  kernel are those
which define  the  _4_._4_B_S_D  _v_i_r_t_u_a_l  _m_a_c_h_i_n_e  in  which  each
process runs.  Like many real machines, this virtual machine
has  memory  management  hardware,  an  interrupt  facility,
timers  and  counters.   The  4.4BSD  virtual machine allows
access to files and other objects through a set of  _d_e_s_c_r_i_p_-
_t_o_r_s.   Each  descriptor  resembles a device controller, and
supports  a  set  of  operations.   Like  devices  on   real
machines, some of which are internal to the machine and some
of which are external, parts of the descriptor machinery are
built-in  to  the  operating  system,  while other parts are
implemented in server  processes  on  other  machines.   The
facilities  provided  through  the  descriptor machinery are
described in section 2.

11..11..  PPrroocceesssseess aanndd pprrootteeccttiioonn


11..11..11..  HHoosstt iiddeennttiiffiieerrss


     Each host has associated with it an  integer  host  ID,
and a host name of up to MAXHOSTNAMELEN (256) characters (as
defined in _<_s_y_s_/_p_a_r_a_m_._h_>).  These identifiers are set (by  a
privileged  user)  and  retrieved using the _s_y_s_c_t_l interface
described in section 1.7.1.  The host ID is seldom used  (or
set),  and is deprecated.  For convenience and backward com-
patibility, the following library routines are provided:

     sethostid(hostid);
     long hostid;













PSD:5-6                           4.4BSD Architecture Manual


     hostid = gethostid();
     result long hostid;


     sethostname(name, len);
     char *name; int len;


     len = gethostname(buf, buflen);
     result int len; result char *buf; int buflen;


11..11..22..  PPrroocceessss iiddeennttiiffiieerrss

Each host runs a set of _p_r_o_c_e_s_s_e_s.  Each process is  largely
independent  of  other  processes, having its own protection
domain, address space, timers, and  an  independent  set  of
references to system or user implemented objects.

     Each  process  in  a host is named by an integer called
the _p_r_o_c_e_s_s _I_D.  This number is in the range 1-30000 and  is
returned by the _g_e_t_p_i_d routine:

     pid = getpid();
     result pid_t pid;

On  each host this identifier is guaranteed to be unique; in
a multi-host environment, the (hostid, process ID) pairs are
guaranteed  unique.   The  parent  process identifier can be
obtained using the _g_e_t_p_p_i_d routine:

     pid = getppid();
     result pid_t pid;


11..11..33..  PPrroocceessss ccrreeaattiioonn aanndd tteerrmmiinnaattiioonn


A new process is created by making a logical duplicate of an
existing process:

     pid = fork();
     result pid_t pid;

The  _f_o_r_k  call  returns  twice, once in the parent process,
where _p_i_d is the process identifier of the child,  and  once
in the child process where _p_i_d is 0.  The parent-child rela-
tionship imposes a hierarchical structure on the set of pro-
cesses in the system.

     For  processes  that are forking solely for the purpose
of _e_x_e_c_v_e'ing another program, the _v_f_o_r_k  system  call  pro-
vides a faster interface:










4.4BSD Architecture Manual                           PSD:5-7


     pid = vfork();
     result pid_t pid;

Like  _f_o_r_k, the _v_f_o_r_k call returns twice, once in the parent
process, where _p_i_d is the process identifier of  the  child,
and  once  in  the child process where _p_i_d is 0.  The parent
process is suspended until the child  process  calls  either
_e_x_e_c_v_e or _e_x_i_t.

A process may terminate by executing an _e_x_i_t call:

     exit(status);
     int status;

The lower 8 bits of exit status are available to its parent.

     When a child process exits  or  terminates  abnormally,
the  parent  process  receives  information  about the event
which caused termination of the child process.   The  inter-
face  allows  the  parent  to wait for a particular process,
process group, or any  direct  descendent  and  to  retrieve
information  about  resources consumed by the process during
its lifetime.  The request may be done either  synchronously
(waiting  for  one  of  the requested processes to exit), or
asynchronously (polling to see if any of the requested  pro-
cesses have exited):

     pid = wait4(wpid, astatus, options, arusage);
     result pid_t pid; pid_t wpid; result int *astatus;
     int options; result struct rusage *arusage;


     A  process  can overlay itself with the memory image of
another process, passing the newly created process a set  of
parameters, using the call:

     execve(name, argv, envp);
     char *name, *argv[], *envp[];

The  specified _n_a_m_e must be a file which is in a format rec-
ognized by the system, either a binary executable file or  a
file  which  causes the execution of a specified interpreter
program to process its contents.  If  the  set-user-ID  mode
bit is set, the effective user ID is set to the owner of the
file; if the set-group-ID mode bit  is  set,  the  effective
group  ID  is set to the group of the file.  Whether changed
or not, the effective user ID is then copied  to  the  saved
user  ID,  and the effective group ID is copied to the saved
group ID.

11..11..44..  UUsseerr aanndd ggrroouupp IIDDss












PSD:5-8                           4.4BSD Architecture Manual


     Each process in the system has associated with it three
user  IDs: a _r_e_a_l _u_s_e_r _I_D, an _e_f_f_e_c_t_i_v_e _u_s_e_r _I_D, and a _s_a_v_e_d
_u_s_e_r _I_D, all unsigned integral types (uuiidd__tt).  Each  process
has a _r_e_a_l _g_r_o_u_p _I_D and a set of _a_c_c_e_s_s _g_r_o_u_p _I_D_s, the first
of which is the _e_f_f_e_c_t_i_v_e  _g_r_o_u_p  _I_D.   The  group  IDs  are
unsigned  integral  types  (ggiidd__tt).   Each process may be in
multiple access groups.  The maximum  concurrent  number  of
access groups is a system compilation parameter, represented
by the constant NGROUPS in the file  _<_s_y_s_/_p_a_r_a_m_._h_>.   It  is
guaranteed to be at least 16.

The real group ID is used in process accounting and in test-
ing whether the effective group ID may be changed; it is not
otherwise  used  for  access  control.   The  members of the
access group ID set are used for  access  control.   Because
the first member of the set is the effective group ID, which
is changed when executing a set-group-ID program, that  ele-
ment is normally duplicated in the set so that access privi-
leges for the original group are not lost when using a  set-
group-ID program.

The  real  and  effective user IDs associated with a process
are returned by:

     ruid = getuid();
     result uid_t ruid;


     euid = geteuid();
     result uid_t euid;

the real and effective group IDs by:

     rgid = getgid();
     result gid_t rgid;


     egid = getegid();
     result gid_t egid;

The access group ID set is returned by a _g_e_t_g_r_o_u_p_s call:

     ngroups = getgroups(gidsetsize, gidset);
     result int ngroups; int gidsetsize; result gid_t gidset[gidsetsize];


The user and group IDs are assigned at login time using  the
_s_e_t_u_i_d, _s_e_t_g_i_d, and _s_e_t_g_r_o_u_p_s calls:

     setuid(uid);
     uid_t uid;












4.4BSD Architecture Manual                           PSD:5-9


     setgid(gid);
     gid_t gid;


     setgroups(gidsetsize, gidset);
     int gidsetsize; gid_t gidset[gidsetsize];

The  _s_e_t_u_i_d  call  sets  the real, effective, and saved user
IDs, and is permitted only if the specified _u_i_d is the  cur-
rent  real  user ID or if the caller is the super-user.  The
_s_e_t_g_i_d call sets the real, effective, and saved  group  IDs;
it  is  permitted  only  if the specified _g_i_d is the current
real group ID or if the caller is the super-user.  The  _s_e_t_-
_g_r_o_u_p_s  call sets the access group ID set, and is restricted
to the super-user.

The _s_e_t_e_u_i_d routine allows any process to set its  effective
user ID to either its real or saved user ID:

     seteuid(uid);
     uid_t uid;

The  _s_e_t_e_g_i_d routine allows any process to set its effective
group ID to either its real or saved group ID:

     setegid(gid);
     gid_t gid;


11..11..55..  SSeessssiioonnss


     When a user first logs onto the system,  they  are  put
into a session with a controlling process (usually a shell).
The session is created with the call:

     pid = setsid();
     result pid_t pid;

All subsequent processes created by the user  (that  do  not
call  _s_e_t_s_i_d) will be part of the session.  The session also
has a login name associated with it which is set  using  the
privileged call:

     setlogin(name);
     char *name;

The login name can be retrieved using the call:

     name = getlogin();
     result char *name;

Unlike  historic  systems, the value returned by _g_e_t_l_o_g_i_n is
stored in the kernel and can be trusted.









PSD:5-10                          4.4BSD Architecture Manual


11..11..66..  PPrroocceessss ggrroouuppss


     Each process in the system is also  associated  with  a
_p_r_o_c_e_s_s _g_r_o_u_p.  The group of processes in a process group is
sometimes referred to as a _j_o_b and manipulated by high-level
system  software  (such  as  the  shell).   All members of a
process group are members of the same session.  The  current
process group of a process is returned by the _g_e_t_p_g_r_p call:

     pgrp = getpgrp();
     result pid_t pgrp;

When a process is in a specific process group it may receive
software interrupts affecting the group, causing  the  group
to  suspend or resume execution or to be interrupted or ter-
minated.  In particular, a system  terminal  has  a  process
group  and  only processes which are in the process group of
the terminal may read from the terminal,  allowing  arbitra-
tion of a terminal among several different jobs.

The  process  group associated with a process may be changed
by the _s_e_t_p_g_i_d call:

     setpgid(pid, pgrp);
     pid_t pid, pgrp;

Newly created processes are assigned  process  IDs  distinct
from  all processes and process groups, and the same process
group as their parent.  Any  process  may  set  its  process
group equal to its process ID or to the value of any process
group within its session.

11..22..  MMeemmoorryy mmaannaaggeemmeenntt


11..22..11..  TTeexxtt,, ddaattaa,, aanndd ssttaacckk


     Each process begins execution with three logical  areas
of  memory  called  text, data, and stack.  The text area is
read-only and shared, while the data  and  stack  areas  are
writable  and  private  to  the  process.  Both the data and
stack areas  may  be  extended  and  contracted  on  program
request.  The call:

     brk(addr);
     caddr_t addr;

sets  the  end of the data segment to the specified address.
More conveniently, the end can be extended  by  _i_n_c_r  bytes,
and the base of the new area returned with the call:











4.4BSD Architecture Manual                          PSD:5-11


     addr = sbrk(incr);
     result caddr_t addr; int incr;

Application  programs normally use the library routines _m_a_l_-
_l_o_c and _f_r_e_e, which provide a more convenient interface than
_b_r_k and _s_b_r_k.

There is no call for extending the stack, as it is automati-
cally extended as needed.

11..22..22..  MMaappppiinngg ppaaggeess


     The system supports sharing of data  between  processes
by  allowing  pages  to be mapped into memory.  These mapped
pages may be _s_h_a_r_e_d with other processes or _p_r_i_v_a_t_e  to  the
process.   Protection  and  sharing  options  are defined in
_<_s_y_s_/_m_m_a_n_._h_> as:


     Protections are chosen from these bits, or-ed together:

     PROT_READ           /* pages can be read */
     PROT_WRITE          /* pages can be written */
     PROT_EXEC           /* pages can be executed */




     Flags contain sharing type and options.  Sharing options, choose one:

     MAP_SHARED                   /* share changes */
     MAP_PRIVATE                  /* changes are private */




     Option flags[+]:

     MAP_ANON           /* allocated from virtual memory; _f_d ignored */
     MAP_FIXED          /* map addr must be exactly as requested */
     MAP_NORESERVE      /* don't reserve needed swap area */
     MAP_INHERIT        /* region is retained after exec */
     MAP_HASSEMAPHORE   /* region may contain semaphores */


The  size of a page is CPU-dependent, and is returned by the
_s_y_s_c_t_l interface described in section 1.7.1.   The  _g_e_t_p_a_g_e_-
_s_i_z_e  library  routine is provided for convenience and back-
ward compatibility:

-----------
[+] In 4.4BSD, only  MAP_ANON  and  MAP_FIXED  are
implemented.









PSD:5-12                          4.4BSD Architecture Manual


     pagesize = getpagesize();
     result int pagesize;


The call:

     maddr = mmap(addr, len, prot, flags, fd, pos);
     result caddr_t maddr; caddr_t addr; size_t len; int prot, flags, fd; off_t pos;

causes the pages starting at _a_d_d_r and continuing for at most
_l_e_n  bytes  to  be  mapped  from  the  object represented by
descriptor _f_d, starting at byte  offset  _p_o_s.   If  _a_d_d_r  is
NULL,  the  system  picks  an unused address for the region.
The starting address of the region is returned; for the con-
venience  of  the  system,  it may differ from that supplied
unless the MAP_FIXED flag is given, in which case the  exact
address will be used or the call will fail.  The _a_d_d_r param-
eter must be a multiple of the  pagesize  (if  MAP_FIXED  is
given).  If _p_o_s and _l_e_n are not a multiple of pagesize, they
will be rounded (down and up respectively) to a page  bound-
ary by the system; the rounding will cause the mapped region
to extend past the specified range.  A successful _m_m_a_p  will
delete  any previous mapping in the allocated address range.
The parameter _p_r_o_t specifies the accessibility of the mapped
pages.   The parameter _f_l_a_g_s specifies the type of object to
be mapped, mapping options, and whether  modifications  made
to  this  mapped copy of the page are to be kept _p_r_i_v_a_t_e, or
are to be _s_h_a_r_e_d  with  other  references.   Possible  types
include MAP_SHARED or MAP_PRIVATE that map a regular file or
character-special device memory, and  MAP_ANON,  which  maps
memory  not  associated  with  any  specific file.  The file
descriptor used when creating MAP_ANON regions is  not  used
and  should  be -1.  The MAP_INHERIT flag allows a region to
be inherited after an  _e_x_e_c_v_e.   The  MAP_HASSEMAPHORE  flag
allows  special  handling for regions that may contain sema-
phores.  The MAP_NORESERVE flag allows processes to allocate
regions  whose  virtual  address  space, if fully allocated,
would exceed the available memory plus swap resources.  Such
regions  may  get  a  SIGSEGV  signal if they page fault and
resources are not available to service their request;  typi-
cally  they  would free up some resources via _m_u_n_m_a_p so that
when they return from the signal the  page  fault  could  be
completed successfully.

A  facility  is provided to synchronize a mapped region with
the file it maps; the call:

     msync(addr, len);
     caddr_t addr; size_t len;

causes any modified pages in the specified region to be syn-
chronized  with  their source and other mappings.  If neces-
sary, it writes any modified pages back to  the  filesystem,
and  updates  the  file modification time.  If _l_e_n is 0, all









4.4BSD Architecture Manual                          PSD:5-13


modified pages within the region  containing  _a_d_d_r  will  be
flushed;  this  usage  is provisional, and may be withdrawn.
If _l_e_n is non-zero, only the pages containing _a_d_d_r  and  _l_e_n
succeeding  locations  will  be examined.  Any required syn-
chronization of memory caches will also take place  at  this
time.

Filesystem  operations  on  a file that is mapped for shared
modifications are currently unpredictable  except  after  an
_m_s_y_n_c.

A mapping can be removed by the call

     munmap(addr, len);
     caddr_t addr; size_t len;

This  call  deletes  the  mappings for the specified address
range, and causes further references to addresses within the
range to generate invalid memory references.

11..22..33..  PPaaggee pprrootteeccttiioonn ccoonnttrrooll


A  process  can  control  the  protection of pages using the
call:

     mprotect(addr, len, prot);
     caddr_t addr; size_t len; int prot;

This call changes the specified  pages  to  have  protection
_p_r_o_t.   Not all implementations will guarantee protection on
a page basis; the granularity of protection changes  may  be
as large as an entire region.

11..22..44..  GGiivviinngg aanndd ggeettttiinngg aaddvviiccee


A  process that has knowledge of its memory behavior may use
the _m_a_d_v_i_s_e[+] call:

     madvise(addr, len, behav);
     caddr_t addr; size_t len; int behav;

_B_e_h_a_v describes expected behavior, as given in _<_s_y_s_/_m_m_a_n_._h_>:





-----------
[+] The  entry  point  for  this  system  call  is
defined,  but  is  not  implemented,  so currently
always returns with the error ``Operation not sup-
ported.''









PSD:5-14                          4.4BSD Architecture Manual


     MADV_NORMAL       /* no further special treatment */
     MADV_RANDOM       /* expect random page references */
     MADV_SEQUENTIAL   /* expect sequential references */
     MADV_WILLNEED     /* will need these pages */
     MADV_DONTNEED     /* don't need these pages */


The _m_i_n_c_o_r_e[+] function allows a process to obtain  informa-
tion about whether pages are memory resident:

     mincore(addr, len, vec);
     caddr_t addr; size_t len; result char *vec;

Here  the  current memory residency of the pages is returned
in the character array _v_e_c, with a value of 1  meaning  that
the  page  is  in-memory.   _M_i_n_c_o_r_e  provides only transient
information about page residency.  Real-time processes  that
need guaranteed residence over time can use the call:

     mlock(addr, len);
     caddr_t addr; size_t len;

This  call  locks  the pages for the specified address range
into memory (paging them in if necessary) ensuring that fur-
ther  references  to  addresses  within the range will never
generate page faults.  The amount  of  memory  that  may  be
locked is controlled by a resource limit, see section 1.6.3.
When the memory is no longer critical  it  can  be  unlocked
using:

     munlock(addr, len);
     caddr_t addr; size_t len;

After  the  _m_u_n_l_o_c_k call, the pages in the specified address
range are still accessible but may be paged out if memory is
needed and they are not accessed.

11..22..55..  SSyynncchhrroonniizzaattiioonn pprriimmiittiivveess

Primitives are provided for synchronization using semaphores
in  shared  memory.[++]  These primitives are expected to be
superseded by the semaphore interface being specified by the
POSIX  1003 Pthread standard.  They are provided as an effi-
cient interim solution.  Application programmers are encour-
aged to use the Pthread interface when it becomes available.

     Semaphores must lie within a MAP_SHARED region with  at
least  modes PROT_READ and PROT_WRITE.  The MAP_HASSEMAPHORE
flag must have been specified when the region  was  created.
To acquire a lock a process calls:

-----------
[++] All currently unimplemented, no entry  points
exists.









4.4BSD Architecture Manual                          PSD:5-15


     value = mset(sem, wait);
     result int value; semaphore *sem; int wait;

_M_s_e_t  indivisibly  tests and sets the semaphore _s_e_m.  If the
previous value is zero, the process has  acquired  the  lock
and  _m_s_e_t  returns true immediately.  Otherwise, if the _w_a_i_t
flag is zero, failure is returned.  If _w_a_i_t is true and  the
previous  value is non-zero, _m_s_e_t relinquishes the processor
until notified that it should retry.

To release a lock a process calls:

     mclear(sem);
     semaphore *sem;

_M_c_l_e_a_r indivisibly tests and clears the semaphore  _s_e_m.   If
the  ``WANT''  flag  is  zero  in the previous value, _m_c_l_e_a_r
returns immediately.  If the ``WANT'' flag  is  non-zero  in
the previous value, _m_c_l_e_a_r arranges for waiting processes to
retry before returning.

     Two routines provide services analogous to  the  kernel
_s_l_e_e_p  and  _w_a_k_e_u_p  functions  interpreted  in the domain of
shared memory.  A process may relinquish  the  processor  by
calling _m_s_l_e_e_p with a set semaphore:

     msleep(sem);
     semaphore *sem;

If the semaphore is still set when it is checked by the ker-
nel, the process will be put in a sleeping state until  some
other  process  issues  an  _m_w_a_k_e_u_p  for  the same semaphore
within the region using the call:

     mwakeup(sem);
     semaphore *sem;

An _m_w_a_k_e_u_p may awaken all sleepers on the semaphore, or  may
awaken only the next sleeper on a queue.

11..33..  SSiiggnnaallss



11..33..11..  OOvveerrvviieeww


     The  system defines a set of _s_i_g_n_a_l_s that may be deliv-
ered to a process.  Signal delivery resembles the occurrence
of  a hardware interrupt: the signal is blocked from further
occurrence, the current process context is saved, and a  new
one  is  built.   A process may specify a _h_a_n_d_l_e_r to which a
signal is delivered, or specify that the  signal  is  to  be
_b_l_o_c_k_e_d  or  _i_g_n_o_r_e_d.   A  process  may  also specify that a









PSD:5-16                          4.4BSD Architecture Manual


_d_e_f_a_u_l_t action is to be taken when signals occur.

     Some signals will cause a process to exit if  they  are
not  caught.   This may be accompanied by creation of a _c_o_r_e
image file, containing  the  current  memory  image  of  the
process  for  use  in  post-mortem debugging.  A process may
also choose to have signals delivered on a special stack, so
that  sophisticated  software stack manipulations are possi-
ble.

     All signals have the same _p_r_i_o_r_i_t_y.  If  multiple  sig-
nals  are pending, signals that may be generated by the pro-
gram's action are delivered first; the order in which  other
signals are delivered to a process is not specified.  Signal
routines execute with the signal that caused  their  invoca-
tion _b_l_o_c_k_e_d, but other signals may occur.  Multiple signals
may be delivered on a single entry to the system, as if sig-
nal  handlers  were  interrupted  by  other signal handlers.
Mechanisms are provided whereby critical  sections  of  code
may  protect  themselves against the occurrence of specified
signals.

11..33..22..  SSiiggnnaall ttyyppeess


     The signals defined by the system fall into one of five
classes:    hardware    conditions,   software   conditions,
input/output notification, process control, or resource con-
trol.  The set of signals is defined by the file _<_s_i_g_n_a_l_._h_>.

     Hardware signals are derived  from  exceptional  condi-
tions  which  may  occur  during  execution.   Such  signals
include SIGFPE representing floating point and other  arith-
metic  exceptions, SIGILL for illegal instruction execution,
SIGSEGV for attempts to access addresses  outside  the  cur-
rently assigned area of memory, and SIGBUS for accesses that
violate memory access constraints.

     Software signals reflect interrupts generated  by  user
request: SIGINT for the normal interrupt signal; SIGQUIT for
the more powerful _q_u_i_t signal, which normally causes a  core
image  to be generated; SIGHUP and SIGTERM that cause grace-
ful process termination, either because a  user  has  ``hung
up'',  or  by  user  or program request; and SIGKILL, a more
powerful termination signal which a process cannot catch  or
ignore.   Programs  may define their own asynchronous events
using SIGUSR1 and SIGUSR2.  Other software signals (SIGALRM,
SIGVTALRM,  SIGPROF)  indicate  the  expiration  of interval
timers.  When a window changes size, a SIGWINCH is  sent  to
the controlling terminal process group.

     A  process  can request notification via a SIGIO signal
when input or output is possible on a descriptor, or when  a
_n_o_n_-_b_l_o_c_k_i_n_g  operation completes.  A process may request to









4.4BSD Architecture Manual                          PSD:5-17


receive a SIGURG signal when an urgent condition arises.

     A process may be _s_t_o_p_p_e_d by a signal sent to it or  the
members  of its process group.  The SIGSTOP signal is a pow-
erful stop signal, because it cannot be caught.  Other  stop
signals  SIGTSTP,  SIGTTIN, and SIGTTOU are used when a user
request, input request, or output  request  respectively  is
the  reason  for  stopping the process.  A SIGCONT signal is
sent to a process when it is continued from a stopped state.
Processes  may  receive  notification  with a SIGCHLD signal
when a child process changes state, either by stopping or by
terminating.

     Exceeding  resource limits may cause signals to be gen-
erated.  SIGXCPU occurs when a process nears  its  CPU  time
limit  and  SIGXFSZ when a process reaches the limit on file
size.

11..33..33..  SSiiggnnaall hhaannddlleerrss


     A process has a handler associated  with  each  signal.
The  handler  controls the way the signal is delivered.  The
call:


     struct sigaction {
          void       (*sa_handler)();
          sigset_t   sa_mask;
          int        sa_flags;
     };



     sigaction(signo, sa, osa);
     int signo; struct sigaction *sa; result struct sigaction *osa;

assigns  interrupt  handler  address  _s_a___h_a_n_d_l_e_r  to  signal
_s_i_g_n_o.   Each  handler address specifies either an interrupt
routine for the signal, that the signal is to be ignored, or
that  a  default  action (usually process termination) is to
occur if the  signal  occurs.   The  constants  SIG_IGN  and
SIG_DFL  used  as  values  for  _s_a___h_a_n_d_l_e_r cause ignoring or
defaulting of a condition, respectively.  The _s_a___m_a_s_k  value
specifies  the  signal  mask  to be used when the handler is
invoked; it implicitly includes the signal which invoked the
handler.  Signal masks include one bit for each signal.  The
following macros, defined in _s_i_g_n_a_l_._h, create an empty mask,
then put _s_i_g_n_o into it:

     sigemptyset(set);
     sigaddset(set, signo);
     result sigset_t *set; int signo;










PSD:5-18                          4.4BSD Architecture Manual


_S_a___f_l_a_g_s  specifies  whether  pending system calls should be
restarted if the signal  handler  returns  (SA_RESTART)  and
whether  the  handler  should operate on the normal run-time
stack or a special signal stack (SA_ONSTACK; see below).  If
_o_s_a  is non-zero, the previous signal handler information is
returned.

     When a signal condition arises for a process, the  sig-
nal  is  added  to a set of signals pending for the process.
If the signal is not currently _b_l_o_c_k_e_d  by  the  process  it
then will be delivered.  The process of signal delivery adds
the signal to be delivered and those  signals  specified  in
the  associated  signal  handler's _s_a___m_a_s_k to a set of those
_m_a_s_k_e_d for the process, saves the current  process  context,
and places the process in the context of the signal handling
routine.  The call is arranged so that if  the  signal  han-
dling  routine  returns  normally,  the  signal mask will be
restored and the process will resume execution in the origi-
nal context.

     The  mask of _b_l_o_c_k_e_d signals is independent of handlers
for signals.  It delays signals from being delivered much as
a  raised  hardware interrupt priority level delays hardware
interrupts.   Preventing  an  interrupt  from  occurring  by
changing the handler is analogous to disabling a device from
further interrupts.

The signal handling routine _s_a___h_a_n_d_l_e_r is called by a C call
of the form:

     (*sa_handler)(signo, code, scp);
     int signo; long code; struct sigcontext *scp;

The  _s_i_g_n_o gives the number of the signal that occurred, and
the _c_o_d_e, a word of signal-specific information supplied  by
the  hardware.  The _s_c_p parameter is a pointer to a machine-
dependent structure containing the information for restoring
the  context  before the signal.  Normally this context will
be restored when the signal  handler  returns.   However,  a
process may do so at any time by using the call:

     sigreturn(scp);
     struct sigcontext *scp;

If  the  signal  handler makes a call to _l_o_n_g_j_m_p, the signal
mask at the time of the corresponding _s_e_t_j_m_p is restored.

11..33..44..  SSeennddiinngg ssiiggnnaallss


A process can send a signal to another process or  processes
group with the call:











4.4BSD Architecture Manual                          PSD:5-19


     kill(pid, signo)
     pid_t pid; int signo;

For  compatibility with old systems, a compatibility routine
is provided to send a signal to a process group:

     killpg(pgrp, signo)
     pid_t pgrp; int signo;

Unless the process sending the signal is privileged, it must
have the same effective user id as the process receiving the
signal.

     Signals also are sent implicitly from a terminal device
to  the process group associated with the terminal when cer-
tain input characters are typed.

11..33..55..  PPrrootteeccttiinngg ccrriittiiccaall sseeccttiioonnss


The _s_i_g_p_r_o_c_m_a_s_k system call is used to manipulate  the  mask
of blocked signals:

     sigprocmask(how, newmask, oldmask);
     int how; sigset_t *newmask; result sigset_t *oldmask;

The  actions  done  by _s_i_g_p_r_o_c_m_a_s_k are to add to the list of
masked signals (SIG_BLOCK), delete from the list  of  masked
signals  (SIG_UNBLOCK),  and block a specific set of signals
(SIG_SETMASK).  The _s_i_g_p_r_o_c_m_a_s_k call can be used to read the
current  mask by specifying SIG_BLOCK with an empty _n_e_w_m_a_s_k.

     It is possible to check conditions  with  some  signals
blocked,  and then to pause waiting for a signal and restor-
ing the mask, by using:

     sigsuspend(mask);
     sigset_t *mask;

It is also possible to find out which  blocked  signals  are
pending delivery using the call:

     sigpending(mask);
     result sigset_t *mask;


11..33..66..  SSiiggnnaall ssttaacckkss


Applications  that maintain complex or fixed size stacks can
use the call:












PSD:5-20                          4.4BSD Architecture Manual


     struct sigaltstack {
          caddr_t   ss_sp;
          long      ss_size;
          int       ss_flags;
     };



     sigaltstack(ss, oss)
     struct sigaltstack *ss; result struct sigaltstack *oss;

to provide the system with a stack based at  _s_s___s_p  of  size
_s_s___s_i_z_e  for  delivery of signals.  The value _s_s___f_l_a_g_s indi-
cates whether the process is currently on the signal  stack,
a notion maintained in software by the system.

     When a signal is to be delivered to a handler for which
the SA_ONSTACK flag was set, the system checks  whether  the
process  is  on a signal stack.  If not, then the process is
switched to the signal stack for delivery, with  the  return
from  the  signal  doing a _s_i_g_r_e_t_u_r_n to restore the previous
stack.  If the process takes a non-local exit from the  sig-
nal routine, _l_o_n_g_j_m_p will do a _s_i_g_r_e_t_u_r_n call to switch back
to the run-time stack.

11..44..  TTiimmeerrss


11..44..11..  RReeaall ttiimmee


     The system's notion of the current time is  in  Coordi-
nated  Universal  Time (UTC, previously GMT) and the current
time zone is set and returned by the calls:

     settimeofday(tp, tzp);
     struct timeval *tp;
     struct timezone *tzp;


     gettimeofday(tp, tzp);
     result struct timeval *tp;
     result struct timezone *tzp;

where the structures are defined in _<_s_y_s_/_t_i_m_e_._h_> as:


















4.4BSD Architecture Manual                          PSD:5-21


     struct timeval {
          long   tv_sec;           /* seconds since Jan 1, 1970 */
          long   tv_usec;          /* and microseconds */
     };
     struct timezone {
          int    tz_minuteswest;   /* of Greenwich */
          int    tz_dsttime;       /* type of dst correction to apply */
     };


The timezone information is present only for historical rea-
sons and is unused by the current system.

The  precision  of  the  system clock is hardware dependent.
Earlier versions of UNIX contained only a  1-second  resolu-
tion  version  of this call, which remains as a library rou-
tine:

     time(tvsec);
     result time_t *tvsec;

returning only the tv_sec field from the _g_e_t_t_i_m_e_o_f_d_a_y  call.

The  _a_d_j_t_i_m_e  system  calls allows for small changes in time
without abrupt changes by skewing the  rate  at  which  time
advances:

     adjtime(delta, olddelta);
     struct timeval *delta; result struct timeval *olddelta;


11..44..22..  IInntteerrvvaall ttiimmee


The system provides each process with three interval timers,
defined in _<_s_y_s_/_t_i_m_e_._h_>:


     ITIMER_REAL      /* real time intervals */
     ITIMER_VIRTUAL   /* virtual time intervals */
     ITIMER_PROF      /* user and system virtual time */


The ITIMER_REAL timer decrements in real time.  It could  be
used  by  a  library  routine  to  maintain a wakeup service
queue.  A  SIGALRM  signal  is  delivered  when  this  timer
expires.

     The  ITIMER_VIRTUAL timer decrements in process virtual
time.  It runs only when the process is executing.  A  SIGV-
TALRM signal is delivered when it expires.

     The  ITIMER_PROF  timer decrements both in process vir-
tual time and when the system is running on  behalf  of  the









PSD:5-22                          4.4BSD Architecture Manual


process.   It is designed to be used by processes to statis-
tically profile their execution.  A SIGPROF signal is deliv-
ered when it expires.

A timer value is defined by the _i_t_i_m_e_r_v_a_l structure:


     struct itimerval {
          struct   timeval it_interval;   /* timer interval */
          struct   timeval it_value;      /* current value */
     };


and a timer is set or read by the call:

     setitimer(which, value, ovalue);
     int which; struct itimerval *value; result struct itimerval *ovalue;


     getitimer(which, value);
     int which; result struct itimerval *value;

The  _i_t___v_a_l_u_e  specifies the time until the next signal; the
_i_t___i_n_t_e_r_v_a_l specifies a new interval that should  be  loaded
into  the  timer  on each expiration.  The third argument to
_s_e_t_i_t_i_m_e_r specifies an optional  structure  to  receive  the
previous  contents  of  the  interval timer.  A timer can be
disabled by setting _i_t___v_a_l_u_e and _i_t___i_n_t_e_r_v_a_l to 0.

     The system rounds argument timer intervals  to  be  not
less  than  the resolution of its clock.  This clock resolu-
tion can be determined by loading a very small value into  a
timer and reading the timer back to see what value resulted.

     The _a_l_a_r_m system call of earlier versions  of  UNIX  is
provided as a library routine using the ITIMER_REAL timer.

     The process profiling facilities of earlier versions of
UNIX remain because it is not always possible  to  guarantee
the  automatic  restart  of  system calls after receipt of a
signal.  The _p_r_o_f_i_l call arranges for the  kernel  to  begin
gathering execution statistics for a process:

     profil(samples, size, offset, scale);
     result char *samples; int size, offset, scale;

This  call begins sampling the program counter, with statis-
tics maintained in the user-provided buffer.

11..55..  DDeessccrriippttoorrss













4.4BSD Architecture Manual                          PSD:5-23


11..55..11..  TThhee rreeffeerreennccee ttaabbllee


     Each process has access to resources  through  _d_e_s_c_r_i_p_-
_t_o_r_s.   Each  descriptor  is  a handle allowing processes to
reference objects such as files, devices and  communications
links.

     Rather   than   allowing  processes  direct  access  to
descriptors, the system introduces a level  of  indirection,
so  that  descriptors may be shared between processes.  Each
process has a _d_e_s_c_r_i_p_t_o_r _r_e_f_e_r_e_n_c_e _t_a_b_l_e, containing  point-
ers  to  the actual descriptors.  The descriptors themselves
therefore may have multiple references,  and  are  reference
counted by the system.

     Each  process  has  a limited size descriptor reference
table, where the current size is returned by the  _g_e_t_d_t_a_b_l_e_-
_s_i_z_e call:

     nds = getdtablesize();
     result int nds;

and  guaranteed  to  be  at least 64.  The maximum number of
descriptors is a resource limit (see  section  1.6.3).   The
entries in the descriptor reference table are referred to by
small integers; for example if there are 64 slots  they  are
numbered 0 to 63.

11..55..22..  DDeessccrriippttoorr pprrooppeerrttiieess


     Each  descriptor  has a logical set of properties main-
tained by the system and defined by  its  _t_y_p_e.   Each  type
supports a set of operations; some operations, such as read-
ing and writing, are common to several  abstractions,  while
others  are  unique.   For  those  types that support random
access, the current file offset is stored in the descriptor.
The  generic  operations applying to many of these types are
described in section 2.1.  Naming contexts, files and direc-
tories  are described in section 2.2.  Section 2.3 describes
communications domains and sockets.  Terminals  and  (struc-
tured  and  unstructured)  devices  are described in section
2.4.

11..55..33..  MMaannaaggiinngg ddeessccrriippttoorr rreeffeerreenncceess


A duplicate of a descriptor reference may be made by doing:

     new = dup(old);
     result int new; int old;

returning a  copy  of  descriptor  reference  _o_l_d  which  is









PSD:5-24                          4.4BSD Architecture Manual


indistinguishable  from the original.  The value of _n_e_w cho-
sen by the system will be  the  smallest  unused  descriptor
reference  slot.   A  copy  of a descriptor reference may be
made in a specific slot by doing:

     dup2(old, new);
     int old, new;

The _d_u_p_2 call causes the system to deallocate the descriptor
reference  current  occupying slot _n_e_w, if any, replacing it
with a reference to the same descriptor as old.

Descriptors are deallocated by:

     close(old);
     int old;


11..55..44..  MMuullttiipplleexxiinngg rreeqquueessttss


     The system provides a standard way  to  do  synchronous
and  asynchronous  multiplexing  of operations.  Synchronous
multiplexing is performed by using the _s_e_l_e_c_t call to  exam-
ine the state of multiple descriptors simultaneously, and to
wait for  state  changes  on  those  descriptors.   Sets  of
descriptors  of interest are specified as bit masks, as fol-
lows:

     nds = select(nd, in, out, except, tvp);
     result int nds; int nd; result fd_set *in, *out, *except;
     struct timeval *tvp;

     FD_CLR(fd, &fdset);
     FD_COPY(&fdset, &fdset2);
     FD_ISSET(fd, &fdset);
     FD_SET(fd, &fdset);
     FD_ZERO(&fdset);
     int fd; fs_set fdset, fdset2;

The _s_e_l_e_c_t call examines the descriptors  specified  by  the
sets  _i_n,  _o_u_t and _e_x_c_e_p_t, replacing the specified bit masks
by the subsets that  select  true  for  input,  output,  and
exceptional conditions respectively (_n_d indicates the number
of file descriptors specified by the  bit  masks).   If  any
descriptors  meet the following criteria, then the number of
such descriptors is returned in _n_d_s and the  bit  masks  are
updated.

*    A  descriptor  selects  for  input if an input oriented
     operation such as _r_e_a_d or _r_e_c_e_i_v_e is possible, or if  a
     connection  request may be accepted (see sections 2.1.3
     and 2.3.1.4).










4.4BSD Architecture Manual                          PSD:5-25


*    A descriptor selects for output if an  output  oriented
     operation  such  as _w_r_i_t_e or _s_e_n_d is possible, or if an
     operation that was ``in progress'', such as  connection
     establishment,  has  completed  (see sections 2.1.3 and
     2.3.1.5).

*    A descriptor selects for an exceptional condition if  a
     condition that would cause a SIGURG signal to be gener-
     ated exists (see section 1.3.2), or  other  device-spe-
     cific events have occurred.

For  these  tests, an operation is considered to be possible
if a call to the operation  would  return  without  blocking
(even  if the O_NONBLOCK flag were not set).  For example, a
descriptor would test as ready for reading if  a  read  call
would  return  immediately with data, an end-of-file indica-
tion, or an error other than EWOULDBLOCK.

If none of the specified conditions is true,  the  operation
waits  for  one of the conditions to arise, blocking at most
the amount of time specified by _t_v_p.  If  _t_v_p  is  given  as
NULL, the _s_e_l_e_c_t waits indefinitely.

Options affecting I/O on a descriptor may be read and set by
the call:

     dopt = fcntl(d, cmd, arg);
     result int dopt; int d, cmd, arg;



     /* command values */

     F_DUPFD    /* return a new descriptor */
     F_GETFD    /* get file descriptor flags */
     F_SETFD    /* set file descriptor flags */
     F_GETFL    /* get file status flags */
     F_SETFL    /* set file status flags */
     F_GETOWN   /* get SIGIO/SIGURG proc/pgrp */
     F_SETOWN   /* set SIGIO/SIGURG proc/pgrp */
     F_GETLK    /* get blocking lock */
     F_SETLK    /* set or clear lock */
     F_SETLKW   /* set lock with wait */


The F_DUPFD _c_m_d provides identical functionality to _d_u_p_2; it
is provided solely for POSIX compatibility.  The F_SETFD _c_m_d
can be used  to  set  the  close-on-exec  flag  for  a  file
descriptor.  The F_SETFL _c_m_d may be used to set a descriptor
in non-blocking I/O mode and/or enable signaling when I/O is
possible.   F_SETOWN  may  be  used  to specify a process or
process group to be signaled when using the latter  mode  of
operation  or when urgent indications arise.  The _f_c_n_t_l sys-
tem call also provides POSIX-compliant byte-range locking on









PSD:5-26                          4.4BSD Architecture Manual


files.   However  the  semantics of unlocking on every _c_l_o_s_e
rather than last close  makes  them  useless.   Much  better
semantics  and faster locking are provided by the _f_l_o_c_k sys-
tem call (see section 2.2.7).  The _f_c_n_t_l and _f_l_o_c_k locks can
be used concurrently; they will serialize against each other
properly.

     Operations on non-blocking descriptors will either com-
plete  immediately,  return the error EWOULDBLOCK, partially
complete an input or output operation  returning  a  partial
count,  or  return  an  error  EINPROGRESS  noting  that the
requested operation is in progress.  A descriptor which  has
signalling  enabled  will cause the specified process and/or
process group be signaled, with a SIGIO for  input,  output,
or  in-progress  operation  complete, or a SIGURG for excep-
tional conditions.

     For example, when writing  to  a  terminal  using  non-
blocking output, the system will accept only as much data as
there is buffer space, then return.  When making  a  connec-
tion  on  a _s_o_c_k_e_t, the operation may return indicating that
the connection establishment is ``in progress''.  The _s_e_l_e_c_t
facility  can  be  used  to determine when further output is
possible on the terminal, or when the connection  establish-
ment attempt is complete.

11..66..  RReessoouurrccee ccoonnttrroollss


11..66..11..  PPrroocceessss pprriioorriittiieess


     The  system  gives CPU scheduling priority to processes
that have not used CPU time recently.  This tends  to  favor
interactive  processes  and  processes that execute only for
short periods.  The instantaneous scheduling priority  is  a
function  of CPU usage and a settable priority value used in
adjusting the instantaneous priority with CPU usage or inac-
tivity.   It  is possible to determine the settable priority
factor  currently  assigned  to  a  process  (PRIO_PROCESS),
process  group  (PRIO_PGRP), or the processes of a specified
user (PRIO_USER), or to alter this priority using the calls:

     prio = getpriority(which, who);
     result int prio; int which, who;


     setpriority(which, who, prio);
     int which, who, prio;

The  value _p_r_i_o is in the range -20 to 20.  The default pri-
ority is 0; lower priorities cause more favorable execution.
The  _g_e_t_p_r_i_o_r_i_t_y  call  returns the highest priority (lowest
numerical value) enjoyed by any of the specified  processes.









4.4BSD Architecture Manual                          PSD:5-27


The  _s_e_t_p_r_i_o_r_i_t_y  call sets the priorities of all the speci-
fied processes to the specified value.  Only the  super-user
may lower priorities.

11..66..22..  RReessoouurrccee uuttiilliizzaattiioonn


     The  _g_e_t_r_u_s_a_g_e  call returns information describing the
resources used by the current process (RUSAGE_SELF), or  all
its terminated descendent processes (RUSAGE_CHILDREN):

     getrusage(who, rusage);
     int who; result struct rusage *rusage;

The  information  is  returned  in  a  structure  defined in
_<_s_y_s_/_r_e_s_o_u_r_c_e_._h_>:


     struct rusage {
          struct   timeval ru_utime;   /* user time used */
          struct   timeval ru_stime;   /* system time used */
          int      ru_maxrss;          /* maximum core resident set size: kbytes */
          int      ru_ixrss;           /* integral shared memory size (kbytes*sec) */
          int      ru_idrss;           /* unshared data memory size */
          int      ru_isrss;           /* unshared stack memory size */
          int      ru_minflt;          /* page-reclaims */
          int      ru_majflt;          /* page faults */
          int      ru_nswap;           /* swaps */
          int      ru_inblock;         /* block input operations */
          int      ru_oublock;         /* block output operations */
          int      ru_msgsnd;          /* messages sent */
          int      ru_msgrcv;          /* messages received */
          int      ru_nsignals;        /* signals received */
          int      ru_nvcsw;           /* voluntary context switches */
          int      ru_nivcsw;          /* involuntary context switches */
     };



11..66..33..  RReessoouurrccee lliimmiittss


     The resources of a process for which  limits  are  con-
trolled  by  the kernel are defined in _<_s_y_s_/_r_e_s_o_u_r_c_e_._h_>, and
controlled by the _g_e_t_r_l_i_m_i_t and _s_e_t_r_l_i_m_i_t calls:

     getrlimit(resource, rlp);
     int resource; result struct rlimit *rlp;


     setrlimit(resource, rlp);
     int resource; struct rlimit *rlp;

The resources that may currently be controlled include:









PSD:5-28                          4.4BSD Architecture Manual


     RLIMIT_CPU       /* cpu time in milliseconds */
     RLIMIT_FSIZE     /* maximum file size */
     RLIMIT_DATA      /* data size */
     RLIMIT_STACK     /* stack size */
     RLIMIT_CORE      /* core file size */
     RLIMIT_RSS       /* resident set size */
     RLIMIT_MEMLOCK   /* locked-in-memory address space */
     RLIMIT_NPROC     /* number of processes */
     RLIMIT_NOFILE    /* number of open files */
     RLIMIT_SBSIZE    /* maximum size of all socket buffers */
     RLIMIT_AS        /* virtual process size (inclusive of mmap) */
     RLIMIT_VMEM      /* alias of RLIMIT_AS */
     RLIMIT_NTHR      /* number of threads */


Each limit has a current value and a maximum defined by  the
_r_l_i_m_i_t structure:


     struct rlimit {
          quad_t   rlim_cur;   /* current (soft) limit */
          quad_t   rlim_max;   /* hard limit */
     };



     Only  the  super-user  can  raise  the  maximum limits.
Other users may only alter _r_l_i_m___c_u_r within the range from  0
to  _r_l_i_m___m_a_x  or (irreversibly) lower _r_l_i_m___m_a_x.  To remove a
limit on a resource, the value is set to RLIM_INFINITY.

11..77..  SSyysstteemm ooppeerraattiioonn ssuuppppoorrtt


Unless noted otherwise, the calls in this section  are  per-
mitted only to a privileged user.

11..77..11..  MMoonniittoorriinngg ssyysstteemm ooppeerraattiioonn


     The _s_y_s_c_t_l function allows any process to retrieve sys-
tem information and allows processes with appropriate privi-
leges to set system configurations.

     sysctl(name, namelen, oldp, oldlenp, newp, newlen);
     int *name; u_int namelen; result void *oldp; result size_t *oldlenp;
     void *newp; size_t newlen;

The  information available from _s_y_s_c_t_l consists of integers,
strings, and tables.  _S_y_s_c_t_l returns a  consistent  snapshot
of  the  data requested.  Consistency is obtained by locking
the destination buffer into memory so that the data  may  be
copied out without blocking.  Calls to _s_y_s_c_t_l are serialized
to avoid deadlock.









4.4BSD Architecture Manual                          PSD:5-29


     The object to be interrogated or set is named  using  a
``Management  Information Base'' (MIB) style name, listed in
_n_a_m_e, which is a _n_a_m_e_l_e_n length  array  of  integers.   This
name  is  from a hierarchical name space, with the most sig-
nificant component in the first element of the array.  It is
analogous  to  a  file pathname, but with integers as compo-
nents rather than slash-separated strings.

     The information is copied into the buffer specified  by
_o_l_d_p.  The size of the buffer is given by the location spec-
ified by _o_l_d_l_e_n_p before  the  call,  and  that  location  is
filled  in with the amount of data copied after a successful
call.  If the amount of data available is greater  than  the
size  of the buffer supplied, the call supplies as much data
as fits in the buffer provided and returns an error.

     To set a new value, _n_e_w_p is set to point to a buffer of
length _n_e_w_l_e_n from which the requested value is to be taken.
If a new value is not to be set, _n_e_w_p should be set to  NULL
and _n_e_w_l_e_n set to 0.

     The top level names (those used in the first element of
the  _n_a_m_e  array)  are  defined  with  a  CTL_   prefix   in
_<_s_y_s_/_s_y_s_c_t_l_._h_>, and are as follows.  The next and subsequent
levels down are found in the include files listed here:


     Name          Next Level Names   Description
     ----------------------------------------------------
     CTL_DEBUG     sys/sysctl.h       Debugging
     CTL_FS        sys/sysctl.h       Filesystem
     CTL_HW        sys/sysctl.h       Generic CPU, I/O
     CTL_KERN      sys/sysctl.h       High kernel limits
     CTL_MACHDEP   sys/sysctl.h       Machine dependent
     CTL_NET       sys/socket.h       Networking
     CTL_USER      sys/sysctl.h       User-level
     CTL_VM        vm/vm_param.h      Virtual memory



11..77..22..  BBoooottssttrraapp ooppeerraattiioonnss


The call:

     mount(type, dir, flags, data);
     int type; char *dir; int flags; caddr_t data;

extends the name space. The _m_o_u_n_t call grafts  a  filesystem
object  onto  the system file tree at the point specified in
_d_i_r.  The argument _t_y_p_e specifies the type of filesystem  to
be  mounted.   The  argument  _d_a_t_a  describes the filesystem
object to be mounted according to the _t_y_p_e.  The contents of
the  filesystem become available through the new mount point









PSD:5-30                          4.4BSD Architecture Manual


_d_i_r.  Any files in or below _d_i_r at the time of a  successful
mount  disappear from the name space until the filesystem is
unmounted.  The _f_l_a_g_s value  specifies  generic  properties,
such as a request to mount the filesystem read-only.

Information  about  all  mounted filesystems can be obtained
with the call:

     getfsstat(buf, bufsize, flags);
     result struct statfs *buf; long bufsize, int flags;


The call:

     swapon(blkdev);
     char *blkdev;

specifies a device to be made available for paging and swap-
ping.

11..77..33..  SShhuuttddoowwnn ooppeerraattiioonnss


The call:

     unmount(dir, flags);
     char *dir; int flags;

unmounts the filesystem mounted on _d_i_r.  This call will suc-
ceed only if the filesystem is not currently being  used  or
if the MNT_FORCE flag is specified.

The call:

     sync();

schedules  I/O to flush all modified disk blocks resident in
the kernel.  (This call does not require privileged status.)
Files  can  be  selectively  flushed to disk using the _f_s_y_n_c
call (see section 2.2.6).

The call:

     reboot(how);
     int how;

causes a machine halt or reboot.  The  call  may  request  a
reboot by specifying _h_o_w as RB_AUTOBOOT, or that the machine
be halted with RB_HALT, among  other  options.   These  con-
stants are defined in _<_s_y_s_/_r_e_b_o_o_t_._h_>.













4.4BSD Architecture Manual                          PSD:5-31


11..77..44..  AAccccoouunnttiinngg


     The  system  optionally keeps an accounting record in a
file for each process that exits on the system.  The  format
of  this  record  is beyond the scope of this document.  The
accounting may be enabled to a file _n_a_m_e by doing:

     acct(path);
     char *path;

If _p_a_t_h is NULL, then accounting  is  disabled.   Otherwise,
the named file becomes the accounting file.

22..  SSyysstteemm ffaacciilliittiieess


The system abstractions described are:

Directory contexts
     A  directory  context  is  a position in the filesystem
     name  space.   Operations  on  files  and  other  named
     objects  in  a filesystem are always specified relative
     to such a context.

Files
     Files are used  to  store  uninterpreted  sequences  of
     bytes,  which  may be _r_e_a_d and _w_r_i_t_t_e_n randomly.  Pages
     from files may also be mapped into the process  address
     space.   A directory may be read as a file if permitted
     by the underlying storage facility, though it  is  usu-
     ally   accessed   using   _g_e_t_d_i_r_e_n_t_r_i_e_s   (see  section
     2.2.3.1).  (Local filesystems permit directories to  be
     read,  although  most  NFS implementations do not allow
     reading of directories.)

Communications domains
     A communications domain represents an interprocess com-
     munications  environment,  such  as  the communications
     facilities of the 4.4BSD system, communications in  the
     INTERNET,  or the resource sharing protocols and access
     rights of a resource sharing system on a local network.

Sockets
     A  socket is an endpoint of communication and the focal
     point for IPC in a communications domain.  Sockets  may
     be  created  in  pairs, or given names and used to ren-
     dezvous with other sockets in a communications  domain,
     accepting  connections from these sockets or exchanging
     messages with them.  These operations model  a  labeled
     or unlabeled communications graph, and can be used in a
     wide variety of communications  domains.   Sockets  can
     have  different _t_y_p_e_s to provide different semantics of
     communication, increasing the flexibility of the model.









PSD:5-32                          4.4BSD Architecture Manual


Terminals and other devices
     Devices  include  terminals  (providing  input editing,
     interrupt generation, output flow  control,  and  edit-
     ing),  magnetic  tapes,  disks,  and other peripherals.
     They normally support the generic _r_e_a_d and _w_r_i_t_e opera-
     tions as well as a number of _i_o_c_t_l's.

Processes
     Process  descriptors provide facilities for control and
     debugging of other processes.

22..11..  GGeenneerriicc ooppeerraattiioonnss


     Many system abstractions support the _r_e_a_d,  _w_r_i_t_e,  and
_i_o_c_t_l  operations.   We  describe the basics of these common
primitives here.  Similarly, the mechanisms whereby normally
synchronous  operations may occur in a non-blocking or asyn-
chronous fashion are common to all  system-defined  abstrac-
tions and are described here.

22..11..11..  RReeaadd aanndd wwrriittee


     The  _r_e_a_d and _w_r_i_t_e system calls can be applied to com-
munications channels, files, terminals  and  devices.   They
have the form:

     cc = read(fd, buf, nbytes);
     result ssize_t cc; int fd; result void *buf; size_t nbytes;


     cc = write(fd, buf, nbytes);
     result ssize_t cc; int fd; void *buf; size_t nbytes;

The  _r_e_a_d  call  transfers as much data as possible from the
object defined by _f_d to the buffer at address  _b_u_f  of  size
_n_b_y_t_e_s.   The number of bytes transferred is returned in _c_c,
which is -1 if a return occurred before any data was  trans-
ferred  because  of  an  error or use of non-blocking opera-
tions.  A return value of 0 is used to indicate  an  end-of-
file condition.

     The  _w_r_i_t_e  call  transfers data from the buffer to the
object defined by _f_d.  Depending on the type of  _f_d,  it  is
possible  that  the _w_r_i_t_e call will accept only a portion of
the provided bytes; the user should resubmit the other bytes
in a later request.  Error returns because of interrupted or
otherwise incomplete operations are possible, in which  case
no data will have been transferred.

     Scattering  of  data on input, or gathering of data for
output is also possible using an array of input/output  vec-
tor descriptors.  The type for the descriptors is defined in









4.4BSD Architecture Manual                          PSD:5-33


_<_s_y_s_/_u_i_o_._h_> as:


     struct iovec {
          char     *iov_base;   /* base of a component */
          size_t   iov_len;     /* length of a component */
     };



The _i_o_v___b_a_s_e field should be treated as  if  its  type  were
``void  *'' as POSIX and other versions of the structure may
use that type.  Thus, pointer arithmetic should not use this
value without a cast.

The calls using an array of _i_o_v_e_c structures are:

     cc = readv(fd, iov, iovlen);
     result ssize_t cc; int fd; struct iovec *iov; int iovlen;


     cc = writev(fd, iov, iovlen);
     result ssize_t cc; int fd; struct iovec *iov; int iovlen;

Here _i_o_v_l_e_n is the count of elements in the _i_o_v array.

22..11..22..  IInnppuutt//oouuttppuutt ccoonnttrrooll


Control  operations  on an object are performed by the _i_o_c_t_l
operation:

     ioctl(fd, request, buffer);
     int fd; u_long request; caddr_t buffer;

This operation causes the specified _r_e_q_u_e_s_t to be  performed
on  the  object _f_d.  The _r_e_q_u_e_s_t parameter specifies whether
the argument buffer is to be read, written, read  and  writ-
ten,  or  is  not  used, and also the size of the buffer, as
well as the request.  Different descriptor  types  and  sub-
types   within  descriptor  types  may  use  distinct  _i_o_c_t_l
requests.  For  example,  operations  on  terminals  control
flushing  of input and output queues and setting of terminal
parameters; operations on disks cause formatting  operations
to occur; operations on tapes control tape positioning.  The
names  for  basic  control   operations   are   defined   by
_<_s_y_s_/_i_o_c_t_l_._h_>, or more specifically by files it includes.

22..11..33..  NNoonn--bblloocckkiinngg aanndd aassyynncchhrroonnoouuss ooppeerraattiioonnss


     A  process that wishes to do non-blocking operations on
one of its descriptors sets the descriptor  in  non-blocking
mode  as  described  in  section 1.5.4.  Thereafter the _r_e_a_d









PSD:5-34                          4.4BSD Architecture Manual


call will return a specific EWOULDBLOCK error indication  if
there  is  no  data  to be _r_e_a_d.  The process may _s_e_l_e_c_t the
associated descriptor to determine when a read is  possible.

     Output attempted when a descriptor can accept less than
is requested will either accept some of the  provided  data,
returning  a  shorter than normal length, or return an error
indicating that the operation would block.  More output  can
be  performed  as soon as a _s_e_l_e_c_t call indicates the object
is writable.

     Operations other than data input or output may be  per-
formed  on  a  descriptor  in a non-blocking fashion.  These
operations will return with a characteristic error  indicat-
ing  that they are in progress if they cannot complete imme-
diately.  The descriptor may then be _s_e_l_e_c_t'ed for _w_r_i_t_e  to
find out when the operation has been completed.  When _s_e_l_e_c_t
indicates the descriptor is writable, the operation has com-
pleted.   Depending  on the nature of the descriptor and the
operation, additional activity may be  started  or  the  new
state may be tested.

22..22..  FFiilleessyysstteemm


22..22..11..  OOvveerrvviieeww


     The filesystem abstraction provides access to a hierar-
chical filesystem structure.  The filesystem contains direc-
tories  (each  of which may contain sub-directories) as well
as files and references to other objects such as devices and
inter-process communications sockets.

     Each  file is organized as a linear array of bytes.  No
record boundaries or system related information  is  present
in a file.  Files may be read and written in a random-access
fashion.  If permitted by the underlying storage  mechanism,
the  user may read the data in a directory as though it were
an ordinary file to determine the  names  of  the  contained
files, but only the system may write into the directories.

22..22..22..  NNaammiinngg


     The  filesystem  calls take _p_a_t_h _n_a_m_e arguments.  These
consist of a zero or more component _f_i_l_e _n_a_m_e_s separated  by
``/''  characters,  where  each  file name is up to NAME_MAX
(255) characters excluding null and ``/''.  Each pathname is
up to PATH_MAX (1024) characters excluding null.

     Each  process  always  has two naming contexts: one for
the root directory of the filesystem and one for the current
working  directory.   These  are  used  by the system in the









4.4BSD Architecture Manual                          PSD:5-35


filename translation process.  If a path name begins with  a
``/'',  it  is called a full path name and interpreted rela-
tive to the root directory context.  If the path  name  does
not begin with a ``/'' it is called a relative path name and
interpreted relative to the current directory context.

     The file name ``.'' in each directory  refers  to  that
directory.  The file name ``..'' in each directory refers to
the parent directory of that directory.  The  parent  direc-
tory of the root of the filesystem is itself.

The calls:

     chdir(path);
     char *path;


     fchdir(fd);
     int fd;


     chroot(path);
     char *path;

change  the current working directory or root directory con-
text of a process.  Only the super-user can change the  root
directory context of a process.

Information  about  a  filesystem that contains a particular
file can be obtained using the calls:

     statfs(path, buf);
     char *path; struct statfs *buf;


     fstatfs(fd, buf);
     int fd; struct statfs *buf;


22..22..33..  CCrreeaattiioonn aanndd rreemmoovvaall


     The  filesystem  allows  directories,  files,   special
devices,  and  fifos  to  be  created  and  removed from the
filesystem.

22..22..33..11..  DDiirreeccttoorryy ccrreeaattiioonn aanndd rreemmoovvaall


A directory is created with the _m_k_d_i_r system call:

     mkdir(path, mode);
     char *path; mode_t mode;










PSD:5-36                          4.4BSD Architecture Manual


where  the  mode  is  defined  as  for  files  (see  section
2.2.3.2).   Directories  are  removed  with the _r_m_d_i_r system
call:

     rmdir(path);
     char *path;

A directory must be empty (other than the entries ``.''  and
``..'')  if it is to be deleted.

Although  directories can be read as files, the usual inter-
face is to use the call:

     getdirentries(fd, buf, nbytes, basep);
     int fd; char *buf; int nbytes; long *basep;

The _g_e_t_d_i_r_e_n_t_r_i_e_s system call returns a canonical  array  of
directory  entries  in  the  filesystem  independent  format
described in _<_d_i_r_e_n_t_._h_>.  Application programs  usually  use
the  library  routines  _o_p_e_n_d_i_r, _r_e_a_d_d_i_r, and _c_l_o_s_e_d_i_r which
provide a more convenient interface than _g_e_t_d_i_r_e_n_t_r_i_e_s.  The
_f_t_s package is provided for recursive directory traversal.

22..22..33..22..  FFiillee ccrreeaattiioonn


Files are opened and/or created with the _o_p_e_n system call:

     fd = open(path, oflag, mode);
     result int fd; char *path; int oflag; mode_t mode;

The  _p_a_t_h  parameter  specifies  the  name of the file to be
opened.  The _o_f_l_a_g parameter must include O_CREAT  to  cause
the  file  to  be  created.   Bits  for _o_f_l_a_g are defined in
_<_f_c_n_t_l_._h_>:


     O_RDONLY     /* open for reading only */
     O_WRONLY     /* open for writing only */
     O_RDWR       /* open for reading and writing */
     O_NONBLOCK   /* no delay */
     O_APPEND     /* set append mode */
     O_SHLOCK     /* open with shared file lock */
     O_EXLOCK     /* open with exclusive file lock */
     O_ASYNC      /* signal pgrp when data ready */
     O_FSYNC      /* synchronous writes */
     O_CREAT      /* create if nonexistent */
     O_TRUNC      /* truncate to zero length */
     O_EXCL       /* error if already exists */



     One of O_RDONLY, O_WRONLY and O_RDWR should  be  speci-
fied,  indicating what types of operations are desired to be









4.4BSD Architecture Manual                          PSD:5-37


done on the open  file.   The  operations  will  be  checked
against the user's access rights to the file before allowing
the _o_p_e_n to succeed.  Specifying O_APPEND causes all  writes
to  be  appended to the file.  Specifying O_TRUNC causes the
file to be truncated when opened.  The flag  O_CREAT  causes
the  file  to  be created if it does not exist, owned by the
current user and the group of the containing directory.  The
permissions for the new file are specified in _m_o_d_e as the OR
of the appropriate permissions as defined in _<_s_y_s_/_s_t_a_t_._h_>:


     S_IRWXU                      /* RWX for owner */
     S_IRUSR                      /* R for owner */
     S_IWUSR                      /* W for owner */
     S_IXUSR                      /* X for owner */
     S_IRWXG                      /* RWX for group */
     S_IRGRP                      /* R for group */
     S_IWGRP                      /* W for group */
     S_IXGRP                      /* X for group */
     S_IRWXO                      /* RWX for other */
     S_IROTH                      /* R for other */
     S_IWOTH                      /* W for other */
     S_IXOTH                      /* X for other */
     S_ISUID                      /* set user id */
     S_ISGID /* set group id */
     S_ISTXT /* sticky bit */



Historically, the file mode has been used as  a  four  digit
octal number.  The bottom three digits encode read access as
4, write  access  as  2  and  execute  access  as  1,  or'ed
together.  The 0700 bits describe owner access, the 070 bits
describe the access rights for processes in the  same  group
as  the file, and the 07 bits describe the access rights for
other processes.  The 7000 bits encode set user ID as  4000,
set  group ID as 2000, and the sticky bit as 1000.  The mode
specified to _o_p_e_n is modified by the process _u_m_a_s_k;  permis-
sions  specified in the _u_m_a_s_k are cleared in the mode of the
created file.  The _u_m_a_s_k can be changed with the call:

     oldmask = umask(newmask);
     result mode_t oldmask; mode_t newmask;


     If the O_EXCL flag is set, and the file already exists,
then  the  _o_p_e_n  will fail without affecting the file in any
way.  This mechanism  provides  a  simple  exclusive  access
facility.   For  security reasons, if the O_EXCL flag is set
and the file is a symbolic link, the open will fail  regard-
less  of  the  existence of the file referenced by the link.
The O_SHLOCK and O_EXLOCK flags allow the file to be  atomi-
cally _o_p_e_n'ed and _f_l_o_c_k'ed; see section 2.2.7 for the seman-
tics of _f_l_o_c_k style locks.  The  O_ASYNC  flag  enables  the









PSD:5-38                          4.4BSD Architecture Manual


SIGIO  signal to be sent to the process group of the opening
process when I/O is possible,  e.g.,  upon  availability  of
data to be read.

22..22..33..33..  CCrreeaattiinngg rreeffeerreenncceess ttoo ddeevviicceess


     The  filesystem  allows entries which reference periph-
eral devices.  Peripherals are  distinguished  as  _b_l_o_c_k  or
_c_h_a_r_a_c_t_e_r  devices  according  by  their  ability to support
block-oriented operations.  Devices are identified by  their
``major''  and  ``minor''  device numbers.  The major device
number determines the kind of peripheral it  is,  while  the
minor  device  number  indicates either one of possibly many
peripherals of that kind, or special characteristics of  the
peripheral.   Structured  devices  have  all operations done
internally  in  ``block''  quantities   while   unstructured
devices may have input and output done in varying units, and
may act as a non-seekable communications channel rather than
a  random-access  device.   The  _m_k_n_o_d  call creates special
entries:

     mknod(path, mode, dev);
     char *path; mode_t mode; dev_t dev;

where _m_o_d_e is formed from the object type and access permis-
sions.   The  parameter  _d_e_v  is  a  configuration dependent
parameter used to identify specific character or  block  I/O
devices.

Fifo's can be created in the filesystem using the call:

     mkfifo(path, mode);
     char *path; mode_t mode;

The _m_o_d_e parameter is used solely to specify the access per-
missions of the newly created fifo.

22..22..33..44..  LLiinnkkss aanndd rreennaammiinngg


     Links allow multiple names for a file to exist.   Links
exist independently of the file to which they are linked.

     Two  types  of  links  exist,  _h_a_r_d  links and _s_y_m_b_o_l_i_c
links.  A hard link is a reference counting  mechanism  that
allows  a  file  to  have  multiple  names  within  the same
filesystem.  Each link to a file is equivalent, referring to
the  file  independently  of any other name.  Symbolic links
cause string substitution during the pathname interpretation
process,  and  refer  to  a  file name rather than referring
directly to a file.











4.4BSD Architecture Manual                          PSD:5-39


     Hard links and symbolic links  have  different  proper-
ties.   A hard link ensures that the target file will always
be accessible, even after its original  directory  entry  is
removed;  no  such  guarantee  exists  for  a symbolic link.
Unlike hard links, symbolic links can  refernce  directories
and  span  filesystems  boundaries.   An  _l_s_t_a_t (see section
2.2.4) call on a hard link will return the information about
the  file  (or  directory) that the link references while an
_l_s_t_a_t call on a symbolic link will return information  about
the  link  itself.   A symbolic link does not have an owner,
group, permissions, access and modification times, etc.  The
only  attributes  returned  from  an _l_s_t_a_t that refer to the
symbolic link itself are  the  file  type  (S_IFLNK),  size,
blocks, and link count (always 1).  The other attributes are
filled in from the directory that contains the link.

The following calls create  a  new  link,  named  _p_a_t_h_2,  to
_p_a_t_h_1:

     link(path1, path2);
     char *path1, *path2;


     symlink(path1, path2);
     char *path1, *path2;

The  _u_n_l_i_n_k  primitive  may be used to remove either type of
link.

If a file is a symbolic link, the ``value'' of the link  may
be read with the _r_e_a_d_l_i_n_k call:

     len = readlink(path, buf, bufsize);
     result int len; char *path; result char *buf; int bufsize;

This call returns, in _b_u_f, the string substituted into path-
names passing through _p_a_t_h.  (This string is not NULL termi-
nated.)

Atomic  renaming  of filesystem resident objects is possible
with the _r_e_n_a_m_e call:

     rename(oldname, newname);
     char *oldname, *newname;

where both _o_l_d_n_a_m_e and _n_e_w_n_a_m_e must be in the same  filesys-
tem.   If either _o_l_d_n_a_m_e or _n_e_w_n_a_m_e is a directory, then the
other also must be a directory for the  _r_e_n_a_m_e  to  succeed.
If _n_e_w_n_a_m_e exists and is a directory, then it must be empty.

22..22..33..55..  FFiillee,, ddeevviiccee,, aanndd ffiiffoo rreemmoovvaall












PSD:5-40                          4.4BSD Architecture Manual


A reference to a file, special device or fifo may be removed
with the _u_n_l_i_n_k call:

     unlink(path);
     char *path;

The  caller must have write access to the directory in which
the file is located for this call to  be  successful.   When
the  last  name for a file has been removed, the file may no
longer be opened; the file itself is removed once any exist-
ing references have been closed.

All current access to a file can be revoked using the call:

     revoke(path);
     char *path;

Subsequent operations on any descriptors open at the time of
the _r_e_v_o_k_e fail, with the exceptions that a _c_l_o_s_e call  will
succeed,  and  a _r_e_a_d from a character device file which has
been revoked returns a count of zero (end of file).  If  the
file  is  a  special  file  for  a device which is open, the
device close function is called as if all open references to
the  file had been closed.  _O_p_e_n's done after the _r_e_v_o_k_e may
succeed.  This call is most useful for revoking access to  a
terminal  line  after a hangup in preparation for reuse by a
new login session.  Access  to  a  controlling  terminal  is
automatically  revoked  when the session leader for the ses-
sion exits.

22..22..44..  RReeaaddiinngg aanndd mmooddiiffyyiinngg ffiillee aattttrriibbuutteess


Detailed information about the attributes of a file  may  be
obtained with the calls:

     stat(path, stb);
     char *path; result struct stat *stb;


     fstat(fd, stb);
     int fd; result struct stat *stb;

The  _s_t_a_t structure includes the file type, protection, own-
ership, access times, size, and a count of hard  links.   If
the  file  is  a  symbolic link, then the status of the link
itself (rather than the file the  link  references)  may  be
obtained using the _l_s_t_a_t call:

     lstat(path, stb);
     char *path; result struct stat *stb;












4.4BSD Architecture Manual                          PSD:5-41


     Newly  created  files  are  assigned the user ID of the
process that created them and the group ID of the  directory
in  which they were created.  The ownership of a file may be
changed by either of the calls:

     chown(path, owner, group);
     char *path; uid_t owner; gid_t group;


     fchown(fd, owner, group);
     int fd, uid_t owner; gid_t group;


     In addition to ownership, each file has three levels of
access  protection  associated  with  it.   These levels are
owner relative, group relative, and other.   Each  level  of
access  has  separate  indicators for read permission, write
permission, and execute  permission.   The  protection  bits
associated with a file may be set by either of the calls:

     chmod(path, mode);
     char *path; mode_t mode;


     fchmod(fd, mode);
     int fd, mode_t mode;

where  _m_o_d_e  is a value indicating the new protection of the
file, as listed in section 2.2.3.2.

     Each file has a set of flags stored as a bit mask asso-
ciated with it.  These flags are returned in the _s_t_a_t struc-
ture and are set using the calls:

     chflags(path, flags);
     char *path; u_long flags;


     fchflags(fd, flags);
     int fd; u_long flags;

The flags specified are formed by or'ing the following  val-
ues:


     UF_NODUMP      Do not dump the file.
     UF_IMMUTABLE   The file may not be changed.
     UF_APPEND      The file may only be appended to.
     SF_IMMUTABLE   The file may not be changed.
     SF_APPEND      The file may only be appended to.


The  UF_NODUMP,  UF_IMMUTABLE and UF_APPEND flags may be set
or unset by either the owner of a file  or  the  super-user.









PSD:5-42                          4.4BSD Architecture Manual


The  SF_IMMUTABLE  and  SF_APPEND  flags  may only be set or
unset by the super-user.  They may be set at any  time,  but
normally may only be unset when the system is in single-user
mode.

Finally, the access and modify times on a file may be set by
the call:

     utimes(path, tvp);
     char *path; struct timeval *tvp[2];

This is particularly useful when moving files between media,
to preserve file access and modification times.

22..22..55..  CChheecckkiinngg aacccceessssiibbiilliittyy


     A process running with  different  real  and  effective
user-ids  may interrogate the accessibility of a file to the
real user by using the _a_c_c_e_s_s call:

     accessible = access(path, how);
     result int accessible; char *path; int how;

_H_o_w is constructed by OR'ing the following bits, defined  in
_<_u_n_i_s_t_d_._h_>:


     F_OK   /* file exists */
     X_OK   /* file is executable/searchable */
     W_OK   /* file is writable */
     R_OK   /* file is readable */


The  presence  or  absence of advisory locks does not affect
the result of _a_c_c_e_s_s.

     The _p_a_t_h_c_o_n_f and _f_p_a_t_h_c_o_n_f functions provide  a  method
for applications to determine the current value of a config-
urable system limit or option  variable  associated  with  a
pathname or file descriptor:

     ans = pathconf(path, name);
     result long ans; char *path; int name;


     ans = fpathconf(fd, name);
     result long ans; int fd, name;

For  _p_a_t_h_c_o_n_f,  the  _p_a_t_h  argument is the name of a file or
directory.  For _f_p_a_t_h_c_o_n_f, the _f_d argument is an  open  file
descriptor.  The _n_a_m_e argument specifies the system variable
to be queried.  Symbolic constants for each name  value  are
found in the include file _<_u_n_i_s_t_d_._h_>.









4.4BSD Architecture Manual                          PSD:5-43


22..22..66..  EExxtteennssiioonn aanndd ttrruunnccaattiioonn


     Files  are created with zero length and may be extended
simply by writing or appending to them.   While  a  file  is
open the system maintains a pointer into the file indicating
the  current  location  in  the  file  associated  with  the
descriptor.   This pointer may be moved about in the file in
a random access fashion.  To set the current offset  into  a
file, the _l_s_e_e_k call may be used:

     oldoffset = lseek(fd, offset, type);
     result off_t oldoffset; int fd; off_t offset; int type;

where _t_y_p_e is defined by _<_u_n_i_s_t_d_._h_> as one of:


     SEEK_SET   /* set file offset to offset */
     SEEK_CUR   /* set file offset to current plus offset */
     SEEK_END   /* set file offset to EOF plus offset */


The call ``lseek(fd, 0, SEEK_CUR)'' returns the current off-
set into the file.

     Files may have ``holes'' in them.  Holes are  areas  in
the  linear  extent  of  the  file where data has never been
written.  These may be created by seeking to a location in a
file  past  the  current end-of-file and writing.  Holes are
treated by the system as zero valued bytes.

A file may be extended  or  truncated  with  either  of  the
calls:

     truncate(path, length);
     char *path; off_t length;


     ftruncate(fd, length);
     int fd; off_t length;

changing the size of the specified file to _l_e_n_g_t_h bytes.

     Unless  opened  with  the O_FSYNC flag, writes to files
are held for an indeterminate period of time in  the  system
buffer cache.  The call:

     fsync(fd);
     int fd;

ensures  that  the  contents of a file are committed to disk
before returning.  This feature is used by applications such
as  editors  that want to ensure the integrity of a new file
before continuing.









PSD:5-44                          4.4BSD Architecture Manual


22..22..77..  LLoocckkiinngg


     The filesystem provides  basic  facilities  that  allow
cooperating  processes to synchronize their access to shared
files.  A process may place an advisory _r_e_a_d or  _w_r_i_t_e  lock
on  a  file,  so  that other cooperating processes may avoid
interfering with the process' access.  This simple mechanism
provides  locking with file granularity.  Byte range locking
is available with _f_c_n_t_l; see section 1.5.4.  The system does
not  force processes to obey the locks; they are of an advi-
sory nature only.

Locking can be done as part of the _o_p_e_n  call  (see  section
2.2.3.2)  or after an _o_p_e_n call by applying the _f_l_o_c_k primi-
tive:

     flock(fd, how);
     int fd, how;

where the _h_o_w parameter  is  formed  from  bits  defined  in
_<_f_c_n_t_l_._h_>:


     LOCK_SH   /* shared file lock */
     LOCK_EX   /* exclusive file lock */
     LOCK_NB   /* don't block when locking */
     LOCK_UN   /* unlock file */


Successive  lock  calls  may be used to increase or decrease
the level of locking.  If an object is currently  locked  by
another  process  when a _f_l_o_c_k call is made, the caller will
be blocked until the current lock owner releases  the  lock;
this  may be avoided by including LOCK_NB in the _h_o_w parame-
ter.  Specifying LOCK_UN removes all locks  associated  with
the  descriptor.  Advisory locks held by a process are auto-
matically deleted when the process terminates.

22..22..88..  DDiisskk qquuoottaass


     As an optional  facility,  each  local  filesystem  can
impose  limits on a user's or group's disk usage.  Two quan-
tities are limited: the total amount of disk space  which  a
user  or  group  may  allocate in a filesystem and the total
number of files a user or group may create in a  filesystem.
Quotas are expressed as _h_a_r_d limits and _s_o_f_t limits.  A hard
limit is always imposed; if a user or group would  exceed  a
hard  limit, the operation which caused the resource request
will fail.  A soft  limit  results  in  the  user  or  group
receiving a warning message, but with allocation succeeding.
Facilities are provided to turn soft limits into hard limits
if  a  user  or  group  has  exceeded  a  soft  limit for an









4.4BSD Architecture Manual                          PSD:5-45


unreasonable period of time.

The _q_u_o_t_a_c_t_l call enables, disables and manipulates filesys-
tem quotas:

     quotactl(path, cmd, id, addr);
     char *path; int cmd; int id; char *addr;

A  quota  control command given by cmd operates on the given
filename path for the given  user  ID.  The  address  of  an
optional  command  specific  data  structure,  addr,  may be
given.  The supported commands include:


     Q_QUOTAON    /* enable quotas */
     Q_QUOTAOFF   /* disable quotas */
     Q_GETQUOTA   /* get limits and usage */
     Q_SETQUOTA   /* set limits and usage */
     Q_SETUSE     /* set usage */
     Q_SYNC       /* sync disk copy of a filesystems quotas */



22..22..99..  RReemmoottee ffiilleessyysstteemmss


There are two system calls  intended  to  help  support  the
remote filesystem implementation.  The call:

     nfssvc(flags, argstructp);
     int flags, void *argstructp;

is  used by the NFS daemons to pass information into and out
of the kernel and also to enter the kernel as a server  dae-
mon.   The flags argument consists of several bits that show
what action is to be taken once in the kernel and _a_r_g_s_t_r_u_c_t_p
points  to  one  of three structures depending on which bits
are set in flags.

The call:

     getfh(path, fhp);
     char *path; result fhandle_t *fhp;

returns a file handle for the specified file or directory in
the  file  handle  pointed  to by fhp.  This file handle can
then be used in future calls to NFS to access the file with-
out  the need to repeat the pathname translation.  This sys-
tem call is restricted to the superuser.

22..22..1100..  OOtthheerr ffiilleessyysstteemmss












PSD:5-46                          4.4BSD Architecture Manual


The kernel supports many other filesystems.  These include:

+o    The log-structured filesystem. It provides an alternate
     disk  layout  than  the  fast  filesystem optimized for
     writing rather than reading.  For  further  information
     see the mount_lfs(8) manual page.

+o    The ISO-standard 9660 filesystem with Rock Ridge exten-
     sions used for CD-ROMs.  For  further  information  see
     the mount_cd9660(8) manual page.

+o    The  file  descriptor  mapping filesystem.  For further
     information see the mount_fdesc(8) manual page.

+o    The /proc filesystem as an alternative  for  debuggers.
     For  further  information  see  section  2.5.1  and the
     mount_procfs(8) manual page.

+o    The memory-based filesystem, used  primarily  for  fast
     but  ethereal  uses such as /tmp.  For further informa-
     tion see the mount_mfs(8) manual page.

+o    The kernel variable filesystem, used as an  alternative
     to  _s_y_s_c_t_l.   For further information see section 1.7.1
     and the mount_kernfs(8) manual page.

+o    The portal filesystem, used to mount processes  in  the
     filesystem.  For further information see the mount_por-
     tal(8) manual page.

+o    The uid/gid remapping filesystem, usually layered above
     NFS  filesystems  exported to an outside administrative
     domain.  For further information see the  mount_umap(8)
     manual page.

+o    The union filesystem, used to place a writable filesys-
     tem above a read-only filesystem.  This  filesystem  is
     useful for compiling sources on a CD-ROM without having
     to copy the CD-ROM contents to writable disk.  For fur-
     ther information see the mount_union(8) manual page.

22..33..  IInntteerrpprroocceessss ccoommmmuunniiccaattiioonnss


22..33..11..  IInntteerrpprroocceessss ccoommmmuunniiccaattiioonn pprriimmiittiivveess


22..33..11..11..  CCoommmmuunniiccaattiioonn ddoommaaiinnss


     The system provides access to an extensible set of com-
munication _d_o_m_a_i_n_s.  A  communication  domain  (or  protocol
family)  is identified by a manifest constant defined in the
file _<_s_y_s_/_s_o_c_k_e_t_._h_>.  Important standard  domains  supported









4.4BSD Architecture Manual                          PSD:5-47


by  the  system are the local (``UNIX'') domain (PF_LOCAL or
PF_UNIX) for communication within the system,  the  ``Inter-
net'' domain (PF_INET) for communication in the DARPA Inter-
net, the ISO family of protocols (PF_ISO and  PF_CCITT)  for
providing  a  check-off box on the list of your system capa-
bilities, and the ``NS'' domain  (PF_NS)  for  communication
using  the  Xerox  Network Systems protocols.  Other domains
can be added to the system.

22..33..11..22..  SSoocckkeett ttyyppeess aanndd pprroottooccoollss


     Within a domain, communication takes place between com-
munication  endpoints known as _s_o_c_k_e_t_s.  Each socket has the
potential to exchange information with other sockets  of  an
appropriate type within the domain.

     Each  socket  has  an  associated  abstract type, which
describes the semantics of communication using that  socket.
Properties  such as reliability, ordering, and prevention of
duplication of messages are determined  by  the  type.   The
basic set of socket types is defined in _<_s_y_s_/_s_o_c_k_e_t_._h_>:


     Standard socket types
     --------------------------------------------------
     SOCK_DGRAM       /* datagram */
     SOCK_STREAM      /* virtual circuit */
     SOCK_RAW         /* raw socket */
     SOCK_RDM         /* reliably-delivered message */
     SOCK_SEQPACKET   /* sequenced packets */


The  SOCK_DGRAM  type  models  the semantics of datagrams in
network communication: messages may be  lost  or  duplicated
and  may  arrive  out-of-order.   A datagram socket may send
messages to and receive messages from multiple  peers.   The
SOCK_RDM  type  models  the semantics of reliable datagrams:
messages arrive unduplicated and  in-order,  the  sender  is
notified  if messages are lost.  The _s_e_n_d and _r_e_c_e_i_v_e opera-
tions (described  below)  generate  reliable  or  unreliable
datagrams.   The  SOCK_STREAM  type  models connection-based
virtual circuits: two-way byte streams with no record bound-
aries.   Connection setup is required before data communica-
tion may begin.  The SOCK_SEQPACKET type  models  a  connec-
tion-based,  full-duplex, reliable, exchange preserving mes-
sage boundaries; the sender  is  notified  if  messages  are
lost, and messages are never duplicated or presented out-of-
order.  Users of the  last  two  abstractions  may  use  the
facilities  for out-of-band transmission to send out-of-band
data.

     SOCK_RAW is used for  unprocessed  access  to  internal
network layers and interfaces; it has no specific semantics.









PSD:5-48                          4.4BSD Architecture Manual


Other socket types can be defined.

     Each socket may have  a  specific  _p_r_o_t_o_c_o_l  associated
with it.  This protocol is used within the domain to provide
the semantics required by the socket type.  Not  all  socket
types  are  supported by each domain; support depends on the
existence and the  implementation  of  a  suitable  protocol
within  the  domain.   For  example, within the ``Internet''
domain, the SOCK_DGRAM type may be implemented  by  the  UDP
user  datagram  protocol,  and  the  SOCK_STREAM type may be
implemented by the TCP transmission control protocol,  while
no  standard protocols to provide SOCK_RDM or SOCK_SEQPACKET
sockets exist.

22..33..11..33..  SSoocckkeett ccrreeaattiioonn,, nnaammiinngg aanndd sseerrvviiccee eessttaabblliisshhmmeenntt


     Sockets may be _c_o_n_n_e_c_t_e_d  or  _u_n_c_o_n_n_e_c_t_e_d.   An  uncon-
nected socket descriptor is obtained by the _s_o_c_k_e_t call:

     s = socket(domain, type, protocol);
     result int s; int domain, type, protocol;

The  socket  domain and type are as described above, and are
specified using the definitions  from  _<_s_y_s_/_s_o_c_k_e_t_._h_>.   The
protocol  may  be given as 0, meaning any suitable protocol.
One of several possible  protocols  may  be  selected  using
identifiers obtained from a library routine, _g_e_t_p_r_o_t_o_b_y_n_a_m_e.

     An unconnected socket descriptor of  a  connection-ori-
ented type may yield a connected socket descriptor in one of
two ways: either by actively connecting to  another  socket,
or  by becoming associated with a name in the communications
domain and  _a_c_c_e_p_t_i_n_g  a  connection  from  another  socket.
Datagram  sockets need not establish connections before use.

     To accept connections or to receive datagrams, a socket
must  first have a binding to a name (or address) within the
communications domain.  Such a binding may be established by
a _b_i_n_d call:

     bind(s, name, namelen);
     int s; struct sockaddr *name; int namelen;

Datagram  sockets may have default bindings established when
first sending data if  not  explicitly  bound  earlier.   In
either  case,  a socket's bound name may be retrieved with a
_g_e_t_s_o_c_k_n_a_m_e call:

     getsockname(s, name, namelen);
     int s; result struct sockaddr *name; result int *namelen;

while the peer's name can be retrieved with _g_e_t_p_e_e_r_n_a_m_e:










4.4BSD Architecture Manual                          PSD:5-49


     getpeername(s, name, namelen);
     int s; result struct sockaddr *name; result int *namelen;

Domains may support sockets with several names.

22..33..11..44..  AAcccceeppttiinngg ccoonnnneeccttiioonnss


Once a binding is made to a connection-oriented  socket,  it
is possible to _l_i_s_t_e_n for connections:

     listen(s, backlog);
     int s, backlog;

The  _b_a_c_k_l_o_g specifies the maximum count of connections that
can be simultaneously queued awaiting acceptance.

An _a_c_c_e_p_t call:

     t = accept(s, name, anamelen);
     result int t; int s; result struct sockaddr *name; result int *anamelen;

returns a descriptor for a new, connected, socket  from  the
queue  of  pending  connections on _s.  If no new connections
are queued for acceptance, the call will wait for a  connec-
tion  unless  non-blocking I/O has been enabled (see section
1.5.4).

22..33..11..55..  MMaakkiinngg ccoonnnneeccttiioonnss


An active connection to a named socket is made by  the  _c_o_n_-
_n_e_c_t call:

     connect(s, name, namelen);
     int s; struct sockaddr *name; int namelen;

Although  datagram sockets do not establish connections, the
_c_o_n_n_e_c_t call may be used with  such  sockets  to  create  an
_a_s_s_o_c_i_a_t_i_o_n  with  the  foreign  address.   The  address  is
recorded for use in future _s_e_n_d calls, which then  need  not
supply  destination  addresses.   Datagrams will be received
only from that peer, and asynchronous error reports  may  be
received.

     It  is also possible to create connected pairs of sock-
ets without using the domain's  name  space  to  rendezvous;
this is done with the _s_o_c_k_e_t_p_a_i_r call[+]:



-----------
[+]  4.4BSD  supports  _s_o_c_k_e_t_p_a_i_r creation only in
the PF_LOCAL communication domain.









PSD:5-50                          4.4BSD Architecture Manual


     socketpair(domain, type, protocol, sv);
     int domain, type, protocol; result int sv[2];

Here  the  returned  _s_v  descriptors  correspond  to   those
obtained with _a_c_c_e_p_t and _c_o_n_n_e_c_t.

The call:

     pipe(pv);
     result int pv[2];

creates  a  pair  of  SOCK_STREAM  sockets  in  the PF_LOCAL
domain, with pv[0] only writable and pv[1] only readable.

22..33..11..66..  SSeennddiinngg aanndd rreecceeiivviinngg ddaattaa


Messages may be sent from a socket by:

     cc = sendto(s, msg, len, flags, to, tolen);
     result int cc; int s; void *msg; size_t len;
     int flags; struct sockaddr *to; int tolen;

if the socket is not connected or:

     cc = send(s, msg, len, flags);
     result int cc; int s; void *msg; size_t len; int flags;

if the socket is connected.  The corresponding receive prim-
itives are:

     msglen = recvfrom(s, buf, len, flags, from, fromlenaddr);
     result int msglen; int s; result void *buf; size_t len; int flags;
     result struct sockaddr *from; result int *fromlenaddr;

and:

     msglen = recv(s, buf, len, flags);
     result int msglen; int s; result void *buf; size_t len; int flags;


     In  the  unconnected  case, the parameters _t_o and _t_o_l_e_n
specify the destination or source of the message, while  the
_f_r_o_m  parameter stores the source of the message, and _*_f_r_o_m_-
_l_e_n_a_d_d_r initially gives the size of the _f_r_o_m buffer  and  is
updated to reflect the true length of the _f_r_o_m address.

     All  calls  cause the message to be received in or sent
from the message buffer of length  _l_e_n  bytes,  starting  at
address _b_u_f.  The _f_l_a_g_s specify peeking at a message without
reading it, sending or receiving  high-priority  out-of-band
messages, or other special requests as follows:











4.4BSD Architecture Manual                          PSD:5-51


     MSG_OOB         /* process out-of-band data */
     MSG_PEEK        /* peek at incoming message */
     MSG_DONTROUTE   /* send without using routing tables */
     MSG_EOR         /* data completes record */
     MSG_TRUNC       /* data discarded before delivery */
     MSG_CTRUNC      /* control data lost before delivery */
     MSG_WAITALL     /* wait for full request or error */
     MSG_DONTWAIT    /* this message should be nonblocking */



22..33..11..77..  SSccaatttteerr//ggaatthheerr aanndd eexxcchhaannggiinngg aacccceessss rriigghhttss


     It  is  possible  to  scatter  and  gather  data and to
exchange access rights with messages.  When either of  these
operations is involved, the number of parameters to the call
becomes large.  Thus, the system defines  a  message  header
structure,  in  _<_s_y_s_/_s_o_c_k_e_t_._h_>,  which can be used to conve-
niently contain the parameters to the calls:


     struct msghdr {
          caddr_t   msg_name;         /* optional address */
          u_int     msg_namelen;      /* size of address */
          struct    iovec *msg_iov;   /* scatter/gather array */
          u_int     msg_iovlen;       /* # elements in msg_iov */
          caddr_t   msg_control;      /* ancillary data */
          u_int     msg_controllen;   /* ancillary data buffer len */
          int       msg_flags;        /* flags on received message */
     };


Here _m_s_g___n_a_m_e and _m_s_g___n_a_m_e_l_e_n specify the source or destina-
tion  address  if the socket is unconnected; _m_s_g___n_a_m_e may be
given as a null pointer if no names are desired or required.
The _m_s_g___i_o_v and _m_s_g___i_o_v_l_e_n describe the scatter/gather loca-
tions, as described in  section  2.1.1.   The  data  in  the
_m_s_g___c_o_n_t_r_o_l  buffer  is  composed  of  an  array of variable
length messages used  for  additional  information  with  or
about  a datagram not expressible by flags.  The format is a
sequence of message elements headed by _c_m_s_g_h_d_r structures:


     struct cmsghdr {
          u_int    cmsg_len;      /* data byte count, including hdr */
          int      cmsg_level;    /* originating protocol */
          int      cmsg_type;     /* protocol-specific type */
          u_char   cmsg_data[];   /* variable length type specific data */
     };


The following macros are provided for use with the  _m_s_g___c_o_n_-
_t_r_o_l buffer:









PSD:5-52                          4.4BSD Architecture Manual


     CMSG_FIRSTHDR(mhdr)       /* given msghdr, return first cmsghdr */
     CMSG_NXTHDR(mhdr, cmsg)   /* given msghdr and cmsghdr, return next cmsghdr */
     CMSG_DATA(cmsg)           /* given cmsghdr, return associated data pointer */


Access  rights  to be sent along with the message are speci-
fied  in  one  of  these  _c_m_s_g_h_d_r  structures,  with   level
SOL_SOCKET  and  type  SCM_RIGHTS.   In  the PF_LOCAL domain
these are an array of integer descriptors, copied  from  the
sending process and duplicated in the receiver.

This  structure  is  used  in  the  operations  _s_e_n_d_m_s_g  and
_r_e_c_v_m_s_g:

     sendmsg(s, msg, flags);
     int s; struct msghdr *msg; int flags;


     msglen = recvmsg(s, msg, flags);
     result int msglen; int s; result struct msghdr *msg; int flags;


22..33..11..88..  UUssiinngg rreeaadd aanndd wwrriittee wwiitthh ssoocckkeettss


     The normal _r_e_a_d and _w_r_i_t_e calls may be applied to  con-
nected  sockets  and  translated into _s_e_n_d and _r_e_c_e_i_v_e calls
from or to a single area of memory and discarding any rights
received.   A  process  may  operate  on  a  virtual circuit
socket, a terminal or a file with blocking  or  non-blocking
input/output  operations without distinguishing the descrip-
tor type.

22..33..11..99..  SShhuuttttiinngg ddoowwnn hhaallvveess ooff ffuullll--dduupplleexx ccoonnnneeccttiioonnss


     A process that has a full-duplex socket such as a  vir-
tual  circuit  and no longer wishes to read from or write to
this socket can give the call:

     shutdown(s, direction);
     int s, direction;

where _d_i_r_e_c_t_i_o_n is 0 to not read further,  1  to  not  write
further,  or  2  to completely shut the connection down.  If
the underlying protocol supports unidirectional or  bidirec-
tional shutdown, this indication will be passed to the peer.
For example, a shutdown for writing might produce an end-of-
file condition at the remote end.

22..33..11..1100..  SSoocckkeett aanndd pprroottooccooll ooppttiioonnss












4.4BSD Architecture Manual                          PSD:5-53


     Sockets,  and their underlying communication protocols,
may support _o_p_t_i_o_n_s.  These options may be used  to  manipu-
late  implementation-  or protocol-specific facilities.  The
_g_e_t_s_o_c_k_o_p_t and _s_e_t_s_o_c_k_o_p_t calls are used to control options:

     getsockopt(s, level, optname, optval, optlen);
     int s, level, optname; result void *optval; result int *optlen;


     setsockopt(s, level, optname, optval, optlen);
     int s, level, optname; void *optval; int optlen;

The  option _o_p_t_n_a_m_e is interpreted at the indicated protocol
_l_e_v_e_l for socket _s.  If a value is specified with _o_p_t_v_a_l and
_o_p_t_l_e_n,  it  is interpreted by the software operating at the
specified _l_e_v_e_l.  The _l_e_v_e_l SOL_SOCKET is reserved to  indi-
cate  options  maintained  by  the socket facilities.  Other
_l_e_v_e_l values indicate a particular protocol which is to  act
on the option request; these values are normally interpreted
as a ``protocol number'' within the protocol family.

22..33..22..  PPFF__LLOOCCAALL ddoommaaiinn


     This section describes briefly the  properties  of  the
PF_LOCAL (``UNIX'') communications domain.

22..33..22..11..  TTyyppeess ooff ssoocckkeettss


     In  the  local domain, the SOCK_STREAM abstraction pro-
vides pipe-like facilities, while SOCK_DGRAM provides  (usu-
ally) reliable message-style communications.

22..33..22..22..  NNaammiinngg


     Socket names are strings and may appear in the filesys-
tem name space.

22..33..22..33..  AAcccceessss rriigghhttss ttrraannssmmiissssiioonn


     The ability to pass descriptors with messages  in  this
domain  allows  migration  of  service within the system and
allows user processes to be used in building system  facili-
ties.

22..33..33..  IINNTTEERRNNEETT ddoommaaiinn


     This  section describes briefly how the Internet domain
is mapped to the model  described  in  this  section.   More
information  will  be  found  in the document describing the









PSD:5-54                          4.4BSD Architecture Manual


network implementation in 4.4BSD (SMM:18).

22..33..33..11..  SSoocckkeett ttyyppeess aanndd pprroottooccoollss


     SOCK_STREAM is supported by the Internet TCP  protocol;
SOCK_DGRAM  by  the  UDP protocol.  Each is layered atop the
transport-level Internet Protocol (IP).  The  Internet  Con-
trol  Message  Protocol is implemented atop/beside IP and is
accessible via a raw  socket.   The  SOCK_SEQPACKET  has  no
direct  Internet  family  analogue;  a protocol based on one
from the XEROX NS family and layered on top of IP  could  be
implemented to fill this gap.

22..33..33..22..  SSoocckkeett nnaammiinngg


     Sockets in the Internet domain have names composed of a
32-bit Internet address and a 16-bit port  number.   Options
may  be  used  to  provide  IP  source  routing  or security
options.  The 32-bit address is composed of network and host
parts; the network part is variable in size and is frequency
encoded.  The host part may optionally be interpreted  as  a
subnet field plus the host on the subnet; this is enabled by
setting a network address mask at boot time.

22..33..33..33..  AAcccceessss rriigghhttss ttrraannssmmiissssiioonn


     No access rights transmission facilities  are  provided
in the Internet domain.

22..33..33..44..  RRaaww aacccceessss


     The Internet domain allows the super-user access to the
raw facilities of  IP.   These  interfaces  are  modeled  as
SOCK_RAW sockets.  Each raw socket is associated with one IP
protocol number, and receives all traffic received for  that
protocol.  This approach allows administrative and debugging
functions to occur, and enables  user-level  implementations
of  special-purpose  protocols such as inter-gateway routing
protocols.

22..44..  TTeerrmmiinnaallss aanndd DDeevviicceess


22..44..11..  TTeerrmmiinnaallss


     Terminals support _r_e_a_d and  _w_r_i_t_e  I/O  operations,  as
well  as a collection of terminal specific _i_o_c_t_l operations,
to control input character interpretation and  editing,  and
output format and delays.









4.4BSD Architecture Manual                          PSD:5-55


     A terminal may be used as a controlling terminal (login
terminal) for a login session.  A  controlling  terminal  is
associated  with  a session (see section 1.1.4).  A control-
ling terminal has a foreground process group, which must  be
a  member  of the session with which the terminal is associ-
ated (see section 1.1.5).  Members of the foreground process
group are allowed to read from and write to the terminal and
change the terminal settings; other process groups from  the
session may be stopped upon attempts to do these operations.

     A session leader allocates a terminal as  the  control-
ling terminal for its session using the ioctl

     ioctl(fd, TIOCSCTTY, NULL);
     int fd;

Only a session leader may acquire a controlling terminal.

22..44..11..11..  TTeerrmmiinnaall iinnppuutt


     Terminals  are handled according to the underlying com-
munication characteristics such as baud  rate  and  required
delays,  and a set of software parameters.  These parameters
are described in the _t_e_r_m_i_o_s  structure  maintained  by  the
kernel for each terminal line:


     struct termios {
          tcflag_t   c_iflag;      /* input flags */
          tcflag_t   c_oflag;      /* output flags */
          tcflag_t   c_cflag;      /* control flags */
          tcflag_t   c_lflag;      /* local flags */
          cc_t       c_cc[NCCS];   /* control chars */
          long       c_ispeed;     /* input speed */
          long       c_ospeed;     /* output speed */
     };


The  _t_e_r_m_i_o_s  structure is set and retrieved using the _t_c_s_e_-
_t_a_t_t_r and _t_c_g_e_t_a_t_t_r functions.

     Two general kinds of input  processing  are  available,
determined by whether the terminal device file is in canoni-
cal mode or noncanonical mode. Additionally,  input  charac-
ters  are  processed  according  to  the _c___i_f_l_a_g and _c___l_f_l_a_g
fields.  Such processing can include echoing, which in  gen-
eral means transmitting input characters immediately back to
the terminal when they are received from the terminal.  Non-
graphic  ASCII input characters may be echoed as a two-char-
acter printable representation, ``^character.''

     In canonical mode input processing, terminal  input  is
processed  in  units  of  lines.   A  line is delimited by a









PSD:5-56                          4.4BSD Architecture Manual


newline character (NL), an end-of-file (EOF)  character,  or
an  end-of-line  (EOL)  character.   Input is presented on a
line-by-line basis.  Using  this  mode  means  that  a  read
request will not return until an entire line has been typed,
or a signal has been received.  Also,  no  matter  how  many
bytes  are  requested  in the read call, at most one line is
returned.  It is not, however, necessary  to  read  a  whole
line  at  once;  any  number  of  bytes,  even  one,  may be
requested in a read without losing information.

     When the terminal is in canonical mode, editing  of  an
input  line is performed.  Editing facilities allow deletion
of the previous character or word, or deletion of  the  cur-
rent  input  line.   In addition, a special character may be
used to reprint the current input line.  Certain other char-
acters are also interpreted specially.  Flow control is pro-
vided by the _s_t_o_p _o_u_t_p_u_t and _s_t_a_r_t  _o_u_t_p_u_t  control  charac-
ters.   Output  may be flushed with the _f_l_u_s_h _o_u_t_p_u_t charac-
ter; and the _l_i_t_e_r_a_l _c_h_a_r_a_c_t_e_r may be used to force the fol-
lowing character into the input line, regardless of any spe-
cial meaning it may have.

     In noncanonical mode input processing, input bytes  are
not assembled into lines, and erase and kill processing does
not occur.  All input  is  passed  through  to  the  reading
process immediately and without interpretation.  Signals and
flow control may be enabled;  here  the  handler  interprets
input  only  by looking for characters that cause interrupts
or output flow control; all other characters are made avail-
able.

     When  interrupt characters are being interpreted by the
terminal handler they cause a software interrupt to be  sent
to  all  processes  in the process group associated with the
terminal.  Interrupt characters exist  to  send  SIGINT  and
SIGQUIT  signals, and to stop a process group with the SIGT-
STP signal either immediately, or when all input up  to  the
stop character has been read.

22..44..11..22..  TTeerrmmiinnaall oouuttppuutt


     On  output,  the  terminal handler provides some simple
formatting services.  These include converting the  carriage
return   character  to  the  two  character  return-linefeed
sequence, inserting delays after  certain  standard  control
characters, and expanding tabs.

22..44..22..  SSttrruuccttuurreedd ddeevviicceess


     Structured  devices  are typified by disks and magnetic
tapes, but may represent any random-access device.  The sys-
tem  performs  read-modify-write  type  buffering actions on
block devices to allow them to be read and written in random









4.4BSD Architecture Manual                          PSD:5-57


access  fashion  like  ordinary files.  Filesystems are nor-
mally mounted on block devices.

22..44..33..  UUnnssttrruuccttuurreedd ddeevviicceess


     Unstructured devices are those  devices  which  do  not
support  block structure.  Familiar unstructured devices are
raw communications lines (with no terminal handler),  raster
plotters,  magnetic  tape  and disks unfettered by buffering
and permitting large block input/output and positioning  and
formatting commands.

22..55..  PPrroocceessss ddeebbuuggggiinngg


22..55..11..  TTrraaddiittiioonnaall ddeebbuuggggiinngg


Debuggers traditionally use the _p_t_r_a_c_e interface:

     ptrace(request, pid, addr, data);
     int request, pid, *addr, data;

This  interface  provides  a means by which a parent process
may control the execution of a child  process,  and  examine
and  change  its  core  image.   Its  primary use is for the
implementation of  breakpoint  debugging.   There  are  four
arguments  whose  interpretation  depends on a request argu-
ment.  A process being  traced  behaves  normally  until  it
encounters  a  signal  (whether  internally  generated  like
``illegal  instruction''  or   externally   generated   like
``interrupt'').   Then  the  traced process enters a stopped
state and its parent is notified via _w_a_i_t.  When  the  child
is  in the stopped state, its core image can be examined and
modified using _p_t_r_a_c_e.   Another  ptrace  request  can  then
cause the child either to terminate or to continue, possibly
ignoring the signal.

     A more general interface is also  provided  in  4.4BSD;
the  _m_o_u_n_t___p_r_o_c_f_s  filesystem  attaches  an  instance of the
process name space to the global filesystem name space.  The
conventional  mount point is _/_p_r_o_c.  The root of the process
filesystem contains an entry for each active process.  These
processes  are  visible as directories named by the process'
ID.  In addition, the special entry _c_u_r_p_r_o_c  references  the
current  process.   Each  directory  contains several files,
including a _c_t_l file.  The debugger finds (or  creates)  the
process  that  it  wants  to debug and then issues an attach
command via the _c_t_l file.  Further interaction can  then  be
done  with  the  process through the other files provided by
the _/_p_r_o_c filesystem.











PSD:5-58                          4.4BSD Architecture Manual


22..55..22..  KKeerrnneell ttrraacciinngg


Another facility for debugging programs is provided  by  the
_k_t_r_a_c_e interface:

     ktrace(tracefile, ops, trpoints, pid);
     char *tracefile; int ops, trpoints, pid;

_K_t_r_a_c_e  does  kernel  trace  logging  for the specified pro-
cesses.  The kernel operations that are traced include  sys-
tem  calls,  pathname  translations,  signal processing, and
I/O.  This facility can be particularly useful to debug pro-
grams for which you do not have the source.

















































4.4BSD Architecture Manual                          PSD:5-59


33..  SSuummmmaarryy ooff ffaacciilliittiieess


1    KKeerrnneell pprriimmiittiivveess
1.1  PPrroocceesssseess aanndd pprrootteeccttiioonn
       sethostid     set host identifier
       gethostid     get host identifier
       sethostname   set host name
       gethostname   get host name
       getpid        get process identifier
       getppid       get parent process identifier
       fork          create a new process
       vfork         create a new process
       exit          terminate a process
       wait4         collect exit status of child
       execve        execute a new program
       getuid        get real user identifier
       geteuid       get effective user identifier
       getgid        get real group identifier
       getegid       get effective group identifier
       getgroups     get access group set
       setuid        set  real,  effective,  and  saved user
identifiers
       setgid        set real, effective,  and  saved  group
identifiers
       setgroups     set access group set
       seteuid       set effective user identifier
       setegid       set effective group identifier
       setsid        create a new session
       setlogin      set login name
       getlogin      get login name
       getpgrp       get process group
       setpgid       set process group
1.2  MMeemmoorryy mmaannaaggeemmeenntt
       brk           set data section size
       sbrk          change data section size
       getpagesize   get system page size
       mmap          map files or devices into memory
       msync         synchronize a mapped region
       munmap        remove a mapping
       mprotect      control the protection of pages
       madvise       give advise about use of memory
       mincore       get advise about use of memory
       mlock         lock physical pages in memory
       munlock       unlock physical pages in memory
       mset          acquire and set a semaphore
       mclear        release  a semaphore and awaken waiting
processes
       msleep        wait for a semaphore
       mwakeup       awaken process(es) sleeping on a  sema-
phore
1.3  SSiiggnnaallss
       sigaction     setup software signal handler
       sigreturn     return from a signal









PSD:5-60                          4.4BSD Architecture Manual


       kill          send signal to a process
       killpg        send signal to a process group
       sigprocmask   manipulate current signal mask
       sigsuspend    atomically  release blocked signals and
wait for interrupt
       sigpending    get pending signals
       sigaltstack   set and/or get signal stack context
1.4  TTiimmeerrss
       settimeofday  set date and time
       gettimeofday  get date and time
       adjtime       synchronization of the system clock
       setitimer     set value of interval timer
       getitimer     get value of interval timer
       profil        control process profiling
1.5  DDeessccrriippttoorrss
       getdtablesize get descriptor table size
       dup           duplicate an existing file descriptor
       dup2          duplicate an existing file descriptor
       close         delete a descriptor
       select        synchronous I/O multiplexing
       fcntl         file control
1.6  RReessoouurrccee ccoonnttrroollss
       getpriority   get program scheduling priority
       setpriority   set program scheduling priority
       getrusage     get information about resource utiliza-
tion
       getrlimit     get maximum system resource consumption
       setrlimit     set maximum system resource consumption
1.7  SSyysstteemm ooppeerraattiioonn ssuuppppoorrtt
       sysctl        get or set system information
       mount         mount a filesystem
       getfsstat     get list of all mounted filesystems
       swapon        add  a swap device for interleaved pag-
ing/swapping
       unmount       dismount a filesystem
       sync          force completion of pending disk writes
(flush cache)
       reboot        reboot system or halt processor
       acct          enable or disable process accounting
2    SSyysstteemm ffaacciilliittiieess
2.1  GGeenneerriicc ooppeerraattiioonnss
       read          read input
       write         write output
       readv         read gathered input
       writev        write scattered output
       ioctl         control device
2.2  FFiilleessyysstteemm
       chdir         change current working directory
       fchdir        change current working directory
       chroot        change root directory
       statfs        get file system statistics
       fstatfs       get file system statistics
       mkdir         make a directory file
       rmdir         remove a directory file









4.4BSD Architecture Manual                          PSD:5-61


       getdirentries get  directory  entries in a filesystem
independent format
       open          open or create a file  for  reading  or
writing
       umask         set file creation mode mask
       mknod         make a special file node
       mkfifo        make a fifo file
       link          make a hard file link
       symlink       make a symbolic link to a file
       readlink      read value of a symbolic link
       rename        change the name of a file
       unlink        remove directory entry
       revoke        revoke file access
       stat          get file status
       fstat         get file status
       lstat         get file status
       chown         change owner and group of a file
       fchown        change owner and group of a file
       chmod         change mode of file
       fchmod        change mode of file
       chflags       set file flags
       fchflags      set file flags
       utimes        set file access and modification times
       access        check  access  permissions of a file or
pathname
       pathconf      get configurable pathname variables
       fpathconf     get configurable pathname variables
       lseek         reposition read/write file offset
       truncate      truncate a file to a specified length
       ftruncate     truncate a file to a specified length
       fsync         synchronize in-core  state  of  a  file
with that on disk
       flock         apply  or remove an advisory lock on an
open file
       quotactl      manipulate filesystem quotas
       nfssvc        NFS services
       getfh         get file handle
2.3  IInntteerrpprroocceessss ccoommmmuunniiccaattiioonnss
       socket        create an endpoint for communication
       bind          bind a name to a socket
       getsockname   get socket name
       getpeername   get name of connected peer
       listen        listen for connections on a socket
       accept        accept a connection on a socket
       connect       initiate a connection on a socket
       socketpair    create a pair of connected sockets
       pipe          create descriptor pair for interprocess
communication
       sendto        send a message from a socket
       send          send a message from a socket
       recvfrom      receive a message from a socket
       recv          receive a message from a socket
       sendmsg       send a message from a socket
       recvmsg       receive a message from a socket









PSD:5-62                          4.4BSD Architecture Manual


       shutdown      shut down part of a full-duplex connec-
tion
       getsockopt    get options on socket
       setsockopt    set options on socket
2.4  TTeerrmmiinnaallss aanndd DDeevviicceess
2.5  PPrroocceessss ddeebbuuggggiinngg
       ptrace        process trace
       ktrace        process tracing
3    SSuummmmaarryy ooff ffaacciilliittiieess






















































PSD:5-2                           4.4BSD Architecture Manual


                          CCoonntteennttss


             NNoottaattiioonn aanndd TTyyppeess                            4
         1   KKeerrnneell pprriimmiittiivveess                             4
       1.1   PPrroocceesssseess aanndd pprrootteeccttiioonn                      5
     1.1.1   Host identifiers                              5
     1.1.2   Process identifiers                           5
     1.1.3   Process creation and termination              5
     1.1.4   User and group IDs                            6
     1.1.5   Sessions                                      7
     1.1.6   Process groups                                7
       1.2   MMeemmoorryy mmaannaaggeemmeenntt                             8
     1.2.1   Text, data, and stack                         8
     1.2.2   Mapping pages                                 8
     1.2.3   Page protection control                      10
     1.2.4   Giving and getting advice                    10
     1.2.5   Synchronization primitives                   10
       1.3   SSiiggnnaallss                                      11
     1.3.1   Overview                                     11
     1.3.2   Signal types                                 11
     1.3.3   Signal handlers                              12
     1.3.4   Sending signals                              13
     1.3.5   Protecting critical sections                 13
     1.3.6   Signal stacks                                14
       1.4   TTiimmeerrss                                       14
     1.4.1   Real time                                    14
     1.4.2   Interval time                                15
       1.5   DDeessccrriippttoorrss                                  16
     1.5.1   The reference table                          16
     1.5.2   Descriptor properties                        16
     1.5.3   Managing descriptor references               16
     1.5.4   Multiplexing requests                        16
       1.6   RReessoouurrccee ccoonnttrroollss                            18
     1.6.1   Process priorities                           18
     1.6.2   Resource utilization                         18
     1.6.3   Resource limits                              19
       1.7   SSyysstteemm ooppeerraattiioonn ssuuppppoorrtt                     19
     1.7.1   Monitoring system operation                  20
     1.7.2   Bootstrap operations                         20
     1.7.3   Shutdown operations                          21
     1.7.4   Accounting                                   21
         2   SSyysstteemm ffaacciilliittiieess                            21
       2.1   GGeenneerriicc ooppeerraattiioonnss                           22
     2.1.1   Read and write                               22
     2.1.2   Input/output control                         23
     2.1.3   Non-blocking and asynchronous operations     23
       2.2   FFiilleessyysstteemm                                   23
     2.2.1   Overview                                     23
     2.2.2   Naming                                       23
     2.2.3   Creation and removal                         24
   2.2.3.1   Directory creation and removal               24
   2.2.3.2   File creation                                24
   2.2.3.3   Creating references to devices               26









4.4BSD Architecture Manual                           PSD:5-3


   2.2.3.4   Links and renaming                           26
   2.2.3.5   File, device, and fifo removal               27
     2.2.4   Reading and modifying file attributes        27
     2.2.5   Checking accessibility                       28
     2.2.6   Extension and truncation                     29
     2.2.7   Locking                                      29
     2.2.8   Disk quotas                                  30
     2.2.9   Remote filesystems                           30
    2.2.10   Other filesystems                            31
       2.3   IInntteerrpprroocceessss ccoommmmuunniiccaattiioonnss                  31
     2.3.1   Interprocess communication primitives        31
   2.3.1.1   Communication domains                        31
   2.3.1.2   Socket types and protocols                   31
   2.3.1.3   Socket creation, naming and service  establish-
ment32
   2.3.1.4   Accepting connections                        33
   2.3.1.5   Making connections                           33
   2.3.1.6   Sending and receiving data                   33
   2.3.1.7   Scatter/gather and exchanging access rights  34
   2.3.1.8   Using read and write with sockets            35
   2.3.1.9   Shutting down  halves  of  full-duplex  connec-
tions35
  2.3.1.10   Socket and protocol options                  35
     2.3.2   PF_LOCAL domain                              36
   2.3.2.1   Types of sockets                             36
   2.3.2.2   Naming                                       36
   2.3.2.3   Access rights transmission                   36
     2.3.3   INTERNET domain                              36
   2.3.3.1   Socket types and protocols                   36
   2.3.3.2   Socket naming                                36
   2.3.3.3   Access rights transmission                   36
   2.3.3.4   Raw access                                   36
       2.4   TTeerrmmiinnaallss aanndd DDeevviicceess                        37
     2.4.1   Terminals                                    37
   2.4.1.1   Terminal input                               37
   2.4.1.2   Terminal output                              38
     2.4.2   Structured devices                           38
     2.4.3   Unstructured devices                         38
       2.5   PPrroocceessss ddeebbuuggggiinngg                            38
     2.5.1   Traditional debugging                        38
     2.5.2   Kernel tracing                               38
         3   SSuummmmaarryy ooff ffaacciilliittiieess                        40


















