








              NNeettwwoorrkkiinngg IImmpplleemmeennttaattiioonn NNootteess
                       44..44BBSSDD EEddiittiioonn


_S_a_m_u_e_l _J_. _L_e_f_f_l_e_r_, _W_i_l_l_i_a_m _N_. _J_o_y_, _R_o_b_e_r_t _S_. _F_a_b_r_y_, _a_n_d _M_i_c_h_a_e_l _J_. _K_a_r_e_l_s
              Computer Systems Research Group
                 Computer Science Division
 Department of Electrical Engineering and Computer Science
             University of California, Berkeley
                    Berkeley, CA  94720


                          _A_B_S_T_R_A_C_T

          This  report describes the internal structure
     of the networking  facilities  developed  for  the
     4.4BSD  version  of the UNIX* operating system for
     the VAX[+].  These facilities are based on several
     central abstractions which structure the  external
     (user)  view  of  network communication as well as
     the internal (system) implementation.

          The report documents the  internal  structure
     of the networking system.  The ``Berkeley Software
     Architecture Manual, 4.4BSD Edition'' (PSD:5) pro-
     vides  a  description of the user interface to the
     networking facilities.


     Revised June 10, 1993

















-----------
* UNIX is a trademark of Bell Laboratories.
[+] DEC, VAX, DECnet, and UNIBUS are trademarks of
Digital Equipment Corporation.









SMM:18-2                     Networking Implementation Notes


                     TTAABBLLEE OOFF CCOONNTTEENNTTSS


11..  IInnttrroodduuccttiioonn

22..  OOvveerrvviieeww

33..  GGooaallss

44..  IInntteerrnnaall aaddddrreessss rreepprreesseennttaattiioonn

55..  MMeemmoorryy mmaannaaggeemmeenntt

66..  IInntteerrnnaall llaayyeerriinngg
6.1.    Socket layer
6.1.1.    Socket state
6.1.2.    Socket data queues
6.1.3.    Socket connection queuing
6.2.    Protocol layer(s)
6.3.    Network-interface layer
6.3.1.    UNIBUS interfaces

77..  SSoocckkeett//pprroottooccooll iinntteerrffaaccee

88..  PPrroottooccooll//pprroottooccooll iinntteerrffaaccee
8.1.     pr_output
8.2.     pr_input
8.3.     pr_ctlinput
8.4.     pr_ctloutput

99..  PPrroottooccooll//nneettwwoorrkk--iinntteerrffaaccee iinntteerrffaaccee
9.1.     Packet transmission
9.2.     Packet reception

1100.. GGaatteewwaayyss aanndd rroouuttiinngg iissssuueess
10.1.     Routing tables
10.2.     Routing table interface
10.3.     User level routing policies

1111.. RRaaww ssoocckkeettss
11.1.     Control blocks
11.2.     Input processing
11.3.     Output processing

1122.. BBuuffffeerriinngg aanndd ccoonnggeessttiioonn ccoonnttrrooll
12.1.     Memory management
12.2.     Protocol buffering policies
12.3.     Queue limiting
12.4.     Packet forwarding

1133.. OOuutt ooff bbaanndd ddaattaa

1144.. TTrraaiilleerr pprroottooccoollss










Networking Implementation Notes                     SMM:18-3


AAcckknnoowwlleeddggeemmeennttss

RReeffeerreenncceess




























































SMM:18-4                     Networking Implementation Notes


11..  IInnttrroodduuccttiioonn

     This report describes the internal structure of facili-
ties  added to the 4.2BSD version of the UNIX operating sys-
tem for the VAX, as modified in  the  4.4BSD  release.   The
system  facilities  provide a uniform user interface to net-
working within UNIX.  In addition, the implementation intro-
duces  a  structure  for network communications which may be
used by system implementors in adding new networking facili-
ties.   The  internal  structure is not visible to the user,
rather it is intended to aid implementors  of  communication
protocols  and  network  services  by  providing a framework
which promotes code  sharing  and  minimizes  implementation
effort.

     The  reader  is expected to be familiar with the C pro-
gramming language and system interface, as described in  the
_B_e_r_k_e_l_e_y   _S_o_f_t_w_a_r_e   _A_r_c_h_i_t_e_c_t_u_r_e  _M_a_n_u_a_l_,  _4_._4_B_S_D  _E_d_i_t_i_o_n
[Joy86].  Basic understanding of network communication  con-
cepts  is  assumed;  where required any additional ideas are
introduced.

     The remainder of this document provides  a  description
of the system internals, avoiding, when possible, those por-
tions which are used only by the interprocess  communication
facilities.

22..  OOvveerrvviieeww

     If  we  consider  the International Standards Organiza-
tion's (ISO) Open System Interconnection (OSI) model of net-
work  communication  [ISO81]  [Zimmermann80], the networking
facilities described here correspond to  a  portion  of  the
session layer (layer 3) and all of the transport and network
layers (layers 2 and 1, respectively).

     The network  layer  provides  possibly  imperfect  data
transport   services   with  minimal  addressing  structure.
Addressing at this level is  normally  host  to  host,  with
implicit  or  explicit  routing  optionally supported by the
communicating agents.

     At the transport layer the notions of  reliable  trans-
fer,  data  sequencing, flow control, and service addressing
are normally included.  Reliability is  usually  managed  by
explicit  acknowledgement  of  data  delivered.   Failure to
acknowledge a transfer  results  in  retransmission  of  the
data.   Sequencing  may  be  handled by tagging each message
handed to the network layer by a _s_e_q_u_e_n_c_e _n_u_m_b_e_r  and  main-
taining  state  at  the  endpoints  of  communication to use
received sequence numbers in reordering data  which  arrives
out of order.











Networking Implementation Notes                     SMM:18-5


     The  session  layer  facilities  may  provide  forms of
addressing which are mapped into  formats  required  by  the
transport layer, service authentication and client authenti-
cation, etc.  Various systems also provide services such  as
data encryption and address and protocol translation.

     The  following sections begin by describing some of the
common data structures and utility  routines,  then  examine
the  internal  layering.  The contents of each layer and its
interface are considered.  Certain  of  the  interfaces  are
protocol  implementation specific.  For these cases examples
have been drawn from the Internet [Cerf78] protocol  family.
Later  sections  cover routing issues, the design of the raw
socket interface and other miscellaneous topics.

33..  GGooaallss

     The networking system was designed  with  the  goal  of
supporting multiple _p_r_o_t_o_c_o_l _f_a_m_i_l_i_e_s and addressing styles.
This required information to be ``hidden''  in  common  data
structures  which  could be manipulated by all the pieces of
the system, but which required interpretation  only  by  the
protocols  which  ``controlled''  it.   The system described
here attempts to minimize the use of shared data  structures
to  those  kept by a suite of protocols (a _p_r_o_t_o_c_o_l _f_a_m_i_l_y),
and those used for rendezvous  between  ``synchronous''  and
``asynchronous'' portions of the system (e.g. queues of data
packets are filled at interrupt time and  emptied  based  on
user requests).

     A  major  goal of the system was to provide a framework
within which new protocols and hardware could be  easily  be
supported.   To  this  end,  a great deal of effort has been
extended to create utility routines which hide many  of  the
more complex and/or hardware dependent chores of networking.
Later sections describe the utility routines and the  under-
lying data structures they manipulate.

44..  IInntteerrnnaall aaddddrreessss rreepprreesseennttaattiioonn

     Common  to  all  portions  of  the  system are two data
structures.   These  structures  are   used   to   represent
addresses  and  various data objects.  Addresses, internally
are described by the _s_o_c_k_a_d_d_r structure,

     struct sockaddr {
            short     sa_family;           /* data format identifier */
            char      sa_data[14];         /* address */
     };

All addresses belong to one or more _a_d_d_r_e_s_s  _f_a_m_i_l_i_e_s  which
define their format and interpretation.  The _s_a___f_a_m_i_l_y field
indicates the address family to which the  address  belongs,
and  the  _s_a___d_a_t_a field contains the actual data value.  The









SMM:18-6                     Networking Implementation Notes


size of the data field, 14 bytes, was selected  based  on  a
study of current address formats.*  Specific address formats
use private structure definitions that define the format  of
the  data  field.   The  system  interface  supports  larger
address structures, although address-family-independent sup-
port  facilities,  for example routing and raw socket inter-
faces, provide only 14 bytes for address storage.  Protocols
that  do  not  use  those  facilities (e.g, the current Unix
domain) may use larger data areas.

55..  MMeemmoorryy mmaannaaggeemmeenntt

     A  single  mechanism  is  used for data storage: memory
buffers, or _m_b_u_f's.  An mbuf is a structure of the form:

     struct mbuf {
            struct    mbuf *m_next;        /* next buffer in chain */
            u_long    m_off;               /* offset of data */
            short     m_len;               /* amount of data in this mbuf */
            short     m_type;              /* mbuf type (accounting) */
            u_char    m_dat[MLEN];         /* data storage */
            struct    mbuf *m_act;         /* link in higher-level mbuf list */
     };

The _m___n_e_x_t field is used to chain mbufs together  on  linked
lists,  while the _m___a_c_t field allows lists of mbuf chains to
be accumulated.  By convention, the mbufs common to a single
object (for example, a packet) are chained together with the
_m___n_e_x_t field, while groups of objects  are  linked  via  the
_m___a_c_t field (possibly when in a queue).

     Each  mbuf  has  a small data area for storing informa-
tion, _m___d_a_t.  The _m___l_e_n field indicates the amount of  data,
while  the  _m___o_f_f field is an offset to the beginning of the
data from the base of the  mbuf.   Thus,  for  example,  the
macro _m_t_o_d, which converts a pointer to an mbuf to a pointer
to the data stored in the mbuf, has the form

     #define mtod(_x,_t)         ((_t)((int)(_x) + (_x)->m_off))

(note the _t parameter, a C type cast, which is used to  cast
the resultant pointer for proper assignment).

     In addition to storing data directly in the mbuf's data
area, data of page size may be also be stored in a  separate
area  of  memory.  The mbuf utility routines maintain a pool
of pages for this purpose and manipulate a private page  map
for  such  pages.  An mbuf with an external data area may be
recognized by the larger offset to the data  area;  this  is
formalized  by  the  macro  M_HASCL(_m), which is true if the
mbuf whose address is _m has an external  page  cluster.   An
-----------
* Later versions of the system may  support  vari-
able length addresses.









Networking Implementation Notes                     SMM:18-7


array  of  reference  counts  on pages is also maintained so
that copies of pages may be made without core to core  copy-
ing  (copies are created simply by duplicating the reference
to the data and incrementing the associated reference counts
for the pages).  Separate data pages are currently used only
when copying data from a user process into the  kernel,  and
when bringing data in at the hardware level.  Routines which
manipulate mbufs are not  normally  aware  whether  data  is
stored  directly in the mbuf data array, or if it is kept in
separate pages.

     The following may be used to allocate and free mbufs:

m = m_get(wait, type);
MGET(m, wait, type);

     The subroutine _m___g_e_t and the macro _M_G_E_T  each  allocate
     an  mbuf,  placing its address in _m.  The argument _w_a_i_t
     is either M_WAIT or  M_DONTWAIT  according  to  whether
     allocation  should  block  or fail if no mbuf is avail-
     able.  The _t_y_p_e is one of the predefined mbuf types for
     use in accounting of mbuf allocation.

MCLGET(m);
     This macro attempts to allocate an mbuf page cluster to
     associate with the mbuf _m.  If successful,  the  length
     of  the  mbuf  is  set  to CLSIZE, the size of the page
     cluster.

n = m_free(m);
MFREE(m,n);

     The routine _m___f_r_e_e and the macro _M_F_R_E_E each free a sin-
     gle  mbuf, _m, and any associated external storage area,
     placing a pointer to its  successor  in  the  chain  it
     heads, if any, in _n.

m_freem(m);
     This routine frees an mbuf chain headed by _m.

     The following utility routines are available for manip-
ulating mbuf chains:

m = m_copy(m0, off, len);
     The _m___c_o_p_y routine create a copy of all, or part, of  a
     list  of  the mbufs in _m_0.  _L_e_n bytes of data, starting
     _o_f_f bytes from the front  of  the  chain,  are  copied.
     Where  possible,  reference  counts  on  pages are used
     instead of core to  core  copies.   The  original  mbuf
     chain  must  have at least _o_f_f + _l_e_n bytes of data.  If
     _l_e_n is specified as M_COPYALL, all  the  data  present,
     offset as before, is copied.











SMM:18-8                     Networking Implementation Notes


m_cat(m, n);
     The  mbuf chain, _n, is appended to the end of _m.  Where
     possible, compaction is performed.

m_adj(m, diff);
     The mbuf chain, _m is adjusted in size  by  _d_i_f_f  bytes.
     If  _d_i_f_f is non-negative, _d_i_f_f bytes are shaved off the
     front of the mbuf chain.   If  _d_i_f_f  is  negative,  the
     alteration  is  performed from back to front.  No space
     is reclaimed in this operation; alterations are  accom-
     plished  by  changing  the  _m___l_e_n  and  _m___o_f_f fields of
     mbufs.

m = m_pullup(m0, size);
     After a successful call to _m___p_u_l_l_u_p, the  mbuf  at  the
     head  of the returned list, _m, is guaranteed to have at
     least _s_i_z_e bytes of data in  contiguous  memory  within
     the  data  area  of  the  mbuf  (allowing  access via a
     pointer, obtained using the _m_t_o_d  macro,  and  allowing
     the  mbuf to be located from a pointer to the data area
     using _d_t_o_m, defined below).  If the original  data  was
     less  than  _s_i_z_e  bytes  long, _l_e_n was greater than the
     size of an mbuf data  area  (112  bytes),  or  required
     resources  were  unavailable,  _m  is 0 and the original
     mbuf chain is deallocated.

     This routine  is  particularly  useful  when  verifying
     packet  header lengths on reception.  For example, if a
     packet is received and only 8 of the necessary 16 bytes
     required  for  a valid packet header are present at the
     head of the list of mbufs representing the packet,  the
     remaining  8  bytes  may be ``pulled up'' with a single
     _m___p_u_l_l_u_p call.  If the call fails  the  invalid  packet
     will have been discarded.

     By insuring that mbufs always reside on 128 byte bound-
aries, it is always possible to locate the  mbuf  associated
with  a data area by masking off the low bits of the virtual
address.  This allows modules to store  data  structures  in
mbufs  and pass them around without concern for locating the
original mbuf when it comes  time  to  free  the  structure.
Note  that this works only with objects stored in the inter-
nal data buffer of the mbuf.  The _d_t_o_m macro is used to con-
vert  a pointer into an mbuf's data area to a pointer to the
mbuf,

     #define   dtom(x)   ((struct mbuf *)((int)x & ~(MSIZE-1)))


     Mbufs are used for dynamically  allocated  data  struc-
tures  such as sockets as well as memory allocated for pack-
ets and headers.  Statistics are maintained  on  mbuf  usage
and can be viewed by users using the _n_e_t_s_t_a_t(1) program.










Networking Implementation Notes                     SMM:18-9


66..  IInntteerrnnaall llaayyeerriinngg

     The internal structure of the network system is divided
into three layers.  These layers correspond to the  services
provided  by  the  socket abstraction, those provided by the
communication protocols, and those provided by the  hardware
interfaces.   The  communication protocols are normally lay-
ered into two or more individual cooperating layers,  though
they are collectively viewed in the system as one layer pro-
viding  services  supportive  of  the   appropriate   socket
abstraction.

     The  following sections describe the properties of each
layer in the system and the interfaces to  which  each  must
conform.

66..11..  SSoocckkeett llaayyeerr

     The socket layer deals with the interprocess communica-
tion facilities provided by the system.  A socket is a bidi-
rectional  endpoint  of  communication which is ``typed'' by
the semantics of  communication  it  supports.   The  system
calls described in the _B_e_r_k_e_l_e_y _S_o_f_t_w_a_r_e _A_r_c_h_i_t_e_c_t_u_r_e _M_a_n_u_a_l
[Joy86] are used to manipulate sockets.

     A socket consists of the following data structure:

     struct socket {
            short     so_type;             /* generic type */
            short     so_options;          /* from socket call */
            short     so_linger;           /* time to linger while closing */
            short     so_state;            /* internal state flags */
            caddr_t   so_pcb;              /* protocol control block */
            struct    protosw *so_proto;   /* protocol handle */
            struct    socket *so_head;     /* back pointer to accept socket */
            struct    socket *so_q0;       /* queue of partial connections */
            short     so_q0len;            /* partials on so_q0 */
            struct    socket *so_q;        /* queue of incoming connections */
            short     so_qlen;             /* number of connections on so_q */
            short     so_qlimit;           /* max number queued connections */
            struct    sockbuf so_rcv;      /* receive queue */
            struct    sockbuf so_snd;      /* send queue */
            short     so_timeo;            /* connection timeout */
            u_short   so_error;            /* error affecting connection */
            u_short   so_oobmark;          /* chars to oob mark */
            short     so_pgrp;             /* pgrp for signals */
     };


     Each  socket  contains  two  data  queues,  _s_o___r_c_v  and
_s_o___s_n_d,  and  a pointer to routines which provide supporting
services.  The type of the socket,  _s_o___t_y_p_e  is  defined  at
socket  creation  time  and used in selecting those services
which  are  appropriate  to  support  it.   The   supporting









SMM:18-10                    Networking Implementation Notes


protocol is selected at socket creation time and recorded in
the socket data structure  for  later  use.   Protocols  are
defined  by  a  table  of procedures, the _p_r_o_t_o_s_w structure,
which will be described in detail later.   A  pointer  to  a
protocol-specific  data  structure,  the  ``protocol control
block,'' is also present in the socket structure.  Protocols
control  this data structure, which normally includes a back
pointer to the parent socket structure to allow easy  lookup
when  returning  information to a user (for example, placing
an error number in the _s_o___e_r_r_o_r field).  The  other  entries
in  the  socket  structure  are  used  in queuing connection
requests, validating user requests, storing  socket  charac-
teristics  (e.g.   options  supplied at the time a socket is
created), and maintaining a socket's state.

     Processes ``rendezvous at a socket'' in many instances.
For  instance,  when a process wishes to extract data from a
socket's receive queue and it is empty, or lacks  sufficient
data  to  satisfy the request, the process blocks, supplying
the address of the receive queue as a ``wait channel' to  be
used in notification.  When data arrives for the process and
is placed in the socket's  queue,  the  blocked  process  is
identified by the fact it is waiting ``on the queue.''

66..11..11..  SSoocckkeett ssttaattee

     A socket's state is defined from the following:

     #define SS_NOFDREF            0x001     /* no file table ref any more */
     #define SS_ISCONNECTED        0x002     /* socket connected to a peer */
     #define SS_ISCONNECTING       0x004     /* in process of connecting to peer */
     #define SS_ISDISCONNECTING    0x008     /* in process of disconnecting */
     #define SS_CANTSENDMORE       0x010     /* can't send more data to peer */
     #define SS_CANTRCVMORE        0x020     /* can't receive more data from peer */
     #define SS_RCVATMARK          0x040     /* at mark on input */

     #define SS_PRIV               0x080     /* privileged */
     #define SS_NBIO               0x100     /* non-blocking ops */
     #define SS_ASYNC              0x200     /* async i/o notify */


     The state of a socket is manipulated both by the proto-
cols and the user (through system calls).  When a socket  is
created,  the  state is defined based on the type of socket.
It may change as control actions are performed, for  example
connection  establishment.   It may also change according to
the type of input/output the  user  wishes  to  perform,  as
indicated  by  options set with _f_c_n_t_l.  ``Non-blocking'' I/O
implies that a process should  never  be  blocked  to  await
resources.  Instead, any call which would block returns pre-
maturely with the error EWOULDBLOCK, or the service  request
may  be  partially  fulfilled,  e.g. a request for more data
than is present.










Networking Implementation Notes                    SMM:18-11


     If a process requested ``asynchronous'' notification of
events  related to the socket, the SIGIO signal is posted to
the process when such events occur.  An event is a change in
the  socket's state; examples of such occurrences are: space
becoming available in the send queue, new data available  in
the receive queue, connection establishment or disestablish-
ment, etc.

     A socket may be marked ``privileged'' if it was created
by   the  super-user.   Only  privileged  sockets  may  bind
addresses in privileged portions of an address space or  use
``raw'' sockets to access lower levels of the network.

66..11..22..  SSoocckkeett ddaattaa qquueeuueess

     A  socket's  data  queue contains a pointer to the data
stored in the queue and other entries related to the manage-
ment  of  the  data.  The following structure defines a data
queue:

     struct sockbuf {
            u_short   sb_cc;               /* actual chars in buffer */
            u_short   sb_hiwat;            /* max actual char count */
            u_short   sb_mbcnt;            /* chars of mbufs used */
            u_short   sb_mbmax;            /* max chars of mbufs to use */
            u_short   sb_lowat;            /* low water mark */
            short     sb_timeo;            /* timeout */
            struct    mbuf *sb_mb;         /* the mbuf chain */
            struct    proc *sb_sel;        /* process selecting read/write */
            short     sb_flags;            /* flags, see below */
     };


     Data is stored in a queue as a  chain  of  mbufs.   The
actual  count  of  data  characters  as well as high and low
water marks are used by the  protocols  in  controlling  the
flow  of  data.   The  amount of buffer space (characters of
mbufs and associated data pages) is also recorded along with
the limit on buffer allocation.  The socket routines cooper-
ate in implementing the flow control policy  by  blocking  a
process  when  it  requests  to send data and the high water
mark has been reached, or when it requests to  receive  data
and  less  than the low water mark is present (assuming non-
blocking I/O has not been specified).*

     When  a  socket  is  created,  the  supporting protocol
``reserves'' space for the send and receive  queues  of  the
socket.   The  limit  on  buffer  allocation is set somewhat
higher than the limit on data characters to account for  the
granularity  of buffer allocation.  The actual storage asso-
ciated with a socket queue may fluctuate during  a  socket's
-----------
* The low-water mark is always presumed to be 0 in
the current implementation.









SMM:18-12                    Networking Implementation Notes


lifetime,  but  it  is  assumed  that  this reservation will
always allow a protocol to acquire enough memory to  satisfy
the high water marks.

     The  timeout  and  select values are manipulated by the
socket routines in  implementing  various  portions  of  the
interprocess  communications  facilities  and  will  not  be
described here.

     Data queued at a socket is stored in one of two styles.
Stream-oriented  sockets queue data with no addresses, head-
ers or record boundaries.  The  data  are  in  mbufs  linked
through  the _m___n_e_x_t field.  Buffers containing access rights
may be present within the chain if the  underlying  protocol
supports passage of access rights.  Record-oriented sockets,
including datagram sockets, queue data as a list of packets;
the  sections  of  packets are distinguished by the types of
the mbufs containing  them.   The  mbufs  which  comprise  a
record  are  linked  through  the  _m___n_e_x_t field; records are
linked from the _m___a_c_t field of the first mbuf of one  packet
to  the  first mbuf of the next.  Each packet begins with an
mbuf containing the ``from'' address if  the  protocol  pro-
vides  it,  then  any  buffers containing access rights, and
finally any buffers containing data.  If a  record  contains
no data, no data buffers are required unless neither address
nor access rights are present.

     A socket queue has a number of flags used  in  synchro-
nizing access to the data and in acquiring resources:

     #define SB_LOCK           0x01   /* lock on data queue (so_rcv only) */
     #define SB_WANT           0x02   /* someone is waiting to lock */
     #define SB_WAIT           0x04   /* someone is waiting for data/space */
     #define SB_SEL            0x08   /* buffer is selected */
     #define SB_COLL           0x10   /* collision selecting */

The  last  two flags are manipulated by the system in imple-
menting the select mechanism.

66..11..33..  SSoocckkeett ccoonnnneeccttiioonn qquueeuuiinngg

     In  dealing  with  connection  oriented  sockets  (e.g.
SOCK_STREAM)  the two ends are considered distinct.  One end
is termed _a_c_t_i_v_e, and generates  connection  requests.   The
other end is called _p_a_s_s_i_v_e and accepts connection requests.

     From  the  passive  side,  a  socket  is  marked   with
SO_ACCEPTCONN  when  a  _l_i_s_t_e_n  call  is  made, creating two
queues of sockets: _s_o___q_0 for  connections  in  progress  and
_s_o___q  for  connections already made and awaiting user accep-
tance.  As a protocol is preparing incoming connections,  it
creates  a  socket  structure queued on _s_o___q_0 by calling the
routine _s_o_n_e_w_c_o_n_n().  When the  connection  is  established,
the  socket structure is then transferred to _s_o___q, making it









Networking Implementation Notes                    SMM:18-13


available for an _a_c_c_e_p_t.

     If an SO_ACCEPTCONN socket is closed  with  sockets  on
either  _s_o___q_0 or _s_o___q, these sockets are dropped, with noti-
fication to the peers as appropriate.

66..22..  PPrroottooccooll llaayyeerr((ss))

     Each socket is  created  in  a  communications  domain,
which  usually implies both an addressing structure (address
family) and a  set  of  protocols  which  implement  various
socket  types  within  the  domain  (protocol family).  Each
domain is defined by the following structure:

     struct       domain {
          int     dom_family;             /* PF_xxx */
          char    *dom_name;
          int     (*dom_init)();          /* initialize domain data structures */
          int     (*dom_externalize)();   /* externalize access rights */
          int     (*dom_dispose)();       /* dispose of internalized rights */
          struct  protosw *dom_protosw, *dom_protoswNPROTOSW;
          struct  domain *dom_next;
     };


     At boot time, each domain configured into the kernel is
added to a linked list of domain.  The initialization proce-
dure of each domain is then called.  After  that  time,  the
domain structure is used to locate protocols within the pro-
tocol family.  It may also contain procedure references  for
externalization of access rights at the receiving socket and
the disposal of access rights that are not received.

     Protocols are described by a set of  entry  points  and
certain  socket-visible  characteristics,  some of which are
used in deciding which socket type(s) they may support.

     An entry in the ``protocol switch''  table  exists  for
each protocol module configured into the system.  It has the
following form:























SMM:18-14                    Networking Implementation Notes


     struct protosw {
          short   pr_type;              /* socket type used for */
          struct  domain *pr_domain;    /* domain protocol a member of */
          short   pr_protocol;          /* protocol number */
          short   pr_flags;             /* socket visible attributes */
     /* protocol-protocol hooks */
          int     (*pr_input)();        /* input to protocol (from below) */
          int     (*pr_output)();       /* output to protocol (from above) */
          int     (*pr_ctlinput)();     /* control input (from below) */
          int     (*pr_ctloutput)();    /* control output (from above) */
     /* user-protocol hook */
          int     (*pr_usrreq)();       /* user request */
     /* utility hooks */
          int     (*pr_init)();         /* initialization routine */
          int     (*pr_fasttimo)();     /* fast timeout (200ms) */
          int     (*pr_slowtimo)();     /* slow timeout (500ms) */
          int     (*pr_drain)();        /* flush any excess space possible */
     };


     A protocol is called through the _p_r___i_n_i_t  entry  before
any  other.   Thereafter it is called every 200 milliseconds
through the _p_r___f_a_s_t_t_i_m_o entry  and  every  500  milliseconds
through the _p_r___s_l_o_w_t_i_m_o for timer based actions.  The system
will call the _p_r___d_r_a_i_n entry if it is low on space and  this
should throw away any non-critical data.

     Protocols  pass  data  between  themselves as chains of
mbufs using the _p_r___i_n_p_u_t and _p_r___o_u_t_p_u_t  routines.   _P_r___i_n_p_u_t
passes  data  up  (towards the user) and _p_r___o_u_t_p_u_t passes it
down (towards the network); control  information  passes  up
and  down  on _p_r___c_t_l_i_n_p_u_t and _p_r___c_t_l_o_u_t_p_u_t.  The protocol is
responsible for the space occupied by any of  the  arguments
to  these  entries and must either pass it onward or dispose
of it.  (On output,  the  lowest  level  reached  must  free
buffers  storing  the arguments; on input, the highest level
is responsible for freeing buffers.)

     The  _p_r___u_s_r_r_e_q  routine  interfaces  protocols  to  the
socket code and is described below.

     The  _p_r___f_l_a_g_s  field  is constructed from the following
values:

     #define PR_ATOMIC         0x01    /* exchange atomic messages only */
     #define PR_ADDR           0x02    /* addresses given with messages */
     #define PR_CONNREQUIRED   0x04    /* connection required by protocol */
     #define PR_WANTRCVD       0x08    /* want PRU_RCVD calls */
     #define PR_RIGHTS         0x10    /* passes capabilities */

Protocols which are connection-based specify the  PR_CONNRE-
QUIRED  flag  so that the socket routines will never attempt
to send data before a connection has been  established.   If
the PR_WANTRCVD flag is set, the socket routines will notify









Networking Implementation Notes                    SMM:18-15


the protocol  when  the  user  has  removed  data  from  the
socket's  receive queue.  This allows the protocol to imple-
ment acknowledgement on user receipt, and also  update  win-
dowing information based on the amount of space available in
the receive queue.  The PR_ADDR  field  indicates  that  any
data  placed  in the socket's receive queue will be preceded
by the address of the sender.  The PR_ATOMIC flag  specifies
that  each  _u_s_e_r request to send data must be performed in a
single _p_r_o_t_o_c_o_l send request; it is the protocol's responsi-
bility  to  maintain  record  boundaries on data to be sent.
The PR_RIGHTS flag indicates that the protocol supports  the
passing of capabilities;  this is currently used only by the
protocols in the UNIX protocol family.

     When a socket is created, the socket routines scan  the
protocol  table  for  the  domain looking for an appropriate
protocol to support the type of socket being  created.   The
_p_r___t_y_p_e  field  contains  one  of  the possible socket types
(e.g. SOCK_STREAM), while the _p_r___d_o_m_a_i_n is a back pointer to
the  domain  structure.   The _p_r___p_r_o_t_o_c_o_l field contains the
protocol number  of  the  protocol,  normally  a  well-known
value.

66..33..  NNeettwwoorrkk--iinntteerrffaaccee llaayyeerr

     Each network-interface configured into a system defines
a path through which packets may be sent and received.  Nor-
mally  a  hardware device is associated with this interface,
though there is no requirement for this  (for  example,  all
systems  have  a  software  ``loopback''  interface used for
debugging and performance analysis).  In addition to manipu-
lating the hardware device, an interface module is responsi-
ble for encapsulation and decapsulation  of  any  link-layer
header information required to deliver a message to its des-
tination.  The selection of which interface to use in deliv-
ering  packets is a routing decision carried out at a higher
level than the network-interface layer.   An  interface  may
have addresses in one or more address families.  The address
is set at boot time using an _i_o_c_t_l on a socket in the appro-
priate domain; this operation is implemented by the protocol
family, after verifying the  operation  through  the  device
_i_o_c_t_l entry.

     An interface is defined by the following structure,



















SMM:18-16                    Networking Implementation Notes


     struct ifnet {
          char     *if_name;              /* name, e.g. ``en'' or ``lo'' */
          short    if_unit;               /* sub-unit for lower level driver */
          short    if_mtu;                /* maximum transmission unit */
          short    if_flags;              /* up/down, broadcast, etc. */
          short    if_timer;              /* time 'til if_watchdog called */
          struct   ifaddr *if_addrlist;   /* list of addresses of interface */
          struct   ifqueue if_snd;        /* output queue */
          int      (*if_init)();          /* init routine */
          int      (*if_output)();        /* output routine */
          int      (*if_ioctl)();         /* ioctl routine */
          int      (*if_reset)();         /* bus reset routine */
          int      (*if_watchdog)();      /* timer routine */
          int      if_ipackets;           /* packets received on interface */
          int      if_ierrors;            /* input errors on interface */
          int      if_opackets;           /* packets sent on interface */
          int      if_oerrors;            /* output errors on interface */
          int      if_collisions;         /* collisions on csma interfaces */
          struct   ifnet *if_next;
     };

Each interface address has the following form:

     struct ifaddr {
             struct   sockaddr ifa_addr;   /* address of interface */
             union {
                      struct   sockaddr ifu_broadaddr;
                      struct   sockaddr ifu_dstaddr;
             } ifa_ifu;
             struct   ifnet *ifa_ifp;      /* back-pointer to interface */
             struct   ifaddr *ifa_next;    /* next address for interface */
     };
     #define ifa_broadaddr   ifa_ifu.ifu_broadaddr        /* broadcast address */
     #define ifa_dstaddr     ifa_ifu.ifu_dstaddr          /* other end of p-to-p link */

The protocol generally maintains this structure as part of a
larger structure containing additional information  concern-
ing the address.

     Each  interface  has a send queue and routines used for
initialization, _i_f___i_n_i_t,  and  output,  _i_f___o_u_t_p_u_t.   If  the
interface resides on a system bus, the routine _i_f___r_e_s_e_t will
be called after a bus reset has been performed.   An  inter-
face  may  also  specify  a  timer  routine, _i_f___w_a_t_c_h_d_o_g; if
_i_f___t_i_m_e_r is non-zero, it  is  decremented  once  per  second
until it reaches zero, at which time the watchdog routine is
called.

     The state of an interface and  certain  characteristics
are  stored in the _i_f___f_l_a_g_s field.  The following values are
possible:












Networking Implementation Notes                    SMM:18-17


     #define IFF_UP                   0x0001/* interface is up */
     #define IFF_BROADCAST     0x0002 /* broadcast address valid */
     #define IFF_DEBUG         0x0004 /* turn on debugging */
     #define IFF_LOOPBACK      0x0008 /* is a loopback net */
     #define IFF_POINTOPOINT   0x0010 /* interface is point-to-point link */
     #define IFF_RUNNING       0x0040 /* resources allocated */
     #define IFF_NOARP         0x0080 /* no address resolution protocol */
     #define IFF_PROMISC       0x0100 /* receive all packets */
     #define IFF_ALLMULTI      0x0200 /* receive all multicast packets */
     #define IFF_OACTIVE       0x0400 /* transmission in progress */
     #define IFF_SIMPLEX       0x0800 /* can't hear own transmissions */
     #define IFF_LINK0         0x1000 /* per link layer defined bit */
     #define IFF_LINK1         0x2000 /* per link layer defined bit */
     #define IFF_LINK2         0x4000 /* per link layer defined bit */
     #define IFF_MULTICAST     0x8000 /* supports multicast */

If the interface is connected to a  network  which  supports
transmission  of  _b_r_o_a_d_c_a_s_t  packets, the IFF_BROADCAST flag
will be set and the _i_f_a___b_r_o_a_d_a_d_d_r  field  will  contain  the
address  to  be  used  in  sending  or accepting a broadcast
packet.  If the interface is  associated  with  a  point-to-
point  hardware  link  (for  example,  a  DEC  DMR-11),  the
IFF_POINTOPOINT flag will be set and _i_f_a___d_s_t_a_d_d_r  will  con-
tain  the  address of the host on the other side of the con-
nection.  These addresses  and  the  local  address  of  the
interface,  _i_f___a_d_d_r, are used in filtering incoming packets.
The interface sets IFF_RUNNING after it has allocated system
resources  and  posted an initial read on the device it man-
ages.  This state bit is used to avoid  multiple  allocation
requests  when  an  interface's  address  is  changed.   The
IFF_NOARP flag indicates the interface  should  not  use  an
``address  resolution  protocol''  in  mapping  internetwork
addresses to local network addresses.

     Various statistics are also  stored  in  the  interface
structure.   These  may  be  viewed  by users using the _n_e_t_-
_s_t_a_t(1) program.

     The interface address and flags may  be  set  with  the
SIOCSIFADDR  and  SIOCSIFFLAGS  _i_o_c_t_ls.  SIOCSIFADDR is used
initially to define each interface's  address;  SIOGSIFFLAGS
can  be used to mark an interface down and perform site-spe-
cific configuration.  The destination address of a point-to-
point link is set with SIOCSIFDSTADDR.  Corresponding opera-
tions exist to read each value.  Protocol families may  also
support  operations  to  set and read the broadcast address.
In addition, the  SIOCGIFCONF  _i_o_c_t_l  retrieves  a  list  of
interface  names and addresses for all interfaces and proto-
cols on the host.

66..33..11..  UUNNIIBBUUSS iinntteerrffaacceess

     All hardware related interfaces currently reside on the
UNIBUS.   Consequently  a common set of utility routines for









SMM:18-18                    Networking Implementation Notes


dealing with the UNIBUS has  been  developed.   Each  UNIBUS
interface uses a structure of the following form:

     struct  ifubinfo {
             short       iff_uban;                      /* uba number */
             short       iff_hlen;                      /* local net header length */
             struct      uba_regs *iff_uba;             /* uba regs, in vm */
             short       iff_flags;                     /* used during uballoc's */
     };

Additional  structures  are associated with each receive and
transmit buffer, normally one each per interface; for read,

     struct  ifrw {
             caddr_t     ifrw_addr;                     /* virt addr of header */
             short       ifrw_bdp;                      /* unibus bdp */
             short       ifrw_flags;                    /* type, etc. */
     #define IFRW_W      0x01                           /* is a transmit buffer */
             int         ifrw_info;                     /* value from ubaalloc */
             int         ifrw_proto;                    /* map register prototype */
             struct      pte *ifrw_mr;                  /* base of map registers */
     };

and for write,

     struct  ifxmt {
             struct      ifrw ifrw;
             caddr_t     ifw_base;                      /* virt addr of buffer */
             struct      pte ifw_wmap[IF_MAXNUBAMR];    /* base pages for output */
             struct      mbuf *ifw_xtofree;             /* pages being DMA'd out */
             short       ifw_xswapd;                    /* mask of clusters swapped */
             short       ifw_nmr;                       /* number of entries in wmap */
     };
     #define ifw_addr    ifrw.ifrw_addr
     #define ifw_bdp     ifrw.ifrw_bdp
     #define ifw_flags   ifrw.ifrw_flags
     #define ifw_info    ifrw.ifrw_info
     #define ifw_proto   ifrw.ifrw_proto
     #define ifw_mr      ifrw.ifrw_mr

One of each of these structures is conveniently packaged for
interfaces  with  single buffers for each direction, as fol-
lows:




















Networking Implementation Notes                    SMM:18-19


     struct  ifuba {
             struct      ifubinfo ifu_info;
             struct      ifrw ifu_r;
             struct      ifxmt ifu_xmt;
     };
     #define ifu_uban    ifu_info.iff_uban
     #define ifu_hlen    ifu_info.iff_hlen
     #define ifu_uba     ifu_info.iff_uba
     #define ifu_flags   ifu_info.iff_flags
     #define ifu_w       ifu_xmt.ifrw
     #define ifu_xtofree ifu_xmt.ifw_xtofree


     The _i_f___u_b_i_n_f_o structure contains the  general  informa-
tion  needed  to characterize the I/O-mapped buffers for the
device.  In addition, there is a structure  describing  each
buffer,  including  UNIBUS  resources held by the interface.
Sufficient memory pages and bus map registers are  allocated
to  each buffer upon initialization according to the maximum
packet size and header length.  The kernel  virtual  address
of  the  buffer  is held in _i_f_r_w___a_d_d_r, and the map registers
begin at _i_f_r_w___m_r.  UNIBUS map register _i_f_r_w___m_r[-1] maps  the
local network header ending on a page boundary.  UNIBUS data
paths  are  reserved  for  read  and  for  write,  given  by
_i_f_r_w___b_d_p.   The  prototype of the map registers for read and
for write is saved in _i_f_r_w___p_r_o_t_o.

     When write transfers are not at least  half-full  pages
on  page boundaries, the data are just copied into the pages
mapped on the UNIBUS and the  transfer  is  started.   If  a
write  transfer  is  at least half a page long and on a page
boundary, UNIBUS page table entries are swapped to reference
the  pages,  and  then  the  initial pages are remapped from
_i_f_w___w_m_a_p when the transfer completes.  The mbufs  containing
the  mapped  pages are placed on the _i_f_w___x_t_o_f_r_e_e queue to be
freed after transmission.

     When read transfers give at least half a page  of  data
to  be  input, page frames are allocated from a network page
list and traded with the pages already containing the  data,
mapping  the  allocated pages to replace the input pages for
the next UNIBUS data input.

     The following utility routines are available for use in
writing  network  interface  drivers; all use the structures
described above.

if_ubaminit(ifubinfo, uban, hlen, nmr, ifr, nr, ifx, nx);
if_ubainit(ifuba, uban, hlen, nmr);

     _i_f___u_b_a_m_i_n_i_t allocates resources on UNIBUS adapter _u_b_a_n,
     storing the information in the _i_f_u_b_i_n_f_o, _i_f_r_w and _i_f_x_m_t
     structures referenced.  The _i_f_r and _i_f_x parameters  are
     pointers  to  arrays of _i_f_r_w and _i_f_x_m_t structures whose









SMM:18-20                    Networking Implementation Notes


     dimensions are _n_r and _n_x, respectively.  _i_f___u_b_a_i_n_i_t  is
     a  simpler,  backwards-compatible  interface  used  for
     hardware with single buffers of each  type.   They  are
     called  only at boot time or after a UNIBUS reset.  One
     data path (buffered or  unbuffered,  depending  on  the
     _i_f_u___f_l_a_g_s field) is allocated for each buffer.  The _n_m_r
     parameter indicates the number of UNIBUS mapping regis-
     ters  required  to  map a maximal sized packet onto the
     UNIBUS, while _h_l_e_n specifies the size of a  local  net-
     work  header, if any, which should be mapped separately
     from the data (see the description of trailer protocols
     in  chapter  14).   Sufficient UNIBUS mapping registers
     and pages of memory are  allocated  to  initialize  the
     input  data  path  for an initial read.  For the output
     data path, mapping registers and pages  of  memory  are
     also  allocated  and mapped onto the UNIBUS.  The pages
     associated with  the  output  data  path  are  held  in
     reserve in the event a write requires copying non-page-
     aligned data (see _i_f___w_u_b_a_p_u_t below).  If _i_f___u_b_a_i_n_i_t  is
     called  with  memory pages already allocated, they will
     be used instead of allocating new ones  (this  normally
     occurs  after  a  UNIBUS  reset).  A 1 is returned when
     allocation and initialization are successful, 0  other-
     wise.

m = if_ubaget(ifubinfo, ifr, totlen, off0, ifp);
m = if_rubaget(ifuba, totlen, off0, ifp);

     _i_f___u_b_a_g_e_t  and  _i_f___r_u_b_a_g_e_t  pull  input  data out of an
     interface receive buffer and into an mbuf  chain.   The
     first  interface passes pointers to the _i_f_u_b_i_n_f_o struc-
     ture for the interface and the _i_f_r_w structure  for  the
     receive buffer; the second call may be used for single-
     buffered devices.  _t_o_t_l_e_n specifies the length of  data
     to  be obtained, not counting the local network header.
     If _o_f_f_0 is non-zero, it indicates a byte  offset  to  a
     trailing  local  network  header which should be copied
     into a separate mbuf and prepended to the front of  the
     resultant mbuf chain.  When the data amount to at least
     a half a page, the previously  mapped  data  pages  are
     remapped  into  the mbufs and swapped with fresh pages,
     thus avoiding any copy.   The  receiving  interface  is
     recorded  as  _i_f_p, a pointer to an _i_f_n_e_t structure, for
     the use of the receiving network protocol.  A 0  return
     value indicates a failure to allocate resources.

if_wubaput(ifubinfo, ifx, m);
if_wubaput(ifuba, m);

     _i_f___u_b_a_p_u_t  and  _i_f___w_u_b_a_p_u_t  map a chain of mbufs onto a
     network interface in preparation for output.  The first
     interface  is  used  by  devices with multiple transmit
     buffers.  The chain includes any local network  header,
     which  is  copied  so that it resides in the mapped and









Networking Implementation Notes                    SMM:18-21


     aligned I/O space.  Page-aligned data  that  are  page-
     aligned  in  the output buffer are mapped to the UNIBUS
     in place of the normal buffer page, and the correspond-
     ing  mbuf is placed on a queue to be freed after trans-
     mission.  Any other  mbufs  which  contained  non-page-
     sized  data  portions  are  copied to the I/O space and
     then freed.  Pages mapped from a previous output opera-
     tion (no longer needed) are unmapped.

77..  SSoocckkeett//pprroottooccooll iinntteerrffaaccee

     The  interface between the socket routines and the com-
munication  protocols  is  through  the  _p_r___u_s_r_r_e_q   routine
defined   in  the  protocol  switch  table.   The  following
requests to a protocol module are possible:

     #define PRU_ATTACH        0      /* attach protocol */
     #define PRU_DETACH        1      /* detach protocol */
     #define PRU_BIND          2      /* bind socket to address */
     #define PRU_LISTEN        3      /* listen for connection */
     #define PRU_CONNECT       4      /* establish connection to peer */
     #define PRU_ACCEPT        5      /* accept connection from peer */
     #define PRU_DISCONNECT    6      /* disconnect from peer */
     #define PRU_SHUTDOWN      7      /* won't send any more data */
     #define PRU_RCVD          8      /* have taken data; more room now */
     #define PRU_SEND          9      /* send this data */
     #define PRU_ABORT         10     /* abort (fast DISCONNECT, DETATCH) */
     #define PRU_CONTROL       11     /* control operations on protocol */
     #define PRU_SENSE         12     /* return status into m */
     #define PRU_RCVOOB        13     /* retrieve out of band data */
     #define PRU_SENDOOB       14     /* send out of band data */
     #define PRU_SOCKADDR      15     /* fetch socket's address */
     #define PRU_PEERADDR      16     /* fetch peer's address */
     #define PRU_CONNECT2      17     /* connect two sockets */
     /* begin for protocols internal use */
     #define PRU_FASTTIMO      18     /* 200ms timeout */
     #define PRU_SLOWTIMO      19     /* 500ms timeout */
     #define PRU_PROTORCV      20     /* receive from below */
     #define PRU_PROTOSEND     21     /* send to below */

A call on the user request routine is of the form,

     error = (*protosw[].pr_usrreq)(so, req, m, addr, rights);
     int error; struct socket *so; int req; struct mbuf *m, *addr, *rights;

The mbuf data chain _m is supplied for output operations  and
for  certain  other  operations  where  it  is  to receive a
result.  The address _a_d_d_r is supplied  for  address-oriented
requests  such  as  PRU_BIND  and  PRU_CONNECT.   The _r_i_g_h_t_s
parameter is an optional pointer to an mbuf chain containing
user-specified  capabilities  (see  the  _s_e_n_d_m_s_g and _r_e_c_v_m_s_g
system calls).  The protocol is responsible for disposal  of
the  data  mbuf  chains  on  output  operations.  A non-zero
return value gives a  UNIX  error  number  which  should  be









SMM:18-22                    Networking Implementation Notes


passed  to  higher level software.  The following paragraphs
describe each of the requests possible.

PRU_ATTACH
     When a protocol is bound to a socket (with  the  _s_o_c_k_e_t
     system  call)  the  protocol module is called with this
     request.  It is the responsibility of the protocol mod-
     ule   to   allocate   any   resources  necessary.   The
     ``attach'' request will always precede any of the other
     requests, and should not occur more than once.

PRU_DETACH
     This  is  the  antithesis of the attach request, and is
     used at the time a socket  is  deleted.   The  protocol
     module  may  deallocate  any  resources assigned to the
     socket.

PRU_BIND
     When a socket is initially created it  has  no  address
     bound  to  it.   This request indicates that an address
     should be bound to an existing  socket.   The  protocol
     module  must verify that the requested address is valid
     and available for use.

PRU_LISTEN
     The ``listen'' request indicates  the  user  wishes  to
     listen  for incoming connection requests on the associ-
     ated socket.  The protocol module  should  perform  any
     state changes needed to carry out this request (if pos-
     sible).  A ``listen'' request always precedes a request
     to accept a connection.

PRU_CONNECT
     The  ``connect''  request indicates the user wants to a
     establish an association.  The _a_d_d_r parameter  supplied
     describes the peer to be connected to.  The effect of a
     connect request may vary  depending  on  the  protocol.
     Virtual circuit protocols, such as TCP [Postel81b], use
     this request to initiate establishment of a TCP connec-
     tion.  Datagram protocols, such as UDP [Postel80], sim-
     ply record the peer's address in a private data  struc-
     ture and use it to tag all outgoing packets.  There are
     no restrictions on how many times a connect request may
     be  used  after  an attach.  If a protocol supports the
     notion of _m_u_l_t_i_-_c_a_s_t_i_n_g, it is possible to use multiple
     connects  to  establish  a  multi-cast group.  Alterna-
     tively, an association may be broken by  a  PRU_DISCON-
     NECT request, and a new association created with a sub-
     sequent connect request;  all  without  destroying  and
     creating a new socket.

PRU_ACCEPT
     Following  a  successful  PRU_LISTEN  request  and  the
     arrival of one or more  connections,  this  request  is









Networking Implementation Notes                    SMM:18-23


     made  to  indicate the user has accepted the first con-
     nection on the queue of pending connections.  The  pro-
     tocol module should fill in the supplied address buffer
     with the address of the connected party.

PRU_DISCONNECT
     Eliminate an association  created  with  a  PRU_CONNECT
     request.

PRU_SHUTDOWN
     This call is used to indicate no more data will be sent
     and/or  received  (the  _a_d_d_r  parameter  indicates  the
     direction of the shutdown, as encoded in the _s_o_s_h_u_t_d_o_w_n
     system call).  The protocol  may,  at  its  discretion,
     deallocate  any data structures related to the shutdown
     and/or notify a connected peer of the shutdown.

PRU_RCVD
     This request is made only if the protocol entry in  the
     protocol  switch  table  includes the PR_WANTRCVD flag.
     When a user removes data from the  receive  queue  this
     request will be sent to the protocol module.  It may be
     used to  trigger  acknowledgements,  refresh  windowing
     information, initiate data transfer, etc.

PRU_SEND
     Each  user  request to send data is translated into one
     or more PRU_SEND requests (a protocol may indicate that
     a  single  user  send request must be translated into a
     single PRU_SEND request  by  specifying  the  PR_ATOMIC
     flag in its protocol description).  The data to be sent
     is presented to the protocol as a list of mbufs and  an
     address is, optionally, supplied in the _a_d_d_r parameter.
     The protocol is responsible for preserving the data  in
     the  socket's  send  queue if it is not able to send it
     immediately, or if it may need it at  some  later  time
     (e.g. for retransmission).

PRU_ABORT
     This  request indicates an abnormal termination of ser-
     vice.  The protocol should delete any existing associa-
     tion(s).

PRU_CONTROL
     The  ``control''  request is generated when a user per-
     forms a UNIX _i_o_c_t_l system call on  a  socket  (and  the
     ioctl  is  not intercepted by the socket routines).  It
     allows protocol-specific operations to be provided out-
     side  the  scope  of  the common socket interface.  The
     _a_d_d_r parameter contains a pointer to  a  static  kernel
     data area where relevant information may be obtained or
     returned.  The _m parameter contains  the  actual  _i_o_c_t_l
     request  code  (note  the  non-standard calling conven-
     tion).  The _r_i_g_h_t_s parameter contains a pointer  to  an









SMM:18-24                    Networking Implementation Notes


     _i_f_n_e_t  structure  if  the _i_o_c_t_l operation pertains to a
     particular network interface.

PRU_SENSE
     The ``sense'' request is generated when the user  makes
     an _f_s_t_a_t system call on a socket; it requests status of
     the associated socket.  This currently returns a  stan-
     dard  _s_t_a_t  structure.   It typically contains only the
     optimal transfer size  for  the  connection  (based  on
     buffer  size,  windowing information and maximum packet
     size).  The _m parameter contains a pointer to a  static
     kernel  data  area  where  the  status buffer should be
     placed.

PRU_RCVOOB
     Any ``out-of-band'' data presently available is  to  be
     returned.   An  mbuf  is passed to the protocol module,
     and the protocol should either place data in  the  mbuf
     or  attach  new  mbufs  to the one supplied if there is
     insufficient space in the single mbuf.  An error may be
     returned  if out-of-band data is not (yet) available or
     has already been consumed.  The _a_d_d_r parameter contains
     any  options  such  as MSG_PEEK to examine data without
     consuming it.

PRU_SENDOOB
     Like PRU_SEND, but for out-of-band data.

PRU_SOCKADDR
     The local address of the socket is returned, if any  is
     currently bound to it.  The address (with protocol spe-
     cific format) is returned in the _a_d_d_r parameter.

PRU_PEERADDR
     The address of the peer to which  the  socket  is  con-
     nected  is returned.  The socket must be in a SS_ISCON-
     NECTED state for this request to be made to the  proto-
     col.    The   address  format  (protocol  specific)  is
     returned in the _a_d_d_r parameter.

PRU_CONNECT2
     The  protocol  module  is  supplied  two  sockets   and
     requested  to  establish  a  connection between the two
     without binding any addresses, if possible.  This  call
     is  used in implementing the _s_o_c_k_e_t_p_a_i_r(2) system call.

     The following requests are used internally by the  pro-
tocol  modules  and  are  never generated by the socket rou-
tines.  In certain instances, they are handed to the _p_r___u_s_r_-
_r_e_q  routine  solely for convenience in tracing a protocol's
operation (e.g. PRU_SLOWTIMO).

PRU_FASTTIMO
     A ``fast timeout'' has occurred.  This request is  made









Networking Implementation Notes                    SMM:18-25


     when a timeout occurs in the protocol's _p_r___f_a_s_t_i_m_o rou-
     tine.   The  _a_d_d_r  parameter  indicates   which   timer
     expired.

PRU_SLOWTIMO
     A  ``slow timeout'' has occurred.  This request is made
     when a timeout occurs  in  the  protocol's  _p_r___s_l_o_w_t_i_m_o
     routine.   The  _a_d_d_r  parameter  indicates  which timer
     expired.

PRU_PROTORCV
     This request is used in  the  protocol-protocol  inter-
     face,  not  by  the routines.  It requests reception of
     data destined for the protocol and not  the  user.   No
     protocols currently use this facility.

PRU_PROTOSEND
     This  request  allows  a protocol to send data destined
     for another protocol module, not a user.   The  details
     of how data is marked ``addressed to protocol'' instead
     of ``addressed to user'' are left to the protocol  mod-
     ules.  No protocols currently use this facility.

88..  PPrroottooccooll//pprroottooccooll iinntteerrffaaccee

     The  interface  between protocol modules is through the
_p_r___u_s_r_r_e_q, _p_r___i_n_p_u_t, _p_r___o_u_t_p_u_t, _p_r___c_t_l_i_n_p_u_t, and  _p_r___c_t_l_o_u_t_-
_p_u_t  routines.   The  calling  conventions  for  all but the
_p_r___u_s_r_r_e_q routine are expected to be specific to the  proto-
col  modules  and are not guaranteed to be consistent across
protocol families.  We will examine the conventions used for
some  of  the Internet protocols in this section as an exam-
ple.

88..11..  pprr__oouuttppuutt

     The Internet protocol UDP uses the convention,

     error = udp_output(inp, m);
     int error; struct inpcb *inp; struct mbuf *m;

where the _i_n_p, ``_i_nternet _protocol _control  _block'',  passed
between  modules  conveys  per connection state information,
and the mbuf chain contains the data to be sent.   UDP  per-
forms  consistency  checks, appends its header, calculates a
checksum, etc. before passing the packet on.  UDP  is  based
on  the Internet Protocol, IP [Postel81a], as its transport.
UDP passes a packet to the IP module for output as follows:

     error = ip_output(m, opt, ro, flags);
     int error; struct mbuf *m, *opt; struct route *ro; int flags;












SMM:18-26                    Networking Implementation Notes


     The call to IP's output  routine  is  more  complicated
than that for UDP, as befits the additional work the IP mod-
ule must do.  The _m parameter is the data to  be  sent,  and
the  _o_p_t  parameter  is an optional list of IP options which
should be placed in the IP packet header.  The _r_o  parameter
is  is  used  in  making routing decisions (and passing them
back to the caller for use in subsequent calls).  The  final
parameter,  _f_l_a_g_s contains flags indicating whether the user
is allowed to transmit a broadcast packet and if routing  is
to  be performed.  The broadcast flag may be inconsequential
if the underlying hardware does not support  the  notion  of
broadcasting.

     All  output  routines  return  0  on success and a UNIX
error number if a failure occurred which could  be  detected
immediately (no buffer space available, no route to destina-
tion, etc.).

88..22..  pprr__iinnppuutt

     Both UDP and TCP use the following calling convention,

     (void) (*protosw[].pr_input)(m, ifp);
     struct mbuf *m; struct ifnet *ifp;

Each mbuf list passed is a single packet to be processed  by
the  protocol  module.   The interface from which the packet
was received is passed as the second parameter.

     The IP input routine is a VAX software interrupt  level
routine,  and  so  is  not  called  with any parameters.  It
instead  communicates  with  network  interfaces  through  a
queue,  _i_p_i_n_t_r_q,  which  is  identical  in  structure to the
queues used by the network interfaces  for  storing  packets
awaiting transmission.  The software interrupt is enabled by
the network interfaces when they place  input  data  on  the
input queue.

88..33..  pprr__ccttlliinnppuutt

     This  routine is used to convey ``control'' information
to a protocol module (i.e. information which might be passed
to the user, but is not data).

     The common calling convention for this routine is,

     (void) (*protosw[].pr_ctlinput)(req, addr);
     int req; struct sockaddr *addr;

The _r_e_q parameter is one of the following,













Networking Implementation Notes                    SMM:18-27


     #define  PRC_IFDOWN             0       /* interface transition */
     #define  PRC_ROUTEDEAD          1       /* select new route if possible */
     #define  PRC_QUENCH             4       /* some said to slow down */
     #define  PRC_MSGSIZE            5       /* message size forced drop */
     #define  PRC_HOSTDEAD           6       /* normally from IMP */
     #define  PRC_HOSTUNREACH        7       /* ditto */
     #define  PRC_UNREACH_NET        8       /* no route to network */
     #define  PRC_UNREACH_HOST       9       /* no route to host */
     #define  PRC_UNREACH_PROTOCOL   10      /* dst says bad protocol */
     #define  PRC_UNREACH_PORT       11      /* bad port # */
     #define  PRC_UNREACH_NEEDFRAG   12      /* IP_DF caused drop */
     #define  PRC_UNREACH_SRCFAIL    13      /* source route failed */
     #define  PRC_REDIRECT_NET       14      /* net routing redirect */
     #define  PRC_REDIRECT_HOST      15      /* host routing redirect */
     #define  PRC_REDIRECT_TOSNET    14      /* redirect for type of service & net */
     #define  PRC_REDIRECT_TOSHOST   15      /* redirect for tos & host */
     #define  PRC_TIMXCEED_INTRANS   18      /* packet lifetime expired in transit */
     #define  PRC_TIMXCEED_REASS     19      /* lifetime expired on reass q */
     #define  PRC_PARAMPROB          20      /* header incorrect */

while  the _a_d_d_r parameter is the address to which the condi-
tion applies.  Many of  the  requests  have  obviously  been
derived  from  ICMP  (the  Internet Control Message Protocol
[Postel81c]), and from error messages defined  in  the  1822
host/IMP  convention  [BBN78].  Mapping tables exist to con-
vert control requests to UNIX error codes which  are  deliv-
ered to a user.

88..44..  pprr__ccttlloouuttppuutt

     This  is the routine that implements per-socket options
at the protocol level for _g_e_t_s_o_c_k_o_p_t  and  _s_e_t_s_o_c_k_o_p_t.   The
calling convention is,

     error = (*protosw[].pr_ctloutput)(op, so, level, optname, mp);
     int op; struct socket *so; int level, optname; struct mbuf **mp;

where  _o_p  is  one  of PRCO_SETOPT or PRCO_GETOPT, _s_o is the
socket from whence the call originated, and _l_e_v_e_l  and  _o_p_t_-
_n_a_m_e  are the protocol level and option name supplied by the
user.  The results of a PRCO_GETOPT call are returned in  an
mbuf  whose  address  is  placed  in _m_p before return.  On a
PRCO_SETOPT call, _m_p contains the address of  an  mbuf  con-
taining  the  option  data;  the mbuf should be freed before
return.


















SMM:18-28                    Networking Implementation Notes


99..  PPrroottooccooll//nneettwwoorrkk--iinntteerrffaaccee iinntteerrffaaccee

     The lowest layer in the set of protocols which comprise
a  protocol family must interface itself to one or more net-
work interfaces in order to transmit  and  receive  packets.
It  is  assumed  that  any  routing decisions have been made
before handing a packet to a network interface, in fact this
is  absolutely necessary in order to locate any interface at
all (unless, of course,  one  uses  a  single  ``hardwired''
interface).  There are two cases with which to be concerned,
transmission of a packet and receipt of a packet; each  will
be considered separately.

99..11..  PPaacckkeett ttrraannssmmiissssiioonn

     Assuming  a protocol has a handle on an interface, _i_f_p,
a (struct ifnet *), it transmits a  fully  formatted  packet
with the following call,

     error = (*ifp->if_output)(ifp, m, dst)
     int error; struct ifnet *ifp; struct mbuf *m; struct sockaddr *dst;

The  output  routine for the network interface transmits the
packet _m to the _d_s_t address, or returns an error  indication
(a  UNIX  error number).  In reality transmission may not be
immediate or successful; normally the output routine  simply
queues  the packet on its send queue and primes an interrupt
driven routine to actually transmit the packet.  For unreli-
able  media,  such as the Ethernet, ``successful'' transmis-
sion simply means that the packet has  been  placed  on  the
cable  without  a  collision.   On  the  other hand, an 1822
interface guarantees proper delivery or an error  indication
for  each  message  transmitted.   The model employed in the
networking system attaches no promises of  delivery  to  the
packets  handed to a network interface, and thus corresponds
more closely to the Ethernet.  Errors returned by the output
routine are only those that can be detected immediately, and
are normally trivial in nature  (no  buffer  space,  address
format  not  handled,  etc.).   No indication is received if
errors are detected after the call has returned.

99..22..  PPaacckkeett rreecceeppttiioonn

     Each protocol family must have  one  or  more  ``lowest
level''  protocols.   These protocols deal with internetwork
addressing and are responsible for the delivery of  incoming
packets  to  the proper protocol processing modules.  In the
PUP model [Boggs78] these protocols are termed Level 1  pro-
tocols,  in the ISO model, network layer protocols.  In this
system each such protocol module has an input  packet  queue
assigned  to  it.   Incoming  packets  received by a network
interface are queued for the  protocol  module,  and  a  VAX
software interrupt is posted to initiate processing.










Networking Implementation Notes                    SMM:18-29


     Three  macros  are  available for queuing and dequeuing
packets:

IF_ENQUEUE(ifq, m)
     This places the packet _m at the tail of the queue  _i_f_q.

IF_DEQUEUE(ifq, m)
     This  places  a  pointer  to  the packet at the head of
     queue _i_f_q in _m and removes the packet from  the  queue.
     A  zero  value  will  be  returned in _m if the queue is
     empty.

IF_DEQUEUEIF(ifq, m, ifp)
     Like IF_DEQUEUE, this removes the next packet from  the
     head  of a queue and returns it in _m.  A pointer to the
     interface on which the packet was received is placed in
     _i_f_p, a (struct ifnet *).

IF_PREPEND(ifq, m)
     This  places the packet _m at the head of the queue _i_f_q.

     Each queue has a maximum length associated with it as a
simple  form of congestion control.  The macro IF_QFULL(ifq)
returns 1 if the queue is filled, in which  case  the  macro
IF_DROP(ifq)  should  be  used to increment the count of the
number of packets  dropped,  and  the  offending  packet  is
dropped.   For  example, the following code fragment is com-
monly found in a network interface's input routine,

     if (IF_QFULL(inq)) {
            IF_DROP(inq);
            m_freem(m);
     } else
            IF_ENQUEUE(inq, m);


1100..  GGaatteewwaayyss aanndd rroouuttiinngg iissssuueess

     The system has been designed with the expectation  that
it  will  be  used  in  an  internetwork  environment.   The
``canonical'' environment was envisioned to be a  collection
of  local  area  networks  connected  at  one or more points
through hosts with multiple network interfaces (one on  each
local  area  network),  and  possibly a connection to a long
haul network (for example, the ARPANET).  In such  an  envi-
ronment, issues of gatewaying and packet routing become very
important.  Certain of these issues, such as congestion con-
trol,  have  been handled in a simplistic manner or specifi-
cally not addressed.  Instead, where possible,  the  network
system attempts to provide simple mechanisms upon which more
involved policies may be  implemented.   As  some  of  these
problems  become  better understood, the solutions developed
will be incorporated into the system.










SMM:18-30                    Networking Implementation Notes


     This section will describe the facilities provided  for
packet routing.  The simplistic mechanisms provided for con-
gestion control are described in chapter 12.

1100..11..  RRoouuttiinngg ttaabblleess

     The network system maintains a set  of  routing  tables
for  selecting  a  network  interface to use in delivering a
packet to its destination.  These tables are of the form:

     struct rtentry {
              u_long   rt_hash;                /* hash key for lookups */
              struct   sockaddr rt_dst;        /* destination net or host */
              struct   sockaddr rt_gateway;    /* forwarding agent */
              short    rt_flags;               /* see below */
              short    rt_refcnt;              /* no. of references to structure */
              u_long   rt_use;                 /* packets sent using route */
              struct   ifnet *rt_ifp;          /* interface to give packet to */
     };


     The routing information is organized  in  two  separate
tables,  one  for  routes  to a host and one for routes to a
network.  The distinction between hosts and networks is nec-
essary  so  that  a  single  mechanism  may be used for both
broadcast and multi-drop type networks, and  also  for  net-
works  built from point-to-point links (e.g DECnet [DEC80]).

     Each table is organized  as  a  hashed  set  of  linked
lists.   Two  32-bit  hash values are calculated by routines
defined for each address family; one based on  the  destina-
tion  being  a host, and one assuming the target is the net-
work portion of the address.  Each hash  value  is  used  to
locate  a  hash  chain to search (by taking the value modulo
the hash table size) and the entire  32-bit  value  is  then
used  as  a key in scanning the list of routes.  Lookups are
applied first to the routing table for hosts,  then  to  the
routing  table  for networks.  If both lookups fail, a final
lookup is made for a ``wildcard'' route (by convention, net-
work  0).   The  first appropriate route discovered is used.
By doing this, routes to a specific host on a network may be
present  as well as routes to the network.  This also allows
a ``fall back'' network route to be defined to  a  ``smart''
gateway which may then perform more intelligent routing.

     Each  routing  table  entry contains a destination (the
desired final destination), a gateway to which to  send  the
packet,  and various flags which indicate the route's status
and type (host or network).  A count of the number of  pack-
ets  sent  using  the  route  is kept, along with a count of
``held references'' to the dynamically  allocated  structure
to insure that memory reclamation occurs only when the route
is not in use.  Finally, a pointer to the a  network  inter-
face  is kept; packets sent using the route should be handed









Networking Implementation Notes                    SMM:18-31


to this interface.

     Routes are typed in two ways: either as  host  or  net-
work,  and  as ``direct'' or ``indirect''.  The host/network
distinction determines how to compare the _r_t___d_s_t field  dur-
ing  lookup.   If the route is to a network, only a packet's
destination network is compared to the _r_t___d_s_t  entry  stored
in the table.  If the route is to a host, the addresses must
match bit for bit.

     The distinction  between  ``direct''  and  ``indirect''
routes  indicates  whether  the destination is directly con-
nected to the source.  This is needed when performing  local
network  encapsulation.   If a packet is destined for a peer
at a host or network which is not directly connected to  the
source,  the  internetwork  packet  header  will contain the
address of the eventual destination, while the local network
header  will  address  the  intervening gateway.  Should the
destination  be  directly  connected,  these  addresses  are
likely to be identical, or a mapping between the two exists.
The RTF_GATEWAY flag indicates  that  the  route  is  to  an
``indirect''  gateway  agent,  and  that  the  local network
header should be filled in from the _r_t___g_a_t_e_w_a_y field instead
of from the final internetwork destination address.

     It is assumed that multiple routes to the same destina-
tion will not be present; only one of multiple routes,  that
most recently installed, will be used.

     Routing  redirect  control messages are used to dynami-
cally modify existing  routing  table  entries  as  well  as
dynamically  create  new  routing  table  entries.  On hosts
where exhaustive routing information  is  too  expensive  to
maintain  (e.g.  work stations), the combination of wildcard
routing entries and routing redirect messages can be used to
provide  a  simple routing management scheme without the use
of a higher level policy process.  Current  connections  may
be  rerouted after notification of the protocols by means of
their _p_r___c_t_l_i_n_p_u_t entries.  Statistics are kept by the rout-
ing  table  routines on the use of routing redirect messages
and their affect on the routing  tables.   These  statistics
may be viewed using _n_e_t_s_t_a_t(1).

     Status  information other than routing redirect control
messages may be used in the future, but at present they  are
ignored.  Likewise, more intelligent ``metrics'' may be used
to describe routes in the future, possibly  based  on  band-
width and monetary costs.

1100..22..  RRoouuttiinngg ttaabbllee iinntteerrffaaccee

     A  protocol  accesses  the routing tables through three
routines, one to allocate a route, one to free a route,  and
one  to  process  a  routing  redirect control message.  The









SMM:18-32                    Networking Implementation Notes


routine _r_t_a_l_l_o_c performs route allocation; it is called with
a  pointer to the following structure containing the desired
destination:

     struct route {
            struct    rtentry *ro_rt;
            struct    sockaddr ro_dst;
     };

The route returned is assumed ``held'' by the  caller  until
released  with  an  _r_t_f_r_e_e  call.  Protocols which implement
virtual circuits, such as TCP,  hold  onto  routes  for  the
duration  of  the  circuit's lifetime, while connection-less
protocols, such as UDP, allocate and  free  routes  whenever
their destination address changes.

     The  routine  _r_t_r_e_d_i_r_e_c_t is called to process a routing
redirect control message.  It is called with  a  destination
address, the new gateway to that destination, and the source
of the redirect.  Redirects are accepted only from the  cur-
rent  router  for  the destination.  If a non-wildcard route
exists to the destination, the gateway entry in the route is
modified to point at the new gateway supplied.  Otherwise, a
new routing table entry is inserted reflecting the  informa-
tion  supplied.  Routes to interfaces and routes to gateways
which are not directly accessible from the host are ignored.

1100..33..  UUsseerr lleevveell rroouuttiinngg ppoolliicciieess

     Routing  policies implemented in user processes manipu-
late the kernel routing tables through two _i_o_c_t_l calls.  The
commands  SIOCADDRT  and  SIOCDELRT  add  and delete routing
entries, respectively;  the  tables  are  read  through  the
/dev/kmem device.  The decision to place policy decisions in
a user process implies that routing table updates may lag  a
bit  behind the identification of new routes, or the failure
of existing routes, but this period of instability  is  nor-
mally  very  small with proper implementation of the routing
process.  Advisory information, such as ICMP error  messages
and  IMP  diagnostic  messages, may be read from raw sockets
(described in the next section).

     Several routing  policy  processes  have  already  been
implemented.   The system standard ``routing daemon'' uses a
variant  of  the  Xerox  NS  Routing  Information   Protocol
[Xerox82] to maintain up-to-date routing tables in our local
environment.  Interaction with other existing routing proto-
cols,  such as the Internet EGP (Exterior Gateway Protocol),
has been accomplished using a similar process.














Networking Implementation Notes                    SMM:18-33


1111..  RRaaww ssoocckkeettss

     A raw socket is an object  which  allows  users  direct
access  to a lower-level protocol.  Raw sockets are intended
for knowledgeable processes which wish to take advantage  of
some  protocol  feature  not directly accessible through the
normal interface, or for the development  of  new  protocols
built  atop  existing lower level protocols.  For example, a
new version of TCP might be developed at the user  level  by
using  a  raw IP socket for delivery of packets.  The raw IP
socket interface attempts to provide an identical  interface
to  the one a protocol would have if it were resident in the
kernel.

     The raw socket support is built around  a  generic  raw
socket  interface, (possibly) augmented by protocol-specific
processing routines.  This section will describe the core of
the raw socket interface.

1111..11..  CCoonnttrrooll bblloocckkss

     Every  raw  socket  has a protocol control block of the
following form:

     struct rawcb {
             struct   rawcb *rcb_next;        /* doubly linked list */
             struct   rawcb *rcb_prev;
             struct   socket *rcb_socket;     /* back pointer to socket */
             struct   sockaddr rcb_faddr;     /* destination address */
             struct   sockaddr rcb_laddr;     /* socket's address */
             struct   sockproto rcb_proto;    /* protocol family, protocol */
             caddr_t  rcb_pcb;                /* protocol specific stuff */
             struct   mbuf *rcb_options;      /* protocol specific options */
             struct   route rcb_route;        /* routing information */
             short    rcb_flags;
     };

All the control blocks are kept on a doubly linked list  for
performing lookups during packet dispatch.  Associations may
be recorded in the control block and used by the output rou-
tine  in  preparing packets for transmission.  The _r_c_b___p_r_o_t_o
structure contains the protocol family and  protocol  number
with which the raw socket is associated.  The protocol, fam-
ily and addresses are used to filter packets on input;  this
will  be described in more detail shortly.  If any protocol-
specific information is required, it may be attached to  the
control  block  using  the _r_c_b___p_c_b field.  Protocol-specific
options for transmission in outgoing packets may  be  stored
in _r_c_b___o_p_t_i_o_n_s.

     A  raw socket interface is datagram oriented.  That is,
each send or receive on the socket  requires  a  destination
address.  This address may be supplied by the user or stored
in the control block  and  automatically  installed  in  the









SMM:18-34                    Networking Implementation Notes


outgoing packet by the output routine.  Since it is not pos-
sible to determine whether an address is present or  not  in
the control block, two flags, RAW_LADDR and RAW_FADDR, indi-
cate if a local and foreign address are present.  Routing is
expected  to be performed by the underlying protocol if nec-
essary.

1111..22..  IInnppuutt pprroocceessssiinngg

     Input packets are ``assigned'' to raw sockets based  on
a simple pattern matching scheme.  Each network interface or
protocol gives unassigned packets to the raw  input  routine
with the call:

     raw_input(m, proto, src, dst)
     struct mbuf *m; struct sockproto *proto, struct sockaddr *src, *dst;

The data packet then has a generic header prepended to it of
the form

     struct raw_header {
            struct    sockproto raw_proto;
            struct    sockaddr raw_dst;
            struct    sockaddr raw_src;
     };

and it is placed in a packet queue for the ``raw input  pro-
tocol''  module.   Packets  taken from this queue are copied
into any raw sockets that match the header according to  the
following rules,

1)   The protocol family of the socket and header agree.

2)   If  the protocol number in the socket is non-zero, then
     it agrees with that found in the packet header.

3)   If a local address  is  defined  for  the  socket,  the
     address  format of the local address is the same as the
     destination address's and the two addresses  agree  bit
     for bit.

4)   The  rules  of  3)  are applied to the socket's foreign
     address and the packet's source address.

A basic assumption is that addresses present in the  control
block  and  packet  header  (as  constructed  by the network
interface and any raw input protocol module) are in a canon-
ical form which may be ``block compared''.

1111..33..  OOuuttppuutt pprroocceessssiinngg

     On  output  the raw _p_r___u_s_r_r_e_q routine passes the packet
and a pointer to the raw control block to the  raw  protocol
output  routine  for  any  processing  required before it is









Networking Implementation Notes                    SMM:18-35


delivered to the appropriate network interface.  The  output
routine  is  normally  the only code required to implement a
raw socket interface.

1122..  BBuuffffeerriinngg aanndd ccoonnggeessttiioonn ccoonnttrrooll

     One of the major factors in the performance of a proto-
col  is the buffering policy used.  Lack of a proper buffer-
ing policy can force packets to be dropped, cause  falsified
windowing  information  to be emitted by protocols, fragment
host memory, degrade the overall host performance, etc.  Due
to  problems  such  as  these, most systems allocate a fixed
pool of memory to the networking system and impose a  policy
optimized for ``normal'' network operation.

     The networking system developed for UNIX is little dif-
ferent in this respect.  At boot time a fixed amount of mem-
ory  is  allocated by the networking system.  At later times
more system memory may be requested as the need arises,  but
at  no  time  is  memory ever returned to the system.  It is
possible to garbage collect memory  from  the  network,  but
difficult.   In  order  to  perform  this garbage collection
properly, some portion  of  the  network  will  have  to  be
``turned off'' as data structures are updated.  The interval
over which this occurs must kept small compared to the aver-
age  inter-packet  arrival  time, or too much traffic may be
lost, impacting other hosts  on  the  network,  as  well  as
increasing  load  on  the  interconnecting  mediums.  In our
environment we have not experienced a  need  for  such  com-
paction, and thus have left the problem unresolved.

     The  mbuf  structure  was  introduced in chapter 5.  In
this section a brief description will be given of the  allo-
cation  mechanisms,  and  policies  used by the protocols in
performing connection level buffering.

1122..11..  MMeemmoorryy mmaannaaggeemmeenntt

     The basic memory allocation routines manage  a  private
page map, the size of which determines the maximum amount of
memory that may be allocated by the network.  A small amount
of  memory  is allocated at boot time to initialize the mbuf
and mbuf page cluster free lists.  When the free  lists  are
exhausted,  more  memory is requested from the system memory
allocator if space remains in the map.  If memory cannot  be
allocated,  callers  may  block awaiting free memory, or the
failure may be reflected to  the  caller  immediately.   The
allocator will not block awaiting free map entries, however,
as exhaustion of the page map usually indicates that buffers
have been lost due to a ``leak.''  The private page table is
used by the network buffer management routines in  remapping
pages  to  be  logically  contiguous as the need arises.  In
addition, an array of reference counts  parallels  the  page
table  and  is  used  when multiple references to a page are









SMM:18-36                    Networking Implementation Notes


present.

     Mbufs are 128 byte structures, 8 fitting  in  a  1Kbyte
page  of memory.  When data is placed in mbufs, it is copied
or remapped into logically contiguous pages of  memory  from
the  network  page pool if possible.  Data smaller than half
of the size of a page is copied into one or  more  112  byte
mbuf data areas.

1122..22..  PPrroottooccooll bbuuffffeerriinngg ppoolliicciieess

     Protocols  reserve  fixed amounts of buffering for send
and receive queues at socket creation time.   These  amounts
define  the high and low water marks used by the socket rou-
tines in deciding when to block and unblock a process.   The
reservation of space does not currently result in any action
by the memory management routines.

     Protocols which provide connection level  flow  control
do  this  based  on  the  amount  of space in the associated
socket queues.  That is, send windows are  calculated  based
on  the  amount of free space in the socket's receive queue,
while receive windows are adjusted based on  the  amount  of
data awaiting transmission in the send queue.  Care has been
taken to avoid the ``silly window  syndrome''  described  in
[Clark82] at both the sending and receiving ends.

1122..33..  QQuueeuuee lliimmiittiinngg

     Incoming  packets  from the network are always received
unless memory allocation fails.  However, each Level 1  pro-
tocol  input queue has an upper bound on the queue's length,
and any packets exceeding that bound are discarded.   It  is
possible  for  a host to be overwhelmed by excessive network
traffic (for instance a host acting as a gateway from a high
bandwidth  network  to  a  low  bandwidth  network).   As  a
``defensive'' mechanism the queue limits may be adjusted  to
throttle  network  traffic  load on a host.  Consider a host
willing to devote some percentage of its machine to handling
network traffic.  If the cost of handling an incoming packet
can be calculated so that an  acceptable  ``packet  handling
rate''  can  be  determined, then input queue lengths may be
dynamically adjusted based on a host's network load and  the
number  of packets awaiting processing.  Obviously, discard-
ing packets is not a satisfactory solution to a problem such
as  this  (simply dropping packets is likely to increase the
load on a network);  the  queue  lengths  were  incorporated
mainly as a safeguard mechanism.

1122..44..  PPaacckkeett ffoorrwwaarrddiinngg

     When  packets  can  not  be forwarded because of memory
limitations, the system  attempts  to  generate  a  ``source
quench''   message.    In   addition,   any  other  problems









Networking Implementation Notes                    SMM:18-37


encountered during packet forwarding are also reflected back
to the sender in the form of ICMP packets.  This helps hosts
avoid unneeded retransmissions.

     Broadcast packets are never forwarded due  to  possible
dire  consequences.   In  an early stage of network develop-
ment, broadcast  packets  were  forwarded  and  a  ``routing
loop''  resulted in network saturation and every host on the
network crashing.

1133..  OOuutt ooff bbaanndd ddaattaa

     Out of band data is a facility peculiar to  the  stream
socket  abstraction  defined.   Little  agreement appears to
exist as to what its semantics should be.  TCP  defines  the
notion  of  ``urgent data'' as in-line, while the NBS proto-
cols [Burruss81] and numerous others provide a  fully  inde-
pendent logical transmission channel along which out of band
data is to be sent.  In addition, the  amount  of  the  data
which may be sent as an out of band message varies from pro-
tocol to protocol; everything from 1  bit  to  16  bytes  or
more.

     A  stream  socket's notion of out of band data has been
defined as the  lowest  reasonable  common  denominator  (at
least  reasonable  in our minds); clearly this is subject to
debate.  Out of band data is expected to be transmitted  out
of the normal sequencing and flow control constraints of the
data stream.  A minimum of 1 byte of out of  band  data  and
one  outstanding out of band message are expected to be sup-
ported by the protocol supporting a stream socket.  It is  a
protocol's  prerogative to support larger-sized messages, or
more than one outstanding out of band message at a time.

     Out of band data is maintained by the protocol  and  is
usually not stored in the socket's receive queue.  A socket-
level option, SO_OOBINLINE, is provided to force out-of-band
data  to  be  placed in the normal receive queue when urgent
data is received; this sometimes amelioriates  problems  due
to  loss  of  data  when  multiple  out-of-band segments are
received before the first has been passed to the user.   The
PRU_SENDOOB and PRU_RCVOOB requests to the _p_r___u_s_r_r_e_q routine
are used in sending and receiving data.




















SMM:18-38                    Networking Implementation Notes


1144..  TTrraaiilleerr pprroottooccoollss

     Core to core copies can be expensive.  Consequently,  a
great  deal  of  effort  was spent in minimizing such opera-
tions.  The VAX architecture provides virtual  memory  hard-
ware  organized  in  page units.  To cut down on copy opera-
tions, data is kept  in  page-sized  units  on  page-aligned
boundaries  whenever possible.  This allows data to be moved
in memory simply by remapping the page instead  of  copying.
The  mbuf  and network interface routines perform page table
manipulations where needed, hiding the complexities  of  the
VAX virtual memory hardware from higher level code.

     Data  enters  the system in two ways: from the user, or
from the network (hardware interface).  When data is  copied
from  the  user's  address  space  into  the  system  it  is
deposited in pages (if sufficient data  is  present).   This
encourages  the  user  to  transmit  information in messages
which are a multiple of the system page size.

     Unfortunately, performing a similar operation when tak-
ing  data  from the network is very difficult.  Consider the
format of an incoming packet.  A packet usually  contains  a
local network header followed by one or more headers used by
the high level protocols.  Finally, the data, if  any,  fol-
lows  these  headers.   Since  the header information may be
variable length, DMA'ing the eventual data for the user into
a page aligned area of memory is impossible without _a _p_r_i_o_r_i
knowledge of the format (e.g., by supporting only  a  single
protocol header format).

     To  allow  variable  length  header  information  to be
present and still ensure page alignment of data,  a  special
local  network  encapsulation  may be used.  This encapsula-
tion, termed a  _t_r_a_i_l_e_r  _p_r_o_t_o_c_o_l  [Leffler84],  places  the
variable  length header information after the data.  A fixed
size local network header is then prepended to the resultant
packet.   The  local network header contains the size of the
data portion (in units of 512 bytes), and a new _t_r_a_i_l_e_r _p_r_o_-
_t_o_c_o_l  _h_e_a_d_e_r,  inserted before the variable length informa-
tion, contains the size of the variable length header infor-
mation.   The  following  trailer protocol header is used to
store information regarding  the  variable  length  protocol
header:

     struct {
            short     protocol;            /* original protocol no. */
            short     length;              /* length of trailer */
     };


     The  processing of the trailer protocol is very simple.
On output, the local network header indicates that a trailer
encapsulation  is  being  used.  The header also includes an









Networking Implementation Notes                    SMM:18-39


indication of the number of data pages  present  before  the
trailer  protocol  header.   The  trailer protocol header is
initialized to contain the actual  protocol  identifier  and
the variable length header size, and is appended to the data
along with the variable length header information.

     On input, the interface routines identify  the  trailer
encapsulation  by the protocol type stored in the local net-
work header, then calculate the number of pages of  data  to
find the beginning of the trailer.  The trailing information
is copied into a separate mbuf and linked to  the  front  of
the resultant packet.

     Clearly,  trailer protocols require cooperation between
source and destination.  In addition, they are normally cost
effective  only  when sizable packets are used.  The current
scheme works because the local network encapsulation  header
is  a fixed size, allowing DMA operations to be performed at
a known offset from the  first  data  page  being  received.
Should  the  local  network  header  be variable length this
scheme fails.

     Statistics collected indicate that as much  as  200Kb/s
can  be gained by using a trailer protocol with 1Kbyte pack-
ets.  The average size of the variable length header was  40
bytes  (the  size  of  a  minimal TCP/IP packet header).  If
hardware supports larger sized packets, even  greater  gains
may be realized.

AAcckknnoowwlleeddggeemmeennttss

     The internal structure of the system is patterned after
the Xerox  PUP  architecture  [Boggs79],  while  in  certain
places  the Internet protocol family has had a great deal of
influence in the design.  The use of software interrupts for
process  invocation  is based on similar facilities found in
the VMS operating system.  Many of the ideas related to pro-
tocol  modularity, memory management, and network interfaces
are based on Rob Gurwitz's  TCP/IP  implementation  for  the
4.1BSD version of UNIX on the VAX [Gurwitz81].  Greg Chesson
explained his use  of  trailer  encapsulations  in  Datakit,
instigating their use in our system.





















SMM:18-40                    Networking Implementation Notes


RReeffeerreenncceess


[Boggs79]           Boggs,  D.  R., J. F. Shoch, E. A. Taft,
                    and R. M. Metcalfe; _P_U_P_: _A_n _I_n_t_e_r_n_e_t_w_o_r_k
                    _A_r_c_h_i_t_e_c_t_u_r_e.   Report CSL-79-10.  XEROX
                    Palo Alto Research Center, July 1979.

[BBN78]             Bolt Beranek and  Newman;  Specification
                    for the Interconnection of Host and IMP.
                    BBN Technical Report 1822.  May 1978.

[Cerf78]            Cerf, V.  G.;   The  Catenet  Model  for
                    Internetworking.     Internet    Working
                    Group, IEN 48.  July 1978.

[Clark82]           Clark, D. D.;  Window  and  Acknowledge-
                    ment  Strategy in TCP, RFC-813.  Network
                    Information Center,  SRI  International.
                    July 1982.

[DEC80]             Digital  Equipment  Corporation;  _D_E_C_n_e_t
                    _D_I_G_I_T_A_L _N_e_t_w_o_r_k _A_r_c_h_i_t_e_c_t_u_r_e  _-  _G_e_n_e_r_a_l
                    _D_e_s_c_r_i_p_t_i_o_n.   Order  No.   AA-K179A-TK.
                    October 1980.

[Gurwitz81]         Gurwitz,  R.  F.;   VAX-UNIX  Networking
                    Support    Project    -   Implementation
                    Description.     Internetwork    Working
                    Group, IEN 168.  January 1981.

[ISO81]             International Organization for Standard-
                    ization.  _I_S_O _O_p_e_n _S_y_s_t_e_m_s  _I_n_t_e_r_c_o_n_n_e_c_-
                    _t_i_o_n  _-  _B_a_s_i_c  _R_e_f_e_r_e_n_c_e _M_o_d_e_l.  ISO/TC
                    97/SC 16 N 719.  August 1981.

[Joy86]             Joy, W.; Fabry, R.; Leffler,  S.;  McKu-
                    sick, M.; and Karels, M.; Berkeley Soft-
                    ware Architecture  Manual,  4.4BSD  Edi-
                    tion.   _U_N_I_X  _P_r_o_g_r_a_m_m_e_r_'_s _S_u_p_p_l_e_m_e_n_t_a_r_y
                    _D_o_c_u_m_e_n_t_s,  Vol.  1  (PSD:5).   Computer
                    Systems  Research  Group,  University of
                    California, Berkeley.  May, 1986.

[Leffler84]         Leffler, S.J. and Karels, M.J.;  Trailer
                    Encapsulations, RFC-893.  Network Infor-
                    mation Center, SRI International.  April
                    1984.

[Postel80]          Postel,   J.   User  Datagram  Protocol,
                    RFC-768.   Network  Information  Center,
                    SRI International.  May 1980.











Networking Implementation Notes                    SMM:18-41


[Postel81a]         Postel,   J.,  ed.   Internet  Protocol,
                    RFC-791.   Network  Information  Center,
                    SRI International.  September 1981.

[Postel81b]         Postel,  J.,  ed.   Transmission Control
                    Protocol, RFC-793.  Network  Information
                    Center,  SRI  International.   September
                    1981.

[Postel81c]         Postel,  J.   Internet  Control  Message
                    Protocol,  RFC-792.  Network Information
                    Center,  SRI  International.   September
                    1981.

[Xerox81]           Xerox  Corporation.   _I_n_t_e_r_n_e_t _T_r_a_n_s_p_o_r_t
                    _P_r_o_t_o_c_o_l_s.   Xerox  System   Integration
                    Standard 028112.  December 1981.

[Zimmermann80]      Zimmermann,  H.   OSI  Reference Model -
                    The ISO Model of Architecture  for  Open
                    Systems  Interconnection.  _I_E_E_E _T_r_a_n_s_a_c_-
                    _t_i_o_n_s  _o_n  _C_o_m_m_u_n_i_c_a_t_i_o_n_s.    Com-28(4);
                    425-432.  April 1980.





































