








              NNeettwwoorrkkiinngg IImmpplleemmeennttaattiioonn NNootteess
                       44..44BBSSDD EEddiittiioonn


_S_a_m_u_e_l _J_. _L_e_f_f_l_e_r_, _W_i_l_l_i_a_m _N_. _J_o_y_, _R_o_b_e_r_t _S_. _F_a_b_r_y_, _a_n_d _M_i_c_h_a_e_l _J_. _K_a_r_e_l_s
              Computer Systems Research Group
                 Computer Science Division
 Department of Electrical Engineering and Computer Science
             University of California, Berkeley
                    Berkeley, CA  94720


                          _A_B_S_T_R_A_C_T

          This  report describes the internal structure
     of the networking  facilities  developed  for  the
     4.4BSD  version  of the UNIX* operating system for
     the  VAX.   These  facilities are based on several
     central abstractions which structure the  external
     (user)  view  of  network communication as well as
     the internal (system) implementation.

          The report documents the  internal  structure
     of the networking system.  The ``Berkeley Software
     Architecture Manual, 4.4BSD Edition'' (PSD:5) pro-
     vides  a  description of the user interface to the
     networking facilities.


     Revised June 10, 1993

















-----------
* UNIX is a trademark of Bell Laboratories.
 DEC, VAX, DECnet, and UNIBUS  are  trademarks  of
Digital Equipment Corporation.









SMM:18-2                     Networking Implementation Notes


                     TTAABBLLEE OOFF CCOONNTTEENNTTSS


11..  IInnttrroodduuccttiioonn

22..  OOvveerrvviieeww

33..  GGooaallss

44..  IInntteerrnnaall aaddddrreessss rreepprreesseennttaattiioonn

55..  MMeemmoorryy mmaannaaggeemmeenntt

66..  IInntteerrnnaall llaayyeerriinngg
6.1.    Socket layer
6.1.1.    Socket state
6.1.2.    Socket data queues
6.1.3.    Socket connection queuing
6.2.    Protocol layer(s)
6.3.    Network-interface layer
6.3.1.    UNIBUS interfaces

77..  SSoocckkeett//pprroottooccooll iinntteerrffaaccee

88..  PPrroottooccooll//pprroottooccooll iinntteerrffaaccee
8.1.     pr_output
8.2.     pr_input
8.3.     pr_ctlinput
8.4.     pr_ctloutput

99..  PPrroottooccooll//nneettwwoorrkk--iinntteerrffaaccee iinntteerrffaaccee
9.1.     Packet transmission
9.2.     Packet reception

1100.. GGaatteewwaayyss aanndd rroouuttiinngg iissssuueess
10.1.     Routing tables
10.2.     Routing table interface
10.3.     User level routing policies

1111.. RRaaww ssoocckkeettss
11.1.     Control blocks
11.2.     Input processing
11.3.     Output processing

1122.. BBuuffffeerriinngg aanndd ccoonnggeessttiioonn ccoonnttrrooll
12.1.     Memory management
12.2.     Protocol buffering policies
12.3.     Queue limiting
12.4.     Packet forwarding

1133.. OOuutt ooff bbaanndd ddaattaa

1144.. TTrraaiilleerr pprroottooccoollss










Networking Implementation Notes                     SMM:18-3


AAcckknnoowwlleeddggeemmeennttss

RReeffeerreenncceess




























































SMM:18-4                     Networking Implementation Notes


11..  IInnttrroodduuccttiioonn

     This report describes the internal structure of facili-
ties  added to the 4.2BSD version of the UNIX operating sys-
tem for the VAX, as modified in  the  4.4BSD  release.   The
system  facilities  provide a uniform user interface to net-
working within UNIX.  In addition, the implementation intro-
duces  a  structure  for network communications which may be
used by system implementors in adding new networking facili-
ties.   The  internal  structure is not visible to the user,
rather it is intended to aid implementors  of  communication
protocols  and  network  services  by  providing a framework
which promotes code  sharing  and  minimizes  implementation
effort.

     The  reader  is expected to be familiar with the C pro-
gramming language and system interface, as described in  the
_B_e_r_k_e_l_e_y   _S_o_f_t_w_a_r_e   _A_r_c_h_i_t_e_c_t_u_r_e  _M_a_n_u_a_l_,  _4_._4_B_S_D  _E_d_i_t_i_o_n
[Joy86].  Basic understanding of network communication  con-
cepts  is  assumed;  where required any additional ideas are
introduced.

     The remainder of this document provides  a  description
of the system internals, avoiding, when possible, those por-
tions which are used only by the interprocess  communication
facilities.

22..  OOvveerrvviieeww

     If  we  consider  the International Standards Organiza-
tion's (ISO) Open System Interconnection (OSI) model of net-
work  communication  [ISO81]  [Zimmermann80], the networking
facilities described here correspond to  a  portion  of  the
session layer (layer 3) and all of the transport and network
layers (layers 2 and 1, respectively).

     The network  layer  provides  possibly  imperfect  data
transport   services   with  minimal  addressing  structure.
Addressing at this level is  normally  host  to  host,  with
implicit  or  explicit  routing  optionally supported by the
communicating agents.

     At the transport layer the notions of  reliable  trans-
fer,  data  sequencing, flow control, and service addressing
are normally included.  Reliability is  usually  managed  by
explicit  acknowledgement  of  data  delivered.   Failure to
acknowledge a transfer  results  in  retransmission  of  the
data.   Sequencing  may  be  handled by tagging each message
handed to the network layer by a _s_e_q_u_e_n_c_e _n_u_m_b_e_r  and  main-
taining  state  at  the  endpoints  of  communication to use
received sequence numbers in reordering data  which  arrives
out of order.











Networking Implementation Notes                     SMM:18-5


     The  session  layer  facilities  may  provide  forms of
addressing which are mapped into  formats  required  by  the
transport layer, service authentication and client authenti-
cation, etc.  Various systems also provide services such  as
data encryption and address and protocol translation.

     The  following sections begin by describing some of the
common data structures and utility  routines,  then  examine
the  internal  layering.  The contents of each layer and its
interface are considered.  Certain  of  the  interfaces  are
protocol  implementation specific.  For these cases examples
have been drawn from the Internet [Cerf78] protocol  family.
Later  sections  cover routing issues, the design of the raw
socket interface and other miscellaneous topics.

33..  GGooaallss

     The networking system was designed  with  the  goal  of
supporting multiple _p_r_o_t_o_c_o_l _f_a_m_i_l_i_e_s and addressing styles.
This required information to be ``hidden''  in  common  data
structures  which  could be manipulated by all the pieces of
the system, but which required interpretation  only  by  the
protocols  which  ``controlled''  it.   The system described
here attempts to minimize the use of shared data  structures
to  those  kept by a suite of protocols (a _p_r_o_t_o_c_o_l _f_a_m_i_l_y),
and those used for rendezvous  between  ``synchronous''  and
``asynchronous'' portions of the system (e.g. queues of data
packets are filled at interrupt time and  emptied  based  on
user requests).

     A  major  goal of the system was to provide a framework
within which new protocols and hardware could be  easily  be
supported.   To  this  end,  a great deal of effort has been
extended to create utility routines which hide many  of  the
more complex and/or hardware dependent chores of networking.
Later sections describe the utility routines and the  under-
lying data structures they manipulate.

44..  IInntteerrnnaall aaddddrreessss rreepprreesseennttaattiioonn

     Common  to  all  portions  of  the  system are two data
structures.   These  structures  are   used   to   represent
addresses  and  various data objects.  Addresses, internally
are described by the _s_o_c_k_a_d_d_r structure,

     struct sockaddr {
            short     sa_family;           /* data format identifier */
            char      sa_data[14];         /* address */
     };

All addresses belong to one or more _a_d_d_r_e_s_s  _f_a_m_i_l_i_e_s  which
define their format and interpretation.  The _s_a___f_a_m_i_l_y field
indicates the address family to which the  address  belongs,
and  the  _s_a___d_a_t_a field contains the actual data value.  The









SMM:18-6                     Networking Implementation Notes


size of the data field, 14 bytes, was selected  based  on  a
study of current address formats.*  Specific address formats
use private structure definitions that define the format  of
the  data  field.   The  system  interface  supports  larger
address structures, although address-family-independent sup-
port  facilities,  for example routing and raw socket inter-
faces, provide only 14 bytes for address storage.  Protocols
that  do  not  use  those  facilities (e.g, the current Unix
domain) may use larger data areas.

55..  MMeemmoorryy mmaannaaggeemmeenntt

     A  single  mechanism  is  used for data storage: memory
buffers, or _m_b_u_f's.  An mbuf is a structure of the form:

     struct mbuf {
            struct    mbuf *m_next;        /* next buffer in chain */
            u_long    m_off;               /* offset of data */
            short     m_len;               /* amount of data in this mbuf */
            short     m_type;              /* mbuf type (accounting) */
            u_char    m_dat[MLEN];         /* data storage */
            struct    mbuf *m_act;         /* link in higher-level mbuf list */
     };

The _m___n_e_x_t field is used to chain mbufs together  on  linked
lists,  while the _m___a_c_t field allows lists of mbuf chains to
be accumulated.  By convention, the mbufs common to a single
object (for example, a packet) are chained together with the
_m___n_e_x_t field, while groups of objects  are  linked  via  the
_m___a_c_t field (possibly when in a queue).

     Each  mbuf  has  a small data area for storing informa-
tion, _m___d_a_t.  The _m___l_e_n field indicates the amount of  data,
while  the  _m___o_f_f field is an offset to the beginning of the
data from the base of the  mbuf.   Thus,  for  example,  the
macro _m_t_o_d, which converts a pointer to an mbuf to a pointer
to the data stored in the mbuf, has the form

     #define mtod(_x,_t)         ((_t)((int)(_x) + (_x)->m_off))

(note the _t parameter, a C type cast, which is used to  cast
the resultant pointer for proper assignment).

     In addition to storing data directly in the mbuf's data
area, data of page size may be also be stored in a  separate
area  of  memory.  The mbuf utility routines maintain a pool
of pages for this purpose and manipulate a private page  map
for  such  pages.  An mbuf with an external data area may be
recognized by the larger offset to the data  area;  this  is
formalized  by  the  macro  M_HASCL(_m), which is true if the
mbuf whose address is _m has an external  page  cluster.   An
-----------
* Later versions of the system may  support  vari-
able length addresses.









Networking Implementation Notes                     SMM:18-7


array  of  reference  counts  on pages is also maintained so
that copies of pages may be made without core to core  copy-
ing  (copies are created simply by duplicating the reference
to the data and incrementing the associated reference counts
for the pages).  Separate data pages are currently used only
when copying data from a user process into the  kernel,  and
when bringing data in at the hardware level.  Routines which
manipulate mbufs are not  normally  aware  whether  data  is
stored  directly in the mbuf data array, or if it is kept in
separate pages.

     The following may be used to allocate and free mbufs:

m = m_get(wait, type);
MGET(m, wait, type);

     The subroutine _m___g_e_t and the macro _M_G_E_T  each  allocate
     an  mbuf,  placing its address in _m.  The argument _w_a_i_t
     is either M_WAIT or  M_DONTWAIT  according  to  whether
     allocation  should  block  or fail if no mbuf is avail-
     able.  The _t_y_p_e is one of the predefined mbuf types for
     use in accounting of mbuf allocation.

MCLGET(m);
     This macro attempts to allocate an mbuf page cluster to
     associate with the mbuf _m.  If successful,  the  length
     of  the  mbuf  is  set  to CLSIZE, the size of the page
     cluster.

n = m_free(m);
MFREE(m,n);

     The routine _m___f_r_e_e and the macro _M_F_R_E_E each free a sin-
     gle  mbuf, _m, and any associated external storage area,
     placing a pointer to its  successor  in  the  chain  it
     heads, if any, in _n.

m_freem(m);
     This routine frees an mbuf chain headed by _m.

     The following utility routines are available for manip-
ulating mbuf chains:

m = m_copy(m0, off, len);
     The _m___c_o_p_y routine create a copy of all, or part, of  a
     list  of  the mbufs in _m_0.  _L_e_n bytes of data, starting
     _o_f_f bytes from the front  of  the  chain,  are  copied.
     Where  possible,  reference  counts  on  pages are used
     instead of core to  core  copies.   The  original  mbuf
     chain  must  have at least _o_f_f + _l_e_n bytes of data.  If
     _l_e_n is specified as M_COPYALL, all  the  data  present,
     offset as before, is copied.











SMM:18-8                     Networking Implementation Notes


m_cat(m, n);
     The  mbuf chain, _n, is appended to the end of _m.  Where
     possible, compaction is performed.

m_adj(m, diff);
     The mbuf chain, _m is adjusted in size  by  _d_i_f_f  bytes.
     If  _d_i_f_f is non-negative, _d_i_f_f bytes are shaved off the
     front of the mbuf chain.   If  _d_i_f_f  is  negative,  the
     alteration  is  performed from back to front.  No space
     is reclaimed in this operation; alterations are  accom-
     plished  by  changing  the  _m___l_e_n  and  _m___o_f_f fields of
     mbufs.

m = m_pullup(m0, size);
     After a successful call to _m___p_u_l_l_u_p, the  mbuf  at  the
     head  of the returned list, _m, is guaranteed to have at
     least _s_i_z_e bytes of data in  contiguous  memory  within
     the  data  area  of  the  mbuf  (allowing  access via a
     pointer, obtained using the _m_t_o_d  macro,  and  allowing
     the  mbuf to be located from a pointer to the data area
     using _d_t_o_m, defined below).  If the original  data  was
     less  than  _s_i_z_e  bytes  long, _l_e_n was greater than the
     size of an mbuf data  area  (112  bytes),  or  required
     resources  were  unavailable,  _m  is 0 and the original
     mbuf chain is deallocated.

     This routine  is  particularly  useful  when  verifying
     packet  header lengths on reception.  For example, if a
     packet is received and only 8 of the necessary 16 bytes
     required  for  a valid packet header are present at the
     head of the list of mbufs representing the packet,  the
     remaining  8  bytes  may be ``pulled up'' with a single
     _m___p_u_l_l_u_p call.  If the call fails  the  invalid  packet
     will have been discarded.

     By insuring that mbufs always reside on 128 byte bound-
aries, it is always possible to locate the  mbuf  associated
with  a data area by masking off the low bits of the virtual
address.  This allows modules to store  data  structures  in
mbufs  and pass them around without concern for locating the
original mbuf when it comes  time  to  free  the  structure.
Note  that this works only with objects stored in the inter-
nal data buffer of the mbuf.  The _d_t_o_m macro is used to con-
vert  a pointer into an mbuf's data area to a pointer to the
mbuf,

     #define   dtom(x)   ((struct mbuf *)((int)x & ~(MSIZE-1)))


     Mbufs are used for dynamically  allocated  data  struc-
tures  such as sockets as well as memory allocated for pack-
ets and headers.  Statistics are maintained  on  mbuf  usage
and can be viewed by users using the _n_e_t_s_t_a_t(1) program.










Networking Implementation Notes                     SMM:18-9


66..  IInntteerrnnaall llaayyeerriinngg

     The internal structure of the network system is divided
into three layers.  These layers correspond to the  services
provided  by  the  socket abstraction, those provided by the
communication protocols, and those provided by the  hardware
interfaces.   The  communication protocols are normally lay-
ered into two or more individual cooperating layers,  though
they are collectively viewed in the system as one layer pro-
viding  services  supportive  of  the   appropriate   socket
abstraction.

     The  following sections describe the properties of each
layer in the system and the interfaces to  which  each  must
conform.

66..11..  SSoocckkeett llaayyeerr

     The socket layer deals with the interprocess communica-
tion facilities provided by the system.  A socket is a bidi-
rectional  endpoint  of  communication which is ``typed'' by
the semantics of  communication  it  supports.   The  system
calls described in the _B_e_r_k_e_l_e_y _S_o_f_t_w_a_r_e _A_r_c_h_i_t_e_c_t_u_r_e _M_a_n_u_a_l
[Joy86] are used to manipulate sockets.

     A socket consists of the following data structure:

     struct socket {
            short     so_type;             /* generic type */
            short     so_options;          /* from socket call */
            short     so_linger;           /* time to linger while closing */
            short     so_state;            /* internal state flags */
            caddr_t   so_pcb;              /* protocol control block */
            struct    protosw *so_proto;   /* protocol handle */
            struct    socket *so_head;     /* back pointer to accept socket */
            struct    socket *so_q0;       /* queue of partial connections */
            short     so_q0len;            /* partials on so_q0 */
            struct    socket *so_q;        /* queue of incoming connections */
            short     so_qlen;             /* number of connections on so_q */
            short     so_qlimit;           /* max number queued connections */
            struct    sockbuf so_rcv;      /* receive queue */
            struct    sockbuf so_snd;      /* send queue */
            short     so_timeo;            /* connection timeout */
            u_short   so_error;            /* error affecting connection */
            u_short   so_oobmark;          /* chars to oob mark */
            short     so_pgrp;             /* pgrp for signals */
     };


     Each  socket  contains  two  data  queues,  _s_o___r_c_v  and
_s_o___s_n_d,  and  a pointer to routines which provide supporting
services.  The type of the socket,  _s_o___t_y_p_e  is  defined  at
socket  creation  time  and used in selecting those services
which  are  appropriate  to  support  it.   The   supporting









SMM:18-10                    Networking Implementation Notes


protocol is selected at socket creation time and recorded in
the socket data structure  for  later  use.   Protocols  are
defined  by  a  table  of procedures, the _p_r_o_t_o_s_w structure,
which will be described in detail later.   A  pointer  to  a
protocol-specific  data  structure,  the  ``protocol control
block,'' is also present in the socket structure.  Protocols
control  this data structure, which normally includes a back
pointer to the parent socket structure to allow easy  lookup
when  returning  information to a user (for example, placing
an error number in the _s_o___e_r_r_o_r field).  The  other  entries
in  the  socket  structure  are  used  in queuing connection
requests, validating user requests, storing  socket  charac-
teristics  (e.g.   options  supplied at the time a socket is
created), and maintaining a socket's state.

     Processes ``rendezvous at a socket'' in many instances.
For  instance,  when a process wishes to extract data from a
socket's receive queue and it is empty, or lacks  sufficient
data  to  satisfy the request, the process blocks, supplying
the address of the receive queue as a ``wait channel' to  be
used in notification.  When data arrives for the process and
is placed in the socket's  queue,  the  blocked  process  is
identified by the fact it is waiting ``on the queue.''

66..11..11..  SSoocckkeett ssttaattee

     A socket's state is defined from the following:

     #define SS_NOFDREF            0x001     /* no file table ref any more */
     #define SS_ISCONNECTED        0x002     /* socket connected to a peer */
     #define SS_ISCONNECTING       0x004     /* in process of connecting to peer */
     #define SS_ISDISCONNECTING    0x008     /* in process of disconnecting */
     #define SS_CANTSENDMORE       0x010     /* can't send more data to peer */
     #define SS_CANTRCVMORE        0x020     /* can't receive more data from peer */
     #define SS_RCVATMARK          0x040     /* at mark on input */

     #define SS_PRIV               0x080     /* privileged */
     #define SS_NBIO               0x100     /* non-blocking ops */
     #define SS_ASYNC              0x200     /* async i/o notify */


     The state of a socket is manipulated both by the proto-
cols and the user (through system calls).  When a socket  is
created,  the  state is defined based on the type of socket.
It may change as control actions are performed, for  example
connection  establishment.   It may also change according to
the type of input/output the  user  wishes  to  perform,  as
indicated  by  options set with _f_c_n_t_l.  ``Non-blocking'' I/O
implies that a process should  never  be  blocked  to  await
resources.  Instead, any call which would block returns pre-
maturely with the error EWOULDBLOCK, or the service  request
may  be  partially  fulfilled,  e.g. a request for more data
than is present.










Networking Implementation Notes                    SMM:18-11


     If a process requested ``asynchronous'' notification of
events  related to the socket, the SIGIO signal is posted to
the process when such events occur.  An event is a change in
the  socket's state; examples of such occurrences are: space
becoming available in the send queue, new data available  in
the receive queue, connection establishment or disestablish-
ment, etc.

     A socket may be marked ``privileged'' if it was created
by   the  super-user.   Only  privileged  sockets  may  bind
addresses in privileged portions of an address space or  use
``raw'' sockets to access lower levels of the network.

66..11..22..  SSoocckkeett ddaattaa qquueeuueess

     A  socket's  data  queue contains a pointer to the data
stored in the queue and other entries related to the manage-
ment  of  the  data.  The following structure defines a data
queue:

     struct sockbuf {
            u_short   sb_cc;               /* actual chars in buffer */
            u_short   sb_hiwat;            /* max actual char count */
            u_short   sb_mbcnt;            /* chars of mbufs used */
            u_short   sb_mbmax;            /* max chars of mbufs to use */
            u_short   sb_lowat;            /* low water mark */
            short     sb_timeo;            /* timeout */
            struct    mbuf *sb_mb;         /* the mbuf chain */
            struct    proc *sb_sel;        /* process selecting read/write */
            short     sb_flags;            /* flags, see below */
     };


     Data is stored in a queue as a  chain  of  mbufs.   The
actual  count  of  data  characters  as well as high and low
water marks are used by the  protocols  in  controlling  the
flow  of  data.   The  amount of buffer space (characters of
mbufs and associated data pages) is also recorded along with
the limit on buffer allocation.  The socket routines cooper-
ate in implementing the flow control policy  by  blocking  a
process  when  it  requests  to send data and the high water
mark has been reached, or when it requests to  receive  data
and  less  than the low water mark is present (assuming non-
blocking I/O has not been specified).*

     When  a  socket  is  created,  the  supporting protocol
``reserves'' space for the send and receive  queues  of  the
socket.   The  limit  on  buffer  allocation is set somewhat
higher than the limit on data characters to account for  the
granularity  of buffer allocation.  The actual storage asso-
ciated with a socket queue may fluctuate during  a  socket's
-----------
* The low-water mark is always presumed to be 0 in
the current implementation.









SMM:18-12                    Networking Implementation Notes


lifetime,  but  it  is  assumed  that  this reservation will
always allow a protocol to acquire enough memory to  satisfy
the high water marks.

     The  timeout  and  select values are manipulated by the
socket routines in  implementing  various  portions  of  the
interprocess  communications  facilities  and  will  not  be
described here.

     Data queued at a socket is stored in one of two styles.
Stream-oriented  sockets queue data with no addresses, head-
ers or record boundaries.  The  data  are  in  mbufs  linked
through  the _m___n_e_x_t field.  Buffers containing access rights
may be present within the chain if the  underlying  protocol
supports passage of access rights.  Record-oriented sockets,
including datagram sockets, queue data as a list of packets;
the  sections  of  packets are distinguished by the types of
the mbufs containing  them.   The  mbufs  which  comprise  a
record  are  linked  through  the  _m___n_e_x_t field; records are
linked from the _m___a_c_t field of the first mbuf of one  packet
to  the  first mbuf of the next.  Each packet begins with an
mbuf containing the ``from'' address if  the  protocol  pro-
vides  it,  then  any  buffers containing access rights, and
finally any buffers containing data.  If a  record  contains
no data, no data buffers are required unless neither address
nor access rights are present.

     A socket queue has a number of flags used  in  synchro-
nizing access to the data and in acquiring resources:

     #define SB_LOCK           0x01   /* lock on data queue (so_rcv only) */
     #define SB_WANT           0x02   /* someone is waiting to lock */
     #define SB_WAIT           0x04   /* someone is waiting for data/space */
     #define SB_SEL            0x08   /* buffer is selected */
     #define SB_COLL           0x10   /* collision selecting */

The  last  two flags are manipulated by the system in imple-
menting the select mechanism.

66..11..33..  SSoocckkeett ccoonnnneeccttiioonn qquueeuuiinngg

     In  dealing  with  connection  oriented  sockets  (e.g.
SOCK_STREAM)  the two ends are considered distinct.  One end
is termed _a_c_t_i_v_e, and generates  connection  requests.   The
other end is called _p_a_s_s_i_v_e and accepts connection requests.

     From  the  passive  side,  a  socket  is  marked   with
SO_ACCEPTCONN  when  a  _l_i_s_t_e_n  call  is  made, creating two
queues of sockets: _s_o___q_0 for  connections  in  progress  and
_s_o___q  for  connections already made and awaiting user accep-
tance.  As a protocol is preparing incoming connections,  it
creates  a  socket  structure queued on _s_o___q_0 by calling the
routine _s_o_n_e_w_c_o_n_n().  When the  connection  is  established,
the  socket structure is then transferred to _s_o___q, making it









Networking Implementation Notes                    SMM:18-13


available for an _a_c_c_e_p_t.

     If an SO_ACCEPTCONN socket is closed  with  sockets  on
either  _s_o___q_0 or _s_o___q, these sockets are dropped, with noti-
fication to the peers as appropriate.

66..22..  PPrroottooccooll llaayyeerr((ss))

     Each socket is  created  in  a  communications  domain,
which  usually implies both an addressing structure (address
family) and a  set  of  protocols  which  implement  various
socket  types  within  the  domain  (protocol family).  Each
domain is defined by the following structure:

     struct       domain {
          int     dom_family;             /* PF_xxx */
          char    *dom_name;
          int     (*dom_init)();          /* initialize domain data structures */
          int     (*dom_externalize)();   /* externalize access rights */
          int     (*dom_dispose)();       /* dispose of internalized rights */
          struct  protosw *dom_protosw, *dom_protoswNPROTOSW;
          struct  domain *dom_next;
     };


     At boot time, each domain configured into the kernel is
added to a linked list of domain.  The initialization proce-
dure of each domain is then called.  After  that  time,  the
domain structure is used to locate protocols within the pro-
tocol family.  It may also contain procedure references  for
externalization of access rights at the receiving socket and
the disposal of access rights that are not received.

     Protocols are described by a set of  entry  points  and
certain  socket-visible  characteristics,  some of which are
used in deciding which socket type(s) they may support.

     An entry in the ``protocol switch''  table  exists  for
each protocol module configured into the system.  It has the
following form:























SMM:18-14                    Networking Implementation Notes


     struct protosw {
          short   pr_type;              /* socket type used for */
          struct  domain *pr_domain;    /* domain protocol a member of */
          short   pr_protocol;          /* protocol number */
          short   pr_flags;             /* socket visible attributes */
     /* protocol-protocol hooks */
          int     (*pr_input)();        /* input to protocol (from below) */
          int     (*pr_output)();       /* output to protocol (from above) */
          int     (*pr_ctlinput)();     /* control input (from below) */
          int     (*pr_ctloutput)();    /* control output (from above) */
     /* user-protocol hook */
          int     (*pr_usrreq)();       /* user request */
     /* utility hooks */
          int     (*pr_init)();         /* initialization routine */
          int     (*pr_fasttimo)();     /* fast timeout (200ms) */
          int     (*pr_slowtimo)();     /* slow timeout (500ms) */
          int     (*pr_drain)();        /* flush any excess space possible */
     };


     A protocol is called through the _p_r___i_n_i_t  entry  before
any  other.   Thereafter it is called every 200 milliseconds
through the _p_r___f_a_s_t_t_i_m_o entry  and  every  500  milliseconds
through the _p_r___s_l_o_w_t_i_m_o for timer based actions.  The system
will call the _p_r___d_r_a_i_n entry if it is low on space and  this
should throw away any non-critical data.

     Protocols  pass  data  between  themselves as chains of
mbufs using the _p_r___i_n_p_u_t and _p_r___o_u_t_p_u_t  routines.   _P_r___i_n_p_u_t
passes  data  up  (towards the user) and _p_r___o_u_t_p_u_t passes it
down (towards the network); control  information  passes  up
and  down  on _p_r___c_t_l_i_n_p_u_t and _p_r___c_t_l_o_u_t_p_u_t.  The protocol is
responsible for the space occupied by any of  the  arguments
to  these  entries and must either pass it onward or dispose
of it.  (On output,  the  lowest  level  reached  must  free
buffers  storing  the arguments; on input, the highest level
is responsible for freeing buffers.)

     The  _p_r___u_s_r_r_e_q  routine  interfaces  protocols  to  the
socket code and is described below.

     The  _p_r___f_l_a_g_s  field  is constructed from the following
values:

     #define PR_ATOMIC         0x01    /* exchange atomic messages only */
     #define PR_ADDR           0x02    /* addresses given with messages */
     #define PR_CONNREQUIRED   0x04    /* connection required by protocol */
     #define PR_WANTRCVD       0x08    /* want PRU_RCVD calls */
     #define PR_RIGHTS         0x10    /* passes capabilities */

Protocols which are connection-based specify the  PR_CONNRE-
QUIRED  flag  so that the socket routines will never attempt
to send data before a connection has been  established.   If
the PR_WANTRCVD flag is set, the socket routines will notify









Networking Implementation Notes                    SMM:18-15


the protocol  when  the  user  has  removed  data  from  the
socket's  receive queue.  This allows the protocol to imple-
ment acknowledgement on user receipt, and also  update  win-
dowing information based on the amount of space available in
the receive queue.  The PR_ADDR  field  indicates  that  any
data  placed  in the socket's receive queue will be preceded
by the address of the sender.  The PR_ATOMIC flag  specifies
that  each  _u_s_e_r request to send data must be performed in a
single _p_r_o_t_o_c_o_l send request; it is the protocol's responsi-
bility  to  maintain  record  boundaries on data to be sent.
The PR_RIGHTS flag indicates that the protocol supports  the
passing of capabilities;  this is currently used only by the
protocols in the UNIX protocol family.

     When a socket is created, the socket routines scan  the
protocol  table  for  the  domain looking for an appropriate
protocol to support the type of socket being  created.   The
_p_r___t_y_p_e  field  contains  one  of  the possible socket types
(e.g. SOCK_STREAM), while the _p_r___d_o_m_a_i_n is a back pointer to
the  domain  structure.   The _p_r___p_r_o_t_o_c_o_l field contains the
protocol number  of  the  protocol,  normally  a  well-known
value.

66..33..  NNeettwwoorrkk--iinntteerrffaaccee llaayyeerr

     Each network-interface configured into a system defines
a path through which packets may be sent and received.  Nor-
mally  a  hardware device is associated with this interface,
though there is no requirement for this  (for  example,  all
systems  have  a  software  ``loopback''  interface used for
debugging and performance analysis).  In addition to manipu-
lating the hardware device, an interface module is responsi-
ble for encapsulation and decapsulation  of  any  link-layer
header information required to deliver a message to its des-
tination.  The selection of which interface to use in deliv-
ering  packets is a routing decision carried out at a higher
level than the network-interface layer.   An  interface  may
have addresses in one or more address families.  The address
is set at boot time using an _i_o_c_t_l on a socket in the appro-
priate domain; this operation is implemented by the protocol
family, after verifying the  operation  through  the  device
_i_o_c_t_l entry.

     An interface is defined by the following structure,



















SMM:18-16                    Networking Implementation Notes


     struct ifnet {
          char     *if_name;              /* name, e.g. ``en'' or ``lo'' */
          short    if_unit;               /* sub-unit for lower level driver */
          short    if_mtu;                /* maximum transmission unit */
          short    if_flags;              /* up/down, broadcast, etc. */
          short    if_timer;              /* time 'til if_watchdog called */
          struct   ifaddr *if_addrlist;   /* list of addresses of interface */
          struct   ifqueue if_snd;        /* output queue */
          int      (*if_init)();          /* init routine */
          int      (*if_output)();        /* output routine */
          int      (*if_ioctl)();         /* ioctl routine */
          int      (*if_reset)();         /* bus reset routine */
          int      (*if_watchdog)();      /* timer routine */
          int      if_ipackets;           /* packets received on interface */
          int      if_ierrors;            /* input errors on interface */
          int      if_opackets;           /* packets sent on interface */
          int      if_oerrors;            /* output errors on interface */
          int      if_collisions;         /* collisions on csma interfaces */
          struct   ifnet *if_next;
     };

Each interface address has the following form:

     struct ifaddr {
             struct   sockaddr ifa_addr;   /* address of interface */
             union {
                      struct   sockaddr ifu_broadaddr;
                      struct   sockaddr ifu_dstaddr;
             } ifa_ifu;
             struct   ifnet *ifa_ifp;      /* back-pointer to interface */
             struct   ifaddr *ifa_next;    /* next address for interface */
     };
     #define ifa_broadaddr   ifa_ifu.ifu_broadaddr        /* broadcast address */
     #define ifa_dstaddr     ifa_ifu.ifu_dstaddr          /* other end of p-to-p link */

The protocol generally maintains this structure as part of a
larger structure containing additional information  concern-
ing the address.

     Each  interface  has a send queue and routines used for
initialization, _i_f___i_n_i_t,  and  output,  _i_f___o_u_t_p_u_t.   If  the
interface resides on a system bus, the routine _i_f___r_e_s_e_t will
be called after a bus reset has been performed.   An  inter-
face  may  also  specify  a  timer  routine, _i_f___w_a_t_c_h_d_o_g; if
_i_f___t_i_m_e_r is non-zero, it  is  decremented  once  per  second
until it reaches zero, at which time the watchdog routine is
called.

     The state of an interface and  certain  characteristics
are  stored in the _i_f___f_l_a_g_s field.  The following values are
possible:












Networking Implementation Notes                    SMM:18-17


     #define IFF_UP            0x1    /* interface is up */
     #define IFF_BROADCAST     0x2    /* broadcast is possible */
     #define IFF_DEBUG         0x4    /* turn on debugging */
     #define IFF_LOOPBACK      0x8    /* is a loopback net */
     #define IFF_POINTOPOINT   0x10   /* interface is point-to-point link */
     #define IFF_NOTRAILERS    0x20   /* avoid use of trailers */
     #define IFF_RUNNING       0x40   /* resources allocated */
     #define IFF_NOARP         0x80   /* no address resolution protocol */

If the interface is connected to a  network  which  supports
transmission  of  _b_r_o_a_d_c_a_s_t  packets, the IFF_BROADCAST flag
will be set and the _i_f_a___b_r_o_a_d_a_d_d_r  field  will  contain  the
address  to  be  used  in  sending  or accepting a broadcast
packet.  If the interface is  associated  with  a  point-to-
point  hardware  link  (for  example,  a  DEC  DMR-11),  the
IFF_POINTOPOINT flag will be set and _i_f_a___d_s_t_a_d_d_r  will  con-
tain  the  address of the host on the other side of the con-
nection.  These addresses  and  the  local  address  of  the
interface,  _i_f___a_d_d_r, are used in filtering incoming packets.
The interface sets IFF_RUNNING after it has allocated system
resources  and  posted an initial read on the device it man-
ages.  This state bit is used to avoid  multiple  allocation
requests  when  an  interface's  address  is  changed.   The
IFF_NOTRAILERS flag indicates the interface  should  refrain
from  using  a _t_r_a_i_l_e_r encapsulation on outgoing packets, or
(where per-host negotiation of trailers  is  possible)  that
trailer encapsulations should not be requested; _t_r_a_i_l_e_r pro-
tocols are described in  section  14.   The  IFF_NOARP  flag
indicates  the interface should not use an ``address resolu-
tion protocol'' in mapping internetwork addresses  to  local
network addresses.

     Various  statistics  are  also  stored in the interface
structure.  These may be viewed  by  users  using  the  _n_e_t_-
_s_t_a_t(1) program.

     The  interface  address  and  flags may be set with the
SIOCSIFADDR and SIOCSIFFLAGS _i_o_c_t_ls.   SIOCSIFADDR  is  used
initially  to  define each interface's address; SIOGSIFFLAGS
can be used to mark an interface down and perform  site-spe-
cific configuration.  The destination address of a point-to-
point link is set with SIOCSIFDSTADDR.  Corresponding opera-
tions  exist to read each value.  Protocol families may also
support operations to set and read  the  broadcast  address.
In  addition,  the  SIOCGIFCONF  _i_o_c_t_l  retrieves  a list of
interface names and addresses for all interfaces and  proto-
cols on the host.

66..33..11..  UUNNIIBBUUSS iinntteerrffaacceess

     All hardware related interfaces currently reside on the
UNIBUS.  Consequently a common set of utility  routines  for
dealing  with  the  UNIBUS  has been developed.  Each UNIBUS
interface uses a structure of the following form:









SMM:18-18                    Networking Implementation Notes


     struct  ifubinfo {
             short       iff_uban;                      /* uba number */
             short       iff_hlen;                      /* local net header length */
             struct      uba_regs *iff_uba;             /* uba regs, in vm */
             short       iff_flags;                     /* used during uballoc's */
     };

Additional structures are associated with each  receive  and
transmit buffer, normally one each per interface; for read,

     struct  ifrw {
             caddr_t     ifrw_addr;                     /* virt addr of header */
             short       ifrw_bdp;                      /* unibus bdp */
             short       ifrw_flags;                    /* type, etc. */
     #define IFRW_W      0x01                           /* is a transmit buffer */
             int         ifrw_info;                     /* value from ubaalloc */
             int         ifrw_proto;                    /* map register prototype */
             struct      pte *ifrw_mr;                  /* base of map registers */
     };

and for write,

     struct  ifxmt {
             struct      ifrw ifrw;
             caddr_t     ifw_base;                      /* virt addr of buffer */
             struct      pte ifw_wmap[IF_MAXNUBAMR];    /* base pages for output */
             struct      mbuf *ifw_xtofree;             /* pages being DMA'd out */
             short       ifw_xswapd;                    /* mask of clusters swapped */
             short       ifw_nmr;                       /* number of entries in wmap */
     };
     #define ifw_addr    ifrw.ifrw_addr
     #define ifw_bdp     ifrw.ifrw_bdp
     #define ifw_flags   ifrw.ifrw_flags
     #define ifw_info    ifrw.ifrw_info
     #define ifw_proto   ifrw.ifrw_proto
     #define ifw_mr      ifrw.ifrw_mr

One of each of these structures is conveniently packaged for
interfaces with single buffers for each direction,  as  fol-
lows:

     struct  ifuba {
             struct      ifubinfo ifu_info;
             struct      ifrw ifu_r;
             struct      ifxmt ifu_xmt;
     };
     #define ifu_uban    ifu_info.iff_uban
     #define ifu_hlen    ifu_info.iff_hlen
     #define ifu_uba     ifu_info.iff_uba
     #define ifu_flags   ifu_info.iff_flags
     #define ifu_w       ifu_xmt.ifrw
     #define ifu_xtofree ifu_xmt.ifw_xtofree











Networking Implementation Notes                    SMM:18-19


     The  _i_f___u_b_i_n_f_o  structure contains the general informa-
tion needed to characterize the I/O-mapped buffers  for  the
device.   In  addition, there is a structure describing each
buffer, including UNIBUS resources held  by  the  interface.
Sufficient  memory pages and bus map registers are allocated
to each buffer upon initialization according to the  maximum
packet  size  and header length.  The kernel virtual address
of the buffer is held in _i_f_r_w___a_d_d_r, and  the  map  registers
begin  at _i_f_r_w___m_r.  UNIBUS map register _i_f_r_w___m_r[-1] maps the
local network header ending on a page boundary.  UNIBUS data
paths  are  reserved  for  read  and  for  write,  given  by
_i_f_r_w___b_d_p.  The prototype of the map registers for  read  and
for write is saved in _i_f_r_w___p_r_o_t_o.

     When  write  transfers are not at least half-full pages
on page boundaries, the data are just copied into the  pages
mapped  on  the  UNIBUS  and  the transfer is started.  If a
write transfer is at least half a page long and  on  a  page
boundary, UNIBUS page table entries are swapped to reference
the pages, and then the  initial  pages  are  remapped  from
_i_f_w___w_m_a_p  when the transfer completes.  The mbufs containing
the mapped pages are placed on the _i_f_w___x_t_o_f_r_e_e queue  to  be
freed after transmission.

     When  read  transfers give at least half a page of data
to be input, page frames are allocated from a  network  page
list  and traded with the pages already containing the data,
mapping the allocated pages to replace the input  pages  for
the next UNIBUS data input.

     The following utility routines are available for use in
writing network interface drivers; all  use  the  structures
described above.

if_ubaminit(ifubinfo, uban, hlen, nmr, ifr, nr, ifx, nx);
if_ubainit(ifuba, uban, hlen, nmr);

     _i_f___u_b_a_m_i_n_i_t allocates resources on UNIBUS adapter _u_b_a_n,
     storing the information in the _i_f_u_b_i_n_f_o, _i_f_r_w and _i_f_x_m_t
     structures  referenced.  The _i_f_r and _i_f_x parameters are
     pointers to arrays of _i_f_r_w and _i_f_x_m_t  structures  whose
     dimensions  are _n_r and _n_x, respectively.  _i_f___u_b_a_i_n_i_t is
     a  simpler,  backwards-compatible  interface  used  for
     hardware  with  single  buffers of each type.  They are
     called only at boot time or after a UNIBUS reset.   One
     data  path  (buffered  or  unbuffered, depending on the
     _i_f_u___f_l_a_g_s field) is allocated for each buffer.  The _n_m_r
     parameter indicates the number of UNIBUS mapping regis-
     ters required to map a maximal sized  packet  onto  the
     UNIBUS,  while  _h_l_e_n specifies the size of a local net-
     work header, if any, which should be mapped  separately
     from the data (see the description of trailer protocols
     in chapter 14).  Sufficient  UNIBUS  mapping  registers
     and  pages  of  memory  are allocated to initialize the









SMM:18-20                    Networking Implementation Notes


     input data path for an initial read.   For  the  output
     data  path,  mapping  registers and pages of memory are
     also allocated and mapped onto the UNIBUS.   The  pages
     associated  with  the  output  data  path  are  held in
     reserve in the event a write requires copying non-page-
     aligned  data (see _i_f___w_u_b_a_p_u_t below).  If _i_f___u_b_a_i_n_i_t is
     called with memory pages already allocated,  they  will
     be  used  instead of allocating new ones (this normally
     occurs after a UNIBUS reset).  A  1  is  returned  when
     allocation  and initialization are successful, 0 other-
     wise.

m = if_ubaget(ifubinfo, ifr, totlen, off0, ifp);
m = if_rubaget(ifuba, totlen, off0, ifp);

     _i_f___u_b_a_g_e_t and _i_f___r_u_b_a_g_e_t pull  input  data  out  of  an
     interface  receive  buffer and into an mbuf chain.  The
     first interface passes pointers to the _i_f_u_b_i_n_f_o  struc-
     ture  for  the interface and the _i_f_r_w structure for the
     receive buffer; the second call may be used for single-
     buffered  devices.  _t_o_t_l_e_n specifies the length of data
     to be obtained, not counting the local network  header.
     If  _o_f_f_0  is  non-zero, it indicates a byte offset to a
     trailing local network header which  should  be  copied
     into  a separate mbuf and prepended to the front of the
     resultant mbuf chain.  When the data amount to at least
     a  half  a  page,  the previously mapped data pages are
     remapped into the mbufs and swapped with  fresh  pages,
     thus  avoiding  any  copy.   The receiving interface is
     recorded as _i_f_p, a pointer to an _i_f_n_e_t  structure,  for
     the  use of the receiving network protocol.  A 0 return
     value indicates a failure to allocate resources.

if_wubaput(ifubinfo, ifx, m);
if_wubaput(ifuba, m);

     _i_f___u_b_a_p_u_t and _i_f___w_u_b_a_p_u_t map a chain of  mbufs  onto  a
     network interface in preparation for output.  The first
     interface is used by  devices  with  multiple  transmit
     buffers.   The chain includes any local network header,
     which is copied so that it resides in  the  mapped  and
     aligned  I/O  space.   Page-aligned data that are page-
     aligned in the output buffer are mapped to  the  UNIBUS
     in place of the normal buffer page, and the correspond-
     ing mbuf is placed on a queue to be freed after  trans-
     mission.   Any  other  mbufs  which contained non-page-
     sized data portions are copied to  the  I/O  space  and
     then freed.  Pages mapped from a previous output opera-
     tion (no longer needed) are unmapped.














Networking Implementation Notes                    SMM:18-21


77..  SSoocckkeett//pprroottooccooll iinntteerrffaaccee

     The interface between the socket routines and the  com-
munication   protocols  is  through  the  _p_r___u_s_r_r_e_q  routine
defined  in  the  protocol  switch  table.   The   following
requests to a protocol module are possible:

     #define PRU_ATTACH        0      /* attach protocol */
     #define PRU_DETACH        1      /* detach protocol */
     #define PRU_BIND          2      /* bind socket to address */
     #define PRU_LISTEN        3      /* listen for connection */
     #define PRU_CONNECT       4      /* establish connection to peer */
     #define PRU_ACCEPT        5      /* accept connection from peer */
     #define PRU_DISCONNECT    6      /* disconnect from peer */
     #define PRU_SHUTDOWN      7      /* won't send any more data */
     #define PRU_RCVD          8      /* have taken data; more room now */
     #define PRU_SEND          9      /* send this data */
     #define PRU_ABORT         10     /* abort (fast DISCONNECT, DETATCH) */
     #define PRU_CONTROL       11     /* control operations on protocol */
     #define PRU_SENSE         12     /* return status into m */
     #define PRU_RCVOOB        13     /* retrieve out of band data */
     #define PRU_SENDOOB       14     /* send out of band data */
     #define PRU_SOCKADDR      15     /* fetch socket's address */
     #define PRU_PEERADDR      16     /* fetch peer's address */
     #define PRU_CONNECT2      17     /* connect two sockets */
     /* begin for protocols internal use */
     #define PRU_FASTTIMO      18     /* 200ms timeout */
     #define PRU_SLOWTIMO      19     /* 500ms timeout */
     #define PRU_PROTORCV      20     /* receive from below */
     #define PRU_PROTOSEND     21     /* send to below */

A call on the user request routine is of the form,

     error = (*protosw[].pr_usrreq)(so, req, m, addr, rights);
     int error; struct socket *so; int req; struct mbuf *m, *addr, *rights;

The  mbuf data chain _m is supplied for output operations and
for certain other  operations  where  it  is  to  receive  a
result.   The  address _a_d_d_r is supplied for address-oriented
requests such  as  PRU_BIND  and  PRU_CONNECT.   The  _r_i_g_h_t_s
parameter is an optional pointer to an mbuf chain containing
user-specified capabilities (see  the  _s_e_n_d_m_s_g  and  _r_e_c_v_m_s_g
system  calls).  The protocol is responsible for disposal of
the data mbuf  chains  on  output  operations.   A  non-zero
return  value  gives  a  UNIX  error  number which should be
passed to higher level software.  The  following  paragraphs
describe each of the requests possible.

PRU_ATTACH
     When  a  protocol is bound to a socket (with the _s_o_c_k_e_t
     system call) the protocol module is  called  with  this
     request.  It is the responsibility of the protocol mod-
     ule  to  allocate   any   resources   necessary.    The
     ``attach'' request will always precede any of the other









SMM:18-22                    Networking Implementation Notes


     requests, and should not occur more than once.

PRU_DETACH
     This is the antithesis of the attach  request,  and  is
     used  at  the  time  a socket is deleted.  The protocol
     module may deallocate any  resources  assigned  to  the
     socket.

PRU_BIND
     When  a  socket  is initially created it has no address
     bound to it.  This request indicates  that  an  address
     should  be  bound  to an existing socket.  The protocol
     module must verify that the requested address is  valid
     and available for use.

PRU_LISTEN
     The  ``listen''  request  indicates  the user wishes to
     listen for incoming connection requests on the  associ-
     ated  socket.   The  protocol module should perform any
     state changes needed to carry out this request (if pos-
     sible).  A ``listen'' request always precedes a request
     to accept a connection.

PRU_CONNECT
     The ``connect'' request indicates the user wants  to  a
     establish  an association.  The _a_d_d_r parameter supplied
     describes the peer to be connected to.  The effect of a
     connect  request  may  vary  depending on the protocol.
     Virtual circuit protocols, such as TCP [Postel81b], use
     this request to initiate establishment of a TCP connec-
     tion.  Datagram protocols, such as UDP [Postel80], sim-
     ply  record the peer's address in a private data struc-
     ture and use it to tag all outgoing packets.  There are
     no restrictions on how many times a connect request may
     be used after an attach.  If a  protocol  supports  the
     notion of _m_u_l_t_i_-_c_a_s_t_i_n_g, it is possible to use multiple
     connects to establish  a  multi-cast  group.   Alterna-
     tively,  an  association may be broken by a PRU_DISCON-
     NECT request, and a new association created with a sub-
     sequent  connect  request;  all  without destroying and
     creating a new socket.

PRU_ACCEPT
     Following  a  successful  PRU_LISTEN  request  and  the
     arrival  of  one  or  more connections, this request is
     made to indicate the user has accepted the  first  con-
     nection  on the queue of pending connections.  The pro-
     tocol module should fill in the supplied address buffer
     with the address of the connected party.

PRU_DISCONNECT
     Eliminate  an  association  created  with a PRU_CONNECT
     request.










Networking Implementation Notes                    SMM:18-23


PRU_SHUTDOWN
     This call is used to indicate no more data will be sent
     and/or  received  (the  _a_d_d_r  parameter  indicates  the
     direction of the shutdown, as encoded in the _s_o_s_h_u_t_d_o_w_n
     system  call).   The  protocol  may, at its discretion,
     deallocate any data structures related to the  shutdown
     and/or notify a connected peer of the shutdown.

PRU_RCVD
     This  request is made only if the protocol entry in the
     protocol switch table includes  the  PR_WANTRCVD  flag.
     When  a  user  removes data from the receive queue this
     request will be sent to the protocol module.  It may be
     used  to  trigger  acknowledgements,  refresh windowing
     information, initiate data transfer, etc.

PRU_SEND
     Each user request to send data is translated  into  one
     or more PRU_SEND requests (a protocol may indicate that
     a single user send request must be  translated  into  a
     single  PRU_SEND  request  by  specifying the PR_ATOMIC
     flag in its protocol description).  The data to be sent
     is  presented to the protocol as a list of mbufs and an
     address is, optionally, supplied in the _a_d_d_r parameter.
     The  protocol is responsible for preserving the data in
     the socket's send queue if it is not able  to  send  it
     immediately,  or  if  it may need it at some later time
     (e.g. for retransmission).

PRU_ABORT
     This request indicates an abnormal termination of  ser-
     vice.  The protocol should delete any existing associa-
     tion(s).

PRU_CONTROL
     The ``control'' request is generated when a  user  per-
     forms  a  UNIX  _i_o_c_t_l  system call on a socket (and the
     ioctl is not intercepted by the socket  routines).   It
     allows protocol-specific operations to be provided out-
     side the scope of the  common  socket  interface.   The
     _a_d_d_r  parameter  contains  a pointer to a static kernel
     data area where relevant information may be obtained or
     returned.   The  _m  parameter contains the actual _i_o_c_t_l
     request code (note  the  non-standard  calling  conven-
     tion).   The  _r_i_g_h_t_s parameter contains a pointer to an
     _i_f_n_e_t structure if the _i_o_c_t_l operation  pertains  to  a
     particular network interface.

PRU_SENSE
     The  ``sense'' request is generated when the user makes
     an _f_s_t_a_t system call on a socket; it requests status of
     the  associated socket.  This currently returns a stan-
     dard _s_t_a_t structure.  It typically  contains  only  the
     optimal  transfer  size  for  the  connection (based on









SMM:18-24                    Networking Implementation Notes


     buffer size, windowing information and  maximum  packet
     size).   The _m parameter contains a pointer to a static
     kernel data area where  the  status  buffer  should  be
     placed.

PRU_RCVOOB
     Any  ``out-of-band''  data presently available is to be
     returned.  An mbuf is passed to  the  protocol  module,
     and  the  protocol should either place data in the mbuf
     or attach new mbufs to the one  supplied  if  there  is
     insufficient space in the single mbuf.  An error may be
     returned if out-of-band data is not (yet) available  or
     has already been consumed.  The _a_d_d_r parameter contains
     any options such as MSG_PEEK to  examine  data  without
     consuming it.

PRU_SENDOOB
     Like PRU_SEND, but for out-of-band data.

PRU_SOCKADDR
     The  local address of the socket is returned, if any is
     currently bound to it.  The address (with protocol spe-
     cific format) is returned in the _a_d_d_r parameter.

PRU_PEERADDR
     The  address  of  the  peer to which the socket is con-
     nected is returned.  The socket must be in a  SS_ISCON-
     NECTED  state for this request to be made to the proto-
     col.   The  address  format  (protocol   specific)   is
     returned in the _a_d_d_r parameter.

PRU_CONNECT2
     The   protocol  module  is  supplied  two  sockets  and
     requested to establish a  connection  between  the  two
     without  binding any addresses, if possible.  This call
     is used in implementing the _s_o_c_k_e_t_p_a_i_r(2) system  call.

     The  following requests are used internally by the pro-
tocol modules and are never generated  by  the  socket  rou-
tines.  In certain instances, they are handed to the _p_r___u_s_r_-
_r_e_q routine solely for convenience in tracing  a  protocol's
operation (e.g. PRU_SLOWTIMO).

PRU_FASTTIMO
     A  ``fast timeout'' has occurred.  This request is made
     when a timeout occurs in the protocol's _p_r___f_a_s_t_i_m_o rou-
     tine.    The   _a_d_d_r  parameter  indicates  which  timer
     expired.

PRU_SLOWTIMO
     A ``slow timeout'' has occurred.  This request is  made
     when  a  timeout  occurs  in the protocol's _p_r___s_l_o_w_t_i_m_o
     routine.  The  _a_d_d_r  parameter  indicates  which  timer
     expired.









Networking Implementation Notes                    SMM:18-25


PRU_PROTORCV
     This  request  is  used in the protocol-protocol inter-
     face, not by the routines.  It  requests  reception  of
     data  destined  for  the protocol and not the user.  No
     protocols currently use this facility.

PRU_PROTOSEND
     This request allows a protocol to  send  data  destined
     for  another  protocol module, not a user.  The details
     of how data is marked ``addressed to protocol'' instead
     of  ``addressed to user'' are left to the protocol mod-
     ules.  No protocols currently use this facility.

88..  PPrroottooccooll//pprroottooccooll iinntteerrffaaccee

     The interface between protocol modules is  through  the
_p_r___u_s_r_r_e_q,  _p_r___i_n_p_u_t, _p_r___o_u_t_p_u_t, _p_r___c_t_l_i_n_p_u_t, and _p_r___c_t_l_o_u_t_-
_p_u_t routines.  The  calling  conventions  for  all  but  the
_p_r___u_s_r_r_e_q  routine are expected to be specific to the proto-
col modules and are not guaranteed to be  consistent  across
protocol families.  We will examine the conventions used for
some of the Internet protocols in this section as  an  exam-
ple.

88..11..  pprr__oouuttppuutt

     The Internet protocol UDP uses the convention,

     error = udp_output(inp, m);
     int error; struct inpcb *inp; struct mbuf *m;

where  the  _i_n_p, ``_i_nternet _protocol _control _block'', passed
between modules conveys per  connection  state  information,
and  the  mbuf chain contains the data to be sent.  UDP per-
forms consistency checks, appends its header,  calculates  a
checksum,  etc.  before passing the packet on.  UDP is based
on the Internet Protocol, IP [Postel81a], as its  transport.
UDP passes a packet to the IP module for output as follows:

     error = ip_output(m, opt, ro, flags);
     int error; struct mbuf *m, *opt; struct route *ro; int flags;


     The  call  to  IP's  output routine is more complicated
than that for UDP, as befits the additional work the IP mod-
ule  must  do.   The _m parameter is the data to be sent, and
the _o_p_t parameter is an optional list of  IP  options  which
should  be placed in the IP packet header.  The _r_o parameter
is is used in making routing  decisions  (and  passing  them
back  to the caller for use in subsequent calls).  The final
parameter, _f_l_a_g_s contains flags indicating whether the  user
is  allowed to transmit a broadcast packet and if routing is
to be performed.  The broadcast flag may be  inconsequential
if  the  underlying  hardware does not support the notion of









SMM:18-26                    Networking Implementation Notes


broadcasting.

     All output routines return 0  on  success  and  a  UNIX
error  number  if a failure occurred which could be detected
immediately (no buffer space available, no route to destina-
tion, etc.).

88..22..  pprr__iinnppuutt

     Both UDP and TCP use the following calling convention,

     (void) (*protosw[].pr_input)(m, ifp);
     struct mbuf *m; struct ifnet *ifp;

Each  mbuf list passed is a single packet to be processed by
the protocol module.  The interface from  which  the  packet
was received is passed as the second parameter.

     The  IP input routine is a VAX software interrupt level
routine, and so is  not  called  with  any  parameters.   It
instead  communicates  with  network  interfaces  through  a
queue, _i_p_i_n_t_r_q, which  is  identical  in  structure  to  the
queues  used  by  the network interfaces for storing packets
awaiting transmission.  The software interrupt is enabled by
the  network  interfaces  when  they place input data on the
input queue.

88..33..  pprr__ccttlliinnppuutt

     This routine is used to convey ``control''  information
to a protocol module (i.e. information which might be passed
to the user, but is not data).

     The common calling convention for this routine is,

     (void) (*protosw[].pr_ctlinput)(req, addr);
     int req; struct sockaddr *addr;

The _r_e_q parameter is one of the following,
























Networking Implementation Notes                    SMM:18-27


     #define  PRC_IFDOWN             0       /* interface transition */
     #define  PRC_ROUTEDEAD          1       /* select new route if possible */
     #define  PRC_QUENCH             4       /* some said to slow down */
     #define  PRC_MSGSIZE            5       /* message size forced drop */
     #define  PRC_HOSTDEAD           6       /* normally from IMP */
     #define  PRC_HOSTUNREACH        7       /* ditto */
     #define  PRC_UNREACH_NET        8       /* no route to network */
     #define  PRC_UNREACH_HOST       9       /* no route to host */
     #define  PRC_UNREACH_PROTOCOL   10      /* dst says bad protocol */
     #define  PRC_UNREACH_PORT       11      /* bad port # */
     #define  PRC_UNREACH_NEEDFRAG   12      /* IP_DF caused drop */
     #define  PRC_UNREACH_SRCFAIL    13      /* source route failed */
     #define  PRC_REDIRECT_NET       14      /* net routing redirect */
     #define  PRC_REDIRECT_HOST      15      /* host routing redirect */
     #define  PRC_REDIRECT_TOSNET    14      /* redirect for type of service & net */
     #define  PRC_REDIRECT_TOSHOST   15      /* redirect for tos & host */
     #define  PRC_TIMXCEED_INTRANS   18      /* packet lifetime expired in transit */
     #define  PRC_TIMXCEED_REASS     19      /* lifetime expired on reass q */
     #define  PRC_PARAMPROB          20      /* header incorrect */

while the _a_d_d_r parameter is the address to which the  condi-
tion  applies.   Many  of  the  requests have obviously been
derived from ICMP (the  Internet  Control  Message  Protocol
[Postel81c]),  and  from  error messages defined in the 1822
host/IMP convention [BBN78].  Mapping tables exist  to  con-
vert  control  requests to UNIX error codes which are deliv-
ered to a user.

88..44..  pprr__ccttlloouuttppuutt

     This is the routine that implements per-socket  options
at  the  protocol  level for _g_e_t_s_o_c_k_o_p_t and _s_e_t_s_o_c_k_o_p_t.  The
calling convention is,

     error = (*protosw[].pr_ctloutput)(op, so, level, optname, mp);
     int op; struct socket *so; int level, optname; struct mbuf **mp;

where _o_p is one of PRCO_SETOPT or  PRCO_GETOPT,  _s_o  is  the
socket  from  whence the call originated, and _l_e_v_e_l and _o_p_t_-
_n_a_m_e are the protocol level and option name supplied by  the
user.   The results of a PRCO_GETOPT call are returned in an
mbuf whose address is placed in  _m_p  before  return.   On  a
PRCO_SETOPT  call,  _m_p  contains the address of an mbuf con-
taining the option data; the mbuf  should  be  freed  before
return.


















SMM:18-28                    Networking Implementation Notes


99..  PPrroottooccooll//nneettwwoorrkk--iinntteerrffaaccee iinntteerrffaaccee

     The lowest layer in the set of protocols which comprise
a protocol family must interface itself to one or more  net-
work  interfaces  in  order to transmit and receive packets.
It is assumed that any  routing  decisions  have  been  made
before handing a packet to a network interface, in fact this
is absolutely necessary in order to locate any interface  at
all  (unless,  of  course,  one  uses a single ``hardwired''
interface).  There are two cases with which to be concerned,
transmission  of a packet and receipt of a packet; each will
be considered separately.

99..11..  PPaacckkeett ttrraannssmmiissssiioonn

     Assuming a protocol has a handle on an interface,  _i_f_p,
a  (struct  ifnet *),  it transmits a fully formatted packet
with the following call,

     error = (*ifp->if_output)(ifp, m, dst)
     int error; struct ifnet *ifp; struct mbuf *m; struct sockaddr *dst;

The output routine for the network interface  transmits  the
packet  _m to the _d_s_t address, or returns an error indication
(a UNIX error number).  In reality transmission may  not  be
immediate  or successful; normally the output routine simply
queues the packet on its send queue and primes an  interrupt
driven routine to actually transmit the packet.  For unreli-
able media, such as the Ethernet,  ``successful''  transmis-
sion  simply  means  that  the packet has been placed on the
cable without a collision.   On  the  other  hand,  an  1822
interface  guarantees proper delivery or an error indication
for each message transmitted.  The  model  employed  in  the
networking  system  attaches  no promises of delivery to the
packets handed to a network interface, and thus  corresponds
more closely to the Ethernet.  Errors returned by the output
routine are only those that can be detected immediately, and
are  normally  trivial  in  nature (no buffer space, address
format not handled, etc.).  No  indication  is  received  if
errors are detected after the call has returned.

99..22..  PPaacckkeett rreecceeppttiioonn

     Each  protocol  family  must  have one or more ``lowest
level'' protocols.  These protocols deal  with  internetwork
addressing  and are responsible for the delivery of incoming
packets to the proper protocol processing modules.   In  the
PUP  model [Boggs78] these protocols are termed Level 1 pro-
tocols, in the ISO model, network layer protocols.  In  this
system  each  such protocol module has an input packet queue
assigned to it.  Incoming  packets  received  by  a  network
interface  are  queued  for  the  protocol module, and a VAX
software interrupt is posted to initiate processing.










Networking Implementation Notes                    SMM:18-29


     Three macros are available for  queuing  and  dequeuing
packets:

IF_ENQUEUE(ifq, m)
     This  places the packet _m at the tail of the queue _i_f_q.

IF_DEQUEUE(ifq, m)
     This places a pointer to the  packet  at  the  head  of
     queue  _i_f_q  in _m and removes the packet from the queue.
     A zero value will be returned in  _m  if  the  queue  is
     empty.

IF_DEQUEUEIF(ifq, m, ifp)
     Like  IF_DEQUEUE, this removes the next packet from the
     head of a queue and returns it in _m.  A pointer to  the
     interface on which the packet was received is placed in
     _i_f_p, a (struct ifnet *).

IF_PREPEND(ifq, m)
     This places the packet _m at the head of the queue  _i_f_q.

     Each queue has a maximum length associated with it as a
simple form of congestion control.  The macro  IF_QFULL(ifq)
returns  1  if  the queue is filled, in which case the macro
IF_DROP(ifq) should be used to increment the  count  of  the
number  of  packets  dropped,  and  the  offending packet is
dropped.  For example, the following code fragment  is  com-
monly found in a network interface's input routine,

     if (IF_QFULL(inq)) {
            IF_DROP(inq);
            m_freem(m);
     } else
            IF_ENQUEUE(inq, m);


1100..  GGaatteewwaayyss aanndd rroouuttiinngg iissssuueess

     The  system has been designed with the expectation that
it  will  be  used  in  an  internetwork  environment.   The
``canonical''  environment was envisioned to be a collection
of local area networks  connected  at  one  or  more  points
through  hosts with multiple network interfaces (one on each
local area network), and possibly a  connection  to  a  long
haul  network  (for example, the ARPANET).  In such an envi-
ronment, issues of gatewaying and packet routing become very
important.  Certain of these issues, such as congestion con-
trol, have been handled in a simplistic manner  or  specifi-
cally  not  addressed.  Instead, where possible, the network
system attempts to provide simple mechanisms upon which more
involved  policies  may  be  implemented.   As some of these
problems become better understood, the  solutions  developed
will be incorporated into the system.










SMM:18-30                    Networking Implementation Notes


     This  section will describe the facilities provided for
packet routing.  The simplistic mechanisms provided for con-
gestion control are described in chapter 12.

1100..11..  RRoouuttiinngg ttaabblleess

     The  network  system  maintains a set of routing tables
for selecting a network interface to  use  in  delivering  a
packet to its destination.  These tables are of the form:

     struct rtentry {
              u_long   rt_hash;                /* hash key for lookups */
              struct   sockaddr rt_dst;        /* destination net or host */
              struct   sockaddr rt_gateway;    /* forwarding agent */
              short    rt_flags;               /* see below */
              short    rt_refcnt;              /* no. of references to structure */
              u_long   rt_use;                 /* packets sent using route */
              struct   ifnet *rt_ifp;          /* interface to give packet to */
     };


     The  routing  information  is organized in two separate
tables, one for routes to a host and one  for  routes  to  a
network.  The distinction between hosts and networks is nec-
essary so that a single  mechanism  may  be  used  for  both
broadcast  and  multi-drop  type networks, and also for net-
works built from point-to-point links (e.g DECnet  [DEC80]).

     Each  table  is  organized  as  a  hashed set of linked
lists.  Two 32-bit hash values are  calculated  by  routines
defined  for  each address family; one based on the destina-
tion being a host, and one assuming the target is  the  net-
work  portion  of  the  address.  Each hash value is used to
locate a hash chain to search (by taking  the  value  modulo
the  hash  table  size)  and the entire 32-bit value is then
used as a key in scanning the list of routes.   Lookups  are
applied  first  to  the routing table for hosts, then to the
routing table for networks.  If both lookups fail,  a  final
lookup is made for a ``wildcard'' route (by convention, net-
work 0).  The first appropriate route  discovered  is  used.
By doing this, routes to a specific host on a network may be
present as well as routes to the network.  This also  allows
a  ``fall  back'' network route to be defined to a ``smart''
gateway which may then perform more intelligent routing.

     Each routing table entry contains  a  destination  (the
desired  final  destination), a gateway to which to send the
packet, and various flags which indicate the route's  status
and  type (host or network).  A count of the number of pack-
ets sent using the route is kept,  along  with  a  count  of
``held  references''  to the dynamically allocated structure
to insure that memory reclamation occurs only when the route
is  not  in use.  Finally, a pointer to the a network inter-
face is kept; packets sent using the route should be  handed









Networking Implementation Notes                    SMM:18-31


to this interface.

     Routes  are  typed  in two ways: either as host or net-
work, and as ``direct'' or ``indirect''.   The  host/network
distinction  determines how to compare the _r_t___d_s_t field dur-
ing lookup.  If the route is to a network, only  a  packet's
destination  network  is compared to the _r_t___d_s_t entry stored
in the table.  If the route is to a host, the addresses must
match bit for bit.

     The  distinction  between  ``direct''  and ``indirect''
routes indicates whether the destination  is  directly  con-
nected  to the source.  This is needed when performing local
network encapsulation.  If a packet is destined for  a  peer
at  a host or network which is not directly connected to the
source, the internetwork  packet  header  will  contain  the
address of the eventual destination, while the local network
header will address the  intervening  gateway.   Should  the
destination  be  directly  connected,  these  addresses  are
likely to be identical, or a mapping between the two exists.
The  RTF_GATEWAY  flag  indicates  that  the  route is to an
``indirect'' gateway  agent,  and  that  the  local  network
header should be filled in from the _r_t___g_a_t_e_w_a_y field instead
of from the final internetwork destination address.

     It is assumed that multiple routes to the same destina-
tion  will not be present; only one of multiple routes, that
most recently installed, will be used.

     Routing redirect control messages are used  to  dynami-
cally  modify  existing  routing  table  entries  as well as
dynamically create new  routing  table  entries.   On  hosts
where  exhaustive  routing  information  is too expensive to
maintain (e.g. work stations), the combination  of  wildcard
routing entries and routing redirect messages can be used to
provide a simple routing management scheme without  the  use
of  a  higher level policy process.  Current connections may
be rerouted after notification of the protocols by means  of
their _p_r___c_t_l_i_n_p_u_t entries.  Statistics are kept by the rout-
ing table routines on the use of routing  redirect  messages
and  their  affect  on the routing tables.  These statistics
may be viewed using _n_e_t_s_t_a_t(1).

     Status information other than routing redirect  control
messages  may be used in the future, but at present they are
ignored.  Likewise, more intelligent ``metrics'' may be used
to  describe  routes  in the future, possibly based on band-
width and monetary costs.

1100..22..  RRoouuttiinngg ttaabbllee iinntteerrffaaccee

     A protocol accesses the routing  tables  through  three
routines,  one to allocate a route, one to free a route, and
one to process a  routing  redirect  control  message.   The









SMM:18-32                    Networking Implementation Notes


routine _r_t_a_l_l_o_c performs route allocation; it is called with
a pointer to the following structure containing the  desired
destination:

     struct route {
            struct    rtentry *ro_rt;
            struct    sockaddr ro_dst;
     };

The  route  returned is assumed ``held'' by the caller until
released with an _r_t_f_r_e_e  call.   Protocols  which  implement
virtual  circuits,  such  as  TCP,  hold onto routes for the
duration of the circuit's  lifetime,  while  connection-less
protocols,  such  as  UDP, allocate and free routes whenever
their destination address changes.

     The routine _r_t_r_e_d_i_r_e_c_t is called to process  a  routing
redirect  control  message.  It is called with a destination
address, the new gateway to that destination, and the source
of  the redirect.  Redirects are accepted only from the cur-
rent router for the destination.  If  a  non-wildcard  route
exists to the destination, the gateway entry in the route is
modified to point at the new gateway supplied.  Otherwise, a
new  routing table entry is inserted reflecting the informa-
tion supplied.  Routes to interfaces and routes to  gateways
which are not directly accessible from the host are ignored.

1100..33..  UUsseerr lleevveell rroouuttiinngg ppoolliicciieess

     Routing policies implemented in user processes  manipu-
late the kernel routing tables through two _i_o_c_t_l calls.  The
commands SIOCADDRT and  SIOCDELRT  add  and  delete  routing
entries,  respectively;  the  tables  are  read  through the
/dev/kmem device.  The decision to place policy decisions in
a  user process implies that routing table updates may lag a
bit behind the identification of new routes, or the  failure
of  existing  routes, but this period of instability is nor-
mally very small with proper implementation of  the  routing
process.   Advisory information, such as ICMP error messages
and IMP diagnostic messages, may be read  from  raw  sockets
(described in the next section).

     Several  routing  policy  processes  have  already been
implemented.  The system standard ``routing daemon'' uses  a
variant   of  the  Xerox  NS  Routing  Information  Protocol
[Xerox82] to maintain up-to-date routing tables in our local
environment.  Interaction with other existing routing proto-
cols, such as the Internet EGP (Exterior Gateway  Protocol),
has been accomplished using a similar process.














Networking Implementation Notes                    SMM:18-33


1111..  RRaaww ssoocckkeettss

     A  raw  socket  is  an object which allows users direct
access to a lower-level protocol.  Raw sockets are  intended
for  knowledgeable processes which wish to take advantage of
some protocol feature not directly  accessible  through  the
normal  interface,  or  for the development of new protocols
built atop existing lower level protocols.  For  example,  a
new  version  of TCP might be developed at the user level by
using a raw IP socket for delivery of packets.  The  raw  IP
socket  interface attempts to provide an identical interface
to the one a protocol would have if it were resident in  the
kernel.

     The  raw  socket  support is built around a generic raw
socket interface, (possibly) augmented by  protocol-specific
processing routines.  This section will describe the core of
the raw socket interface.

1111..11..  CCoonnttrrooll bblloocckkss

     Every raw socket has a protocol control  block  of  the
following form:

     struct rawcb {
             struct   rawcb *rcb_next;        /* doubly linked list */
             struct   rawcb *rcb_prev;
             struct   socket *rcb_socket;     /* back pointer to socket */
             struct   sockaddr rcb_faddr;     /* destination address */
             struct   sockaddr rcb_laddr;     /* socket's address */
             struct   sockproto rcb_proto;    /* protocol family, protocol */
             caddr_t  rcb_pcb;                /* protocol specific stuff */
             struct   mbuf *rcb_options;      /* protocol specific options */
             struct   route rcb_route;        /* routing information */
             short    rcb_flags;
     };

All  the control blocks are kept on a doubly linked list for
performing lookups during packet dispatch.  Associations may
be recorded in the control block and used by the output rou-
tine in preparing packets for transmission.   The  _r_c_b___p_r_o_t_o
structure  contains  the protocol family and protocol number
with which the raw socket is associated.  The protocol, fam-
ily  and addresses are used to filter packets on input; this
will be described in more detail shortly.  If any  protocol-
specific  information is required, it may be attached to the
control block using the  _r_c_b___p_c_b  field.   Protocol-specific
options  for  transmission in outgoing packets may be stored
in _r_c_b___o_p_t_i_o_n_s.

     A raw socket interface is datagram oriented.  That  is,
each  send  or  receive on the socket requires a destination
address.  This address may be supplied by the user or stored
in  the  control  block  and  automatically installed in the









SMM:18-34                    Networking Implementation Notes


outgoing packet by the output routine.  Since it is not pos-
sible  to  determine whether an address is present or not in
the control block, two flags, RAW_LADDR and RAW_FADDR, indi-
cate if a local and foreign address are present.  Routing is
expected to be performed by the underlying protocol if  nec-
essary.

1111..22..  IInnppuutt pprroocceessssiinngg

     Input  packets are ``assigned'' to raw sockets based on
a simple pattern matching scheme.  Each network interface or
protocol  gives  unassigned packets to the raw input routine
with the call:

     raw_input(m, proto, src, dst)
     struct mbuf *m; struct sockproto *proto, struct sockaddr *src, *dst;

The data packet then has a generic header prepended to it of
the form

     struct raw_header {
            struct    sockproto raw_proto;
            struct    sockaddr raw_dst;
            struct    sockaddr raw_src;
     };

and  it is placed in a packet queue for the ``raw input pro-
tocol'' module.  Packets taken from this  queue  are  copied
into  any raw sockets that match the header according to the
following rules,

1)   The protocol family of the socket and header agree.

2)   If the protocol number in the socket is non-zero,  then
     it agrees with that found in the packet header.

3)   If  a  local  address  is  defined  for the socket, the
     address format of the local address is the same as  the
     destination  address's  and the two addresses agree bit
     for bit.

4)   The rules of 3) are applied  to  the  socket's  foreign
     address and the packet's source address.

A  basic assumption is that addresses present in the control
block and packet  header  (as  constructed  by  the  network
interface and any raw input protocol module) are in a canon-
ical form which may be ``block compared''.

1111..33..  OOuuttppuutt pprroocceessssiinngg

     On output the raw _p_r___u_s_r_r_e_q routine passes  the  packet
and  a  pointer to the raw control block to the raw protocol
output routine for any  processing  required  before  it  is









Networking Implementation Notes                    SMM:18-35


delivered  to the appropriate network interface.  The output
routine is normally the only code required  to  implement  a
raw socket interface.

1122..  BBuuffffeerriinngg aanndd ccoonnggeessttiioonn ccoonnttrrooll

     One of the major factors in the performance of a proto-
col is the buffering policy used.  Lack of a proper  buffer-
ing  policy can force packets to be dropped, cause falsified
windowing information to be emitted by  protocols,  fragment
host memory, degrade the overall host performance, etc.  Due
to problems such as these, most  systems  allocate  a  fixed
pool  of memory to the networking system and impose a policy
optimized for ``normal'' network operation.

     The networking system developed for UNIX is little dif-
ferent in this respect.  At boot time a fixed amount of mem-
ory is allocated by the networking system.  At  later  times
more  system memory may be requested as the need arises, but
at no time is memory ever returned to  the  system.   It  is
possible  to  garbage  collect  memory from the network, but
difficult.  In order  to  perform  this  garbage  collection
properly,  some  portion  of  the  network  will  have to be
``turned off'' as data structures are updated.  The interval
over which this occurs must kept small compared to the aver-
age inter-packet arrival time, or too much  traffic  may  be
lost,  impacting  other  hosts  on  the  network, as well as
increasing load on  the  interconnecting  mediums.   In  our
environment  we  have  not  experienced a need for such com-
paction, and thus have left the problem unresolved.

     The mbuf structure was introduced  in  chapter  5.   In
this  section a brief description will be given of the allo-
cation mechanisms, and policies used  by  the  protocols  in
performing connection level buffering.

1122..11..  MMeemmoorryy mmaannaaggeemmeenntt

     The  basic  memory allocation routines manage a private
page map, the size of which determines the maximum amount of
memory that may be allocated by the network.  A small amount
of memory is allocated at boot time to initialize  the  mbuf
and  mbuf  page cluster free lists.  When the free lists are
exhausted, more memory is requested from the  system  memory
allocator  if space remains in the map.  If memory cannot be
allocated, callers may block awaiting free  memory,  or  the
failure  may  be  reflected  to the caller immediately.  The
allocator will not block awaiting free map entries, however,
as exhaustion of the page map usually indicates that buffers
have been lost due to a ``leak.''  The private page table is
used  by the network buffer management routines in remapping
pages to be logically contiguous as  the  need  arises.   In
addition,  an  array  of reference counts parallels the page
table and is used when multiple references  to  a  page  are









SMM:18-36                    Networking Implementation Notes


present.

     Mbufs  are  128  byte structures, 8 fitting in a 1Kbyte
page of memory.  When data is placed in mbufs, it is  copied
or  remapped  into logically contiguous pages of memory from
the network page pool if possible.  Data smaller  than  half
of  the  size  of a page is copied into one or more 112 byte
mbuf data areas.

1122..22..  PPrroottooccooll bbuuffffeerriinngg ppoolliicciieess

     Protocols reserve fixed amounts of buffering  for  send
and  receive  queues at socket creation time.  These amounts
define the high and low water marks used by the socket  rou-
tines  in deciding when to block and unblock a process.  The
reservation of space does not currently result in any action
by the memory management routines.

     Protocols  which  provide connection level flow control
do this based on the  amount  of  space  in  the  associated
socket  queues.   That is, send windows are calculated based
on the amount of free space in the socket's  receive  queue,
while  receive  windows  are adjusted based on the amount of
data awaiting transmission in the send queue.  Care has been
taken  to  avoid  the ``silly window syndrome'' described in
[Clark82] at both the sending and receiving ends.

1122..33..  QQuueeuuee lliimmiittiinngg

     Incoming packets from the network are  always  received
unless  memory allocation fails.  However, each Level 1 pro-
tocol input queue has an upper bound on the queue's  length,
and  any  packets exceeding that bound are discarded.  It is
possible for a host to be overwhelmed by  excessive  network
traffic (for instance a host acting as a gateway from a high
bandwidth  network  to  a  low  bandwidth  network).   As  a
``defensive''  mechanism the queue limits may be adjusted to
throttle network traffic load on a host.   Consider  a  host
willing to devote some percentage of its machine to handling
network traffic.  If the cost of handling an incoming packet
can  be  calculated  so that an acceptable ``packet handling
rate'' can be determined, then input queue  lengths  may  be
dynamically  adjusted based on a host's network load and the
number of packets awaiting processing.  Obviously,  discard-
ing packets is not a satisfactory solution to a problem such
as this (simply dropping packets is likely to  increase  the
load  on  a  network);  the  queue lengths were incorporated
mainly as a safeguard mechanism.

1122..44..  PPaacckkeett ffoorrwwaarrddiinngg

     When packets can not be  forwarded  because  of  memory
limitations,  the  system  attempts  to  generate a ``source
quench''  message.   In   addition,   any   other   problems









Networking Implementation Notes                    SMM:18-37


encountered during packet forwarding are also reflected back
to the sender in the form of ICMP packets.  This helps hosts
avoid unneeded retransmissions.

     Broadcast  packets  are never forwarded due to possible
dire consequences.  In an early stage  of  network  develop-
ment,  broadcast  packets  were  forwarded  and  a ``routing
loop'' resulted in network saturation and every host on  the
network crashing.

1133..  OOuutt ooff bbaanndd ddaattaa

     Out  of  band data is a facility peculiar to the stream
socket abstraction defined.   Little  agreement  appears  to
exist  as  to what its semantics should be.  TCP defines the
notion of ``urgent data'' as in-line, while the  NBS  proto-
cols  [Burruss81]  and numerous others provide a fully inde-
pendent logical transmission channel along which out of band
data  is  to  be  sent.  In addition, the amount of the data
which may be sent as an out of band message varies from pro-
tocol  to  protocol;  everything  from  1 bit to 16 bytes or
more.

     A stream socket's notion of out of band data  has  been
defined  as  the  lowest  reasonable  common denominator (at
least reasonable in our minds); clearly this is  subject  to
debate.   Out of band data is expected to be transmitted out
of the normal sequencing and flow control constraints of the
data  stream.   A  minimum of 1 byte of out of band data and
one outstanding out of band message are expected to be  sup-
ported  by the protocol supporting a stream socket.  It is a
protocol's prerogative to support larger-sized messages,  or
more than one outstanding out of band message at a time.

     Out  of  band data is maintained by the protocol and is
usually not stored in the socket's receive queue.  A socket-
level option, SO_OOBINLINE, is provided to force out-of-band
data to be placed in the normal receive  queue  when  urgent
data  is  received; this sometimes amelioriates problems due
to loss of  data  when  multiple  out-of-band  segments  are
received  before the first has been passed to the user.  The
PRU_SENDOOB and PRU_RCVOOB requests to the _p_r___u_s_r_r_e_q routine
are used in sending and receiving data.




















SMM:18-38                    Networking Implementation Notes


1144..  TTrraaiilleerr pprroottooccoollss

     Core  to core copies can be expensive.  Consequently, a
great deal of effort was spent  in  minimizing  such  opera-
tions.   The  VAX architecture provides virtual memory hard-
ware organized in page units.  To cut down  on  copy  opera-
tions,  data  is  kept  in  page-sized units on page-aligned
boundaries whenever possible.  This allows data to be  moved
in  memory  simply by remapping the page instead of copying.
The mbuf and network interface routines perform  page  table
manipulations  where  needed, hiding the complexities of the
VAX virtual memory hardware from higher level code.

     Data enters the system in two ways: from the  user,  or
from  the network (hardware interface).  When data is copied
from  the  user's  address  space  into  the  system  it  is
deposited  in  pages  (if sufficient data is present).  This
encourages the user  to  transmit  information  in  messages
which are a multiple of the system page size.

     Unfortunately, performing a similar operation when tak-
ing data from the network is very difficult.   Consider  the
format  of  an incoming packet.  A packet usually contains a
local network header followed by one or more headers used by
the  high  level protocols.  Finally, the data, if any, fol-
lows these headers.  Since the  header  information  may  be
variable length, DMA'ing the eventual data for the user into
a page aligned area of memory is impossible without _a _p_r_i_o_r_i
knowledge  of  the format (e.g., by supporting only a single
protocol header format).

     To allow  variable  length  header  information  to  be
present  and  still ensure page alignment of data, a special
local network encapsulation may be  used.   This  encapsula-
tion,  termed  a  _t_r_a_i_l_e_r  _p_r_o_t_o_c_o_l  [Leffler84], places the
variable length header information after the data.  A  fixed
size local network header is then prepended to the resultant
packet.  The local network header contains the size  of  the
data portion (in units of 512 bytes), and a new _t_r_a_i_l_e_r _p_r_o_-
_t_o_c_o_l _h_e_a_d_e_r, inserted before the variable  length  informa-
tion, contains the size of the variable length header infor-
mation.  The following trailer protocol header  is  used  to
store  information  regarding  the  variable length protocol
header:

     struct {
            short     protocol;            /* original protocol no. */
            short     length;              /* length of trailer */
     };


     The processing of the trailer protocol is very  simple.
On output, the local network header indicates that a trailer
encapsulation is being used.  The header  also  includes  an









Networking Implementation Notes                    SMM:18-39


indication  of  the  number of data pages present before the
trailer protocol header.  The  trailer  protocol  header  is
initialized  to  contain  the actual protocol identifier and
the variable length header size, and is appended to the data
along with the variable length header information.

     On  input,  the interface routines identify the trailer
encapsulation by the protocol type stored in the local  net-
work  header,  then calculate the number of pages of data to
find the beginning of the trailer.  The trailing information
is  copied  into  a separate mbuf and linked to the front of
the resultant packet.

     Clearly, trailer protocols require cooperation  between
source and destination.  In addition, they are normally cost
effective only when sizable packets are used.   The  current
scheme  works because the local network encapsulation header
is a fixed size, allowing DMA operations to be performed  at
a  known  offset  from  the  first data page being received.
Should the local network  header  be  variable  length  this
scheme fails.

     Statistics  collected  indicate that as much as 200Kb/s
can be gained by using a trailer protocol with 1Kbyte  pack-
ets.   The average size of the variable length header was 40
bytes (the size of a  minimal  TCP/IP  packet  header).   If
hardware  supports  larger sized packets, even greater gains
may be realized.

AAcckknnoowwlleeddggeemmeennttss

     The internal structure of the system is patterned after
the  Xerox  PUP  architecture  [Boggs79],  while  in certain
places the Internet protocol family has had a great deal  of
influence in the design.  The use of software interrupts for
process invocation is based on similar facilities  found  in
the VMS operating system.  Many of the ideas related to pro-
tocol modularity, memory management, and network  interfaces
are  based  on  Rob  Gurwitz's TCP/IP implementation for the
4.1BSD version of UNIX on the VAX [Gurwitz81].  Greg Chesson
explained  his  use  of  trailer  encapsulations in Datakit,
instigating their use in our system.





















SMM:18-40                    Networking Implementation Notes


RReeffeerreenncceess


[Boggs79]           Boggs, D. R., J. F. Shoch, E.  A.  Taft,
                    and R. M. Metcalfe; _P_U_P_: _A_n _I_n_t_e_r_n_e_t_w_o_r_k
                    _A_r_c_h_i_t_e_c_t_u_r_e.  Report CSL-79-10.   XEROX
                    Palo Alto Research Center, July 1979.

[BBN78]             Bolt  Beranek  and Newman; Specification
                    for the Interconnection of Host and IMP.
                    BBN Technical Report 1822.  May 1978.

[Cerf78]            Cerf,  V.  G.;   The  Catenet  Model for
                    Internetworking.     Internet    Working
                    Group, IEN 48.  July 1978.

[Clark82]           Clark,  D.  D.;  Window and Acknowledge-
                    ment Strategy in TCP, RFC-813.   Network
                    Information  Center,  SRI International.
                    July 1982.

[DEC80]             Digital Equipment  Corporation;   _D_E_C_n_e_t
                    _D_I_G_I_T_A_L  _N_e_t_w_o_r_k  _A_r_c_h_i_t_e_c_t_u_r_e _- _G_e_n_e_r_a_l
                    _D_e_s_c_r_i_p_t_i_o_n.   Order  No.   AA-K179A-TK.
                    October 1980.

[Gurwitz81]         Gurwitz,  R.  F.;   VAX-UNIX  Networking
                    Support   Project    -    Implementation
                    Description.     Internetwork    Working
                    Group, IEN 168.  January 1981.

[ISO81]             International Organization for Standard-
                    ization.   _I_S_O _O_p_e_n _S_y_s_t_e_m_s _I_n_t_e_r_c_o_n_n_e_c_-
                    _t_i_o_n _- _B_a_s_i_c  _R_e_f_e_r_e_n_c_e  _M_o_d_e_l.   ISO/TC
                    97/SC 16 N 719.  August 1981.

[Joy86]             Joy,  W.;  Fabry, R.; Leffler, S.; McKu-
                    sick, M.; and Karels, M.; Berkeley Soft-
                    ware  Architecture  Manual,  4.4BSD Edi-
                    tion.  _U_N_I_X  _P_r_o_g_r_a_m_m_e_r_'_s  _S_u_p_p_l_e_m_e_n_t_a_r_y
                    _D_o_c_u_m_e_n_t_s,  Vol.  1  (PSD:5).   Computer
                    Systems Research  Group,  University  of
                    California, Berkeley.  May, 1986.

[Leffler84]         Leffler,  S.J. and Karels, M.J.; Trailer
                    Encapsulations, RFC-893.  Network Infor-
                    mation Center, SRI International.  April
                    1984.

[Postel80]          Postel,  J.   User  Datagram   Protocol,
                    RFC-768.   Network  Information  Center,
                    SRI International.  May 1980.











Networking Implementation Notes                    SMM:18-41


[Postel81a]         Postel,  J.,  ed.   Internet   Protocol,
                    RFC-791.   Network  Information  Center,
                    SRI International.  September 1981.

[Postel81b]         Postel, J.,  ed.   Transmission  Control
                    Protocol,  RFC-793.  Network Information
                    Center,  SRI  International.   September
                    1981.

[Postel81c]         Postel,  J.   Internet  Control  Message
                    Protocol, RFC-792.  Network  Information
                    Center,  SRI  International.   September
                    1981.

[Xerox81]           Xerox Corporation.   _I_n_t_e_r_n_e_t  _T_r_a_n_s_p_o_r_t
                    _P_r_o_t_o_c_o_l_s.    Xerox  System  Integration
                    Standard 028112.  December 1981.

[Zimmermann80]      Zimmermann, H.  OSI  Reference  Model  -
                    The  ISO  Model of Architecture for Open
                    Systems Interconnection.  _I_E_E_E  _T_r_a_n_s_a_c_-
                    _t_i_o_n_s   _o_n  _C_o_m_m_u_n_i_c_a_t_i_o_n_s.   Com-28(4);
                    425-432.  April 1980.





































