Title: Our recv1000'c driver
1Our recv1000.c driver
- Implementing a packet-receive capability with
the Intel 82573L network interface controller
2Similarities
- There exist quite a few similarities between
implementing the transmit-capability and the
receive-capability in a device-driver for
Intels 82573L ethernet controller - Identical device-discovery and ioremap steps
- Same steps for global reset of the hardware
- Comparable data-structure initializations
- Parallel setups for the TX and RX registers
- But there also are a few fundamental differences
(such as active versus passive roles for
driver)
3push versus pull
Host memory
Ethernet controller
transmit packet buffer
transmit-FIFO
push
to/from LAN
receive-FIFO
receive packet buffer
pull
The write() routine in our xmit1000.c driver
could transfer data at any time, but the
read() routine in our recv1000.c driver has
to wait for data to arrive. So to avoid doing
any wasteful busy-waiting, our recv1000.c
driver can use the Linux kernels sleep/wakeup
mechanism if it enables NICs interrupts!
4Sleep/wakeup
- We will need to employ a wait-queue, we will need
to enable device-interrupts, and we will need to
write and install the code for an interrupt
service routine (ISR) - So our recv1000.c driver will have a few
additional code and data components that were
absent in our xmit1000.c driver
5Drivers components
my_isr()
wait_queue_head
This function will awaken any sleeping reader-task
my_fops
read
my_read()
This function will program the actual
data-transfer
struct holds one function-pointer
my_get_info()
This function will allow us to inspect the
receive-descriptors
module_init()
module_exit()
This function will detect and configure the
hardware, define page-mappings, allocate and
initialize the descriptors, install our ISR and
enable interrupts, start the receive engine,
create the pseudo-file and register my_fops
This function will do needed cleanup when
its time to unload our driver turn off the
receive engine, disable interrupts and remove
our ISR, free memory, delete page-table
entries, the pseudo-file, and the my_fops
6How NICs interrupts work
- There are four interrupt-related registers which
are essential for us to understand
ICR
0x00C0 0x00C8 0x00D0 0x00D8
Interrupt Cause Read
ICS
Interrupt Cause Set
IMS
Interrupt Mask Set/Read
IMC
Interrupt Mask Clear
7Interrupt event-types
31 30
18 17 16 15 14
10 9 8 7 6 5 4 2
1 0
reserved
reserved
31 INT_ASSERTED (1yes,0no)
17 ACK (Rx-ACK Frame detected) 16 SRPD (Small
Rx-Packet detected) 15 TXD_LOW (Tx-Descr Low
Thresh hit) 9 MDAC (MDI/O Access Completed)
7 RXT0 ( Receiver Timer expired) 6 RXO
(Receiver Overrun) 4 RXDMT0 (Rx-Desc Min
Thresh hit) 2 LSC (Link Status Change) 1
TXQE( Transmit Queue Empty) 0 TXDW (Transmit
Descriptor Written Back)
82573L
8Interrupt Mask Set/Read
- This register is used to enable a selection of
the devices interrupts which the driver will be
prepared to recognize and handle - A particular interrupt becomes enabled if
software writes a 1 to the corresponding bit of
this Interrupt Mask Set register - Writing 0 to any register-bit has no effect, so
interrupts can be enabled one-at-a-time
9Interrupt Mask Clear
- Your driver can discover which interrupts have
been enabled by reading IMS but your driver
cannot disable any interrupts by writing to
that register - Instead a specific interrupt can be disabled by
writing a 1 to the corresponding bit in the
Interrupt Mask Clear register - Writing 0 to a register-bit has no effect on
the interrupt controllers Interrupt Mask
10Interrupt Cause Read
- Whenever interrupts occur, your drivers
interrupt service routine can discover the
specific conditions that triggered them if it
reads the Interrupt Cause Read register - In this case your driver can clear any selection
of these bits (except bit 31) by writing 1s to
them (writing 0s to this register will have no
effect) - If case no interrupt has occurred, reading this
register may have the side-effect of clearing it
11Interrupt Cause Set
- For testing your drivers interrupt-handler, you
can artificially trigger any particular
combination of interrupts by writing 1s into
the corresponding register-bits of this Interrupt
Cause Set register (assuming your combination of
bits corresponds to interrupts that are enabled
by 1s being present for them in the Interrupt
Mask)
12Our interrupt-handler
- We decided to enable all possible causes (and we
log them via printk() messages weve omitted
in the code-fragment here)
irqreturn_t my_isr( int irq, void dev_id )
int intr_cause ioread32( io E1000_ICR
) if ( intr_cause 0 ) return
IRQ_NONE wake_up_interruptible( wq_rd
) iowrite32( intr_cause, io E1000_ICR
) return IRQ_HANDLED
13We tweak our packet-format
- Our xmit1000.c driver elected to have the NIC
append padding to any short packets - But this prevents a receiver from knowing how
many bytes represent actual data - To solve this problem, we added our own count
field to each packets payload
0 6
12
14
actual bytes of user-data
destination MAC-address
source MAC-address
Type/Len
count
14Our read() method
ssize_t my_read( struct file file, char buf,
size_t len, loff_t pos ) static int rxhead
0 // to remember where we left off unsigned
char from phys_to_virt( rxdesc rxhead
.base_addr ) unsigned int count // go to
sleep if no new data-packets have been received
yet if ( ioread32( io E1000_RDH ) rxhead
) if ( wait_event_interruptible( wq_rd,
ioread32( io E1000_RDH ) ! rxhead ) )
return EINTR // get the number of actual
data-bytes in the new (possibly padded)
data-packet count (unsigned short)(from
14) // data-count as stored by xmit1000.c if
( count gt len ) count len // cant transfer
more bytes than buffer can hold if (
copy_to_user( buf, from16, count ) ) return
EFAULT // advance our static array-index
variable to the next receive-descriptor rxhead
(1 rxhead) 8 // this index wraps-around
after 8 descriptors return count // tell
kernel how many bytes were transferred
15Hardwares initialization
- We allocate and initialize a minimum-size Receive
Descriptor Queue (8 descriptors) - We perform a global reset via the RST-bit in
the NICs Device Control register (with a
side-effect of zeroing both RDH and RDT) - We configure the receive engine (RCTL) plus a
few additional registers that affect the
network-controllers reception-options (namely
RXCSUM, RFCTL, PSRCTL)
16Receive Control (0x0100)
31 30 29 28 27 26
25 24 23 22 21
20 19 18 17 16
R 0
0
0
FLXBUF
SE CRC
BSEX
R 0
PMCF
DPF
R 0
CFI
CFI EN
VFE
BSIZE
15 14 13 12 11
10 9 8 7 6 5
4 3 2 1 0
B A M
R 0
MO
DTYP
RDMTS
I L O S
S L U
LPE
UPE
0 0
R 0
SBP
E N
LBM
MPE
EN Receive Enable DTYP Descriptor
Type DPF Discard Pause Frames SBP Store Bad
Packets MO Multicast Offset PMCF Pass MAC
Control Frames UPE Unicast Promiscuous Enable
BAM Broadcast Accept Mode BSEX Buffer Size
Extension MPE Multicast Promiscuous Enable
BSIZE Receive Buffer Size SECRC Strip
Ethernet CRC LPE Long Packet reception Enable
VFE VLAN Filter Enable FLXBUF Flexible
Buffer size LBM Loopback Mode CFIEN
Canonical Form Indicator Enable RDMTS
Rx-Descriptor Minimum Threshold Size CFI
Cannonical Form Indicator bit-value
Our driver initially will program this register
with the value 0x0400801C. Then later, when
everything is ready, it will turn on bit 1 to
start the receive engine
82573L
17Packet-Split Rx Control (0x2170)
31 30 29 24
23 22 21 16 15
14 13 8 7
6 0
BSIZE3 (in KB)
BSIZE2 (in KB)
BSIZE1 (in KB)
BSIZE0 (in 1/8 KB)
0
0
0
0
0
0
0
If the controller is configured to use the
packet-split feature (RCTL.DTYP1), then this
register controls the sizes of the four
receive-buffers, so there are certain
requirements that nonzero values appear in
several of these fields. But our recv1000.c
driver will use the legacy receive-descriptor
format (i.e., RCRL.DTYP0) and so this register
will be disregarded by the NIC and therefore we
are allowed to program it with the value
0x00000000.
18Receive Filter Control (0x5008)
31 30 29 28 27 26
25 24 23 22 21
20 19 18 17 16
PHY RST
VME
R 0
TFCE
RFCE
RST
R 0
R 0
R 0
R 0
R 0
ADV D3 WUC
R 0
D/UD status
R 0
reserved
15 14 13 12 11
10 9 8 7 6 5
4 3 2 1 0
EXSTEN
IPFRSP _DIS
ACKD _DIS
ACK DIS
IPv6 XSUM _DIS
IPv6 _DIS
NFS_VER
NSFR _DIS
NSFW _DIS
R 0
R 0
R 1
0 0
iSCSI _DIS
GIO M D
iSCSI_DWC
Our driver writes 0x00000000 to this register,
which among other effects will cause the
ethernet controller NOT to write Extended Status
information into our device-drivers
legacy-format Receive Descriptors (bit 15
EXTEN0)
19RX Checksum Control (0x5000)
31
10 9 8 7 0
reserved
packet checksum start
TCP/UDP Checksum Off-load enabled (1yes, 0no)
IP Checksum Off-load enabled
(1yes, 0no) This field controls the starting
byte for the Packet Checksum calculation
Our driver programs this register with the value
0x00000000 (which disables Checksum Off-loading
for TCP/UDP packets (which we wont be
receiving) and for IP packets (which likewise
wont be sent by our xmit1000.c driver), and
all Packet-Checksums will be calculated starting
from the very first byte
20Rx-Descriptor Control (0x2828)
31 30 29 28 27 26
25 24 23 22 21
20 19 18 17 16
0
0
0
0
0
0
0
G R A N
0
0
WTHRESH (Writeback Threshold)
15 14 13 12 11
10 9 8 7 6 5
4 3 2 1 0
0
0
0
FRC DPLX
FRC SPD
0
HTHRESH (Host Threshold)
I L O S
0 0
A S D E
0
L R S T
0 0
0
0
PTHRESH (Prefetch Threshold)
0
0
This register controls the fetching and write
back of receive descriptors. The three
threshhold values are used to determine when
descriptors are read from, and written to, host
memory. Their values can be in units of cache
lines or of descriptors (each descriptor is 16
bytes), based on the value of the GRAN bit
(0cache lines, 1descriptors). When GRAN 1,
all descriptors are written back (even if not
requested). --Intel manual
Recommended for 82573 0x01010000 (GRAN1,
WTHRESH1)
21Maximum-size buffers
- We use a minimal number of maximum-size
receive-buffers (eight of 1536-bytes)
buffer 7
buffer 6
buffer 5
buffer 4
buffer 3
buffer 2
buffer 1
buffer 0
kernel memory
ring of eight rx-descriptors
22NIC owns our rx-descriptors
RDBAH/RDBAL
RDH
descriptor 0
0 1 2 3 4 5 6 7 8
This register gets initialized to 0, then gets
changed by the controller as new packets are
received
descriptor 1
descriptor 2
descriptor 3
RDLEN
descriptor 4
0x80
descriptor 5
descriptor 6
descriptor 7
RDT
descriptor 8
This register gets initialized to 8, then never
gets changed
Our static variable
rxhead
23Driver defects
- If an application tries to read from our
device-file /dev/nic, but the controller
received a packet that contains more bytes of
data than the user requested, excess bytes get
lost (i.e., discarded) - If an application delays reading packets while
the controller continues receiving, then an
earlier packet gets overwritten
24In-class exercise 1
- Discuss with your nearest class-member your ideas
for how these driver defects might be overcome,
so that packet-data being received will be
protected against getting lost and/or being
overwritten
25In-class exercise 2
- Login to a pair of machines on the anchor
cluster and install our xmit1000.ko and our
recv1000.ko modules (one on each) - Try transferring a textfile from one of the
machines to the other, by using cat - anchor01 cat textfile gt /dev/nic
- anchor02 cat /dev/nic gt recv1000.out
- How large a textfile can you successfully
transfer using our simple driver-modules?