Quantcast
Channel: my tech blog » PCI express
Viewing all 30 articles
Browse latest View live

PCI express maximal payload size: Finding it and its impact on bandwidth

$
0
0

Finding the maximal payload manually

The truth is, there is no need to do this manually. lspci does the work for us. But looking into the configuration table once and for all helps demystifying the issue. So here we go.

According to the PCIe spec (section 7.8), the max_payload_size the card can take is give in the PCIe Device Capabilities Register (Offset 0x04 in the PCI Express Capability structure), bits 2-0. Basically, take that three-bit field as a number, add 7 to it, and you have the log-2 of the number of bytes allowed.

Let me write this in C for clarity:

max_payload_size_capable = 1 << ( (DevCapReg & 0x07) + 7); // In bytes

The actual value used is set by host in the Device Control Register (Offset Ox08 in the PCI Express Capability structure). It’s the same drill, but with bits 7-5 instead. So in C it would be

max_payload_size_in_effect = 1 << ( ( (DevCtrlReg >> 5) & 0x07) + 7); // In bytes

OK, so how can we find these registers? How do we find the structure? Let’s start with dumping the hexadecimal representation of the 256-byte configuration space. Using lspci -xxx on a Linux machine we will get the dump for all devices, but we’ll look at one specific:

# lspci -xxx
(...)

01:00.0 Class ff00: Xilinx Corporation Generic FPGA core
00: ee 10 34 12 07 04 10 00 00 00 00 ff 01 00 00 00
10: 04 f0 af fd 00 00 00 00 00 00 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 ee 10 34 12
30: 00 00 00 00 40 00 00 00 00 00 00 00 ff 00 00 00
40: 01 48 03 70 08 00 00 00 05 58 81 00 0c 30 e0 fe
50: 00 00 00 00 71 41 00 00 10 00 01 00 c2 8f 28 00
60: 10 28 00 00 11 f4 03 00 00 00 11 00 00 00 00 00
70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

The first important thing to know about lspci -xxx on a little-endian machine (x86 processors included) is that PCI and PCIe work in big endian. And that the data  is shown as little-endian DWs (or 32-bit unsigned ints). So the way to look at the output is in groups of four bytes each, and take them for a little-endian unsigned int, whose bit map matches the spec.

For example, according to the spec, bits 15-0 of the word mapped at 00h is the Vendor ID, and bits 31-16 is the Device ID. So we take the first four bytes for a little-endian 32-bit integer, and get Ox123410ee. Bits 15-0 are indeed Ox10ee, the vendor ID Xilinx, and bits 31-16 are Ox1234 which is the Device ID I made up for a custom device. So far so good.

Now we need to find the PCI Express Capability structure. It’s one of the structures in a linked list (would you believe that?), and it’s identified by a Cap ID of Ox10.

The pointer to the list is at bits 7-0 of the configuration word at Ox34. In our little-endian representation above, it’s simply the byte at Ox34, which says Ox40. The capabilities hence start at Ox40.

From here on, we can travel along the list of capability structures. Each starts 32-bit aligned, with the header always having the Capability ID on bits 7-0 (appears as the first byte above), and a pointer to the next structure in bits 15-8 (the second byte).

So we start at offset Ox40, finding it’s of Cap ID Ox01, and that the byte at offset Ox41 tells us that the next entry is at offset Ox48. Moving on to offset Ox48 we find Cap ID Ox05 and the next entry at Ox58. The entry at Ox58 is with Cap ID Ox10 (!!!), and it’s the last one (pointer to next is zero).

So we found our structure at Ox58. The Device Capabilities Register is hence at Ox5c (offset Ox04) and reads Ox00288fc2. The Device Control Register is at Ox60 (offset Ox08), and reads Ox00002810.

So we learn from bits 2-0 of the Device Capabilities Register (having value 2) that the device supports a max_payload_size of 512. But bits 7-5 (having value 0) of the Device Control Register tell us that the effective maximal payload is only 128 bytes.

Getting the info with lspci

As I mentioned above, we didn’t really need to find the addresses by hand. lspci -v gives us, for the specific device:

# lspci -v
(...)
01:00.0 Class ff00: Xilinx Corporation Generic FPGA core
 Subsystem: Xilinx Corporation Generic FPGA core
 Flags: bus master, fast devsel, latency 0, IRQ 42
 Memory at fdaff000 (64-bit, non-prefetchable) [size=128]
 Capabilities: [40] Power Management version 3
 Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
 Capabilities: [58] Express Endpoint IRQ 0
 Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-00

So the address to the PCI Express capabilities structure is given to us, but not the internal details (maybe some newer version of lspci does). And by the way, the size=128 above  has nothing to do with maximal payload: It’s the size of the memory space allocated to the device by BIOS (BAR address space, if we’re into it).

For the details, including the maximal payload, we use the lspci -vv option.

# lspci -vv
(...)
01:00.0 Class ff00: Xilinx Corporation Generic FPGA core
 Subsystem: Xilinx Corporation Generic FPGA core
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
 Latency: 0, Cache Line Size: 4 bytes
 Interrupt: pin ? routed to IRQ 42
 Region 0: Memory at fdaff000 (64-bit, non-prefetchable) [size=128]
 Capabilities: [40] Power Management version 3
 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
 Address: 00000000fee0300c  Data: 4171
 Capabilities: [58] Express Endpoint IRQ 0
 Device: Supported: MaxPayload 512 bytes, PhantFunc 0, ExtTag-
 Device: Latency L0s unlimited, L1 unlimited
 Device: AtnBtn- AtnInd- PwrInd-
 Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
 Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
 Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
 Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
 Link: Latency L0s unlimited, L1 unlimited
 Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
 Link: Speed 2.5Gb/s, Width x1
 Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-0

So there we have it, black on white: The device supports 512 bytes MaxPayload, but below we have MayPayload given as 128 bytes.

Impact on performance

A 128-byte maximal payload is not good news if one wants to get the most out of the bandwidth. By the way, switches are not permitted to split packets (but the Root Complex is allowed) so this number actually tells us how much overhead each TLP (Transaction Layer Packet) carries. I talk about the TLP structure in another post.

Let’s make a quick calculation: Each packet comes with a header of 3 DWs (a DW is a 32-bit word, right?) when using 32 bit addressing, and a header of 4 DWs for 64-bit addressing. Let’s be nice and assume 32-bit addressing, so the header is 3 DWs.

TLPs may optionally carry a one-DW TLP digest (ECRC), which is generally a stupid idea if you trust the switching chipsets not to mess up your data. Otherwise, the Data Link layer’s CRC should be enough. So we’ll assume no TLP digest.

The Data Link layer overhead is a bit more difficult to estimate, because it has its own housekeeping packets. But since most acknowledge and flow control packets go in the opposite direction and hence don’t interfere with a unidirectional bulk data transmission, we’ll focus on the actual data added to each TLP: It consists of a 2-byte header (partially filled with a TLP sequence number) and a 4-byte LCRC.

So all in all, the overhead, assuming a 3-DW header, is 12 bytes for the TLP header and another 6 bytes by the Data Link. All in all, we have 18 bytes, which takes up ~12% if transmitted along a 128-byte TLP, but only ~3.4% for a 512-byte TLP.

For a 1x configuration, which has 2.5 Gbps on the wires, and effective 2.0 Gbps (10/8 bit coding), we could dream about 250 MBytes/sec. But when the TLPs are 128 bytes long each, our upper limit goes down to some ~219 Mbytes/sec. With 512-bytes TLPs it’s ~241 Mbytes/sec. Does it matter at all? I suppose it depends. In benchmark testing, it’s important to know these limits, or you start thinking something is wrong, when it’s actually the packet network limiting the speed.

 

 

 


PCIe read completion reordering and how it reduces bandwidth efficiency

$
0
0

While the PCI Express standard is impressive in that it actually makes sense (well, most of the time) there is a pretty annoying thing about read requests reordering.

By the way, I talk about TLP packet formation in general in another post.

In section 2.4.1 of the PCIe spec 1.1, it says that read requests may be reordered as they travel across the switching network, and same goes for read completions associated with different read requests. The read-related packets, which must arrive in the same order they were sent, are read completions associated with the same read request, or as the spec puts it: “… must return in address order”.

This would be a good time to mention, that a read request may be larger than the maximal payload size, so obviously the completer must have a means of splitting the completion into several TLPs, which must be sent in rising address order. And if we’re at it, there’s a boundary restriction on how to cut the data in pieces, namely that the cuts are on boundaries of RCB, where RCB can either be 64 bytes or 128 bytes (if you’re not a Root Complex you may cut on 128-byte boundaries only, and if you are a Root Complex, you choose 64 or 128, and configure the endpoints telling then what you chose).

And there’s a restriction on the maximal read request size, but that’s a different story.

So far the specification makes sense: Read completions will be split into several TLPs pretty often, and they have to arrive to the requester in linear address order, so these packets must not be reordered.

But what if the endpoint needs to collect a chunk of data which is larger than the maximal read request size (typically 512 bytes)? It will have to issue several read requests for that. But read requests and completions from different read requests may be reordered.

So if we want to assure that the data arrives in linear order (which is necessary when the data goes into a FIFO-like data sink) each read request can be transmitted only when the last completion TLP from the previous request arrives. Otherwise, a completion TLP from the following request may arrive before that last packet. Hence there’s a time gap of non-utilized bandwidth.

In general there is no problem having several outstanding read requests. So had read requests and read completions been strictly ordered, it would be possible to send the following read request more or less immediately after the first one, and completions would arrive continuously.

Another issue, which is less likely to bother anyone, is that if some software makes assumptions on the order at which data in some buffer is updated, this can cause rare bugs. For example, if the read completions update some dual port RAM, and there’s software reading from this buffer, it may deduce from the update of some high-addressed memory cell that the entire buffer is updated.

And a final word: Since PCIe infrastructure is pretty plain when this post is written, I will be surprised if anyone manages to catch any packet reordering taking place for real. But exactly as write reordering is commonplace in modern CPUs to increase performance, it’s only a matter of time before PCIe switching networks do the same.

Update: I got an email from someone informing me that he spotted reordering taking place on some Intel desktop chipsets. So this isn’t just theory, after all.

Questions & Comments

Since the comment section of similar posts tends to turn  into a Q & A session, I’ve taken that to a separate place. So if you’d like to discuss something with me, please  post questions and comments here instead. No registration is required. The comment section below is closed.

A sniff dump of a PCIe device talking with Linux host

$
0
0

This is just a raw dump of PCIe communication. I wrote a small sniffer on an FPGA and ran some data in a loop to and from the peripheral. The sniffer’s own data was stored while sniffing, so it doesn’t appear in the stream. The whole thing ran on a Linux machine.

I thought that after writing a few words about TLP formation, a real-life example could be in place.

I recorded headers only, and then hacked together a Perl script (too ugly and specific for any future use) and got the dump below.

All writes from host to peripheral (marked with “>>”) are register writes (to the kernel code the BAR is at 0xfacf2000, but see lspci output below).

Writes from peripheral to host (marked with “<<”) consist of DMA transmissions containing data (longer writes) and status updates (shorter).

And then we have DMA reads made by peripheral, with read requests (“<<”) and completions (“>>”).

Each TLP is given in cleartext, and then the packet’s 3-4 header words hexadecimal in parentheses. In the cleartext part the address and (sender’s) bus ID are given in hex, all other in plain decimal.

As it turned out, the host sends packets using 32-bit addressing, and the peripheral uses 64 bits (as it was told to).

So before getting to the raw dumps, let’s just see what lspci -vv gave us on the specific device:

01:00.0 Class ff00: Xilinx Corporation Generic FPGA core
 Subsystem: Xilinx Corporation Generic FPGA core
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Step
ping- SERR- FastB2B-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
 Latency: 0, Cache Line Size: 4 bytes
 Interrupt: pin ? routed to IRQ 42
 Region 0: Memory at fdaff000 (64-bit, non-prefetchable) [size=128]
 Capabilities: [40] Power Management version 3
 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
 Address: 00000000fee0300c  Data: 4152
 Capabilities: [58] Express Endpoint IRQ 0
 Device: Supported: MaxPayload 512 bytes, PhantFunc 0, ExtTag-
 Device: Latency L0s unlimited, L1 unlimited
 Device: AtnBtn- AtnInd- PwrInd-
 Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
 Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
 Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
 Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
 Link: Latency L0s unlimited, L1 unlimited
 Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
 Link: Speed 2.5Gb/s, Width x1
 Capabilities: [100] Device Serial Number 00-00-00-00-00-00-00-0

And now to the dump itself (unfortunately, I didn’t grab any MSI):

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff008
>> (40000001, 0000000f, fdaff008)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff034
>> (40000001, 0000000f, fdaff034)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c17000
 << (60000020, 010000ff, 00000000, 00c17000)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c17080
 << (60000020, 010000ff, 00000000, 00c17080)

 << (Write) Type = 0, fmt=3, length=11
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c17100
 << (6000000b, 010000ff, 00000000, 00c17100)

 << (Write) Type = 0, fmt=3, length=4
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c29200
 << (60000004, 010000ff, 00000000, 00c29200)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff008
>> (40000001, 0000000f, fdaff008)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff034
>> (40000001, 0000000f, fdaff034)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c15000
 << (60000020, 010000ff, 00000000, 00c15000)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c15080
 << (60000020, 010000ff, 00000000, 00c15080)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c15100
 << (60000020, 010000ff, 00000000, 00c15100)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c15180
 << (60000020, 010000ff, 00000000, 00c15180)

 << (Write) Type = 0, fmt=3, length=12
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c15200
 << (6000000c, 010000ff, 00000000, 00c15200)

 << (Write) Type = 0, fmt=3, length=4
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c29200
 << (60000004, 010000ff, 00000000, 00c29200)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff008
>> (40000001, 0000000f, fdaff008)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff034
>> (40000001, 0000000f, fdaff034)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

 << (Read Rq) Type = 0, fmt=1, length=128
 <<  Bus ID: 0100, Tag=01
 << Address = 0000000000c1f000
 << (20000080, 010001ff, 00000000, 00c1f000)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=01
>>  Completion low addr=0, byte count=512
>> (4a000010, 00000200, 01000100)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=01
>>  Completion low addr=64, byte count=448
>> (4a000010, 000001c0, 01000140)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=01
>>  Completion low addr=0, byte count=384
>> (4a000010, 00000180, 01000100)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=01
>>  Completion low addr=64, byte count=320
>> (4a000010, 00000140, 01000140)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1b000
 << (60000020, 010000ff, 00000000, 00c1b000)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=01
>>  Completion low addr=0, byte count=256
>> (4a000010, 00000100, 01000100)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=01
>>  Completion low addr=64, byte count=192
>> (4a000010, 000000c0, 01000140)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1b080
 << (60000010, 010000ff, 00000000, 00c1b080)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=01
>>  Completion low addr=0, byte count=128
>> (4a000010, 00000080, 01000100)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=01
>>  Completion low addr=64, byte count=64
>> (4a000010, 00000040, 01000140)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c000
 << (60000020, 010000ff, 00000000, 00c1c000)

 << (Read Rq) Type = 0, fmt=1, length=128
 <<  Bus ID: 0100, Tag=02
 << Address = 0000000000c1f200
 << (20000080, 010002ff, 00000000, 00c1f200)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1f400
 << (60000020, 010000ff, 00000000, 00c1f400)

 << (Write) Type = 0, fmt=3, length=2
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c29200
 << (60000002, 010000ff, 00000000, 00c29200)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c100
 << (60000010, 010000ff, 00000000, 00c1c100)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=02
>>  Completion low addr=0, byte count=512
>> (4a000010, 00000200, 01000200)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=02
>>  Completion low addr=64, byte count=448
>> (4a000010, 000001c0, 01000240)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=02
>>  Completion low addr=0, byte count=384
>> (4a000010, 00000180, 01000200)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c140
 << (60000020, 010000ff, 00000000, 00c1c140)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=02
>>  Completion low addr=64, byte count=320
>> (4a000010, 00000140, 01000240)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=02
>>  Completion low addr=0, byte count=256
>> (4a000010, 00000100, 01000200)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c1c0
 << (60000010, 010000ff, 00000000, 00c1c1c0)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=02
>>  Completion low addr=64, byte count=192
>> (4a000010, 000000c0, 01000240)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=02
>>  Completion low addr=0, byte count=128
>> (4a000010, 00000080, 01000200)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c200
 << (60000020, 010000ff, 00000000, 00c1c200)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=02
>>  Completion low addr=64, byte count=64
>> (4a000010, 00000040, 01000240)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c280
 << (60000010, 010000ff, 00000000, 00c1c280)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c2c0
 << (60000020, 010000ff, 00000000, 00c1c2c0)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff008
>> (40000001, 0000000f, fdaff008)

 << (Write) Type = 0, fmt=3, length=2
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c29200
 << (60000002, 010000ff, 00000000, 00c29200)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff008
>> (40000001, 0000000f, fdaff008)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff034
>> (40000001, 0000000f, fdaff034)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

 << (Read Rq) Type = 0, fmt=1, length=128
 <<  Bus ID: 0100, Tag=03
 << Address = 0000000000c20000
 << (20000080, 010003ff, 00000000, 00c20000)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=03
>>  Completion low addr=0, byte count=512
>> (4a000010, 00000200, 01000300)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=03
>>  Completion low addr=64, byte count=448
>> (4a000010, 000001c0, 01000340)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=03
>>  Completion low addr=0, byte count=384
>> (4a000010, 00000180, 01000300)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c340
 << (60000020, 010000ff, 00000000, 00c1c340)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=03
>>  Completion low addr=64, byte count=320
>> (4a000010, 00000140, 01000340)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=03
>>  Completion low addr=0, byte count=256
>> (4a000010, 00000100, 01000300)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=03
>>  Completion low addr=64, byte count=192
>> (4a000010, 000000c0, 01000340)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c3c0
 << (60000020, 010000ff, 00000000, 00c1c3c0)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=03
>>  Completion low addr=0, byte count=128
>> (4a000010, 00000080, 01000300)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=03
>>  Completion low addr=64, byte count=64
>> (4a000010, 00000040, 01000340)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c440
 << (60000010, 010000ff, 00000000, 00c1c440)

 << (Read Rq) Type = 0, fmt=1, length=128
 <<  Bus ID: 0100, Tag=04
 << Address = 0000000000c20200
 << (20000080, 010004ff, 00000000, 00c20200)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c480
 << (60000020, 010000ff, 00000000, 00c1c480)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c500
 << (60000010, 010000ff, 00000000, 00c1c500)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=04
>>  Completion low addr=0, byte count=512
>> (4a000010, 00000200, 01000400)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=04
>>  Completion low addr=64, byte count=448
>> (4a000010, 000001c0, 01000440)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=04
>>  Completion low addr=0, byte count=384
>> (4a000010, 00000180, 01000400)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c540
 << (60000020, 010000ff, 00000000, 00c1c540)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=04
>>  Completion low addr=64, byte count=320
>> (4a000010, 00000140, 01000440)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=04
>>  Completion low addr=0, byte count=256
>> (4a000010, 00000100, 01000400)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c5c0
 << (60000020, 010000ff, 00000000, 00c1c5c0)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=04
>>  Completion low addr=64, byte count=192
>> (4a000010, 000000c0, 01000440)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=04
>>  Completion low addr=0, byte count=128
>> (4a000010, 00000080, 01000400)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=04
>>  Completion low addr=64, byte count=64
>> (4a000010, 00000040, 01000440)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c640
 << (60000020, 010000ff, 00000000, 00c1c640)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c6c0
 << (60000010, 010000ff, 00000000, 00c1c6c0)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c700
 << (60000010, 010000ff, 00000000, 00c1c700)

 << (Write) Type = 0, fmt=3, length=2
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c29200
 << (60000002, 010000ff, 00000000, 00c29200)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff008
>> (40000001, 0000000f, fdaff008)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff034
>> (40000001, 0000000f, fdaff034)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

 << (Read Rq) Type = 0, fmt=1, length=128
 <<  Bus ID: 0100, Tag=05
 << Address = 0000000000c21000
 << (20000080, 010005ff, 00000000, 00c21000)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=05
>>  Completion low addr=0, byte count=512
>> (4a000010, 00000200, 01000500)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=05
>>  Completion low addr=64, byte count=448
>> (4a000010, 000001c0, 01000540)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=05
>>  Completion low addr=0, byte count=384
>> (4a000010, 00000180, 01000500)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c740
 << (60000020, 010000ff, 00000000, 00c1c740)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=05
>>  Completion low addr=64, byte count=320
>> (4a000010, 00000140, 01000540)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=05
>>  Completion low addr=0, byte count=256
>> (4a000010, 00000100, 01000500)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=05
>>  Completion low addr=64, byte count=192
>> (4a000010, 000000c0, 01000540)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c7c0
 << (60000020, 010000ff, 00000000, 00c1c7c0)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=05
>>  Completion low addr=0, byte count=128
>> (4a000010, 00000080, 01000500)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=05
>>  Completion low addr=64, byte count=64
>> (4a000010, 00000040, 01000540)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c840
 << (60000010, 010000ff, 00000000, 00c1c840)

 << (Read Rq) Type = 0, fmt=1, length=128
 <<  Bus ID: 0100, Tag=06
 << Address = 0000000000c21200
 << (20000080, 010006ff, 00000000, 00c21200)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c880
 << (60000020, 010000ff, 00000000, 00c1c880)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c900
 << (60000010, 010000ff, 00000000, 00c1c900)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=06
>>  Completion low addr=0, byte count=512
>> (4a000010, 00000200, 01000600)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=06
>>  Completion low addr=64, byte count=448
>> (4a000010, 000001c0, 01000640)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=06
>>  Completion low addr=0, byte count=384
>> (4a000010, 00000180, 01000600)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c940
 << (60000020, 010000ff, 00000000, 00c1c940)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=06
>>  Completion low addr=64, byte count=320
>> (4a000010, 00000140, 01000640)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=06
>>  Completion low addr=0, byte count=256
>> (4a000010, 00000100, 01000600)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1c9c0
 << (60000020, 010000ff, 00000000, 00c1c9c0)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=06
>>  Completion low addr=64, byte count=192
>> (4a000010, 000000c0, 01000640)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=06
>>  Completion low addr=0, byte count=128
>> (4a000010, 00000080, 01000600)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=06
>>  Completion low addr=64, byte count=64
>> (4a000010, 00000040, 01000640)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1ca40
 << (60000020, 010000ff, 00000000, 00c1ca40)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1cac0
 << (60000010, 010000ff, 00000000, 00c1cac0)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1cb00
 << (60000010, 010000ff, 00000000, 00c1cb00)

 << (Write) Type = 0, fmt=3, length=2
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c29200
 << (60000002, 010000ff, 00000000, 00c29200)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff008
>> (40000001, 0000000f, fdaff008)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff034
>> (40000001, 0000000f, fdaff034)

>> (Write) Type = 0, fmt=2, length=1
>>  Bus ID: 0000, Tag=00
>> Address = fdaff030
>> (40000001, 0000000f, fdaff030)

 << (Read Rq) Type = 0, fmt=1, length=128
 <<  Bus ID: 0100, Tag=07
 << Address = 0000000000c22000
 << (20000080, 010007ff, 00000000, 00c22000)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=07
>>  Completion low addr=0, byte count=512
>> (4a000010, 00000200, 01000700)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=07
>>  Completion low addr=64, byte count=448
>> (4a000010, 000001c0, 01000740)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=07
>>  Completion low addr=0, byte count=384
>> (4a000010, 00000180, 01000700)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=07
>>  Completion low addr=64, byte count=320
>> (4a000010, 00000140, 01000740)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1cb40
 << (60000020, 010000ff, 00000000, 00c1cb40)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=07
>>  Completion low addr=0, byte count=256
>> (4a000010, 00000100, 01000700)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=07
>>  Completion low addr=64, byte count=192
>> (4a000010, 000000c0, 01000740)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1cbc0
 << (60000010, 010000ff, 00000000, 00c1cbc0)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=07
>>  Completion low addr=0, byte count=128
>> (4a000010, 00000080, 01000700)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=07
>>  Completion low addr=64, byte count=64
>> (4a000010, 00000040, 01000740)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1cc00
 << (60000020, 010000ff, 00000000, 00c1cc00)

 << (Read Rq) Type = 0, fmt=1, length=128
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c22200
 << (20000080, 010000ff, 00000000, 00c22200)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c22400
 << (60000020, 010000ff, 00000000, 00c22400)

 << (Write) Type = 0, fmt=3, length=16
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c1cd00
 << (60000010, 010000ff, 00000000, 00c1cd00)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=00
>>  Completion low addr=0, byte count=512
>> (4a000010, 00000200, 01000000)

>> (Completion) Type = 10, fmt=2, length=16
>>  Bus ID: 0000, Tag=00
>>  Completion low addr=64, byte count=448
>> (4a000010, 000001c0, 01000040)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13000
 << (60000020, 010000ff, 00000000, 00c13000)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13080
 << (60000020, 010000ff, 00000000, 00c13080)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13100
 << (60000020, 010000ff, 00000000, 00c13100)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13180
 << (60000020, 010000ff, 00000000, 00c13180)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13200
 << (60000020, 010000ff, 00000000, 00c13200)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13280
 << (60000020, 010000ff, 00000000, 00c13280)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13300
 << (60000020, 010000ff, 00000000, 00c13300)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13380
 << (60000020, 010000ff, 00000000, 00c13380)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13400
 << (60000020, 010000ff, 00000000, 00c13400)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13480
 << (60000020, 010000ff, 00000000, 00c13480)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13500
 << (60000020, 010000ff, 00000000, 00c13500)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13580
 << (60000020, 010000ff, 00000000, 00c13580)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13600
 << (60000020, 010000ff, 00000000, 00c13600)

 << (Write) Type = 0, fmt=3, length=32
 <<  Bus ID: 0100, Tag=00
 << Address = 0000000000c13680
 << (60000020, 010000ff, 00000000, 00c13680)

(and this is where the sniffer's memory got full)

ASPM makes Spartan-6′s PCIe core miss TLP packets

$
0
0

The fatal error

Let’s break the bad news: Spartan-6′s PCIe core may drop TLP packets sporadically when ASPM (Active State Power Management) is enabled. That means that any TLP given to the core for transmission can silently disappear, as if it was never submitted. I also suspect that the problem exists in the opposite direction.

Hardware involved: Spartan xc6slx45t-fgg484-3-es (evaluation sample version) on an SP605 evaluation board. That mounted on a Gigabyte G31M-ES2L motherboard, having the Intel G33 chipset and a E5700 3.0 GHz processor.

The fairly good news is that he core’s cfg_dstatus[2] ( = fatal error detected) will go high as a result of dropping TLPs. Or at least so it did in my case. So it looks like monitoring this signal, and do something loud if it goes to ’1′ is enough to at least know if the core does the job or not.

Let me spell it out: If you’re designing with Xilinx’ PCIe core, you should verify that cfg_dstatus[2] stays ’0′, and if it goes high you should treat the PCIe endpoint as completely unreliable.

How to know if ASPM is enabled

On a Linux box, become root and go lspci -vv. The output will include all devices, but the relevant part will be something like

01:00.0 Class ff00: Xilinx Corporation Generic FPGA core
 Subsystem: Xilinx Corporation Generic FPGA core
 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B-
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR-
 Latency: 0, Cache Line Size: 4 bytes
 Interrupt: pin ? routed to IRQ 44
 Region 0: Memory at fdaff000 (64-bit, non-prefetchable) [size=128]
 Capabilities: [40] Power Management version 3
 Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
 Status: D0 PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [48] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
 Address: 00000000fee0300c  Data: 4181
 Capabilities: [58] Express Endpoint IRQ 0
 Device: Supported: MaxPayload 512 bytes, PhantFunc 0, ExtTag-
 Device: Latency L0s unlimited, L1 unlimited
 Device: AtnBtn- AtnInd- PwrInd-
 Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
 Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
 Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
 Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
 Link: Latency L0s unlimited, L1 unlimited
 Link: ASPM L0s Enabled RCB 64 bytes CommClk- ExtSynch-
 Link: Speed 2.5Gb/s, Width x1

There we have it: I set up the device with an unlimited L0s latency, hence the BIOS configured the device to have an unlimited L0s latency, and this ended up with ASPM enabled.

What we really want is the output to end with something like:

Link: Latency L0s unlimited, L1 unlimited
 Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
 Link: Speed 2.5Gb/s, Width x1

The elegant solution

The really good news is that there is a simple solution: Disable ASPM. In other words, program the link partners to never reach the L0s nor L1 power saving modes. In a Linux kernel driver, it’s pretty simple:

#include <linux/pci-aspm.h>

pci_disable_link_state(pdev, PCIE_LINK_STATE_L0S | PCIE_LINK_STATE_L1 |
 PCIE_LINK_STATE_CLKPM)

This is something I would do without thinking twice for any device based upon Xilinx’ PCIe core. Actually, I would do this for any device for which power saving is irrelevant.

The maybe-working solution

In theory, the kernel can run in different ASPM policies, one of which is “powersave”. If it runs in “performance” all transactions to L0s are disabled, and all should be well. In practice, it looks like the kernel community is pushing towards allowing L0s even under the performance policy.

The shaky workaround

When some software wants to allow L0s, it must check if the switching latency from L0s to L0 (that is, from napping to awake) is one the device can take. The device announces its maximal allowed latency in the PCI Express Capability Structure. By setting the acceptable L0s latency limit to the shortest latency allowed (64 ns), one can hope that the hardware will not be able to meet this requirement, and hence give up on using ASPM. This trick happened to work on my own motherboard, but another motherboard may be able to meet the 64 ns requirement, and enable ASPM. So this isn’t really a solution.

Anyhow, the success of this method will yield an lspci -vv output with something like

Capabilities: [58] Express Endpoint IRQ 0
 Device: Supported: MaxPayload 512 bytes, PhantFunc 0, ExtTag-
 Device: Latency L0s <64ns, L1 <1us
 Device: AtnBtn- AtnInd- PwrInd-
 Device: Errors: Correctable- Non-Fatal- Fatal- Unsupported-
 Device: RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
 Device: MaxPayload 128 bytes, MaxReadReq 512 bytes
 Link: Supported Speed 2.5Gb/s, Width x1, ASPM L0s, Port 0
 Link: Latency L0s unlimited, L1 unlimited
 Link: ASPM Disabled RCB 64 bytes CommClk- ExtSynch-
 Link: Speed 2.5Gb/s, Width x1

How I know it isn’t my own bug

The transitions from L0 to L0s and back throttle the data flow through the PCIe core, so maybe these on-and-offs exposed a bug in my own HDL code’s data flow? Why do I blame Xilinx?

The answer was found in the dbg_* debug lines supplied from within the PCIe core. These lines go high whenever something bad happens in the core’s lower layers. Running without ASPM these lines stayed zero. When ASPM was enabled, and in conjunction with packet drops, the following lines were asserted:

  • dbg_reg_detected_fatal: Well, I knew this already. A fatal error was detected.
  • dbg_reg_detected_correctable: A correctable error was detected. Nice, but I really don’t care.
  • dbg_rply_timeout_status: The replay timer fired off: A TLP packet was sent, but didn’t receive an acknowledgement. That indicates that things aren’t perfect, but if the packet was retransmitted, this doesn’t indicate a user-visible issue.
  • dbg_dl_protocol_status: Ayeee. This means that an out of range ACK or NAK was received. In other words, the link partners are not on the same page regarding which packets are waiting for acknowledgement.

The last bullet is our smoking gun: It indicates that the PCIe link protocol has been violated. There is nothing the application HDL code can do to make this happen. The two last bullets indicate some problem in the domain of a TLP being lost, retransmitted, and some problem with the acknowledge. Not a sign saying “a packet was lost”, but as close as one gets to that, I suppose.

Update: My attention to some interesting Xilinx Answer records was drawn in a comment below. Answer record #33871 mentions LL_REPLAY_TIMEOUT as the a parameter to fix, in order to solve a fatal error condition, but says nothing about packet dropping. It looks like this issue has been fixed in the official PCIe wrapper lately. This leaves me wondering whether people didn’t notice they lost packets, or if Xilinx decided not to admit that too loud.

PCIe: Is your card silently struggling with TLP retransmits?

$
0
0

Introduction

The PCI Express standard requires an error detection and retransmit mechanism, which ensures that the TLP packets indeed arrive correctly. The need for reliable communication on a system bus is obvious, but this mechanism also sweeps problems under the carpet: If data packets arrive faulty or are lost in the lower layers, nobody will practically notice this. While error reporting mechanisms exist in the hardware level, there is no mechanism to inform the end user that something isn’t working so well.

Errors in the low-level packets are not only a performance issue (retransmissions are a waste of bandwidth). With properly designed hardware, there is no reason for their appearance at all, so their very existence indicates that something might be close to stop working.

When developing hardware or using PCIe extension cables, this issue is even more important. A setting which hasn’t been verified extensively may appear to work, but in fact it’s just barely getting the data through.

The methodology

According to the PCIe spec, correctable (as well as uncorrectable) errors are noted in PCI Express Capability structure by setting bits matching the type of error. Using command-line application in Linux, we’ll detect the status of a specific device.

By checking the status register of our specific device, it’s possible to tell if it has detected (and fixed) something wrong in the TLP packets it has received. To detect corrected errors in TLPs going in the other direction, it’s necessary to locate the device’s link partner (a switch, bridge or the root complex). Even then, it will be difficult to say something definite: If the link partner reports an error, there may not be a way to tell which link (and hence device) caused it.

In this example, we’ll check a Xillybus peripheral (custom hardware), because we can control the amount of data flowing from and to it. For example, in order to send 100 MB of zeros in a loop, just go:

$ dd if=/dev/zero of=/dev/xillybus_write_32 bs=1k count=100k &
$ cat /dev/xillybus_read_32 > /dev/null

The Device Status Register

This register is part of the PCI Express Capability structure, at offset 0x0a. This register’s 4 least significant bits can supply information about the device’s health:

  • Bit 0 — Correctable Error Detected. This bit is set if e.g. a TLP packet doesn’t pass the CRC check. This error is correctable with a retransmit, and hence sets this bit.
  • Bit 1 — Non-Fatal Error Detected. A condition which wasn’t expected, but could be recovered from. This may indicate some incompatibility between the link partners, or an physical layer error, which caused a recoverable mishap in the protocol.
  • Bit 2 — Fatal Error Detected. This means that the device should be considered unreliable. Unrecoverable packet loss is one of the reasons for setting this bit.
  • Bit 3 — Unsupported Request Detected. When the device receives a request packet which it doesn’t support, this bit goes high. It may be harmless, in particular if the hosting hardware is significantly newer than the device.

(See section 6.2 for the classification of errors)

Checking status

This requires a fairly recent version of setpci (3.1.7 is enough). Earlier version may not recognize extended capability registers by their name.

As mentioned earlier, we’ll query a Xillybus peripheral. This allows running a script loop of sending a known amount of data, and then check if something went wrong.

To read the Device Status Register, become root and go:

# setpci -d 10ee:ebeb CAP_EXP+0xa.w
0000

Despite the command’s name, setpci, it actually reads a word (the “.w” suffix) at offset 0xa on the PCI Express Capability (CAP_EXP) structure. The device is selected by its Vendor/Product IDs, which are 0x10ee and 0xebeb respectively. This works well when there’s a single device with that pair.

Otherwise, it can be singled out by its bus position. For example, check one of the switches:

# lspci
(... some devices ...)
00:1b.0 Audio device: Intel Corporation Ibex Peak High Definition Audio (rev 05)
00:1c.0 PCI bridge: Intel Corporation Ibex Peak PCI Express Root Port 1 (rev 05)
00:1c.1 PCI bridge: Intel Corporation Ibex Peak PCI Express Root Port 2 (rev 05)
00:1c.3 PCI bridge: Intel Corporation Ibex Peak PCI Express Root Port 4 (rev 05)
00:1d.0 USB Controller: Intel Corporation Ibex Peak USB Universal Host Controller (rev 05)
(... more devices ...)
[root@ocho eli]# setpci -s 00:1c.0 CAP_EXP+0xa.w
0010

In both cases the return value was zeros on bits 3-0, indicating that no errors whatsoever were detected. But suppose we got something like this (which is a result of playing nasty games with the PCIe connector):

# setpci -d 10ee:ebeb CAP_EXP+0xa.w
000a

Bits 1 and 3 are set here, indicating a non-fatal error has been detected as well as an unsupported request. Surprisingly enough, playing with the connector didn’t cause a correctable error.

When writing to this register, any bit which is ’1′ in the written word is cleared in the status register. So to clear all four error bits, write the word 0x000f:

# setpci -d 10ee:ebeb CAP_EXP+0xa.w=0x000f
# setpci -d 10ee:ebeb CAP_EXP+0xa.w
0000

Some general notes

  • setpci writes directly to the PCIe peripheral’s configuration space. Typos may be as harmful as with any conduct as root. Note that almost all peripherals, including disk controllers are linked to the PCIe bus somehow.
  • The truth is that all these 0x prefixes are redundant. lspci assumed hex values anyhow.
  • When lspci answers “Capability 0010 not found” it doesn’t necessarily mean that the PCI Express capability structure doesn’t exist on some device. It can also mean that no device was matched, or that you don’t have permissions for the relevant operation.

Virtex-5 PCIe endpoint block plus: Stay away from v1.15

$
0
0

While porting Xillybus to Virtex-5, I ran into nasty trouble. In the beginning, it looked like the MSI interrupt delivery mechanism was wrong, and then it turned out that the core gets locked up completely after a few packets, and refuses to send any TLPs after a few sent. I also noticed that the PCIe core has the “Fatal Error Detected” flag set in its status register (or more precisely, Xillybus banged me in the head with the bad news). Eventually, I found myself resetting the core with a debounced pushbutton connected to sys_reset_n at some very certain point in the host’s boot process to make the system work. Using just PERST_B, like the user guide suggests, simply didn’t work.

All this was with version 1.15 of the PCIe endpoint block plus, which was introduced in ISE 13.2. Quite by chance, I tried ISE 13.1, which comes with version 1.14 of the core. And guess what, suddenly PERST_B connected to sys_reset_n did the job, and the Fatal Error vanished.

I have to admit I’m quite amazed by this.

Questions & Comments

Since the comment section of similar posts tends to turn  into a Q & A session, I’ve taken that to a separate place. So if you’d like to discuss something with me, please  post questions and comments here instead. No registration is required. The comment section below is closed.

An FPGA-based PCI Express peripheral for Windows: It’s easy

$
0
0

To make a long story short…

There is really no need to work hard to make your FPGA talk with a PC.  Xillybus gives you the end-to-end connection interfacing FPGAs with both Linux and Windows computers.

The challenge

At times, FPGA engineers are faced with the need to transfer data to or from a regular PC. Sometimes it’s the purpose of the project (e.g. data acquisition, frame grabbing, firmware loading etc.). Others need data transport with a PC for testing their HDL design on silicon. Or just run whatever automated tests or calibration processes involved in the project. This way or another, the lowest-level piece of logic (the FPGA) now needs to talk with the highest level form of logic (a user space  application running in protected memory mode within a fullblown operation system). That’s quite a gap to bridge.

Since I published a short guide about the basics of PCI Express, I keep receiving questions implying that some FPGA engineers don’t grasp what’s ahead of them when they start such a project. Even for low-bandwidth assignments (where no DMA is necessary) there’s quite some way to go before having something that works in a stable manner. While Linux offers a fairly straightforward API for writing device drivers, developing any kind of driver for Windows is much of a project in itself. And even though my heart is with Linux, it’s pretty clear that “PC” and “Windows” are synonyms to most people today.

Non-PCIe solutions

There are two common approaches today for making a Windows PC and an FPGA talk:

  • Using Cypress’ EZ-USB chip, which supplies a generic data interface for USB communication. Windows drivers are available from Cypress, but interfacing the chip with the FPGA requires a substantial piece of logic, as well as some 8051 firmware hacking. From my own experience and others’, those chips have some “personality” once they’re put on a real design’s board. So all in all, this is not a rose garden, and yet for many years this was considered the only sane solution.
  • Connecting the FPGA to the PC’s network plug through an Ethernet chip, typically a Marvell Alaska transceiver. This solution is attractive in particular when data goes from the FPGA to the PC, since raw MAC packets are quite easy to assemble. The main problem with this solution is that even though it usually works fine, it’s just because the hardware and software components are more reliable than required, as detailed in another post of mine.

Painless PCIe

As mentioned above, the good news is that Xillybus makes it amazingly easy: On the computer, open a plain file in an user space application, using whatever programming language you want. No creepy API involved. Just read and write to a file, and put a plain FIFO inside the FPGA to hold the data.

Xillybus supports a variety if Xilinx and Altera FPGAs, regardless of the host’s operating system: All Spartan 6, Virtex-5 and Virtex-6 devices with a “T” suffix (those having a built-in PCIe hardware core). As for Altera, all devices having the hard IP form of PCI compiler for PCI Express.

List of FPGA boards and IP cores with PCIe/USB and their vendors

$
0
0

I collected some links for my own use (limiting myself to Virtex-5 and later Xilinx FPGAs). Maybe this can help someone else too. This is by no means a complete list, but additions and corrections are welcome in the comment section below (I may delete your comment and update the list, don’t take it personally).

So, in no particular order…

PCI Express IP Cores

PCI Express helper chipsets

USB chipsets

  • Cypress EZ-USB FX2LP (CY7C68013A and friends)
  • FDTI FT2232H UART/FIFO
  • TI’s TUSB series of transceivers. PHY only, without USB’s logical protocol, which is a lot to implement.
  • SMSC’s USB3250 (PHY only as well)

The boards listed have native PCIe connection (that is, with no PCIe bridge)

EZUSB boards

Other USB boards

The full list of boards (all types) is here.


Altera’s IP compiler for PCI express, and how to survive it

$
0
0

This is the good news: Xillybus is now supporting Altera FPGAs having the hard IP transceiver for PCI Express (and other Gigabit interfaces). If you’re into PCI Express, and into a fairly recent project, odds are that your device is on the list.

There is, of course, the possibility to handle the Avalon-ST interface by yourself, with its quirky Endianess issues. The payload is Endian swapped, the header isn’t.

Or its “interesting” alignment of DWords in QWords, making it Avalon compatible, but also opens an opportunity for exotic bugs in getting the payload data right. And is it only me not being very fond of the delay of three clock cycles between deasserting rx_st_ready and rx_st_valid going low? That kind-of forced me to put a FIFO inbetween. When I can’t get more data, I can’t. Period. And just to play completely unfair, the application logic may deassert tx_st_valid, but not in the middle of a TLP. So much for flow control. They call it readyLatency, but I call it practically no flow control.

This is not to say that the Altera’s PCIe core is bad. It just has its little corners. As has PCI Express itself: Minding the credits for incoming read request completions is something one can’t get away from: In theory, it should have been the data link layer’s work, but since it’s forced to announce infinite credits on completions to unposted requests, user logic must make sure the completion buffers don’t overflow.

Having the user logic aware of max_payload_size and the maximal read request size is just another little detail to handle. Not to mention the need to know our bus and device number (the function is always zero. Shhh.). The chosen implementation for retrieving this information (a.k.a. the “Configuration space signals”) could have been easier and less confusing to handle.  The tl_cfg_add, tl_cfg_ctl, tl_cfg_ctl_wr, tl_cfg_sts and tl_cfg_sts_wr are documented in a somewhat confusing manner, and it’s not always 100% clear what information is where.

And I haven’t even started with the PCI Express protocol itself. Not to mention the host driver, in particular if it’s for Windows.

So indeed, Xillybus is good news.


Please post questions and comments here. No registration is required. The comment section below is closed, since comments to posts like this one tend to turn into Q&A.

Getting the PCIe of Avnet S6LX150T Development Kit detected

$
0
0

About a year ago, I had a client failing to get the PCIe working on an Avnet LX150T development board. Despite countless joint efforts, we failed to get the card detected as a PCIe device by the computer.

A recent comment from another client supplied the clue: The user guide (which I downloaded recently from Avnet) is simply wrong about the DIP switch setting of the board. Table 5 (page 11) says that SW8 should be set OFF-ON-OFF to have a reference clock of 125 MHz to the FPGA.

On the other hand, Avnet’s AXI PCIe Endpoint Design guide for the same board (also downloaded recently from Avnet) says on page 30, that the setting for the same frequency should be ON-ON-ON.

Hmmm… So which one is correct?

Well, those signals go to an IDT ICS874003-05 PCI Express jitter attenuator, which consists of a frequency synthesizer. It multiplies the board’s 100 MHz reference clock by 5 on its internal VCO, and then divides that clock by an integer. The DIP switch setting determines the integers used for both outputs.

Ah, yes, there are generally two different divisors for the two outputs, depending on the DIP switch settings. In other words, PCIe-REFCLK0 and PCIe-REFCLK2 run at different frequencies (except for two settings, for which they happen to be the same). It’s worth to download the chip’s datasheet and have a look. It’s on the first page of the datasheet.

The bottom line is that the correct setting for a 125 MHz clock is ON-ON-ON, for which the correct clock is generated on both clock outputs. By the way, if it’s OFF-OFF-OFF, a 250 MHz clock appears on both outputs.

All other combinations generate two different clocks. Refer to the datasheet if you need a 100 MHz clock.

 

Linux kernel hack for calming down a flood of PCIe AER messages

$
0
0

While working on a project involving a custom PCIe interface, Linux’ message log became flooded with messages like

pcieport 0000:00:1c.6:   device [8086:a116] error status/mask=00001081/00002000
pcieport 0000:00:1c.6:    [ 0] Receiver Error
pcieport 0000:00:1c.6:    [ 7] Bad DLLP
pcieport 0000:00:1c.6:    [12] Replay Timer Timeout
pcieport 0000:00:1c.6:   Error of this Agent(00e6) is reported first
pcieport 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID)
pcieport 0000:02:00.0:   device [10b5:8606] error status/mask=00003000/00002000
pcieport 0000:02:00.0:    [12] Replay Timer Timeout
pcieport 0000:00:1c.6: AER: Corrected error received: id=00e6
pcieport 0000:00:1c.6: can't find device of ID00e6
pcieport 0000:00:1c.6: AER: Corrected error received: id=00e6
pcieport 0000:02:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, id=0200(Transmitter ID)

And before long, some 400 MB of log messages accumulated in /var/log/messages. In this context, they are merely informative AER (Advanced Error Reporting) messages, telling me that errors have occurred in the link between the computer’s PCIe controller and the PCIe switch on the custom board. But all of these errors were correctable (presumably with retransmits) so from a functional standpoint, the hardware worked.

Advanced Error Reporting, and its Linux driver was explained in OLS 2007 (pdf).

Had it not been for these messages, I could have been mislead to think that all was fine, even though there’s a method to tell, which I’ve dedicated an earlier post to. So they’re precious, but they flood the system logs, and even worse, the system is so busy handling them, that the boot is slowed down, and sometimes the boot process got stuck in the middle.

At first I thought that it would be enough to just turn off the logging of these messages, but it seems like the flood of interrupts was the problem.

So one way out is to disable the handler of AER altogether: Use the pci=noaer kernel parameter on boot, or disable the CONFIG_PCIEAER kernel configuration flag, and recompile the kernel. This removes the piece of code that configures the computer’s root port to send interrupts if and when an AER message arrives, but that way I won’t be alerted that a problem exists.

So I went for hacking the kernel code. In an early attempt, I went for not producing error messages for each event, but to keep it down to no more than 5 per second. It worked in the sense that the log wasn’t flooded, but didn’t solve the problem of a slow or impossible boot. As mentioned earlier, the core problem seems to be a bombardment of interrupts.

So the hack that eventually did the job for me tells the root port to stop generating interrupts after 100 kernel messages have been produced. That’s enough to inform me that there’s a problem, and give me an idea of where it is, but it stops soon enough to let the system live.

The only file I modified was drivers/pci/pcie/aer/aerdrv_errprint.c on a 4.2.0 Linux kernel. In retrospective, I could have done it more elegant. But hey, now that it works, why should I care…?

It goes like this: I defined a static variable, countdown, and initialized it to 100. Before a message is produced, a piece of code like this runs:

	if (!countdown--)
		aer_enough_is_enough(dev);

aer_enough_is_enough() is merely a copy of aerdrv.c’s aer_disable_rootport(), which is defines as static there, and requires an uncomfortable argument. It would have made more sense to make aer_disable_rootport() a wrapper of another function, which could have been used both by aerdrv.c and my little hack — that would have been much more elegant.

Instead, I copied two additional static functions that are required by aer_disable_rootport() into aerdrv_errprint.c, and ended up with an ugly hack that solves the problem.

With all due shame, here’s the changes in patch format. It’s not intended to apply on your kernel as is. It’s more intended to be a guideline to how to get it done. And by all means, take a look on aerdrv.c’s relevant functions, and see if they’re different, by any chance.

From b007850486167288ea4c6c6a1bf30ddd1a299f24 Mon Sep 17 00:00:00 2001
From: Eli Billauer <my-mail@gmail.com>
Date: Sat, 17 Oct 2015 07:37:19 +0300
Subject: [PATCH] PCIe AER handler: Turn off interrupts from root port after 100 messages

---
 drivers/pci/pcie/aer/aerdrv_errprint.c |   78 ++++++++++++++++++++++++++++++++
 1 files changed, 78 insertions(+), 0 deletions(-)

diff --git a/drivers/pci/pcie/aer/aerdrv_errprint.c b/drivers/pci/pcie/aer/aerdrv_errprint.c
index 167fe41..31a8572 100644
--- a/drivers/pci/pcie/aer/aerdrv_errprint.c
+++ b/drivers/pci/pcie/aer/aerdrv_errprint.c
@@ -20,6 +20,7 @@
 #include <linux/pm.h>
 #include <linux/suspend.h>
 #include <linux/cper.h>
+#include <linux/pcieport_if.h>

 #include "aerdrv.h"
 #include <ras/ras_event.h>
@@ -129,6 +130,74 @@ static const char *aer_agent_string[] = {
 	"Transmitter ID"
 };

+/* Two functions copied from aerdrv.c, to prevent name space pollution */
+
+static int set_device_error_reporting(struct pci_dev *dev, void *data)
+{
+	bool enable = *((bool *)data);
+	int type = pci_pcie_type(dev);
+
+	if ((type == PCI_EXP_TYPE_ROOT_PORT) ||
+	    (type == PCI_EXP_TYPE_UPSTREAM) ||
+	    (type == PCI_EXP_TYPE_DOWNSTREAM)) {
+		if (enable)
+			pci_enable_pcie_error_reporting(dev);
+		else
+			pci_disable_pcie_error_reporting(dev);
+	}
+
+	if (enable)
+		pcie_set_ecrc_checking(dev);
+
+	return 0;
+}
+
+/**
+ * set_downstream_devices_error_reporting - enable/disable the error reporting  bits on the root port and its downstream ports.
+ * @dev: pointer to root port's pci_dev data structure
+ * @enable: true = enable error reporting, false = disable error reporting.
+ */
+static void set_downstream_devices_error_reporting(struct pci_dev *dev,
+						   bool enable)
+{
+	set_device_error_reporting(dev, &enable);
+
+	if (!dev->subordinate)
+		return;
+	pci_walk_bus(dev->subordinate, set_device_error_reporting, &enable);
+}
+
+/* Allow 100 messages, and then stop it. Since the print functions are called
+   from a work queue, it's safe to call anything, aer_disable_rootport()
+   included. */
+
+static int countdown = 100;
+
+/* aer_enough_is_enough() is a copy of aer_disable_rootport(), only the
+   latter requires to get the aer_rpc structure from the pci_dev structure,
+   and then uses it to get the pci_dev structure. So enough with that too.
+*/
+
+static void aer_enough_is_enough(struct pci_dev *pdev)
+{
+	u32 reg32;
+	int pos;
+
+	dev_err(&pdev->dev, "Exceeded limit of AER errors to report. Turning off Root Port interrupts.\n");
+
+	set_downstream_devices_error_reporting(pdev, false);
+
+	pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
+	/* Disable Root's interrupt in response to error messages */
+	pci_read_config_dword(pdev, pos + PCI_ERR_ROOT_COMMAND, &reg32);
+	reg32 &= ~ROOT_PORT_INTR_ON_MESG_MASK;
+	pci_write_config_dword(pdev, pos + PCI_ERR_ROOT_COMMAND, reg32);
+
+	/* Clear Root's error status reg */
+	pci_read_config_dword(pdev, pos + PCI_ERR_ROOT_STATUS, &reg32);
+	pci_write_config_dword(pdev, pos + PCI_ERR_ROOT_STATUS, reg32);
+}
+
 static void __print_tlp_header(struct pci_dev *dev,
 			       struct aer_header_log_regs *t)
 {
@@ -168,6 +237,9 @@ void aer_print_error(struct pci_dev *dev, struct aer_err_info *info)
 	int layer, agent;
 	int id = ((dev->bus->number << 8) | dev->devfn);

+	if (!countdown--)
+		aer_enough_is_enough(dev);
+
 	if (!info->status) {
 		dev_err(&dev->dev, "PCIe Bus Error: severity=%s, type=Unaccessible, id=%04x(Unregistered Agent ID)\n",
 			aer_error_severity_string[info->severity], id);
@@ -200,6 +272,9 @@ out:

 void aer_print_port_info(struct pci_dev *dev, struct aer_err_info *info)
 {
+	if (!countdown--)
+		aer_enough_is_enough(dev);
+
 	dev_info(&dev->dev, "AER: %s%s error received: id=%04x\n",
 		info->multi_error_valid ? "Multiple " : "",
 		aer_error_severity_string[info->severity], info->id);
@@ -226,6 +301,9 @@ void cper_print_aer(struct pci_dev *dev, int cper_severity,
 	u32 status, mask;
 	const char **status_strs;

+	if (!countdown--)
+		aer_enough_is_enough(dev);
+
 	aer_severity = cper_severity_to_aer(cper_severity);

 	if (aer_severity == AER_CORRECTABLE) {
--
1.7.2.3

And again — it’s given as a patch, but really, it’s not intended for application as is. If you need to do this yourself, read through the patch, understand what it does, and make the changes with respect to your own kernel. Or your system may just hang.

Using Linux’ setpci to program an EEPROM attached to an PLX / Avago PCIe switch

$
0
0

Introduction

These are my notes as I programmed an Atmel AT25128 EEPROM, attached to a PEX 8606 PCIe switch, using PCIe configuration-space writes only (that is, no I2C / SMBus cable). This is frankly quite redundant, as Avago supplies software tools for doing this.

In fact, in order to get their tools, register at Avago’s site, then make the extra registration in PLX Tech’ site. None of these registrations require signing an NDA. At PLX Tech’s site, pick SDK -> PEX at the bottom of list of devices to get documentation for, and download the PLX SDK. Among others, this suite includes the PEX Device Editor, which is quite a useful tool regardless of switches, as it gives a convenient tree view of the bus. The Device Editor, as well as other tools, allow programming the EEPROM from the host, with or without an I2C cable.

There are also other tools in the SDK that do the same thing PLXMon in particular. If you have an Aardvark I2C to USB cable, the PLXMon tool allows reading and writing to the EEPROM through I2C. And there’s a command line interface, probably for all functionality. So really, this is really for those who want to get down to the gory details.

All said below will probably work with the entire PEX 86xx family, and possibly with other Avago devices as well. The Data Book is your friend.

The EEPROM format

The organization of data in the outlined in the Data Book, but to keep it short and concise: It’s a sequence of bytes, consisting of a concatenation of the following words, all represented in Little Endian format:

  1. The signature, always 0x5a, occupying one byte
  2. A zero (0x00), occupying one byte
  3. The number of bytes of payload data to come, given as a 16-bit words (two bytes). Or equivanlently, the number of registers to be written to, multiplied by 6.
  4. The address of the register to be written to, divided by 4, and ORed with the port number, left shifted by 10 bits. See the data book for how NT ports are addressed. This field occupies 16 bits (two bytes). Or to put it in C’ish:
    unsigned short addr_field = (reg_addr >> 2) | (port << 10)
  5. The data to be written: 32 bits (four bytes)

Items #4 and #5 are repeated for each register write. There is no alignment, so when this stream is organized in 32-bit words, it becomes somewhat inconvenient.

And as the Data Book keeps saying all over the place: If the Debug Control register (at 0x1dc) is written to, it has to be the first entry (occupying bytes 4 to 9 in the stream). Its address representation in the byte stream is 0x0077, for example (or more precisely, the byte 0x77 followed by 0x00).

Accessing configuration space registers

Given the following PCI bus setting:

02:00.0 PCI bridge: PLX Technology, Inc. Unknown device 8606 (rev ba)
03:01.0 PCI bridge: PLX Technology, Inc. Unknown device 8606 (rev ba)
03:05.0 PCI bridge: PLX Technology, Inc. Unknown device 8606 (rev ba)
03:07.0 PCI bridge: PLX Technology, Inc. Unknown device 8606 (rev ba)
03:09.0 PCI bridge: PLX Technology, Inc. Unknown device 8606 (rev ba)

In particular note that the switch’ upstream port 0 is at 02:00.0.

Reading from the Serial EEPROM Buffer register at 264h (as root, of course):

# setpci -s 02:00.0 264.l
00000000

The -s 02:00.0 part selects the device by its bus position (see above).

Note that all arguments as well as return values are given in hexadecimal. An 0x prefix is allowed, but it’s redundant.

Making a dry-run of writing to this register, and verifying nothing happened:

# setpci -Dv -s 02:00.0 264.l=12345678
02:00.0:264 12345678
# setpci -s 02:00.0 0x264.l
00000000

Now let’s write for real:

# setpci -s 02:00.0 264.l=12345678
# setpci -s 02:00.0 264.l
12345678

(Yey, it worked)

Reading from the EEPROM

Reading four bytes from the EEPROM at address 0:

# setpci -s 02:00.0 260.l=00a06000
# setpci -s 02:00.0 264.l
0012005a

The “a0″ part above sets the address width explicitly to 2 bytes on each operation. There may be some confusion otherwise, in particular if the device wasn’t detected properly at bringup. The “60″ part means “read”.

Just checking the value of the status register after this:

# setpci -s 02:00.0 260.l
00816000

Same, but read from EEPROM address 4. The lower 13 LSBs are used as bits [14:0] of the EEPROM address. It’s also possible to access higher addresses (see the respective Data Book).

# setpci -s 02:00.0 260.l=00a06001
# setpci -s 02:00.0 264.l
0008c03a

Or, to put it in a simple Bash script (this one reads the first 16 DWords, i.e. 64 bytes) from the EEPROM of the switch located at the bus address given as the argument to the script (see example below):

#!/bin/bash

DEVICE=$1

for ((i=0; i<16; i++)); do
  setpci -s $DEVICE 260.l=`printf '%08x' $((i+0xa06000))`
  usleep 100000
  setpci -s $DEVICE 264.l
done

Rather than checking the status bit for the read to be finished, the script waits 100 ms. Quick and dirty solution, but works.

Note: usleep is deprecated as a command-line utility. Instead, odds are that “sleep 0.1″ replaces “usleep 100000″. Yes, sleep takes non-integer arguments in non-ancient UNIXes.

Writing to the EEPROM

Important: Writing to the EEPROM, in particular the first word, can make the switch ignore the EEPROM or load faulty data into the registers. On some boards, the EEPROM is essential for the detection of the switch by the host and its enumeration. Consequently, writing junk to the EEPROM can make it impossible to rectify this through the PCIe interface. This can render the PCIe switch useless, unless this is fixed with I2C access.

Before starting to write, the EEPROM’s write enable latch needs to be set. This is done once for each write as follows, regardless of the desired target address:

# setpci -s 02:00.0 260.l=00a0c000

Now we’ll write 0xdeadbeef to the first 4 bytes of the EEPROM.

# setpci -s 02:00.0 264.l=deadbeef
# setpci -s 02:00.0 260.l=00a04000

If another address is desired, add the address in bytes, divided by 4 to 00004000 above. The write enable latch is the same (no change in the lower bits is required).

Here’s an example of the sequence for writing to bytes 4-7 of the EEPROM (all three lines are always required)

# setpci -s 02:00.0 260.l=00a0c000
# setpci -s 02:00.0 264.l=010d0077 # Just any value goes
# setpci -s 02:00.0 260.l=00a04001

Or making a script of this, which writes the arguments from address 0 and on (for those who like to make big mistakes…)

#!/bin/bash

numargs=$#
DEVICE=$1

shift

for ((i=0; i<(numargs-1); i++)); do
  setpci -s $DEVICE 260.l=00a0c000
  setpci -s $DEVICE 264.l=$1
  setpci -s $DEVICE 260.l=`printf '%08x' $((i+0xa04000))`
  usleep 100000
  shift
done

Again, usleep can be replaced with a plain sleep with a non-integer argument. See above.

Example of using these scripts

# ./writeeeprom.sh 02:00.0 0006005a 00ff0081 ffff0001
# ./readeeprom.sh 02:00.0
0006005a
00ff0081
ffff0001
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff
ffffffff

When the EEPROM gets messed up

It’s more than possible that the switch becomes unreachable to the host as a result of messing up the EEPROM’s registers. For example, by changing the upstream port setting. A simple way out, if a blank EEPROM is good enough for talking with the switch, is to force the EEPROM undetected by e.g. short-circuiting the EEPROM’s SO pin (pin number 2 on AT25128) to ground with a 33 Ohm resistor or so. This prevents the data from being loaded, but the commands above will nevertheless work, so the content can be altered. Yet another “dirty, but works” solution.

Gigabit traceivers on FPGA: Selected topics

$
0
0

Introduction

This is a summary of a few topics that should to be kept in mind when a Multi-Gigabit Tranceiver (MGT) is employed in an FPGA design. It’s not a substitute for reading the relevant user guide, nor a tutorial. Rather, it’s here to point at issues that may not be obvious at first glance.

The terminology and signal names are those used with Xilinx FPGA. The tranceiver is referred to as GTX (Gigabit Transceiver), but other variants of transceivers, e.g. GTH and GTZ, are to a large extent the same components with different bandwidth capabilities.

Overview

GTXs, which are the basic building block for common interface protocols (e.g. PCIe and SATA) are becoming an increasingly popular solution for communication between FPGAs. As the GTX’ instance consists of a clock and parallel data interface, it’s easy to mistake it for a simple channel that moves the data to the other end in a failsafe manner. A more realistic view of the GTX’ is a front end for a modem, with the possible bit errors and a need to synchronize serial-to-parallel data alignment at the receiver. Designing with the GTX also requires attention to classic communication related topics, e.g. the use of data encoding, equalizers and scramblers.

As a result, there are a few application-dependent pieces of logic that needs to be developed to support the channel:

  • The possibility of bit errors on the channel must be handled
  • The alignment from a bit stream to a parallel word must be taken care of (which bit is the LSB of the parallel word in the serial stream?)
  • If the transmitter and receiver aren’t based on a common clock, a protocol that injects and tolerates idle periods on the data stream must be used, or the clock difference will cause data underflows or overflows. Sending the data in packets in a common solution.
  • Odds are that a scrambler needs to be applied on the channel. This requires logic that creates the scrambling sequence as well as synchronizes the receiver. The reason is that an equalizer assumes that the bit stream is uncorrelated on the average. Any average correlation between bit positions is considered ISI and is “fixed”. See Wikipedia

Having said the above, it’s not uncommon that no bit errors are ever observed on a GTX channel, even at very high rates, and possibly with no equalization enabled. This can’t be relied on however, as there is in fact no express guarantee for the actual error probablity of the channel.

Clocking

The clocking of the GTXs is an issue in itself. Unlike the logic fabric, each GTX has a limited number of possible sources for its reference clock. It’s mandatory to ensure that the reference clock(s) are present in one of the allowed dedicated inputs. Each clock pin can function as the reference clock of up to 12 particular GTXs.

It’s also important to pay attention to the generation of the serial data clocks for each GTX from the reference clock(s). It’s not only a matter of what multiplication ratios are allowed, but also how to allocate PLL resources and their access to the required reference clocks.

QPLL vs. CPLL

Two types of PLLs are availble for producing the serial data clock, typically running at severtal GHz: QPLLs and CPLLS.

The GTXs are organized in groups of four (“quads”). Each quad shares a single QPLL (Quad PLL), which is instantiated separately (as a GTXE2_COMMON). In addition, each GTX has a dedicated CPLL (Channel PLL), which can generate the serial clock for that GTX only.

Each GTX may select its clock source from either the (common) QPLL or its dedicated CPLL. The main difference between these is that the QPLL covers higher frequencies. High-rate applications are hence forced to use the QPLL. The downside is that all GTXs sharing the same QPLL must have the same data rate (except for that each GTX may divide the QPLL’s clock by a different rate). The CPLL allow for a greater flexibility of the clock rates, as each GTX can pick its clock independently, but with a limited frequency range.

Jitter

Jitter on the reference clock(s) is the silent killer of GTX links. It’s often neglected by designers because “it works anyhow”, but jitter on the reference clock has a disastrous effect on the channel’s quality, which can be by far worse than a poor PCB layout. As both jitter and poor PCB layout (and/or cabling) contribute to the bit error rate and the channel’s instability, the PCB design is often blamed when things go bad. And indeed, playing with the termination resistors or similar black-magic actions sometimes “fix it”. This makes people believe that GTX links are extremely sensitive to every via or curve in the PCB trace, which is not the case at all. It is, on the other hand, very sensitive to the reference clock’s jitter. And with some luck, a poorly chosen reference clock can be compensated for with a very clean PCB trace.

Jitter is commonly modeled as a noise component which is added to the timing of the clock transition, i.e. t=kT+n (n is the noise). Consequently, it is often defined in terms of the RMS of this noise component, or a maximal value which is crossed at a sufficiently low probability. The treatment of an GTX’ reference clock requires a slightly different approach; the RMS figures are not necessarily a relevant measures. In particular, clock sources with excellent RMS jitter may turn out inadequate, while other sources, with less impressive RMS figures may work better.

Since the QPLL or CPLL locks on this reference clock, jitter on the reference clock results in jitter in the serial data clock. The prevailing effect is on the transmitter, which relies on this serial data clock; the receiver is mainly based on the clock it recovers from the incoming data stream, and is therefore less sensitive to jitter.

Some of the jitter – in particular “slow” jitter (based upon low frequency components) is fairly harmless, as the other side’s receiver clock synchronization loop will cancel its effect by tracking the random shifts of the clock. On the other hand, very fast jitter in the reference clock may not be picked up by the QPLL/CPLL, and is hence harmless as well.

All in all, there’s a certain band of frequency components in the clock’s timing noise spectrum, which remains relevant: The band that causes jitter components which are slow enough for the QPLL/CPLL to track and hence present on the serial data clock, and too fast for the receiver’s tracking loop to follow. The measurable expression for this selective jitter requirement is given in terms of phase noise frequency masks, or sometimes as the RMS jitter in bandwidth segments (e.g. PCIe Base spec 2.1, section 4.3.7, or Xilinx’ AR 44549). Such spectrum masks required for GTX published by the hardware vendors. The spectral behavior of clock sources is often more difficult to predict: Even when noise spectra are published in datasheets, they are commonly given only for certain scenarios as typical figures.

8b/10b encoding

Several standardized uses of MGT channels (SATA, PCIe, DisplayPort etc.) involve a specific encoding scheme between payload bytes for transmission and the actual bit sequence on the channel. Each (8-bit) byte is mapped to an 10-bit word, based upon a rather peculiar encoding table. The purposes of this encoding is to ensure a balance between the number of 0′s and 1′s on the physical channel, allowing AC-coupling of the electrical signal. Also, this encoding also ensures frequent toggling between 0′s and 1′s, which ensures the proper bit synchronization at the receiver by virtue of the of the clock recovery loop (“CDR”). Other things that are worth noting about this encoding:

  • As there are 1024 possible code words covering 256 possible input bytes, some of the excessive code words are allocated as control characters. In particular, a control character designated K.28.5 is often referred to as “comma”, and is used for synchronization.
  • The 8b/10b encoding is not an error correction code despite its redundancy, but it does detect some errors, if the received code word is not decodable. On the other hand, a single bit error may lead to a completely different decoded word, without any indication that an error occurred.

Scrambling

To put it short and concise: If an equalizer is applied, the user-supplied data stream must be random. If the data payload can’t be ensured to be random itself (this is almost always the case), a scrambler must be defined in the communication protocol, and applied in the logic design.

Applying a scrambler on the channel is a tedious task, as it requires a synchronization mechanism between the transmitter and receiver. It’s often quite tempting to skip it, as the channel will work quite well even in the absence of a scrambler, even where it’s needed. However in the long run, occasional channel errors are typically experienced.

The rest of this paragraph attempts to explain the connection between the equalizer and scrambler. It’s not the easiest piece of reading, so it’s fine to skip it, if my word on this is enough for you.

In order to understand why scrambling is probably required, it’s first necessary to understand what an equalizer does.

The problem equalizers solve is the filtering effect of the electrical media (the “channel”) through which the bit stream travels. Both cables and PCBs reduce the strength of the signal, but even worse: The attenuation depends on the frequency, and reflections occur along the metal trace. As a result, the signal doesn’t just get smaller in magnitude, but it’s also smeared over time. A perfect, sharp, step-like transition from -1200 mV to +1200mV at the transmitter’s pins may end up as a slow and round rise from -100mV to +100mV. Because of this slow motion of the transitions at the receiver, the clear boundaries between the bits are broken. Each transmitted bit keeps leaving its traces way after its time period. This is called Inter-Symbol Interference (ISI): The received voltage at the sampling time for the bit at t=0 depends on the bits at t=-T, t=t-2T and so on. Each bit effectively produces noise for the bits coming after it.

This is where the equalizer comes in. The input of this machine is the time samples of the bit at t=0, but also a number of measured voltage samples of the bits before and after it. By making a weighted sum of these inputs, the equalizer manages, to a large extent, to cancel the Inter-Symbol Interference. In a way, it implements a reverse filter of the channel.

So how does the equalizer acquire the coefficients for each of the samples? There are different techniques for training an equalizer to work effectively against the channel’s filtering. For example, cellular phones do their training based upon a sequence of bits on each burst, which is known in advance. But when the data stream runs continuously, and the channel may change slightly over time (e.g. a cable is being bent) the training has to be continuous as well. The chosen method for the equalizers in GTXs is therefore continuous.

The Decision Feedback Equalizer, for example, starts with making a decision on whether each input bit is a ’0′ or ’1′. It then calculates the noise signal for this bit, by subtracting the measured voltage with the expected voltage for a ’0′ or ’1′, whichever was decided upon. The algorithm then slightly alters the weighted sums in a way that removes any statistical correlation between the noise and the previous samples. This works well when the bit sequence is completely random: There is no expected correlation between any input sample, and if such exists, it’s rightfully removed. Also, the adaptation converges into a compromise that works on the average best for all bit sequences.

But what happens if there is a certain statistical correlation between the bits in the data itself? The equalizer will specialize in reducing the ISI for the bit patterns occurring more often, possibly doing very bad on the less occurring patterns. The equalizer’s role is to compensate for the channel’s filtering effect, but instead, it adds an element of filtering of its own, based upon the common bit patterns. In particular, note that if a constant pattern runs through the channel when there’s no data for transmission (zeros, idle packets etc.) the equalizer will specialize in getting that no-data through, and mess up with the actual data.

One could be led to think that the 8b/10b encoding plays a role in this context, but it doesn’t. Even though cancels out DC on the channel, it does nothing about the correlation between the bits. For example, if the payload for transmission consists of zeros only, the encoded words on the channel will be either 1001110100 or 0110001011. The DC on the channel will remain zero, but the statistical correlation between the bits is far from being zero.

So unless the data is inherently random (e.g. an encrypted stream), using an equalizer means that the data which is supplied by the application to the transmitter must be randomized.

The common solution is a scrambler: XORing the payload data by a pseudo-random sequence of bits, generated by a simple state machine. The receiver must XOR the incoming data with the same sequence in order to retrieve the payload data. The comma (K28.5) symbol is often used to synchronize both state machines.

In GTX applications, the (by far) most commonly used scrambler is the G(X)=X^16+X^5+X^4+X^3+1 LFSR, which is defined in a friendly manner in the PCIe standard (e.g. the PCI Express Base Specification, rev. 1.1, section 4.2.3 and in particular Appendix C).

TX/RXUSRCLK and TX/RXUSRCLK2

Almost all signals between the FPGA logic fabric and the GTX are clocked with TXUSRCLK2 (for transmission) and RXUSRCLK2 (for reception). These signals are supplied by the user application logic, without any special restriction, except that the frequency must match the GTX’ data rate so as to avoid overflows or underflows. A common solution for generating this clock is therefore to drive the GTX’ RX/TXOUTCLK through a BUFG.

The logic fabric is required to supply a second clock in each direction, TXUSRCLK and RXUSRCLK (without the “2” suffix). These two clocks are the parallel data clocks in a deeper position of the GTX.

The rationale is that sometimes, it’s desired to let the logic fabric work with a word width which is twice as wide as the actual word width. For example, in a high-end data rate application, the GTX’ word width may be set to 40 bits with 8b/10b, so the logic fabric would interface with the GTX through a 32 bit data vector. But because of the high rate, the clock frequency may still be too high for the logic fabric, in which case the GTX allows halving the clock, and applying the data through a 80 bits word. In this case, the logic fabric supplies the 80-bit word clocked with TXUSRCLK2, and is also required to supply a second clock, TXUSRCLK having twice the frequency, and being phase aligned with TXUSRCLK2. TXUSRCLK is for the GTX’ internal use.

A similar arrangement applies for reception.

Unless the required data clock rate is too high for the logic fabric (which is usually not the case), this dual-clock arrangement is best avoided, as it requires an MMCM or PLL to generate two phase aligned clocks. Except for the lower clock applied to the logic fabric, there is no other reason for this.

Word alignment

On the transmitting side, the GTX receives a vector of bits, which forms a word for transmission. The width of this word is one of the parameters that are set when the GTX is instantiated, and so is whether 8b/10b encoding is applied. Either way, some format of parallel words is transmitted over the channel in a serialized manner, bit after bit. Unless explicitly required, there is nothing in this serial bitstream to indicate the words’ boundaries. Hence the receiver has no way, a-priori, to recover the word alignment.

The receiver’s GTX’ output consists of a parallel vector of bits, typically with the same width as the transmitter. Unless a mechanism is employed by the user logic, the GTX has no way to recover the correct alignment. Without such alignment, the organization into a parallel words arrives wrong at the receiver, and possibly as complete garbage, as an incorrect alignment prevents 8b/10b decoding (if employed).

It’s up to the application logic to implement a mechanism for synchronizing the receiver’s word alignment. There are two methodologies for this: Moving the alignment one bit at a time at the receiver’s side (“bit slipping”) until the data arrives properly, or transmitting a predefined pattern (a “comma”) periodically, and synchronize the receiver when this pattern is detected.

Bit slipping is the less recommended practice, even though simpler to understand. It keeps most of the responsibility in the application logic’s domain: The application logic monitors the arriving data, and issues a bit slip request when it has gathered enough errors to conclude that the alignment is out of sync.

However most well-established GTX-based protocols use commas for alignment. This method is easier in the way that the GTX aligns the word automatically when a comma is detected (if the GTX is configured to do so). If injecting comma characters periodically into the data stream fits well in the protocol, this is probably the preferred solution. The comma character can also be used to synchronize other mechanisms, in particular the scrambler (if employed).

Comma detection may also have false positives, resulting from errors in the raw data channel. As these data channels usually have a very low bit error probability (BER), this possibility can be overlooked in applications where a short-term false alignment resulting from a false comma detected is acceptable. When this is not acceptable, the application logic should monitor the incoming data, and disable the GTX automatic comma alignment through the rxpcommaalignen and/or rxmcommaalignen inputs of the GTX.

Tx buffer, to use or not to use

The Tx buffer is a small dual-clock (“asynchronous”) FIFO in the transmitter’s data path + some logic that makes sure that it starts off in the state of being half full.

The underlying problem, which the Tx buffer potentially solves, is that the serializer inside the GTX runs on a certain clock (XCLK) while the application logic is exposed to another clock (TXUSRCLK). The frequency of these clocks must be exactly the same to prevent overflow or underflow inside the GTX. This is fairly simple to achieve. Ensuring proper timing relationships between these two clocks is however less trivial.

There are hence two possibilies:

  • Not requiring a timing relationship between these clock (just the same frequency). Instead, use a dual-clock FIFO, which interfaces between these two clock domains. This small FIFO is referred to as the “Tx buffer”. Since it’s part of the GTX’ internal logic, going this path doesn’t require any additional resources from the logic fabric.
  • Make sure that the clocks are aligned, by virtue of a state machine. This state machine is implemented in the logic fabric.

The first solution is simpler and requires less resources from the FPGA’s logic fabric. Its main drawback is the latency of the Tx buffer, which is typically around 30 TXUSRCLK cycles. While this delay is usually negligible from a functional point of view, it’s not possible to predict its exact magnitude. It’s therefore not possible to use the Tx buffer on several parallel lanes of data, if the protocol requires a known alignment between the data in these lanes, or when an extremely low latency is required.

The second solutions requires some extra logic, but there is no significant design effort: This logic that aligns the clocks is included automatically by the IP core generator on Vivado 2014.1 and later, when “Tx/Rx buffer off” mode is chosen.

Xilinx GTX’ documentation is somewhat misleading in that it details the requirements of the state machine to painful detail: There’s no need to read through that long saga in the user guide. As a matter of fact, this logic is included automatically by the IP core generator on Vivado 2014.1, so there’s really no reason to dive into this issue. Only note that gtN_tx_fsm_reset_done_out may take a bit longer to assert after a reset (something like 1 ms on a 10 Gb/s lane).

Rx buffer

The Rx buffer (also called “Rx elastic buffer”) is also a dual-clock FIFO, which is placed in the same clock domain gap as the Tx buffer, and has the same function. Bypassing it requires the same kind of alignment mechanism in the logic fabric.

As with its Tx counterpart, bypassing the Rx buffer makes the latency short and deterministic. It’s however less common that such a bypass is practically justified: While a deterministic Tx latency may be required to ensure data alignment between parallel lanes in order to meet certain standard protocol requirements, there is almost always fairly easy methods to compesate for the unknown latency in user logic. Either way, it’s preferred not to rely on the transmitter to meet requirements on data alignment, and align the data, if required, by virtue of user logic.

Leftover notes

  • sysclk_in must be stable when the FPGA wakes up from configuration. A state machine that brings up the transceivers is based upon this clock. It’s referred to as the DRP clock in the wizard.
  • It’s important to declare the DRP clock’s frequency correctly, as certain required delays which are measured in nanoseconds are implemented by dwelling for a number of clocks, which is calculated from this frequency.
  • In order to transmit a comma, set the txcharisk to 1 (since it’s a vector, it sets the LSB) and the value of the 8 LSBs of the data to 0xBC, which is the code for K.28.5.

 

PCIe over fiber optics notes (using SFP+)

$
0
0

General

As part of a larger project, I was required to set up a PCIe link between a host and some FPGAs through a fiber link, in order to ensure medical-grade electrical isolation of a high-bandwidth video data link + allow for control over the same link.

These are a few jots on carrying a 1x Gen2 PCI Express link over a plain SFP+ fiber optics interface. PCIe is, after all, just one GTX lane going in each direction, so it’s quite natural to carry each Gigabit Transceiver lane on an optical link.

When a general-purpose host computer is used, at least one PCIe switch is required in order to ensure that the optical link is based upon a steady, non-spread spectrum clock. If an FPGA is used as a single endpoint at the other side of the link, it can be connected directly to the SFP+ adapter, with the condition that the FPGA’s PCIe block is set to asynchronous clock mode.

Since my project involved more than one endpoint on the far end (an FPGA and USB 3.0 chip), I went for the solution of one PCIe switch on each end. Avago’s PEX 8606, to be specific.

All in all, there are two issues that really require attention:

  • Clocking: Making sure that the clocks on both sides are within the required range (and it doesn’t hurt if they’re clean from jitter)
  • Handling the receiver detect issue, detailed below

How each signal is handled

  • Tx/Rx lanes: Passed through with fiber. The differential pair is simply connected to the SFP+ respective data input and output.
  • PERST: Signaled by turning off laser on the upstream side and issuing PERST to everything on the downstream side on (a debounced) LOS (Loss of Signal).
  • Clock: Not required. Keep both clocks clean, and within 250 ppm.
  • PRSNT: Generated locally, if this is at all relevant
  • All other PCIe signals are not mandatory

Some insights

  • It’s as easy (or difficult) as setting up a PCIe switch on both sides. The optical link itself is not adding any particular difficulty.
  • Dual clock mode on the PCIe switches is mandatory (hence only certain devices are suitable). The isolated clock goes to a specific lane (pair?), and not all configurations are possible (e.g. not all 1x on PEX8606).
  • According to PCIe spec 4.2.6.2, the LTSSM goes to Polling if a receiver has been detected (that is, a load is sensed), but Polling returns to Detect if there is no proper training sequence received from the other end. So apparently there is no problem with a fiber optic transceiver, even though it presents itself as a false load in the absence of a link partner at the other side of the fiber: The LTSSM will just keep looping between Detect and Polling until such partner appears.
  • The SFP+ RD pins are transmitters on the PCIe wire pair, and the TD are receivers. Don’t get confused.
  • AC coupling: All lane wires must have an 100 nF capacitor in series. External connectors (e.g. PCIe fingers) must have an capacitor on PET side (but must not have one on the ingoing signal).
  • Turn off ASPM wherever possible. Most BIOSes and many Linux kernels volunteer doing that automatically, but it’s worth making sure ASPM is never turned on in any usage scenario. A lot of errors are related to the L0s state (which is invoked by ASPM) in both switches and endpoints.

PEX 86xx notes

  • PEX_NT_RESETn is an output signal (but shouldn’t be used anyhow)
  • It seems like the PLX device cares about nothing that happened before the reset: A lousy voltage ramp-up or the absence of clock. All if forgotten and forgiven.
  • A fairly new chipset and BIOS are required on the motherboard, say from year 2012 and on, or the switch isn’t handled properly by the host.
  • On a Gigabyte Technology Co., Ltd. G31M-ES2L/G31M-ES2L, BIOS FH 04/30/2010, the motherboard’s BIOS stopped the clock short after powering up (it gave up, probably), and that made the PEX clockless, probably, leading to completely weird behavior.
  • There’s a difference between the lane numbering a port numbering (the latter used in function numbers of the “virtual” endpoints created with respect to each port). For example, on 8606 running a 2x-1x-1x-1x-1x configuration, lanes 0-1, 4, 5, 6 and 7 are mapped to ports 0, 1, 5, 7 and 9 respectively. Port 4 is lane 1 in an all-1x configuration (with other ports mapped the same).
  • The PEX doesn’t detect an SFP+ transceiver as a receiver on the respective PET lane, which prevents bringup of the fiber lane, unless the SerDes X Mask Receiver Not Detected bit is enabled in the relevant register (e.g. bit 16 at address 0x204). The lane still produces its receiver detection pattern, but ignores the fact it didn’t feel any receiver at the other end. See below.
  • In dual-clock mode, the switch works even if the main REFCLK is idle, given that the respective lane is unused (needless to say, the other clock must work).
  • Read the errata of the device before picking one. It’s available on PLX’ site on the same page that the Data Book is downloaded.
  • Connect an EEPROM on custom board designs, and be prepared to use it. It’s a lifesaver.

Why receiver detect is an issue

Before attempting to train a lane, the PCIe spec requires the transmitter to check if there is any receiver on the other side. The spec requires that the receiver should have a single-ended impedance of 40-60 Ohm on each of the P/N wires at DC (and a differential impedance of 80-120 Ohms, but that’s not relevant). The transmitter’s single-ended impedance isn’t specified, only the differential impedance must be 80-120. The coupling capacitor may range between 75-200 nF, and is always on the transmitter’s side (this is relevant only when there’s a plug connection between Tx and Rx).

The transmitter performs a receiver detect by creating an upward common mode pulse of up to 600 mV on both lane wires, and measuring the voltage on these.This pulse lasts for 100 us or so. As the time constant for 50 Ohms combined with 100 nF is 5 us, a charging capacitor’s voltage pattern is expected. Note that the common mode impedance of the transmitter is not defined by the spec, but the transmitter’s designer knows it. Either way, if a flat pulse is observed on the lane wires, there’s no receiver sensed.

Now to SFP+ modules: The SFP+ specification requires a nominal 100 Ohm differential impedance on its receivers, but “does not require any common mode termination at the receiver. If common mode terminations are provided, it may reduce common mode voltage and EMI” (SFF-8431, section 3.4). Also, it requires DC-blocking capacitors on both transmitter and receiver lane wires, so there’s some extra capacitance on the PCIe-to-SFP+ direction (where the SFP+ is the PCIe receiver) which is not expected. But the latter issue is negligible compared with the possible absence of common mode termination.

As the common-mode termination on the receiver is optional, some modules may be detected by the PCIe transmitter, and some may not.

This is what one of the PCIe lane’s wires looks like when the PEX8606 switch is set to ignore the absence of receiver (with the SerDes X Mask Receiver Not Detected bit): It still runs the receiver detect test (the large pulse), but then goes to link training despite that no load was detected (that’s the noisy part after the pulse). In the shown case, the training kept failing (no response on the other side), so it goes back and forth between detection and training.

Oscilloscope plot of receiver detect of PLX8606

This capture was done with a plain digital oscilloscope (~ 200 MHz bandwidth).

PCIe: Xilinx’ pipe_clock module and its timing constraints

$
0
0

Introduction

In several versions of Xilinx’ wrapper for the integrated PCIe block, it’s the user application logic’s duty to instantiate the module which generates the “pipe clock”. It typically looks something like this:

pcie_myblock_pipe_clock #
      (
          .PCIE_ASYNC_EN                  ( "FALSE" ),                 // PCIe async enable
          .PCIE_TXBUF_EN                  ( "FALSE" ),                 // PCIe TX buffer enable for Gen1/Gen2 only
          .PCIE_LANE                      ( LINK_CAP_MAX_LINK_WIDTH ), // PCIe number of lanes
          // synthesis translate_off
          .PCIE_LINK_SPEED                ( 2 ),
          // synthesis translate_on
          .PCIE_REFCLK_FREQ               ( PCIE_REFCLK_FREQ ),        // PCIe reference clock frequency
          .PCIE_USERCLK1_FREQ             ( PCIE_USERCLK1_FREQ ),      // PCIe user clock 1 frequency
          .PCIE_USERCLK2_FREQ             ( PCIE_USERCLK2_FREQ ),      // PCIe user clock 2 frequency
          .PCIE_DEBUG_MODE                ( 0 )
      )
      pipe_clock_i
      (

          //---------- Input -------------------------------------
          .CLK_CLK                        ( sys_clk ),
          .CLK_TXOUTCLK                   ( pipe_txoutclk_in ),     // Reference clock from lane 0
          .CLK_RXOUTCLK_IN                ( pipe_rxoutclk_in ),
          .CLK_RST_N                      ( pipe_mmcm_rst_n ),      // Allow system reset for error_recovery
          .CLK_PCLK_SEL                   ( pipe_pclk_sel_in ),
          .CLK_PCLK_SEL_SLAVE             ( pipe_pclk_sel_slave),
          .CLK_GEN3                       ( pipe_gen3_in ),

          //---------- Output ------------------------------------
          .CLK_PCLK                       ( pipe_pclk_out),
          .CLK_PCLK_SLAVE                 ( pipe_pclk_out_slave),
          .CLK_RXUSRCLK                   ( pipe_rxusrclk_out),
          .CLK_RXOUTCLK_OUT               ( pipe_rxoutclk_out),
          .CLK_DCLK                       ( pipe_dclk_out),
          .CLK_OOBCLK                     ( pipe_oobclk_out),
          .CLK_USERCLK1                   ( pipe_userclk1_out),
          .CLK_USERCLK2                   ( pipe_userclk2_out),
          .CLK_MMCM_LOCK                  ( pipe_mmcm_lock_out)

      );

Consequently, some timing constraints that are related to the PCIe block’s internal functionality aren’t added automatically by the wrapper’s own constraints, but must be given explicitly by the user of the block, typically by following an example design.

This post discusses the implications of this situation. Obviously, none of this applies to PCIe block wrappers which handle this instantiation internally.

What is the pipe clock?

For our narrow purposes, the PIPE interface is the parallel data part of the SERDES attached to the Gigabit Transceivers (MGTs), which drive the physical PCIe lanes. For example, data to a Gen1 lane, running at 2.5 GT/s, requires 2.0 Gbit/s of payload data (as it’s expanded by a 10/8 ratio with 10b/8b encoding). If the SERDES is fed with 16 bits in parallel, a 125 MHz clock yields the correct data rate (125 MHz * 16 = 2 GHz).

By the same coin, a Gen2 interface requires a 250 MHz clock to support a payload data rate of 4.0 Gbit/s per lane (expanded into 5 GT/s with 10b/8b encoding).

The clock mux

If a PCIe block is configured for Gen2, it’s required to support both rates: 5 GT/s, and also be able to fall back to 2.5 GT/s if the link partner doesn’t support Gen2 or if the link doesn’t work properly at the higher rate.

In the most common setting (or always?), the pipe clock is muxed between two source clocks by this piece of code (in the pipe_clock module):

    //---------- PCLK Mux ----------------------------------
    BUFGCTRL pclk_i1
    (
        //---------- Input ---------------------------------
        .CE0                        (1'd1),
        .CE1                        (1'd1),
        .I0                         (clk_125mhz),
        .I1                         (clk_250mhz),
        .IGNORE0                    (1'd0),
        .IGNORE1                    (1'd0),
        .S0                         (~pclk_sel),
        .S1                         ( pclk_sel),
        //---------- Output --------------------------------
        .O                          (pclk_1)
    );
    end

So pclk_sel, which is a registered version of the CLK_PCLK_SEL input port is used to switch between a 125 MHz clock (pclk_sel == 0) and a 250 MHz clock (clk_sel == 1), both clocks generated from the same MMCM_ADV block in the pipe_clock module.

The BUFGMUX’ output, pclk_1 is assigned as the pipe clock output (CLK_PCLK). It’s also used in other ways, depending on the instantiation parameters of pipe_clock.

Constraints for Gen1 PCIe blocks

If a PCIe block is configured for Gen1 only, there’s no question about the pipe clock’s frequency: It’s 125 MHz. As a matter of fact, if the PCIE_LINK_SPEED instantiation parameter is set to 1, one gets (by virtue of Verilog’s generate commands)

    BUFG pclk_i1
    (
        //---------- Input ---------------------------------
        .I                          (clk_125mhz),
        //---------- Output --------------------------------
        .O                          (clk_125mhz_buf)
    );
    assign pclk_1 = clk_125mhz_buf;

But never mind this — it’s never used: Even when the block is configured as Gen1 only, PCIE_LINK_SPEED is set to 3 in the example design’s instantiation, and we all copy from it.

Instead, the clock mux is used and fed with pclk_sel=0. The constraints reflect this with the following lines appearing in the example design’s XDC file for Gen1 PCIe blocks (only!):

set_case_analysis 1 [get_pins {pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/S0}]
set_case_analysis 0 [get_pins {pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/S1}]
set_property DONT_TOUCH true [get_cells -of [get_nets -of [get_pins {pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/S0}]]]

The first two commands tell the timing analysis tools to assume that the clock mux’ inputs are S0=1 and S1=0, and hence that the mux forwards the 125 MHz clock (connected to I0).

The DONT_TOUCH constraint works around a bug in early Vivado revisions, as explained in AR #62296: The S0 input is assigned ~pclk_sel, which requires a logic inverter. This inverter was optimized into the BUFCTRL primitive by the synthesizer, flipping the meaning of the first set_case_analysis constraints. Which caused the timing tools to analyze the design as if both S0 and S1 were set to zero, hence no clock output, and no constraining of the relevant paths.

The problem with this set of constraints is their cryptic nature: It’s not clear at all why they are there, just by reading the XDC file. If the user of the PCIe block decides, for example, to change from a 8x Gen1 configuration to 4x Gen2, everything will appear to work nicely, since all clocks except the pipe clock remain the same. It takes some initiative and effort to figure out that these constraints are incorrect for a Gen2 block.

To make things even worse, almost all relevant paths will meet the 250 MHz (4 ns) requirement even when constrained for 125 MHz on a sparsely filled FPGA, simply because there’s little logic along these paths. So odds are that everything will work fine during the initial tests (before the useful logic is added to the design), and later on the PCIe interface may become shaky throughout the design process, as some paths accidentally exceed the 4 ns limit.

Dropping the set_case_analysis constraints

As these constraints are relaxing by their nature, what happens if they are dropped? Once could expect that the tools would work a bit harder to ensure that all relevant paths meet timing with either 125 MHz or 250 MHz, or simply put, that the constraining would occur as if pclk_1 was always driven with a 250 MHz clock.

But this isn’t how timing calculations are made. The tools can’t just pick the faster clock from a clock mux and follow through, since the logic driven by the clock might interact with other clock domains. If so, a slower clock might require stricter timing due to different relations between the source and target clock’s frequencies.

So what actually happens is that the timing tools mark all logic driven by the pipe clock as having multiple clocks: The timing of each path going to and from any such logic element is calculated for each of the two clocks. Even the timing for paths going between logic elements that are both driven by the pipe clock are calculated four times, covering the four combinations of the 125 MHz and 250 MHz clocks, as source and destination clocks.

From a practical point of view, this is rather harmless, since both clocks come from the same MMCM_ADV, and are hence aligned. Making these excessive timing calculations always ends up with the equivalent for the 250 MHz clock only (some clock skew uncertainty possibly added for going between the two clocks). Since timing is met easily on these paths, this extra work adds very little to the implementation efforts (and how long it takes to finish).

On the other hand, this adds some dirt to the timing report. First, the multiple clocks are reported (excerpt from the Timing Report):

7. checking multiple_clock
--------------------------
 There are 2598 register/latch pins with multiple clocks. (HIGH)

Later on, the paths between logic driven by the pipe clock are counted as inter clock paths: Once from 125 MHz to 250 MHz, and vice versa. This adds up to a large number of bogus inter clock paths:

------------------------------------------------------------------------------------------------
| Inter Clock Table
| -----------------
------------------------------------------------------------------------------------------------

From Clock    To Clock          WNS(ns)      TNS(ns)  TNS Failing Endpoints  TNS Total Endpoints      WHS(ns)      THS(ns)  THS Failing Endpoints  THS Total Endpoints
----------    --------          -------      -------  ---------------------  -------------------      -------      -------  ---------------------  -------------------
clk_250mhz    clk_125mhz          0.114        0.000                      0                 5781        0.053        0.000                      0                 5781
clk_125mhz    clk_250mhz          0.114        0.000                      0                 5764        0.053        0.000                      0                 5764

Since a single endpoint might produce many paths (e.g. a block RAM), there’s no need for a correlation between the number of endpoints and the number of paths. However the similarity between the figures of the two directions seems to indicate that the vast majority of these paths are bogus.

So dropping the set_case_analysis constraints boils down to some noise in the timing report. I can think of two ways to eliminate it:

  • Issue set_case_analysis constraints setting S0=0, S1=1, so the tools assume a 250 MHz clock. This covers the Gen2 case as well as Gen1.
  • Use the constraints of the example design for a Gen2 block (shown below).

Even though both ways (in particular the second) seem OK to me, I prefer taking the dirt in the timing report and not add constraints without understanding the full implications. Being more restrictive never hurts (as long as the design meets timing).

Constraints for Gen2 PCIe blocks

If a PCIe block is configured for Gen2, it has to be able to work a Gen1 as well. So the set_case_analysis constraints are out of the question.

Instead, this is what one gets in the example design:

create_generated_clock -name clk_125mhz_x0y0 [get_pins pcie_myblock_support_i/pipe_clock_i/mmcm_i/CLKOUT0]
create_generated_clock -name clk_250mhz_x0y0 [get_pins pcie_myblock_support_i/pipe_clock_i/mmcm_i/CLKOUT1]
create_generated_clock -name clk_125mhz_mux_x0y0 \
                        -source [get_pins pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/I0] \
                        -divide_by 1 \
                        [get_pins pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/O]
#
create_generated_clock -name clk_250mhz_mux_x0y0 \
                        -source [get_pins pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/I1] \
                        -divide_by 1 -add -master_clock [get_clocks -of [get_pins pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/I1]] \
                        [get_pins pcie_myblock_support_i/pipe_clock_i/pclk_i1_bufgctrl.pclk_i1/O]
#
set_clock_groups -name pcieclkmux -physically_exclusive -group clk_125mhz_mux_x0y0 -group clk_250mhz_mux_x0y0

This may seem tangled, but says something quite simple: The 125 MHz and 250 MHz clocks are physically exclusive (see AR #58961 for an elaboration on this). In other words, these constraints declare that no path exists between logic driven by one clock and logic driven by the other. If such path is found, it’s bogus.

So this drops all the bogus paths mentioned above. Each path between logic driven by the pipe clock is now calculated twice (for 125 MHz and 250 MHz, but not across the clocks). This seems to yield the same practical results as without these constraints, but without complaints about multiple clocks, and of course no inter-clock paths.

Both clocks are still related to the pipe clock however. For example, checking a register driven by the pipe clock yields (Tcl session):

get_clocks -of_objects [get_pins -hier -filter {name=~*/pipe_clock_i/pclk_sel_reg1_reg[0]/C}]
clk_250mhz_mux_x0y0 clk_125mhz_mux_x0y0

Not surprisingly, this register is attached to two clocks. The multiple clock complaint disappeared thanks to the set_clock_groups constraint (even the lower “asynchronous” flag is enough for this purpose).

So can these constraints be used for a Gen1-only block, as a safer alternative for the set_case_analysis constraints? It seems so. Is it a good bargain for getting rid of those extra notes in the timing report? It’s a matter of personal choice. Or knowing for sure.

Bonus: Meaning of some instantiation parameters of pipe_clock

This is the meaning according to dissection of Kintex-7′s pipe_clock Verilog file. It’s probably the same for other targets.

PCIE_REFCLK_FREQ: The frequency of the reference clock

  • 1 => 125 MHz
  • 2 => 250 MHz
  • Otherwise: 100 MHz

CLKFBOUT_MULT_F is set to that the MCMM_ADV’s internal VCO always runs at 1 GHz. Hence the constant CLKOUT0_DIVIDE_F = 8 makes clk_125mhz run at 125 MHz (dividing by 8), and CLKOUT1_DIVIDE = 4 makes clk_250mhz run at 250 MHz (dividing by 8)

PCIE_USERCLK1_FREQ: The frequency of the module’s CLK_USERCLK1 output, which is among others the clock with the user interface (a.k.a. user_clk_out or axi_clk)

  • 1 => 31.25 MHz
  • 2 => 62.5 MHz
  • 3 => 125 MHz
  • 4 => 250 MHz
  • 5 => 500 MHz
  • Otherwise: 62.5 MHz

PCIE_USERCLK2_FREQ: The frequency of the module’s CLK_USERCLK2 output. Not used in most applications. Same frequency mapping as PCIE_USERCLK1_FREQ.


PCIe on Cyclone 10 GX: Data loss on DMA writes by FPGA

$
0
0

TL;DR

DMA writes from a Cyclone 10 GX PCIe interface may be lost, probably due to a path that isn’t timed properly by the fitter. This has been observed with Quartus Prime Version 17.1.0 Build 240 SJ Pro Edition, and the official Cyclone 10 GX development board. A wider impact is likely, possibly on Arria 10 device as well (as its PCIe block is the same one).

The problem seems to be rare, and appears and disappears depending on how the fitter places the logic. It’s however fairly easy to diagnose if this specific problem is in effect (see “The smoking gun” below).

Computer hardware: Gigabyte GA-B150M-D2V motherboard (with an Intel B150 Chipset) + Intel i5-6400 CPU.

The story

It started with a routine data transport test (FPGA to host), which failed virtually immediately (that is, after a few kilobytes). It was apparent that some portions of data simply weren’t written into the DMA buffer by the FPGA.

So I tried a fix in my own code, and yep, it helped. Or so I thought. Actually, anything I changed seemed to fix the problem. In the end, I changed nothing, but just added

set_global_assignment -name SEED 2

to the QSF file. Which only changes the fitter’s initial placement of the logic elements, which eventually leads to an alternative placement and routing of the design. That should work exactly the same, of course. But it “solved the problem”.

This was consistent: One “magic” build that failed consistently, and any change whatsoever made the issue disappear.

The design was properly constrained, of course, as shown in the development board’s sample SDC file. In fact, there isn’t much to constrain: It’s just setting the main clock to 100 MHz, derive_pll_clocks and derive_clock_uncertainty. And a false path from the PERST pin.

So maybe my bad? Well, no. There were no unconstrained paths in the entire design (with these simple constraints), so one fitting of the design should be exactly like any other. Maybe my application logic? No again:

The smoking gun

The final nail in the coffin was when I noted errors in the PCIe Device Status Registers on both sides. I’ve discussed this topic in this and this other posts of mine, however in the current case no AER kernel messages were produced (unfortunately, and it’s not clear why).

And whatever the application code does, Intel / Altera’s PCIe block shouldn’t produce a link error, and neither it does normally. It’s a violation of the PCIe spec.

These are the steps for observing this issue on a Linux machine. First, find out who the link partners are:

$ lspci
00:00.0 Host bridge: Intel Corporation Device 191f (rev 07)
00:01.0 PCI bridge: Intel Corporation Device 1901 (rev 07)
[ ... ]
01:00.0 Unassigned class [ff00]: Altera Corporation Device ebeb

and then figuring out that the FPGA card is connected via the bridge at 00:01.0 with

$ lspci -t
-[0000:00]-+-00.0
           +-01.0-[01]----00.0

So it’s between 00:01.0 and 01:00.0. Then, following that post of mine, using setpci to read from the status register to tell an error had occurred.

First, what it should look like: With any bitstream except that specific faulty one, I got

# setpci -s 01:00.0 CAP_EXP+0xa.w
0000
# setpci -s 00:01.0 CAP_EXP+0xa.w
0000

any time and all the time, which says the obvious: No errors sensed on either side.

But with the bitstream that had data losses, before any communication had taken place (except for the driver being loaded):

# setpci -s 01:00.0 CAP_EXP+0xa.w
0009
# setpci -s 00:01.0 CAP_EXP+0xa.w
0000

Non-zero means error. So at this stage the FPGA’s PCIe interface was unhappy with something (more on that below), but the processor’s side had no complaints.

I have to admit that I’ve seen the 0009 status in a lot of other tests, in which communication went through perfectly. So even though reflects some kind of error, it doesn’t necessarily predict any functional fault. As elaborated below, the 0009 status consists of correctable errors. It’s just that such errors are normally never seen (i.e. with any PCIe card that works properly).

Anyhow, back to the bitstream that did have data errors. After some data had been written by the FPGA:

# setpci -s 01:00.0 CAP_EXP+0xa.w
0009
root@diskless:/home/eli# setpci -s 00:01.0 CAP_EXP+0xa.w
000a

In this case, the FPGA card’s link partner complained. To save ourselves the meaning of these numbers (even though the’re listed in that post), use lspci -vv:

# lspci -vv
00:01.0 PCI bridge: Intel Corporation Device 1901 (rev 07) (prog-if 00 [Normal decode])
[ ... ]
        Capabilities: [a0] Express (v2) Root Port (Slot+), MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0
                        ExtTag- RBE+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
                        MaxPayload 256 bytes, MaxReadReq 128 bytes
                DevSta: CorrErr- UncorrErr+ FatalErr- UnsuppReq+ AuxPwr- TransPend-
[ ... ]

So the bridge complained about an uncorrectable and an unsupported request only after the data transmission, but the FPGA side:

01:00.0 Unassigned class [ff00]: Altera Corporation Device ebeb
[ ... ]
        Capabilities: [80] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-

complained about a correctable error and an unsupported request (as seen above, that happened before any payload transmission).

Low-level errors. I couldn’t make this happen even if I wanted to.

Aftermath

The really bad news is that this problem isn’t in the logic itself, but in how it’s placed. It seems to be a rare and random occurrence of a poor job done by the fitter. Or maybe it’s not all that rare, if you let the FPGA heat up a bit. In my case a spinning fan kept an almost idle FPGA quite cool, I suppose.

The somewhat good news is that the data loss comes with these PCIe status errors, and maybe with the relevant kernel messages (not clear why I didn’t see any). So there’s something to hold on to.

And I should also mention that the offending PCIe interface was a Gen2 x 4 running with a 64-bit interface at 250 MHz. which a rather marginal frequency for Arria 10 / Cyclone 10. So going with the speculation that this is a timing issue that isn’t handled properly by the fitter, maybe sticking to 125 MHz interfaces on these devices is good enough to be safe against this issue.

Note to self: The outputs are kept in cyclone10-failure.tar.gz

Nvidia graphics cards on Linux: PCIe link speed and width

$
0
0

Why is it at 2.5 GT/s???

With all said about Nvidia’s refusal to release their drivers as open source, their Linux support is great. I don’t think I’ve ever had such a flawless graphics card experience with Linux. After replacing the nouveau driver with Nvidia’s, of course. Ideology is nice, but a computer that works is nicer.

But then I looked at the output of lspci -vv (on an Asus fanless GT 730 2GB DDR3), and horrors, it’s not running at full PCIe speed!

17:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 730] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GK208B [GeForce GT 730]
[ ... ]
        Capabilities: [78] Express (v2) Legacy Endpoint, MSI 00
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 <64us
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
[ ... ]
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

Whatwhat? The card declares it supports 5 GT/s, but runs only at 2.5 GT/s? And on my brand new super-duper motherboard, which supports Gen3 PCIe connected directly to an Intel X-family CPU?

It’s all under control

Well, the answer is surprisingly simple: Nvidia’s driver changes the card’s PCIe speed dynamically to support the bandwidth needed. When there’s no graphics activity, the speed drops to 2.5 GT/s.

This behavior can be controlled with Nvidia’s X Server Settings control panel (it has an icon in the system’s setting panel, or just type “Nvidia” on Gnome’s start menu). Under the PowerMizer sub-menu, the card’s behavior can be changed to stay at 5 GT/s if you like your card hot and electricity bill fat.

Otherwise, in “Adaptive mode” it switches back and forth from 2.5 GT/s to 5 GT/s. The screenshot below was taken after a few seconds of idling (click to enlarge):

Screenshot of Nvidia X Server settings in adaptive mode

And this is how to force it to 5 GT/s constantly (click to enlarge):

Screenshot of Nvidia X Server settings in maximum performance mode

With the latter setting, lspci -vv shows that the card is at 5 GT/s, as promised:

17:00.0 VGA compatible controller: NVIDIA Corporation GK208 [GeForce GT 730] (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GK208B [GeForce GT 730]
[ ... ]
                LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <512ns, L1 <4us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

So don’t worry about a low speed on an Nvidia card (or make sure it steps up on request).

A word on GT 1030

I added another fanless card, Asus GT 1030 2GB, to the computer for some experiments. This card is somewhat harder to catch at 2.5 GT/s, because it steps up very quickly in response to any graphics event. But I managed to catch this:

65:00.0 VGA compatible controller: NVIDIA Corporation GP108 (rev a1) (prog-if 00 [VGA controller])
        Subsystem: ASUSTeK Computer Inc. GP108 [GeForce GT 1030]
[ ... ]
                LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <512ns, L1 <16us
                        ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                        ExtSynch- ClockPM+ AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 2.5GT/s, Width x4, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

The running 2.5 GT/s speed vs. the maximal 8 GT/s is pretty clear by now, but the declared maximal Width is 4x? If so, why does it have a 16x PCIe form factor? The GT 730 has an 8x form factor, and uses 8x lanes, but GT 1030 has 16x and declares it can only use 4x? Is this some kind of marketing thing to make the card look larger and stronger?

On the other hand, show me a fairly recent motherboard without a 16x PCIe slot. The thing is that sometimes that slot can be used for something else, and the graphics card could then have gone into a vacant 4x slot instead. But no. Let’s make it big and impressive with a long PCIe plug that makes it look massive. Personally, I find the gigantic heatsink impressive enough.

An FPGA-based PCI Express peripheral for Windows: It’s easy

$
0
0

To make a long story short…

There is really no need to work hard to make your FPGA talk with a PC.  Xillybus gives you the end-to-end connection interfacing FPGAs with both Linux and Windows computers.

The challenge

At times, FPGA engineers are faced with the need to transfer data to or from a regular PC. Sometimes it’s the purpose of the project (e.g. data acquisition, frame grabbing, firmware loading etc.). Others need data transport with a PC for testing their HDL design on silicon. Or just run whatever automated tests or calibration processes involved in the project. This way or another, the lowest-level piece of logic (the FPGA) now needs to talk with the highest level form of logic (a user space  application running in protected memory mode within a fullblown operation system). That’s quite a gap to bridge.

Since I published a short guide about the basics of PCI Express, I keep receiving questions implying that some FPGA engineers don’t grasp what’s ahead of them when they start such a project. Even for low-bandwidth assignments (where no DMA is necessary) there’s quite some way to go before having something that works in a stable manner. While Linux offers a fairly straightforward API for writing device drivers, developing any kind of driver for Windows is much of a project in itself. And even though my heart is with Linux, it’s pretty clear that “PC” and “Windows” are synonyms to most people today.

Non-PCIe solutions

There are two common approaches today for making a Windows PC and an FPGA talk:

  • Using Cypress’ EZ-USB chip, which supplies a generic data interface for USB communication. Windows drivers are available from Cypress, but interfacing the chip with the FPGA requires a substantial piece of logic, as well as some 8051 firmware hacking. From my own experience and others’, those chips have some “personality” once they’re put on a real design’s board. So all in all, this is not a rose garden, and yet for many years this was considered the only sane solution.
  • Connecting the FPGA to the PC’s network plug through an Ethernet chip, typically a Marvell Alaska transceiver. This solution is attractive in particular when data goes from the FPGA to the PC, since raw MAC packets are quite easy to assemble. The main problem with this solution is that even though it usually works fine, it’s just because the hardware and software components are more reliable than required, as detailed in another post of mine.

Painless PCIe

As mentioned above, the good news is that Xillybus makes it amazingly easy: On the computer, open a plain file in an user space application, using whatever programming language you want. No creepy API involved. Just read and write to a file, and put a plain FIFO inside the FPGA to hold the data.

Xillybus supports a variety if Xilinx and Altera FPGAs, regardless of the host’s operating system: All Spartan 6, Virtex-5 and Virtex-6 devices with a “T” suffix (those having a built-in PCIe hardware core). As for Altera, all devices having the hard IP form of PCI compiler for PCI Express.

Intel FPGA’s Stratix 10: My impressions and notes

$
0
0

Introduction

These are a few random things I wrote down as I worked with the Stratix 10 Development Kit, with focus on its PCIe interface. Quite obviously, it’s mostly about things I found noteworthy about this specific FPGA and its board, compared with previous hardware I’ve encountered.

Generally speaking, Stratix 10 is not for the faint-hearted: It has quite a few special issues that require attention when designing with it (some detailed below), and it’s clearly designed with the assumption that if you’re working with this king-sized beast, you’re most likely part of some high-end project, being far from a novice in the FPGA field.

Some National Geographic

Even though I discuss the development kit further below, I’ll start with a couple of images of the board’s front and back. This 200W piece of logic has a liquid cooler and an exceptionally noisy fan — none of which are shown in Intel’s official images I’ve seen. In other words, it’s not as innocent as it may appear from the official pics.

There are no earplugs in the kit itself, so it’s recommended to buy something of that sort along with it. One could only wish for a temperature controlled fan. I mean, measuring the temperature of the liquid would probably have done the job. Some silence when the device isn’t working hard.

So here’s what the board looks like out of the box (in particular DIP switches in the default positions). Click images to enlarge.

Front side of Stratix 10 Development Kit

Front side of Stratix 10 Development Kit

Back side of Stratix 10 Development Kit

Back side of Stratix 10 Development Kit

 

“Hyperflex”

The logic on the Stratix 10 FPGAs has been given this rather promising name, implying that there’s something groundbreaking about it. However synthesizing a real-life design for Stratix 10, I experienced no advantage over Cyclone 10: All of the hyper-something phases got their moment of glory during the project implementation (Quartus Pro 19.2), but frankly speaking, when the design got the slightest heavy (5% of the FPGA resources, but still a 256-bit wide bus everywhere on a 250 MHz clock), timing failed exactly as it would on a Cyclone 10.

Comparing with Xilinx, it feels a bit like Kintex-7 (mainline speed grade -2), in terms of the logic’s timing performance. Maybe if the logic design is tuned to fit the architecture, there’s a difference.

Assuming that this Hyperflex thing is more than just a marketing buzz, I imagine that the features of this architecture are taken advantage of in Intel’s own IP cores for certain tasks (with extensive pipelining?). Just don’t expect anything hyper to happen when implementing your own plain design.

PCIe, Transceivers and Tiles

It’s quite common to use the term “tiles” in the FPGA industry to describe sections on the silicon die that belong to a certain functionality. However the PCIe + transceiver tiles on a Stratix 10 are separate silicon dies on the package substrate, connected to the main logic fabric (“HyperFlex”) through Intel’s Embedded Multi-die Interconnect Bridge (EMIB) interface. Not that it really matters, but anyhow.

H, L and E tiles provide Gigabit transceivers. H and L tiles come with exactly one PCIe hard IP each, E-tiles with 100G Ethernet. There might be one or more of these tiles on a Stratix 10 device. It seems like the L tile will vanish with time, as it has weaker performance in almost all parameters.

All tiles have 24 Gigabit transceivers. Those not used by the hard IP are vacant for general purpose, even though some might become unusable, subject to certain rules (given in the relevant user guides).

And here comes the hard nut: PCIe has a minimal data interface of 256 bits with the application logic. The other possibility is 512 bits. This can be a significant burden when porting a design from earlier FPGA families, in particular if they were based upon a narrower data interface.

Xillybus supports the Stratix 10 device family, however.

PCIe unsupported request error

Quite interestingly, there were correctable (and hence practically harmless) errors on the PCIe link consistently when booting a PC with the official development kit, with a production grade (i.e. not ES) H-tile FPGA. This is what plain lspci -vv gave me, even before the application logic got a chance to do anything:

01:00.0 Unassigned class [ff00]: Altera Corporation Device ebeb (rev 01)
        Subsystem: Altera Corporation Device ebeb
        Control: I/O- Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Interrupt: pin A routed to IRQ 16
        Region 0: Memory at d0100000 (64-bit, prefetchable) [size=256]
        Capabilities: [40] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit-
                Address: 00000000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [70] Express (v2) Endpoint, MSI 00
                DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <64ns, L1 <1us
                        ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset+
                DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
                        RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
                LnkCap: Port #1, Speed 5GT/s, Width x16, ASPM not supported, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot-
                LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                LnkSta: Speed 5GT/s, Width x16, TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, LTR-, OBFF Not Supported
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
                LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance De-emphasis: -6dB
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
                         EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
        Capabilities: [100 v2] Advanced Error Reporting
                UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
                AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-

As highlighted above, Unsupported Request correctable errors took place on the link. Even though this is harmless, it’s nevertheless nothing that should happen on a properly working PCIe link.

Note that I ran the PCIe link on Gen2 only, even though it supports Gen3. Not that it should matter.

Reset release IP

According to Intel’s Configuration Guide for Stratix 10 for Quartus Design Suite 19.2, one can’t rely on the device’s consistent wakeup, but the nINIT_DONE signal must be used to reset all logic:

“The entire device does not enter user mode simultaneously. Intel requires you to include the Intel Stratix 10 Reset Release IP on page 22 to hold your application logic in the reset state until the entire FPGA fabric is in user mode. Failure to include this IP in your design may result in intermittent application logic failures.”

Note nINIT_DONE is asserted (low) when it’s fine to run the logic, so it’s effective an active HIGH reset. It’s so easy to get confused, as the “n” prefix triggers the “active low reset” part of an FPGA designer’s brain.

Failing to have the Reset Release IP included in the project results the following critical warning during synthesis (Quartus Pro 19.2):

Critical Warning (20615): Use the Reset Release IP in Intel Stratix 10 designs to ensure a successful configuration. For more information about the Reset Release IP, refer to the Intel Stratix 10 Configuration User Guide.

The IP just exposes the nINIT_DONE signal as an output and has no parameters. It boils down to the following:

wire ninit_done;
altera_s10_user_rst_clkgate init_reset(.ninit_done(ninit_done));

One could instantiate this directly, but it’s not clear if this is Quartus forward compatible, and it won’t silence the critical warning.

However Quartus Pro 18.0 doesn’t issue any warning if the Reset Release IP is missing, and neither is this issue mentioned in the related configuration guide. Actually, the required IP isn’t available on Quartus Pro 18.0. This issue obviously evolved with time.

Variable core voltage (SmartVID)

Another ramp-up in the usage complexity is the core voltage supply. The good old practice is to set the power supply to whatever voltage the datasheet requires, but no, Stratix 10 FPGAs need to control the power supply, in order to achieve the exact voltage that is required for each specific device. So there’s now a Power Management User Guide to tackle this issue.

This has a reason: As the transistors get smaller, so does the tolerance of the process get a larger impact. To compensate for these tolerances, and not take a hit on the timing performance, each device has its own ideal core voltage. So if you’ve gone as far as using a Stratix 10 FPGA, what’s connecting a few I2C wires to the power supply and let it pick its favorite voltage?

The impact on the FPGA design is the need to inform the tools which pins to use for this purpose, what I2C address to use, which power supply to expect on the other end, and other parameters. This takes the form of a few extra lines, as shown below for the Stratix 10 Development Kit:

set_global_assignment -name USE_PWRMGT_SCL SDM_IO14
set_global_assignment -name USE_PWRMGT_SDA SDM_IO11
set_global_assignment -name VID_OPERATION_MODE "PMBUS MASTER"
set_global_assignment -name PWRMGT_BUS_SPEED_MODE "400 KHZ"
set_global_assignment -name PWRMGT_SLAVE_DEVICE_TYPE LTM4677
set_global_assignment -name PWRMGT_SLAVE_DEVICE0_ADDRESS 4F
set_global_assignment -name PWRMGT_SLAVE_DEVICE1_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE2_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE3_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE4_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE5_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE6_ADDRESS 00
set_global_assignment -name PWRMGT_SLAVE_DEVICE7_ADDRESS 00
set_global_assignment -name PWRMGT_PAGE_COMMAND_ENABLE ON
set_global_assignment -name PWRMGT_VOLTAGE_OUTPUT_FORMAT "AUTO DISCOVERY"
set_global_assignment -name PWRMGT_TRANSLATED_VOLTAGE_VALUE_UNIT VOLTS

It’s among the things that are easy when they work, but when designing your own board and something goes wrong with the I2C bus, well, well.

“Self service licensing”

The Stratix 10 Development Kit includes a one-year license for Quartus Pro, which is activated on Intel’s website. It’s recommended to start this process as soon as possible, as it has a potential of getting tangled and time consuming. In particular, be sure to know which email address was reported to Intel along with the purchase of the Kit, and that you have a fully verified account for that email address on Intel’s website.

That’s because the self-service licensing procedure is possible only from the Intel account that is registered with a specific email address. This email address is the one that the distributor reported when forwarding the order for the development kit to Intel. In my case, they used an address they had on record from a previous purchase I made from the same distributor, and it didn’t even cross my mind to try it.

Be sure to fill in the detailed registration form and to confirm the email address. Access to the licensing area is denied otherwise. It continues to be denied for a few days after filling in the details. Probably a matter of validation by a human.

The serial number that needs to be fed in (or does it? see below) is the one that appears virtually everywhere (on the PCB itself, on the package, on the outer box with which the package arrived), and has the form of e.g. 10SHTPCIe0001234. However the instructions said it should be “printed on the side of the development kit box below the bottom bar code”. Well, there is nothing printed under the bottom bar code. It’s not so difficult to find it, as it says “serial number”, but when the registration fails, this misleading direction adds a level of confusion.

Since the serial number is so out in the open, it’s quite clear why another form of authentication is needed. Too bad that the email issue wasn’t mentioned in the instructions.

In my case, there was no need to feed any serial number. Once the Intel account was validated (a few days after filling in the registration details), the license simply appeared on the self-service licensing page. As I contacted Intel’s licensing support twice throughout the process, it’s possible that someone at Intel’s support took care of pairing the serial number  with my account.

Development kit’s power supplies

I put this section last, because it’s the one that is quite pointless reading, frankly speaking. The bottom line is simple, exactly like the user guide says: If you use the board stand-alone, use the power supply that came along with it. If the board goes into the PCIe slot, connect both J26 and J27 to the computer’s ATX power supply, or the board will not power up.

J27 is a plain PCIe power connector (6 pins), and J26 is an 8-pin power connector. On my plain ATX power supply there was a PCIe power connector with a pair of extra pins attached with a cable tie (8 pins total). It fit in nicely into J26, it worked, no smoke came out, so I guess that’s the way it should be done. See pictures below, click to enlarge.

ATX power supply connected to Stratix 10 Development Kit, front side

ATX power supply connected to Stratix 10 Development Kit, front side

ATX power supply connected to Stratix 10 Development Kit, back side

ATX power supply connected to Stratix 10 Development Kit, back side

Now to the part you can safely skip:

As the board is rated at 240 W and may draw up to 20A from its internal +12V power supply, it might be interesting understand how the power load is distributed among the different sources. However the gory details have little practical importance, as the board won’t power up when plugged in as a PCIe card unless power is applied both to J26 and J27 (the power-up is sequencer set up this way, I guess). So this is just a little bit of theory.

There are three power groups, each having a separate 12V power rail: 12V_GROUP1, 12V_GROUP2 and 12V_GROUP3.

12V_GROUP2 will feed 12V_GROUP1 and 12V_GROUP3 with current if their voltage is lower than its own, by virtue of an emulated ideal diode. It’s as if there was two ideal diodes connected with their anodes on 12V_GROUP2 and one diode’s cathode on 12V_GROUP1, and cathode on 12V_GROUP3.

These voltage rails are in turn fed by external connectors, through emulated ideal diodes as follows:

  • J26 (8-pin aux voltage) feeds 12V_GROUP1
  • J27 (6-pin PCIe / power brick) feeds 12V_GROUP2
  • The PCIe slot’s 12V supply feeds 12V_GROUP3

The PCIe slot’s 3.3V supply is not used by the board.

This arrangement makes sense: If the board is used standalone, the brick power supply is connected to J27, and feeds all three groups. When used in a PCIe slot, the slot itself can only power 12V_GROUP3, so by itself, the board can’t power up. Theoretically speaking, J27 needs to be connected to the computer’s power supply through a PCIe power connector, at the very least. For the higher power applications, J26 should be connected as well to the power supply, to allow for the higher current flow. In practice, J27 alone won’t power the board up, probably as a safety measure.

The FPGA’s core voltage is S10_VCC, which is generated from 12V_GROUP1 — this is the heavy lifting, and it’s not surprising that it’s connected to J26, which is intended for the higher currents.

The ideal diode emulation is done with LTC4357 devices, which measure the voltage between the emulated diode’s anode and cathode. If this voltage is slightly positive, the device opens a external power FET by applying voltage to its gate. This FET’s drain and source pins are connected to the emulated diode’s anode and cathode pins, so all in all, when there’s a positive voltage across it, current flows. This reduces the voltage drop considerably, allowing efficient power supply OR-ing, as done extensively on this development kit.

The board’s user guide advises against connecting the brick power supply to J27 when the board is in a PCIe slot, but also mentions the ideal diode mechanism (once again, it won’t power up at all this way). This is understandable, as doing so will cause current to be drawn from the PCIe slot’s 12V supply when its voltage is higher that the one supplied by J27, even momentarily. With the voltage turbulence that is typical to switching power supplies, the currents may end up swinging quite a lot in an unfortunate combination of power supplies.

So even though it’s often more comfortable to control the power of the board separately from the hosting computer’s power, or to connect J27 only if the board is expected to draw less than 75W, both possibilities are eliminated. Both the noisy fan and the board’s refusal to power up unless fed properly prepare the board for the worst case power consumption scenario.

Critical Warnings after upgrading a PCIe block for Ultrascale+ on Vivado 2020.1

$
0
0

Introduction

Checking Xillybus’ bundle for Kintex Ultrascale+ on Vivado 2020.1, I got several critical warnings related to the PCIe block. As the bundle is intended to show how Xillybus’ IP core is used for simplifying communication with the host, these warnings aren’t directly related, and yet they’re unacceptable.

This bundle is designed to work with Vivado 2017.3 and later: It sets up the project by virtue of a Tcl script, which among others calls the upgrade_ip function for updating all IPs. Unfortunately, a bug in Vivado 2020.1 (and possibly other versions) causes the upgraded PCIe block to end up misconfigured.

This bug applies to Zynq Ultrascale+ as well, but curiously enough not with Virtex Ultrascale+. At least with my setting there was no problem.

The problem

Having upgraded an UltraScale+ Integrated Block (PCIE4) for PCI Express IP block from Vivado 2017.3 (or 2018.3) to Vivado 2020.1, I got several Critical Warnings. Three during synthesis:

[Vivado 12-4739] create_clock:No valid object(s) found for '-objects [get_pins -filter REF_PIN_NAME=~TXOUTCLK -of_objects [get_cells -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GT*E4_CHANNEL_PRIM_INST}]]'. ["project/pcie_ip_block/source/ip_pcie4_uscale_plus_x0y0.xdc":127]
[Vivado 12-4739] get_clocks:No valid object(s) found for '--of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]'. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":63]
[Vivado 12-4739] get_clocks:No valid object(s) found for '--of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]'. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":64]

and another seven during implementation:

[Vivado 12-4739] create_clock:No valid object(s) found for '-objects [get_pins -filter REF_PIN_NAME=~TXOUTCLK -of_objects [get_cells -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GT*E4_CHANNEL_PRIM_INST}]]'. ["project/pcie_ip_block/source/ip_pcie4_uscale_plus_x0y0.xdc":127]
[Vivado 12-4739] set_clock_groups:No valid object(s) found for '-group [get_clocks -of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]]'. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":63]
[Vivado 12-4739] set_clock_groups:No valid object(s) found for '-group '. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":63]
[Vivado 12-4739] set_clock_groups:No valid object(s) found for '-group [get_clocks -of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]]'. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":64]
[Vivado 12-4739] set_clock_groups:No valid object(s) found for '-group '. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":64]
[Vivado 12-5201] set_clock_groups: cannot set the clock group when only one non-empty group remains. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":63]
[Vivado 12-5201] set_clock_groups: cannot set the clock group when only one non-empty group remains. ["project/pcie_ip_block/synth/pcie_ip_block_late.xdc":64]

The first warning in each group points at this line in ip_pcie4_uscale_plus_x0y0.xdc, which was automatically generated by the tools:

create_clock -period 4.0 [get_pins -filter {REF_PIN_NAME=~TXOUTCLK} -of_objects [get_cells -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GT*E4_CHANNEL_PRIM_INST}]]

And the other at these two lines in pcie_ip_block_late.xdc, also generated by the tools:

set_clock_groups -asynchronous -group [get_clocks -of_objects [get_ports sys_clk]] -group [get_clocks -of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]]
set_clock_groups -asynchronous -group [get_clocks -of_objects [get_pins -hierarchical -filter {NAME =~ *gen_channel_container[1200].*gen_gtye4_channel_inst[3].GTYE4_CHANNEL_PRIM_INST/TXOUTCLK}]] -group [get_clocks -of_objects [get_ports sys_clk]]

So this is clearly about a reference to a non-existent logic cell supposedly named gen_channel_container[1200], and in particular that index, 1200, looks suspicious.

I would have been relatively fine with ignoring these warnings had it been just the set_clock_groups that failed, as these create false paths. If the design implements properly without these, it’s fine. But failing a create_clock command is serious, as this can leave paths unconstrained. I’m not sure if this is indeed the case, and it doesn’t matter all that much. One shouldn’t get used to ignoring critical warnings.

Looking at the .xci file for this PCIe block, it’s apparent that several changes were made to it while upgrading to 2020.1. Among those changes, these three lines were added:

<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT">GTHE4_CHANNEL_X49Y99</spirit:configurableElementValue>
<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT_CONTAINER">1200</spirit:configurableElementValue>
<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT_QUAD_INX">3</spirit:configurableElementValue>

Also, somewhere else in the XCI file, this line was added:

<spirit:configurableElementValue spirit:referenceId="PARAM_VALUE.MASTER_GT">GTHE4_CHANNEL_X49Y99</spirit:configurableElementValue>

So there’s a bug in the upgrading mechanism, which sets some internal parameter to select the a nonexistent GT site.

The manual fix (GUI)

To rectify the wrong settings manually, enter the settings of the PCIe block, and click the checkbox for “Enable GT Quad Selection” twice: Once for unchecking, and once for checking it. Make sure that the selected GT hasn’t changed.

Then it might be required to return some unrelated settings to their desired values. In particular, the PCI Device ID and similar attributes change to Xilinx’ default as a result of this. It’s therefore recommended to make a copy of the XCI file before making this change, and then use a diff tool to compare the before and after files, looking for irrelevant changes. Given that this revert to default has been going on for so many years, it seems like Xilinx considers this a feature.

But this didn’t solve my problem, as the bundle needs to set itself correctly out of the box.

Modifying the XCI file? (Not)

The immediate thing to check was whether this problem applies to PCIe blocks that are created in Vivado 2020.1 from scratch inside a project which is set to target KCU116 (which is what the said Xillybus bundle targets). As expected, it doesn’t — this occurs just on upgraded IP blocks: With the project that was set up from scratch, the related lines in the XCI file read:

<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT">GTYE4_CHANNEL_X0Y7</spirit:configurableElementValue>
<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT_CONTAINER">1</spirit:configurableElementValue>
<spirit:configurableElementValue spirit:referenceId="MODELPARAM_VALUE.MASTER_GT_QUAD_INX">3</spirit:configurableElementValue>

and

<spirit:configurableElementValue spirit:referenceId="PARAM_VALUE.MASTER_GT">GTYE4_CHANNEL_X0Y7</spirit:configurableElementValue>

respectively. These are values that make sense.

With this information at hand, my first attempt to solve this was to add the four new lines to the old XCI file. This allowed using the XCI file with Vivado 2020.1 properly, however synthesizing the PCIe block on older Vivado versions failed: As it turns out, all MODELPARAM_VALUE attributes become instantiation parameters for pcie_uplus_pcie4_uscale_core_top inside the PCIe block. However looking at the source file (on 2020.1), these parameters are indeed defined (only in those generated in 2020.1), and yet they are unused, like many other instantiation parameters in this module. So apparently, Vivado’s machinery generates an instantiation parameter for each of these, even if they’re not used. Those unused parameters are most likely intended for scripting.

So this trick made Vivado instantiate the pcie_uplus_pcie4_uscale_core_top with instantiation parameters that it doesn’t have, and hence its synthesis failed. Dead end.

I didn’t examine the possibility to deselect “Enable GT Quad Selection” in the original block, because Vivado 2017.3 chooses the wrong GT for the board without this option.

Workaround with Tcl

Eventually, I solved the problem by adding a few lines to the Tcl script.

Assuming that $ip_name has been set to the name of the PCIe block IP, this Tcl snippet rectifies the bug:

if {![string equal "" [get_property -quiet CONFIG.MASTER_GT [get_ips $ip_name]]]} {
  set_property -dict [list CONFIG.en_gt_selection {true} CONFIG.MASTER_GT {GTYE4_CHANNEL_X0Y7}] [get_ips $ip_name]
}

This snippet should of course be inserted after updating the IP core (with e.g. upgrade_ip [get_ips]). The code first checks if the MASTER_GT is defined, and only if so, it sets it to the desired value. This ensures that nothing happens with the older Vivado versions. Note the “quiet” flag of get_properly, which prevents it from generating an error if the property isn’t defined. Rather, it returns an empty string if that’s the case, which is what the result is compared against.

Setting MASTER_GT this way also rectifies GT_CONTAINER correctly, and surprisingly enough, this doesn’t change anything it shouldn’t, and in particular, the Device IDs remain intact.

However the disadvantage with this solution is that the GT to select is hardcoded in the Tcl code. But that’s fine in my case, for which a specific board (KCU116) is targeted by the bundle.

Another way to go, which is less recommended, is to emulate the check and uncheck of “Enable GT Quad Selection”:

if {![string equal "" [get_property -quiet CONFIG.MASTER_GT [get_ips $ip_name]]]} {
  set_property CONFIG.en_gt_selection {false} [get_ips $ip_name]
  set_property CONFIG.en_gt_selection {true} [get_ips $ip_name]
}

However turning the en_gt_selection flag off and on again also resets the Device ID to default as with manual toggling of the checkbox. And even though it set the MASTER_GT correctly in my specific case, I’m not sure whether this can be relied upon.

Viewing all 30 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>