Is anybody interested in collaborating on an informal spec + reference implementation for tunneling AMBA over a single high speed serial link?

Andrew Zonenberg

Think AMBA CHI C2C or PCIe stripped down to its bare essentials.

Basic concepts I have in mind:
* No link training or negotiation. Both sides must agree on data rate out of band or hard-code
* No plug-and-play enumeration, device descriptors, hotswap support, etc. Intended for hard wired applications between multiple FPGAs (or potentially FPGA-ASIC in the future)
* Capable of being used with FPGA LVDS GPIOs or gigabit transceivers depending on data rate
* Point to point topology only at the physical/link layer. Multiple links can be instantiated to build star/tree structures as needed.
* 8B/10B line code
* Transports raw AMBA requests/responses. Initial implementation will be APB only, but intent is to support encapsulation of AHB/AXI-stream/AXI in the future

Andrew Zonenberg

There's a bunch of open questions to work out, keeping it simple but reliable.

For example, do we allow buffered writes? This will allow the requester side of a link to return a completion (PREADY) response before the request actually reaches the link partner. But then what if the bus segment at the far side reports an error?

How do we handle errors?

The simplest option (which I'm leaning toward, given my use cases of lowish speed control plane traffic rather than heavy data plane) is a blocking architecture, in which the bus interface tightly couples across the SERDES link and maintains end to end select/ready flow control.

Andrew Zonenberg

Using some plausible values: let's say we have a 150 MHz (6.666 ns) AMBA clock domain in the FPGA and are using 32-bit APB. The theoretical max on chip data rate (given APB's throughput of one transaction per 3 clocks) is 1.6 Gbps.

But if the data is being sourced by a STM32H735 using the FMC interface (as will likely be the case for most of my upcoming projects) the max throughput for writes is one 64-bit word per 150 MHz * (1 address + 2 wait + 4 data + 2 CS# high = 9 cycles) or 1.06 Gbps.

The 7 series GTP has latency ranging from 65 to 234 UI at TX, and 141 to 916 UI at RX, depending on configuration, line coder sync phases, etc. Best case is 206.

Adding these up for a worst-case number, we get a max of 1150 UI.

Suppose we run the SERDES at 2.5 Gbps (a nice round number sufficient to ensure we never saturate it). This gives (400 ps * 1150) = 460 ns one-way SERDES latency, or 69 AMBA clocks worst case. Best case is 82.4 ns or 13 AMBA clocks.

If we bump up to 5 Gbps (a nice round number doable by a -2 artix7 GTP) we get 200ps*1150 = 230 ns one way latency, or 35 clocks worst case. Best case is 41.2 ns or 7 AMBA clocks.

These are one-way numbers. If we are moving APB data in lock-step across the link, we have to double the latency to get a round trip flow control period.

Andrew Zonenberg

So if I go with 5 Gbps data rate, I'm looking at worst case 70 clocks round trip latency, on top of however long the peripheral on the remote side takes to respond.

Adding in a few cycles for clock domain crossing between the APB and SERDES clock domains, and for the remote side to actually process the request, and let's call it one transaction per 80 cycles.

That's 32 bits / (80 clocks * 6.66 ns) or 60.06 Mbps. Not great, but probably fine for the kinds of low speed control plane traffic APB is designed for.

And this is a worst case number. It could be much faster; best case of 14 clocks round trip plus CDC delays etc... call it one per 20 cycles or 240 Mbps.

Usable throughput would increase significantly if encapsulating AHB or AXI or something that supports multi-beat bursts (which wouldn't need to add flow control latency for every word of data).

Andrew Zonenberg

For an initial implementation I'm thinking of 32 bit data only, and 24 bit address over the wire (we can probably parameterize it).

Overall packet format would be something like
K28.0 start of APB write

C0 DE 42 24 bit address

12 34 56 78 32 bit data

Xx CRC-8

No end of frame code needed since APB frames are fixed size

Andrew Zonenberg

So total nine codewords, 90 UI, to actually send a single transaction.

Erin 💽✨

@azonenberg I feel like the way you tackle APB and AXI should be very different

AXI is a very nice multi-pipelined bus and I'd be tempted to pipeline the multiple independent sub-buses across separately and statelessly