Using some plausible values: let's say we have a 150 MHz (6.666 ns) AMBA clock domain in the FPGA and are using 32-bit APB. The theoretical max on chip data rate (given APB's throughput of one transaction per 3 clocks) is 1.6 Gbps.
But if the data is being sourced by a STM32H735 using the FMC interface (as will likely be the case for most of my upcoming projects) the max throughput for writes is one 64-bit word per 150 MHz * (1 address + 2 wait + 4 data + 2 CS# high = 9 cycles) or 1.06 Gbps.
The 7 series GTP has latency ranging from 65 to 234 UI at TX, and 141 to 916 UI at RX, depending on configuration, line coder sync phases, etc. Best case is 206.
Adding these up for a worst-case number, we get a max of 1150 UI.
Suppose we run the SERDES at 2.5 Gbps (a nice round number sufficient to ensure we never saturate it). This gives (400 ps * 1150) = 460 ns one-way SERDES latency, or 69 AMBA clocks worst case. Best case is 82.4 ns or 13 AMBA clocks.
If we bump up to 5 Gbps (a nice round number doable by a -2 artix7 GTP) we get 200ps*1150 = 230 ns one way latency, or 35 clocks worst case. Best case is 41.2 ns or 7 AMBA clocks.
These are one-way numbers. If we are moving APB data in lock-step across the link, we have to double the latency to get a round trip flow control period.