Access to UART ring buffers

martincho · Post by **martincho** » Mon Jun 20, 2022 4:25 am

RP2040

Is there a way to have direct access to the UART ring buffers from assembler?

I've been studying the source and understand how things work to what I think might be a reasonable degree. I see where UART reading and writing happens in "machine_uart.c" in the rp2 port directory. I am not sure I see a way to directly tap the ring buffers. All I can pass to the initialization routine is a size.

The only option I see is to not use machine.uart and recreate it all in assembler for better performance in my application. This would not be a horrible task. I've already written ring buffer and FIFO code in assembler, I'd just have to configure and manage the two UARTs.

Why?

I need to manage high speed relay between the ports with minimal delay while, at the same time, decoding and validating packets and executing the commands the contain. Once you start getting much past 115200 baud the higher speed portions of this process really benefit from saving clock cycles everywhere possible. Hence reaching for assembler everywhere I can.

jimmo · Post by **jimmo** » Wed Jun 22, 2022 6:37 am

martincho wrote: ↑
Mon Jun 20, 2022 4:25 am
Is there a way to have direct access to the UART ring buffers from assembler?

In theory, yes, you could.

When you call an @asm_thumb method, each argument is placed in a register, via the convert_obj_for_inline_asm function (see py/objfun.c).
So what you'll get in (say) r0 is the address of a machine_uart_obj_t instance. From there you can offset to get the read_buffer/write_buffer.

But... this sounds like it would be a lot easier to fix in machine_uart.c (i.e. add a helper method that does what you need).

martincho · Post by **martincho** » Wed Jun 22, 2022 7:36 am

jimmo wrote: ↑
Wed Jun 22, 2022 6:37 am
But... this sounds like it would be a lot easier to fix in machine_uart.c (i.e. add a helper method that does what you need).

I think this is likely where I am headed. It will be a couple of months before I can take this path though. My preference would be to try and contribute something potentially useful to the codebase.

That said, it would be really useful to be able to call a routine from assembler that would pop a single byte at a time out of the relevant ring buffer and, similarly, place one in a transmission ring buffer. This would enable such things as high speed protocol decoding in assembler, which is easy to do...but you need a way to get bytes in and out of the ring buffers (without going through bytearrays and incurring additional MicroPython calls or allocations.

In theory I could use .readinto() and then pass that bytearray to an assembler routine. Same with .write(), I could populate a bytearray from assembler and provide it to write(). My preference would be to stay in assembler for the entire process.

Roberthh · Post by **Roberthh** » Wed Jun 22, 2022 9:27 am

Just putting a byte into the ringbuffer does not mean it will be sent. Unless sending is already active, the send process has to be initiated. So what you need is a lightweight version of uart_write(). That has as well to cater for the case that the ringbuffer is full.

martincho · Post by **martincho** » Wed Jun 22, 2022 6:19 pm

Roberthh wrote: ↑
Wed Jun 22, 2022 9:27 am
Just putting a byte into the ringbuffer does not mean it will be sent. Unless sending is already active, the send process has to be initiated. So what you need is a lightweight version of uart_write(). That has as well to cater for the case that the ringbuffer is full.

True. That's pretty much what I meant. An efficient way to transmit one byte at a time without having to pass around bytearrays and incur various call overheads. I tried using memoryviews but that actually was slower than single byte rx/tx buffers for the readinto() and write() methods.

The simplest way I can put it is that just managing this at the tightest MicroPython level I can muster is imposing a 80% overhead. In other words, a 3 ms long packet coming into the RX pin, when relayed to the TX pin ends-up being 5.4 ms long.

A couple of relevant portions of the code that produces that output on the logic analyzer:

Code: Select all

        # Performance hack: Grab pointers to frequently used functions to reduce dictionary search time
        # In this case the resulting code is 20% faster
        # I use the "p_" prefix to denote such hacks
        self.p_rx_any = self.rx.any
        self.p_rx_readinto = self.rx.readinto
        self.p_tx_write = self.tx.write

Code: Select all

        self.p_rx_readinto(self.rx_buffer, 1)
        if self.busy:
            # This means we already found STX
            self.packet_index += 1
            self.current_packet[self.packet_index] = self.rx_buffer[0]
            if relay_enabled:
                self.p_tx_write(self.rx_buffer)  # Relay to destination port
        else:
            # No STX yet; if we didn't get STX now, ignore it
            if self.rx_buffer[0] == _STX:
                self.busy = True
                self.packet_index = 0
                self.current_packet[0] = _STX
                if relay_enabled:
                    self.p_tx_write(self.rx_buffer)  # Relay to destination port

The issue at this point is that I can't insert an assembler routine in there because the call overhead is expensive, so it doesn't help. The only way assembler would really help me here is if I could replace that entire block with a single assembler function. I think a single call overhead would work very well.

For example, I tried the simple idea of moving a single byte from rx_buffer to current_packet and incrementing the index in assembler. You'd think this would run a bit faster; it does not:

Code: Select all

# Copy a single byte from one bytearray to another and increment index
#
#   r0 destination bytearray address
#   r1 source bytearray address
#   r2 destination current index - the byte will be written to this index + 1
#
# Return:
#   last index + 1
#   Simply assign the return value to the index counter and you are ready for the next write
#
# Usage:
#   destination = bytearray(range[n])
#   source = bytearray(range[m])
#   index_of(destination, source, destination current index)
#
# WARNING:
#    Assembler routines in MicroPython have no way to know where an array ends!
#
@micropython.asm_thumb
def copy_one(r0, r1, r2) -> int:
    add(r2, 1)          # Index where we want to write
    add(r0, r0, r2)     # Pointer to that byte in destination
    ldrb(r3, [r1, 0])   # Read byte
    strb(r3, [r0, 0])   # Write it
    mov(r0, r2)         # Return new length

This would turn these two lines of code into a single assembler call:

Code: Select all

self.packet_index += 1
self.current_packet[self.packet_index] = self.rx_buffer[0]

Performance is actually worse. Yes, I am using the @micropython.native decorator for the relevant method. Can't go to viper because of the data types involved. To be fair, I haven't played around with viper enough to know if I could actually rewrite that code section to take advantage of that code emitter.

This is why I was saying that being able to make calls to serial read and write from assembler might be very useful when performance is critical. I mean, you so-much as touch a packet to do simple processing and the packet time nearly doubles.

What's missing from this code is some very mild packet validation I want to do on the fly. Again, character-by-character. This is important because it is the only effective way to deal with receiving bad data. In other words, if I simply sent a continuous stream of random numbers --which would result in petty much 100% bad packets-- you have to have a fast way to dump the data before filling a full packet buffer and sending it to the packet processor for validation. A simple example of this is that the third byte is actually a payload length. I can check that for validity (range of legal values) and then stop adding bytes to the candidate packet once that number is reached. Once again, touching the data stream at this point will add overhead. I am starting to believe I'll probably end-up with about 150% increase in packet length after all checks.

Am I missing something fundamental?

BTW, burst reading more than one byte at a time makes things far more complicated. For one thing, the latency --input byte to relay byte time-- is a function of this. So, if you burst-read 32 bytes at a time, at 115200, the relayed packet will not come out for about 3 ms. In the single byte read scenario the relay starts about 0.5 characters after being received (about 62 microseconds):

Interesting problem.

MicroPython Forum (Archive)

Access to UART ring buffers

Access to UART ring buffers

Re: Access to UART ring buffers

Re: Access to UART ring buffers

Re: Access to UART ring buffers

Re: Access to UART ring buffers