Roberthh wrote: ↑Wed Jun 22, 2022 9:27 am
Just putting a byte into the ringbuffer does not mean it will be sent. Unless sending is already active, the send process has to be initiated. So what you need is a lightweight version of uart_write(). That has as well to cater for the case that the ringbuffer is full.
True. That's pretty much what I meant. An efficient way to transmit one byte at a time without having to pass around bytearrays and incur various call overheads. I tried using memoryviews but that actually was slower than single byte rx/tx buffers for the readinto() and write() methods.
The simplest way I can put it is that just managing this at the tightest MicroPython level I can muster is imposing a 80% overhead. In other words, a 3 ms long packet coming into the RX pin, when relayed to the TX pin ends-up being 5.4 ms long.
A couple of relevant portions of the code that produces that output on the logic analyzer:
Code: Select all
# Performance hack: Grab pointers to frequently used functions to reduce dictionary search time
# In this case the resulting code is 20% faster
# I use the "p_" prefix to denote such hacks
self.p_rx_any = self.rx.any
self.p_rx_readinto = self.rx.readinto
self.p_tx_write = self.tx.write
Code: Select all
self.p_rx_readinto(self.rx_buffer, 1)
if self.busy:
# This means we already found STX
self.packet_index += 1
self.current_packet[self.packet_index] = self.rx_buffer[0]
if relay_enabled:
self.p_tx_write(self.rx_buffer) # Relay to destination port
else:
# No STX yet; if we didn't get STX now, ignore it
if self.rx_buffer[0] == _STX:
self.busy = True
self.packet_index = 0
self.current_packet[0] = _STX
if relay_enabled:
self.p_tx_write(self.rx_buffer) # Relay to destination port
The issue at this point is that I can't insert an assembler routine in there because the call overhead is expensive, so it doesn't help. The only way assembler would really help me here is if I could replace that entire block with a single assembler function. I think a single call overhead would work very well.
For example, I tried the simple idea of moving a single byte from rx_buffer to current_packet and incrementing the index in assembler. You'd think this would run a bit faster; it does not:
Code: Select all
# Copy a single byte from one bytearray to another and increment index
#
# r0 destination bytearray address
# r1 source bytearray address
# r2 destination current index - the byte will be written to this index + 1
#
# Return:
# last index + 1
# Simply assign the return value to the index counter and you are ready for the next write
#
# Usage:
# destination = bytearray(range[n])
# source = bytearray(range[m])
# index_of(destination, source, destination current index)
#
# WARNING:
# Assembler routines in MicroPython have no way to know where an array ends!
#
@micropython.asm_thumb
def copy_one(r0, r1, r2) -> int:
add(r2, 1) # Index where we want to write
add(r0, r0, r2) # Pointer to that byte in destination
ldrb(r3, [r1, 0]) # Read byte
strb(r3, [r0, 0]) # Write it
mov(r0, r2) # Return new length
This would turn these two lines of code into a single assembler call:
Code: Select all
self.packet_index += 1
self.current_packet[self.packet_index] = self.rx_buffer[0]
Performance is actually worse. Yes, I am using the @micropython.native decorator for the relevant method. Can't go to viper because of the data types involved. To be fair, I haven't played around with viper enough to know if I could actually rewrite that code section to take advantage of that code emitter.
This is why I was saying that being able to make calls to serial read and write from assembler might be very useful when performance is critical. I mean, you so-much as touch a packet to do simple processing and the packet time nearly doubles.
What's missing from this code is some very mild packet validation I want to do on the fly. Again, character-by-character. This is important because it is the only effective way to deal with receiving bad data. In other words, if I simply sent a continuous stream of random numbers --which would result in petty much 100% bad packets-- you have to have a fast way to dump the data before filling a full packet buffer and sending it to the packet processor for validation. A simple example of this is that the third byte is actually a payload length. I can check that for validity (range of legal values) and then stop adding bytes to the candidate packet once that number is reached. Once again, touching the data stream at this point will add overhead. I am starting to believe I'll probably end-up with about 150% increase in packet length after all checks.
Am I missing something fundamental?
BTW, burst reading more than one byte at a time makes things far more complicated. For one thing, the latency --input byte to relay byte time-- is a function of this. So, if you burst-read 32 bytes at a time, at 115200, the relayed packet will not come out for about 3 ms. In the single byte read scenario the relay starts about 0.5 characters after being received (about 62 microseconds):
Interesting problem.