Supporting zero-copy operations in MicroPython

pfalcon · Post by **pfalcon** » Sat May 10, 2014 9:58 am

Due to immutability of strings in Python, string operations in Python are inherently expensive - any change to string means allocating more memory to handle the result, and usually copying large part of original string. And string manipulation is arguably covers good half of programming done (for example, network programming is 100% requires manipulating data in buffers, naturally represented in Python as strings).

That's not how you typically program in C - there, a buffer is usually allocated, knowing upper bound of data chunk size, or just using arbitrary, but realistically high enough, bound, and then operations are performed as much as possible using pointers within this buffer, and inplace operations. Surely, C way is gazillion of times faster, and not less importantly, memory-friendly.

For MicroPython, it is important to strive for the highest possible efficiency, so it would be tempting to support such programming model. Let's figure out how to do that.

Problem #1: Getting data into buffers

Python over time grew datatypes suitable for C-like inplace buffer operations. That's obviously bytearray (and array.array for cases of greater-than-byte unit of data). The problem is how to get data from outside into bytearrary with minimum of copies (specifically, zero).

If one doesn't know Python API well, one would consider to read bytes object from a stream, and then convert to bytearray. But the latter operation means copying. So, one would consider adding an operation to convert a bytes object to bytearray inplace. But that's completely non-Pythonic way, and thus immediately gets "ugly" moniker. Implementation-wise, bytes and bytearray are rather different objects, so there's no guarantee that such "inplace" conversion will happen without copying (or will be possible at all, if underlying structures need to grow, which is very likely with mutable growable type).

But looking at API reference, Python has that covered - it has methods to read data directly to an arbitrary mutable buffer (including, but not limited, to bytearray). For stream protocol, that's method readinto(). But if some type has multiple means to read data, this leads to noticeable API bloat. This can be seen with socket type, which has also recv() and recv_from() methods. Thus, we get recv_into() and recv_from_into() (note also arguably inconsistent naming - but it's governed by "recv_from" case, surely "recv_frominto" would be ugly).

So, API bloat. Conceptually, problem is solved - Python has it already, so we have little choice but follow that trend. But of course, implementation-wise, adding extra methods has its overhead, so there's always concern about adding "rarely" used one. Well, the solution should be along the lines we discuss - they should be used often!

Problem #2: In-place operations on buffers

First of all, besides literally "inplace" (modifying) operations, another useful operation class is analog of C "pointing" operations. That subclass is well served with memoryview object - it's possible to take a light-weight slice of a bytearray object, which will be essentially represented by offset and length values.

That still leaves real in-place string operations on bytearrays, like for example "strip". And when I started to write them, I thought they're not supported. Live and learn: https://docs.python.org/3.4/library/std ... es-methods:

Due to the common use of ASCII text as the basis for binary protocols, bytes and bytearray objects provide almost all methods found on text strings, with the exceptions of: [few methods]

So, cool! Python rules, as usual! We just need to implement this wealth

Comments, further ideas?

Anton2 · Post by **Anton2** » Sat May 10, 2014 11:15 am

One idea would be for bytes and byte arrays to support transparent 'stiching' and sharing of buffer's slices.

ex

Code: Select all

packet1 = b'first packet received from some interface'
packet2 = b'next packet'

packet = packet1 + packet2 # allocate a new bytes object, referencing the buffers of packet1 and packet2. No copy. Buffer are shared, and only freed when nobody needs it anymore.

Same for bytearray. A byte array created from a bytes object shares the initial buffer. You grow them by adding a second buffer.

Clearly, you can only share immutable stuff, so we'll have to copy bytes when modifying a bytearray backed by a shared buffer. And maybe bite the bullet and copy stuff when the number of stitched buffers become too large.

For ownership of buffer, a simple reference counting is sufficient : no risks of cycles, and we can cap the number of owners (just copy it when/if max is reached), so it can use a small number of bits.

It's more complicated, but allow for efficient byte handling in day to day python. Because WE know that we shouldn't use packet = packet1 + packet2, but the target user I have in mind doesn't.

Benefits:
Any receiving interface returns a bytes object.
Efficient "concatenation" of received packets.
Zero copy transform between bytes and bytearray.
Less memory fragmentation, as we defer copy and reallocation.
Low overhead for random access, iteration and slicing

Things get complicated when we use a C function expecting a flat memory buffer. Or want to use DMA on a bytearray... So we'll need to have a well defined api to access, slice, iterate the memoryview, eventually flattening it when the C function can't deal with it.

I'm surely overlooking lots of details, but the idea is something I would find great for a high level langage targeting low-level hardware...

pfalcon · Post by **pfalcon** » Sat May 10, 2014 2:18 pm

Anton2, I had somewhat similar idea for handling reads from streams - obvious implementation (and how it is done currently) is to try to read some stuff into temporary fixed buffer, then reallocate current read result string by that size, repeat. That's of course highly not memory friendly. Instead, we could alloc fixed size buffer chunk, read into it, chain to previous buffer, repeat.

The problem with such solutions is that for the same Python's type (str) we have different C-level types. And at each place str is accepted, we need to be ready to handle any of underlying type. That leads to explosive growth of number of combinations to handle. With C++ we might be able to abstract it, but with C, all the boilerplate code needs to be written manually.

So, I don't think such approach is viable - we already have many combinations too take care of. For example, yesterday I figured that bytes comparison was not implemented. D'oh! And looking at the code, it "should have worked OOB", but didn't, due to various checks and optimizations specific to str (and thus forgetfully leaving out bytes which shares most of code with str implementation).

nelfata · Post by **nelfata** » Tue Nov 25, 2014 1:04 am

Hi,
I am interested to hear if there is any progress on that front.
I am running into major memory issues trying to read sockets into blocks and manipulating the data.
So far my only option is to reimplement the code in C.
If someone has done any work with dealing with buffer allocations on fixed size buffers, I would greatly appreciate the help/input.

Thanks.

pfalcon · Post by **pfalcon** » Wed Dec 03, 2014 11:51 pm

Nope, I guess you're the first to hit these issues in practice (vs just theoretical concerns which this thread expressed). Which means that you're in exceptional position to scratch your own itch and implement these things

.

Post by **Damien** » Sat Jan 10, 2015 1:37 pm

Here is a real-life scenario that we should try to make work without allocating heap memory:

nelfata wrote:The concept is quite simple, the code expects a connection (done periodically) to update sensor data from a client (web browser), it is mainly acting as a webserver. The data being exchanged is a dictionary but converted to a byte stream with the JSON library (ujson). This byte stream is about 600 bytes. So this stream of bytes is sent to the client.

Basically we want to convert a Python dict to a JSON string without allocating memory. Here's some semi-realistic code:

Code: Select all

class SensorServer:
    def __init__(self):
        self.accel = pyb.Accel()
        self.uart = pyb.UART(1, 9600)
        self.data = {"timestamp":0, "x":0, "y":0, "z":0}
        self.jsonbuf = bytearray(100)

    def update(self):
        self.data["timestamp"] = pyb.millis()
        self.data["x"] = self.accel().x()
        self.data["y"] = self.accel().y()
        self.data["z"] = self.accel().z()

    def send(self):
        ???

So far the above code only allocates memory in the init function (the dict stores into pre-existing keys don't need any extra memory). How do we write the send function? At the moment we have to do it like this:

self.uart.write(json.dumps(self.data))

but that will allocate memory to create the json string. Ideally we want something that uses the jsonbuf inplace, eg:

Code: Select all

json.dumps_into(self.data, self.jsonbuf)
self.uart.write(self.jsonbuf)

This won't work because we need to know the size of the created json string. So it gets a bit more complicated, eg we could do:

Code: Select all

len = json.dumps_into(self.data, self.jsonbuf)
self.uart.write(self.jsonbuf, len) # len specified as 2nd arg

Or probably better to use memoryview:

Code: Select all

len = json.dumps_into(self.data, self.jsonbuf)
self.uart.write(memoryview(self.jsonbuf)[0:len])

But that piece of code allocates memory for the memoryview object (actually 2 of them). memoryview's are small, but we still don't want to allocate anything at all. Thus we need an in-place memoryview update method, and we should pre-allocate the memoryview object, eg self.jsonbuf = memoryview(bytearray(100))). Then send becomes:

Code: Select all

self.jsonbuf.len = json.dumps_into(self.data, self.jsonbuf)
self.uart.write(self.jsonbuf)

Changes required to uPy to make the above work:
1. Add json.dumps_into method.
2. Allow memoryview length to be updated in-place.

pfalcon · Post by **pfalcon** » Thu Jan 15, 2015 9:55 am

Damien wrote:Here is a real-life scenario that we should try to make work without allocating heap memory:

At the moment we have to do it like this:

self.uart.write(json.dumps(self.data))
but that will allocate memory to create the json string. Ideally we want something that uses the jsonbuf inplace, eg:
Code: Select all
json.dumps_into(self.data, self.jsonbuf)
self.uart.write(self.jsonbuf)

No, we want to implement json.dump which takes a stream (and as expected, writes to a stream incrementally). Then we want to implement buffered streams of course, to not be concerned about performance.

But I seriously think we should implement .write(data, offset, length) extension, as it's very cheap and obvious optimization. But preferrably, start flamewar about it on python-ideas at the same time.

Post by **Damien** » Thu Jan 15, 2015 10:41 am

pfalcon wrote: No, we want to implement json.dump.

Yes of course we do. I was looking for such existing functionality but managed to completely miss that!

Then we want to implement buffered streams of course, to not be concerned about performance.

Yes, seems like a good idea.

But I seriously think we should implement .write(data, offset, length) extension, as it's very cheap and obvious optimization. But preferrably, start flamewar about it on python-ideas at the same time.

Yes it's cheap since we just extend the existing write method in stream protocol. But are there then other places we want to add such extension? And what's wrong with memoryview?

pfalcon · Post by **pfalcon** » Thu Jan 15, 2015 12:57 pm

Damien wrote: Yes it's cheap since we just extend the existing write method in stream protocol. But are there then other places we want to add such extension? And what's wrong with memoryview?

Probably better say that it has maximum (usefulness*obviousness)/cost ratio. I point to it as it's very obvious way to support data transfer operations w/o memory allocation. Using memoryview will require small-sized, maybe rare, but memory allocations.

Other places might be pointed at the beginning of this thread - in-place string operations on bytearrays, and/or functions which return memoryviews for string operations. But those are more complex concepts, not immediately obvious due to Python3 byte/str dichotomy, so instead of trying to implement them, it's better to put effort into compacting GC.

MicroPython Forum (Archive)

Supporting zero-copy operations in MicroPython

Supporting zero-copy operations in MicroPython

Re: Supporting zero-copy operations in MicroPython

Re: Supporting zero-copy operations in MicroPython

Re: Supporting zero-copy operations in MicroPython

Re: Supporting zero-copy operations in MicroPython

Re: Supporting zero-copy operations in MicroPython

Re: Supporting zero-copy operations in MicroPython

Re: Supporting zero-copy operations in MicroPython

Re: Supporting zero-copy operations in MicroPython