Understanding bytecode compiler and its limits

C programming, build, interpreter/VM.
Target audience: MicroPython Developers.
Post Reply
tomlogic
Posts: 13
Joined: Tue Jul 19, 2016 10:20 pm

Understanding bytecode compiler and its limits

Post by tomlogic » Fri Jun 23, 2017 12:40 am

I'd like to start understanding how the bytecode compiler and lexer work, and why it seems to quickly fill the heap.

For example, trying to compile a version of ftplib (using tabs for indentation instead of 4 spaces) passed to the compiler as a vstr (paste mode) or using mp_lexer_new_from_str_len(), I found success with a 48KB heap but failures in a 32KB heap. The following stats are from code I'm working on (based on the v1.9.1 commit) that relocates compiled bytecode from the heap to addressable flash so it can run in place.

Compiling 13781 bytes of source code...
Used 157/371 QSTR entries.
Compiled 13781 bytes of code to 8583/31232 bytes of flash.

So what I find interesting is that the bytecode and QSTR generated by the compile appears to only use 8583 bytes of the heap. I can move that to flash, power cycle my hardware, and execute it from there so it isn't dependent on anything else in the heap.

What else is in there during a compile? Is this an issue of fragmentation, or are there references in the heap that we can't release until the compiler/lexer are done? What tools should I make use of to understand/monitor the heap contents (maybe micropython.mem_info())?

Where's a good insertion point to instrument the lexer/compiler, if I wanted to dump memory statistics on each line parsed?

I'm willing to invest time in understanding how this portion of MicroPython works, researching the heap usage of a compile, and testing possible solutions to reduce the overhead. I've already poked around the lexer a bit and provided a bug fix and code optimization, but that's just a splash of water from the swimming pool. Hoping for some guidance on where to begin this exploration.

Thanks.

Damien
Site Admin
Posts: 647
Joined: Mon Dec 09, 2013 5:02 pm

Re: Understanding bytecode compiler and its limits

Post by Damien » Sat Jun 24, 2017 1:30 pm

If you are compiling from a vstr (paste mode) then you will need to have enough heap available for the entire script. Much less heap memory is needed if you instead compile from some other source, eg from a file. In the case of compiling from a file, characters that are fed into the lexer are read one at a time from the file (with a small amount of buffering), so there is only a small constant overhead needed in RAM, regardless of the size of the file (eg it could have a huge amount of comments in it, and this wouldn't require any extra heap to compile compared to the same file without comments).

Regarding tracing the lexer/parser/compiler: I've spent a lot of time in the past doing such tracing in order to optimise the parser and compiler. I simply trace the calls to malloc/free/realloc (printing the exact number of bytes that are allocated or freed) and then have a simple script to process this output and generate some data for gnuplot to graph it. You can see an example of such an analysis in this GitHub issue: https://github.com/micropython/micropython/issues/2417, in particular this comment: https://github.com/micropython/micropyt ... -247034375

The "parse-bytecode" branch of the main MicroPython repository has an alternative parser and compiler which uses about 50% the heap of the standard one. The main difference is that this "optimised" version stores the parse tree as a byte stream, rather than a tree with pointers, and so it is much more compressed. But it's also slower (although this hasn't been properly benchmarked) and the code size is larger because it's more complicated. The optimised version also has one big flaw: the entire parse "tree" (which is now a byte stream) must be stored in a contiguous chunk of memory on the heap. Compared to the standard parser/compiler which has the parse tree (which is a true tree in the sense of C structs with pointers) split over many chunks on the heap (although it's not completely fragmented, many parse tree nodes are stored adjacently on the heap if possible). So if you compile a large script with the "optimised" parser/compiler then you may run out of memory due to fragmentation and not being able to store the entire parse byte stream.

Regardless of the problems mentioned above, this alternative parser shows that there is room for improvement in the standard MicroPython parser/compiler. The main reason that it takes a decent amount of heap is that at its peak heap usage it must store the entire parse tree, all generated bytecode, and all support data structures (the scope, names of local variables and their mapping to offsets on the stack, etc). You can see that maximum in the plot on GitHub linked above. One idea to reduce the maximum heap usage is to compile outer-level functions/classes ASAP and then free their parse tree and associated data structures before the rest of the script is compiled, keeping only the bytecode. I didn't get around to working on this idea yet.

pfalcon
Posts: 1155
Joined: Fri Feb 28, 2014 2:05 pm

Re: Understanding bytecode compiler and its limits

Post by pfalcon » Mon Jun 26, 2017 7:25 pm

Few things I'd like to add:

1. We generally need to work on improving memory diagnostics to make MicroPython more "enterprisey". In particular, micropython.mem_info() is pretty bare means to analyze heap structure/usage. Someone should write GUI viewer, where you point at a memory block, and get complete info about what it's part of and complete info about the whole chunk.

2. Something I wanted to ask for a long time: Damien, why did you decide to write parser in such way, as LR(n) automaton at all? Did you consider LL(n=~1), recursive-descent parser? My assumed answer always has been "It's impossible to parse Python with LL(1)", but it always was just a random assumption, I wonder if someone thought and tried that actually.

tomlogic, you should also keep in mind that parser memory allocation is configurable with MICROPY_ALLOC* settings. These unlikely will play much role when parsing big scripts, but there's one important usecase - parsing small scripts, like REPL lines. For example, it's on my TODO to (try to) tweak esp8266's port settings even further down.
Awesome MicroPython list
Pycopy - A better MicroPython https://github.com/pfalcon/micropython
MicroPython standard library for all ports and forks - https://github.com/pfalcon/micropython-lib
More up to date docs - http://pycopy.readthedocs.io/

User avatar
deshipu
Posts: 1388
Joined: Thu May 28, 2015 5:54 pm

Re: Understanding bytecode compiler and its limits

Post by deshipu » Mon Jun 26, 2017 11:16 pm

Recursive descent algorithms have usually pretty nasty characteristics, both memory and time-wise. It would be interesting, however, if the Python grammar could be tackled with an operator-precedence parser -- those are linear both in terms of time and memory, and simple enough to be done by hand.

tomlogic
Posts: 13
Joined: Tue Jul 19, 2016 10:20 pm

Re: Understanding bytecode compiler and its limits

Post by tomlogic » Tue Jun 27, 2017 1:45 am

Damien wrote:If you are compiling from a vstr (paste mode) then you will need to have enough heap available for the entire script. Much less heap memory is needed if you instead compile from some other source, eg from a file.
Actually, in my case I'm compiling from source stored in addressable flash. So I set up the vstr outside of the MicroPython heap, and I thought I reviewed the lexer code to confirm that it just walks through the string without copying it to the heap.

Code: Select all

    vstr_init_fixed_buf(&vstr, SRC_FLASH_MAX_LEN, SRC_FLASH_ADDR);
    vstr.len = SRC_FLASH_LEN;
    parse_compile_execute(&vstr, MP_PARSE_FILE_INPUT,
      EXEC_FLAG_SOURCE_IS_VSTR | EXEC_FLAG_ALLOW_DEBUGGING | EXEC_FLAG_IS_REPL);
And my code for "compilation only" creates the lexer in a similar manner:

Code: Select all

mp_lexer_t *lex = mp_lexer_new_from_str_len(MP_QSTR__lt_stdin_gt_, SRC_FLASH_ADDR, SRC_FLASH_LEN, 0);
Damien wrote:One idea to reduce the maximum heap usage is to compile outer-level functions/classes ASAP and then free their parse tree and associated data structures before the rest of the script is compiled, keeping only the bytecode. I didn't get around to working on this idea yet.
I was wondering about this myself, as possible low hanging fruit. If we've completed compilation of a class or function definition, and are about to start on something new (non-comment and non-whitespace in column 1), is there anything we can release or let go of that we're definitely "done" with? I realize that there's always the top-level mp_raw_code_t structure that's "in progress", but after completing an inner mp_raw_code_t structure, I would think it possible to release resources associated with its creation.
pfalcon wrote:tomlogic, you should also keep in mind that parser memory allocation is configurable with MICROPY_ALLOC* settings. These unlikely will play much role when parsing big scripts, but there's one important usecase - parsing small scripts, like REPL lines.
Ooh, I like that. I might try to work on some tuning code. For example, how much wasted space is there with various settings of MICROPY_ALLOC_QSTR_CHUNK_INIT? And it would be useful to have the compiler report the line number where it ran out of memory.

I need to wrap up some other commitments before the end of the month, and then I'm taking a long holiday weekend, but I hope to dig into this when I get back. Thanks to both of you for the detailed information.

Damien
Site Admin
Posts: 647
Joined: Mon Dec 09, 2013 5:02 pm

Re: Understanding bytecode compiler and its limits

Post by Damien » Tue Jun 27, 2017 6:19 am

pfalcon wrote: 2. Something I wanted to ask for a long time: Damien, why did you decide to write parser in such way, as LR(n) automaton at all? Did you consider LL(n=~1), recursive-descent parser?
The parser/compile was the very first thing I wrote and it was a while ago and I don't remember the details of my decisions from back then. I don't think I did any formal checks on the type of grammar that Python has. The main aims were 1) to make it work at all on the pyboard; 2) use as little code space as possible. I was originally thinking that scripts would be compiled to bytecode and saved as .pyc files to flash, and then executed later on. So it was more important to optimise RAM usage during execution rather than during compilation.

If someone has time/inclination then they it would be interesting to investigate alternative ways of parsing/compiling. For example, it may be possible to compile directly from the Python script, ie not have a parsing stage at all and treat the script as the "parse tree". Kind of like how Espruino does it, executing the JavaScript text directly.

Post Reply