Does the cross sompiler make your code run slower?

appels · Post by **appels** » Mon Jan 11, 2021 2:03 pm

Hi,

I was testing out the cross compiler to see how much faster my code could run if the microcontroller didn't have to compile it anymore.

After reading the following posts I realised that I couldn't just cross compile main.py to main.mpy:
viewtopic.php?f=6&t=8410&p=47777&hilit=main.mpy#p47777
viewtopic.php?f=2&t=7627&p=43920&hilit= ... ler#p43920

So I did what the posts suggested and compiled my code with a different name and imported that .mpy module in the main.py

To test this I wrote a program that wakes the microcontroller up and puts it back to sleep (with a timer for 1s) and just keeps on doing this.
I used an ESP32_DevKitC_V4 with the WROOM-32 module.

My original code (main.py) is the following:

Code: Select all

import machine
machine.deepsleep(1000)

Then I used the cross compiler to compile the previous code and I named it "my_code.mpy"
I loaded this file to the flash memory. (So no frozen module)
And then my new code (main.py) became the following:

Code: Select all

import my_code

When I run the code I get a recurring pattern of the microcontroller that's sleeping with a constant deep sleep current and then when it wakes up you can see the current change for a while (this pattern is always the same) until it goes back to sleep for 1 second.
When I zoom in on one active region I get the following graph (green = original code and red = new code with cross compiler):

So as you can see there isn't much difference between the two. (green = 685 ms and red = 700 ms)
But what surprised me is that the cross compiled code actually takes longer to execute! (15 ms)

Am I doing something wrong or is this always the case with the cross compiler and is it only meant for using less RAM?
And also I second question: can you see on the graph were the compilation takes place and how long it takes?

Thanks in advance!

stijn · Post by **stijn** » Mon Jan 11, 2021 2:08 pm

appels wrote: ↑
Mon Jan 11, 2021 2:03 pm
Am I doing something wrong or is this always the case with the cross compiler and is it only meant for using less RAM?

I don't know, but what comes to mind here: it would make more sense to test this with an actual application instead of a toy example which doesn't do much at all, and to measure execution time between certain points by e.g toggling output pins (which is slightly related to your second question: you don't really know what is going on now)

appels · Post by **appels** » Mon Jan 11, 2021 3:05 pm

stijn wrote: ↑
Mon Jan 11, 2021 2:08 pm
I don't know, but what comes to mind here: it would make more sense to test this with an actual application instead of a toy example which doesn't do much at all, and to measure execution time between certain points by e.g toggling output pins (which is slightly related to your second question: you don't really know what is going on now)

So I tried what you suggested and wrote some code that toggles a pin 50,000 times before going to sleep.
And there is indeed a difference. Now the cross compiled code is faster (10 ms).
But the difference is still barely noticable. (1,81 s VS 1,82 s)

(green = original code and red = new code with cross compiler)

This is the code:

Code: Select all

import machine
testPin = machine.Pin(17, machine.Pin.OUT)
for x in range(50000):
    testPin.on()
    testPin.off()
machine.deepsleep(1000)

stijn · Post by **stijn** » Mon Jan 11, 2021 7:15 pm

That's still not 'real life' code; should have been more clear, but I meant toggling pin(s) and measuring those pin(s) with an oscilloscope to measure execution time of code between the toggling. And that code should be something resembling an actual application: now you're just measuring time differences of things which do not matter a lot.

Roberthh · Post by **Roberthh** » Mon Jan 11, 2021 8:43 pm

Execution times on an ESP32 can vary quite a bit, even for the same code. The reason is the virtual style of memory. Both flash and SPI RAM are mapped to a virtual memory space through limited sized cache area, form which it is executed or through which it handles data. The interface to flash and RAM is SPI. When a cache miss occurs, the missing block has to be fetched from the external memory. That take some time, like 200-300µs. If background activity happens, like WiFi or other communication, even the execution time of identical code with the same data may vary. It all is related to the varying state of the cache.
Architectures with direct mapped memory like the STM32 have a more reproducible and faster execution time.

Pre-compiled and on-board-compiled code should have per se the same execution time. Only the location in memory will most likely differ.

About your first post. I understand that you had two set-ups:
a) one main.py file, containing

import machine
machine.deepsleep(1000)

b) a main.py file and a my_code.mpy file, where main.py consists of "import my_code", and my_code.mpy is the compiled version of case a)

In that szenario case b will be a little bit slower, because it executes an additional import, which requires file access (search the directory, read file, ....).

jimmo · Post by **jimmo** » Tue Jan 12, 2021 12:03 am

To elaborate on what Roberthh and stjin have said

What you're comparing is:

a)
- Load the python code from flash
- Parse it
- Generate bytecode, save to ram
- Execute bytecode

b)
- Load the bytecode from flash, copy it to ram
- Execute bytecode

My guess is that the time spent is largely dominated by the flash operations, especially if your code is a tight loop (i.e. it has long execution time but it's not actually very much code).

The only way to really analyse this stuff is to instrument the code and measure when various points are reached (e.g. toggle a pin, etc).

pythoncoder · Post by **pythoncoder** » Tue Jan 12, 2021 6:02 am

I find that small, optimised routines run fast on an ESP32 (presumably because in my tests the code probably stayed in the cache).

However complete applications run dramatically slower than on a Pyboard 1.1 or D. I have observed this on a GUI library for the official display. It's also been encountered by users of my asynchronous GPS library. It is hard to put a number to this slowdown as it's highly variable, but it's in the range of 3 to 20 times slower. In the case of the GUI the code which is dramatically slower is the screen redraw which is standard synchronous Python. See this RFC.

MicroPython Forum (Archive)

Does the cross sompiler make your code run slower?

Does the cross sompiler make your code run slower?

Re: Does the cross sompiler make your code run slower?

Re: Does the cross sompiler make your code run slower?

Re: Does the cross sompiler make your code run slower?

Re: Does the cross sompiler make your code run slower?

Re: Does the cross sompiler make your code run slower?

The ESP32 makes your code run slower.