Unexpected difference between PYBD SF2 and SF6

The official PYBD running MicroPython, and its accessories.
Target audience: Users with a PYBD
Post Reply
User avatar
Roberthh
Posts: 3667
Joined: Sat May 09, 2015 4:13 pm
Location: Rhineland, Europe

Unexpected difference between PYBD SF2 and SF6

Post by Roberthh » Wed Oct 02, 2019 7:29 am

I have a simple fft test script. That's the one @pythoncoder used as starting point for his tests. This script behaves repeatable on a SF2 device, and show irregular behavior on a SF6 device. Both run (almost) the same firmware build. The script is below. What's strange with the SF6 version is, that it take longer, between 2 and 6 times, and that it sometimes fails with memory error. The slower execution may be caused by double precision floats, but not the large variation in execution time. The latter is what puzzles me, besides the memory error despite the larger RAM. On an ESP32 with SPIRAM, it works w/o problems, at about the speed of the fastest PYBD_SF6 times (single precision floats).

Code: Select all

# algorithms.py
# 14th March 2015
# Examples of Discrete Fourier Transform code in Python
# http://forum.micropython.org/viewtopic.php?f=2&t=208&hilit=fft
#
import math, cmath, time
import gc

# Recursive algorithm: simple but slow and stack-hungry
# Enter with an array of complex numbers of length 2**N


def fft_recursive(x):
    n = len(x)
    if n <= 1:
        return x
    even = fft_recursive(x[0::2])
    odd = fft_recursive(x[1::2])
    return [even[m] + math.e**(-2j*math.pi*m/n)*odd[m] for m in range(n//2)] + \
           [even[m] - math.e**(-2j*math.pi*m/n)*odd[m] for m in range(n//2)]

# Code used as the basis for the assembler routine. Cooley-Tukey algorithm with
# twiddle factors precomputed.


def fft(nums, roots, forward=True):
    n = len(nums)
    m = int(math.log(n)/math.log(2))
    # n= 2**m
    # Do the bit reversal
    i2 = n >> 1
    j = 0
    for i in range(n-1):
        if i < j:
            nums[i], nums[j] = nums[j], nums[i]
        k = i2
        while (k <= j):
            j -= k
            k >>= 1
        j += k
    # Compute the FFT
    l2 = 1
    for l in range(m):
        c = roots[l]
        if forward:
            c = c.real - c.imag * 1j
        l1 = l2
        l2 <<= 1
        u = 0j+1
        for j in range(l1):
            for i in range(j, n, l2):
                i1 = i+l1
                t1 = u*nums[i1]
                nums[i1] = nums[i] - t1
                nums[i] += t1
            u *= c
    # Scaling for forward transform
    if forward:
        for i in range(n):
            nums[i] /= n
    return nums


def buildarrays(length):
    bits = int(math.log(length)/math.log(2))
    roots = []
    c = -1+0j
    roots.append(c)    # Complex roots of unity
    for x in range(bits):
        cimag = math.sqrt((1.0 - c.real) / 2.0)  # Imaginary part
        creal = math.sqrt((1.0 + c.real) / 2.0)  # Real part
        c = creal + cimag*1j
        roots.append(c)

    re = [0+0j]*length
    return re, roots

# Test/demo for above algorithm

def printlist(q):
    for t in q:
        tc = t + (0 + 0j)
        print("[{:5.2f}{:5.2f}] ".format(tc.real, tc.imag), end="")
    print()

def run(n=1024):
    nums, roots = buildarrays(n)
    print("Initial")
    gc.collect()
    try:
        for x in range(len(nums)):
            nums[x] = 0.1 + math.cos(2*math.pi*x/len(nums))
        start = time.ticks_ms()
        fft_recursive(nums)
        print("Recursive fft", time.ticks_diff(time.ticks_ms(), start)/1000)
    except Exception as e:
        print("Recursive fft failed: ", e)
    for x in range(len(nums)):
        nums[x] = 0.1 + math.cos(2*math.pi*x/len(nums))
    gc.collect()
    start = time.ticks_ms()
    fft(nums, roots, True)
    print("Forward fft", time.ticks_diff(time.ticks_ms(), start)/1000)
    gc.collect()
    start = time.ticks_ms()
    # printlist(nums)
    fft(nums, roots, False)
    print("Reverse fft", time.ticks_diff(time.ticks_ms(), start)/1000)
    # start = time.ticks_ms()
    # printlist(nums)
    # print("List", time.ticks_diff(time.ticks_ms(), start)/1000)

User avatar
jimmo
Posts: 2754
Joined: Tue Aug 08, 2017 1:57 am
Location: Sydney, Australia
Contact:

Re: Unexpected difference between PYBD SF2 and SF6

Post by jimmo » Wed Oct 02, 2019 7:49 am

Interesting!!

I can repro this on a build at master.

SF2:

Code: Select all

Recursive fft 0.849
Forward fft 0.157
Reverse fft 0.147
(very little variance over runs)

SF6:

Code: Select all

Recursive fft 3.075
Forward fft 0.914
Reverse fft 1.136
(with a significant variance as you say, haven't seen the memory error yet)

I tried a build for SF6 with single-fp.

Code: Select all

Recursive fft 0.518
Forward fft 0.118
Reverse fft 0.108
(very little variance)

User avatar
jimmo
Posts: 2754
Joined: Tue Aug 08, 2017 1:57 am
Location: Sydney, Australia
Contact:

Re: Unexpected difference between PYBD SF2 and SF6

Post by jimmo » Wed Oct 02, 2019 2:24 pm

I think what's going on here is that complex numbers with double-fp is maximally tough on the heap because they no longer fit into a single GC block. The default block size is four words and you need five words for a double-fp complex. If I set the default block size to 8, the performance problems disappear (and the SF6 is about 30% faster than the SF2).

To double-check that theory, on SF2 I added two dummy single-float members to the complex type (keeping the default four-word gc block). Then I saw the exact same performance regression.

This test code barely spends any time doing any actual maths, it's pretty much entirely dominated by creating float and complex objects. So I guess that explains why the number of GC blocks has such a big impact.

So... not sure what to suggest... not sure increasing the default GC block size on double-fp targets is an option because it makes everything else use so much more RAM.

If nothing else, I did find a minor optimisation to make sqrt ~35% faster when double-fp is enabled (it wasn't using hardware sqrt).

User avatar
Roberthh
Posts: 3667
Joined: Sat May 09, 2015 4:13 pm
Location: Rhineland, Europe

Re: Unexpected difference between PYBD SF2 and SF6

Post by Roberthh » Wed Oct 02, 2019 6:13 pm

Hello @jimmo. Thank you for the testing. Lesson learnt: If the precision is not required, use single precision complex numbers. The only drawback is, that one has to make a separate firmware image for that, and it affects all float numbers.
One observation added for ESP32 boards: The non-SPIRAM variant runs about two times faster than the SPIRAM variant (150 vs 330 ns), and only a bit slower than the PYBD_SF2 (150 vs 100 ms). SPIRAM access is obviously slow.

Damien
Site Admin
Posts: 647
Joined: Mon Dec 09, 2013 5:02 pm

Re: Unexpected difference between PYBD SF2 and SF6

Post by Damien » Thu Oct 03, 2019 2:19 am

Very interesting!

Note that PYBD runs by default not at the maximum possible frequency. SF2 is at 120MHz and SF6 at 144MHz by default. This can be increased to the maximum via machine.freq(216_000000). Raw performance should scale roughly linear with clock speed.

Maybe it is worth increasing the GC block size to 8 (or 6, not sure if that'll work) when double-precision FP is enabled on these boards (all stm32 targets support double-fp, just those without HW support will use SW routines).

We can also provide single-fp builds of the SF6 firmware, that's easy to do.

But in the end, this kind of computation is best left for optimise C routines rather than written purely in Python.

chuckbook
Posts: 135
Joined: Fri Oct 30, 2015 11:55 pm

Re: Unexpected difference between PYBD SF2 and SF6

Post by chuckbook » Thu Oct 03, 2019 1:56 pm

Just out of curiosity:
The code example gives

Code: Select all

>>> run()
Initial
Recursive fft 0.358
Forward fft 0.077
Reverse fft 0.072
on a PYBD SF6 SP & 216 MHz

User avatar
jimmo
Posts: 2754
Joined: Tue Aug 08, 2017 1:57 am
Location: Sydney, Australia
Contact:

Re: Unexpected difference between PYBD SF2 and SF6

Post by jimmo » Thu Oct 03, 2019 2:11 pm

I figured out what the underlying reason behind why the two-block complex objects are so much slower than single-block -- there's a thing to prevent fragmentation in the heap where only single-block allocs can advance the "next free block" pointer. So when all your code does is multi-block allocs, every allocation has to scan an increasingly big heap to find a free block.

I've got a very easy workaround for the complex type, but I'll look into whether we can make a more generic fix.

Thanks for the report @Roberthh !!

Post Reply