.isalpha() works for ascii only :-(

HermannSW · Post by **HermannSW** » Tue Oct 16, 2018 7:09 pm

With Python2 I did count the "word" characters in Unicode today as 99537 (those with .isalpha() returning True).
Here I did count again, for whole Unicode set, for first 256 and for first 128 unicode characters:

Code: Select all

>>> for i in 0x10ffff, 0xff, 0x7f:
...     c=0
...     for j in range(i+1):
...         if unichr(j).isalpha():
...             c+=1
...     print(c)
... 
99537
117
52
>>>

In MicroPython the chr() function does what unichr() does in Python2.
But .isalpha() counts are the same for first 128 and first 256 Unicode characters, and we have just seen that they really are 117 and 52:

Code: Select all

>>> 
MicroPython v1.9.4-272-g46091b8a on 2018-07-18; ESP module with ESP8266
Type "help()" for more information.
>>> for i in 0xff, 0x7f:
...     c=0
...     for j in range(i+1):
...         if chr(j).isalpha():
...             c+=1
...             
...             
...     print(c)
...     
...     
... 
52
52
>>> print(chr(8364))
€
>>>

So this experimentally proves that .isalpha() seems to be restricted to ASCII.
If that is correct, is there an alternative in MicroPython() that works as .isalpha() in Python2?

pfalcon · Post by **pfalcon** » Wed Oct 17, 2018 5:12 pm

So this experimentally proves that .isalpha() seems to be restricted to ASCII.

Sure. How bloated the micropython binary would be otherwise (and how little would be left for your apps)?

is there an alternative in MicroPython() that works as .isalpha() in Python2?

Who asks? Must be a complete Python noob who didn't even read the tutorials and who doesn't know that there're many ways to test set containment in Python, starting from:

Code: Select all

print("ф" in "фывапролджэ")

HermannSW · Post by **HermannSW** » Mon Oct 22, 2018 3:27 pm

There is no reason to respond so aggressive, but I saw another similar tone recent posting from you to somebody else.
Your example does not even match the question.
And I was able to implement isalpha() for complete unicode as module that runs even on ESP8266, not in firmware.
Seems that I am not the noob you stated.

I started looking on unicode character distribution for characters with isalpha True.
It turned out that all unicode characters starting and above 0x300000 have isalpha False.
Next I looked at 256 character pages, there are 768 below 0x300000.
There are 339 pages with 256 False entries and 353 pages with 256 True entries for isalpha.
Only 76 pages have mixed True/False entries.

It turned out that isalpha has more True values for Python 3 (101013) than the 99537 in Python 2.
"gen_blk.py" is a Python 3 script that outputs "isalpha.py" MicroPython script.
Biggest part is 768 entries mixed array "ISALPHA" of booleans and 256bit integers.
Function "isalpha(str)" is implementation for whole unicode.

Since all character processing like "for c in str" creates length 1 strings, these strings end up as interned strings.
I did add "tst(strt,len)" function for testing "isalpha()".
But that runs oom even on 100KB free memory ESP32 when testing all 1114112 unicode characters
(because of the more than 1 million interned strings).

So I added "tst_(strt,len)" function passing unicode character ord value to function "_()" called from "isalpha()" as well.
That allows to test whole unicode character set.
"tstPlanes()" is a convenience function testing all 17 (default) or only first 3 multilingual (65536 chars) planes (with False arg).

"tstPlanes()" returns correctly 101013 isalpha characters found, 2nd response is time in ms, 3rd is average time per call.
I tested on Intel laptop Python3,

Code: Select all

$ python3
Python 3.4.9 (default, Aug 14 2018, 21:28:57) 
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import isalpha
>>> isalpha.tstPlanes(False)
(101013, 202.113525390625, 0.0010280025502045949)
>>>

on ESP32

Code: Select all

>>> 
MicroPython v1.9.4-623-g34af10d2e on 2018-10-03; ESP32 module with ESP32
Type "help()" for more information.
>>> import isalpha
>>> isalpha.tstPlanes(False)
(101013, 11504, 0.05851237)
>>>

as well as ESP8266:

Code: Select all

>>> 
MicroPython v1.9.4-272-g46091b8a on 2018-07-18; ESP module with ESP8266
Type "help()" for more information.
>>> import isalpha
>>> isalpha.tstPlanes(False)
(101013, 38704, 0.196859)
>>>

Python3 on laptop needs 1µs, MicroPython on ESP32 58µs and on ESP8266 196µs per function call on average.

Python3 says sys.getsizeof(ISALPHA)=6232. But importing module in MicroPython takes 10KB memory, here for ESP8266:

Code: Select all

>>> 
MicroPython v1.9.4-272-g46091b8a on 2018-07-18; ESP module with ESP8266
Type "help()" for more information.
>>> gc.collect()
>>> gc.mem_free()
28608
>>> import isalpha
>>> gc.mem_free()
16848
>>> gc.collect()
>>> gc.mem_free()
18416
>>>

This is a part of ISALPHA array near end:

Code: Select all

..., True, True,
115792089237316195423570985008687907853269984665640564039439146271038674829311, 
1073741823, 
False, False, ...

This is function isalpha, calling out to "_()":

Code: Select all

def isalpha(str):
    for c in str:
        if not(_(ord(c))):
            return False

    return True

And here is where the work is done:

Code: Select all

def _(i):
    if i>=0x030000:
        return False 
    else:
        b = i//256
        if type(ISALPHA[b])==type(True):
            return ISALPHA[b]
        else:
            return ISALPHA[b] & (1<<(i%256)) != 0

Summary:

isalpha.py can be used as long as 10KB of free memory is available
ESP8266 average "isalpha()" time is only 3.8 times that of ESP32
files attached

HermannSW · Post by **HermannSW** » Wed Oct 24, 2018 4:16 pm

I found out that isalpha() is not what is needed to generate the "word" count value for "wc".

So I did run this command to extract what Linux "wc" thinks is a word character:

Code: Select all

$ time ( for((i=0; i<17*65536; ++i)); do python3 -c "import sys;sys.stdout.write(chr($i))" | wc -w; done > out )

It did run for 433 minutes(!) or more than 7 hours on an i7 laptop.
It generates 17*65536=1114112 one-character documents that get piped to "wc -w".
The "wc" word characters outnumber the isalpha characters by far:

Code: Select all

$ grep "^1" out | wc --lines
238046
$

Even if restricting to planes 0-2, there are more word than the 101013 isalpha characters:

Code: Select all

$ head -$((3*65536)) out | grep "^1" | wc --lines
106641
$

The very first isalpha character is chr(65)='A'.
All characters chr(33)-chr(64) are word characters, space character chr(32) is not.

It turned out that the 131405 characters above 0x030000 (in fact 0x0e0000 and above) can be easily determined by few "if" statements:

Code: Select all

    if i>=0x030000:
        if (i<0x0e0000):
            return False
        elif i>=0x0f0000:
            return (i%65536)<0xfffe
        else:
            return (i==0x0e0001) | (0x0e0020<=i<=0x0e007f) | (0x0e0100<=i<=0x0e01ef)
    else:

As for isalpha, only 76 of the 768 256-character pages below 0x030000 are neither all False or all True for isword.
So I could rewrite "gen_blk.py" script used to generate "isalpha.py".
Now "gen_blk_wc.py" script generates "isword.py" script.
Similar to isalpha, "import isword" does cost 10KB of RAM.

Find "gen_blk_wc.py" and details upysh "wc" here:
viewtopic.php?f=15&t=233&p=31320#p31320

HermannSW · Post by **HermannSW** » Wed Oct 31, 2018 5:16 pm

I had idea to reduce RAM usage of isword.py significantly in this posting:
viewtopic.php?f=15&t=233&start=20#p31371

Instead of having a huge bitarray, this time the runs of 0s and 1s get determined, and indices of change will be searched by binary search. The new isword.py does now occupy 3056 bytes of RAM only, 10KB before, available through this commit.

As always with time vs space runtime increases (11627ms to 18013ms for ESP32, 37467ms to 67989ms for ESP8266 for processing 196608 characters in first 3 unicode planes):

Code: Select all

tstPlanes  False                        True
ESP32old   (106641, 11627, 0.05913798)  (238046, 46374, 0.04162418)
ESP32new   (106641, 18013, 0.09161886)  (238046, 52801, 0.0473929)

ESP8266old (106641, 37467, 0.190567)    (238046, 102917, 0.0923758)
EPS8266new (106641, 67989, 0.34581)     (238046, 131885, 0.118377)

But the memory reduction by more than 2/3rd is definitely worth that:

Code: Select all

MicroPython v1.9.4-272-g46091b8a on 2018-07-18; ESP module with ESP8266
Type "help()" for more information.
>>> gc.collect(); gc.mem_free()
28656
>>> import isword
>>> gc.collect(); gc.mem_free()
25600
>>>

This is the binary search part of the commit:

Code: Select all

def _s(arr, i):
    lft = 0
    rgt = len(arr)-1
    while lft+1 < rgt:
        mid = (lft + rgt) // 2
        if i <= arr[mid]:
            rgt = mid
        else:
            lft = mid
    return (lft % 2) if i<=arr[lft] else (rgt % 2)

This is the "isword" test for single unicode index:

Code: Select all

def _w(i):
    if i>=0x030000:
        if (i<0x0e0000):
            return False
        elif i>=0x0f0000:
            return (i%65536)<0xfffe
        else:
            return (i==0x0e0001) | (0x0e0020<=i<=0x0e007f) | (0x0e0100<=i<=0x0e01ef)
    elif i>=0x020000:
        if i>=0x02f800:
            return i<0x02fa1e
        else:
            return i<0x02a6d7
    elif i>=0x010000:
        return 1 - _s(_1, i%65536)
    else:
        return _s(_0, i)

These are the two new arrays:

Code: Select all

_0=array.array('H', [32, 126, 159, 887, ..., 65528, 65533, 65535])
_1=array.array('H', [11, 12, 38, 39, ..., 61487, 61587, 65535])

MicroPython Forum (Archive)

.isalpha() works for ascii only :-(

.isalpha() works for ascii only :-(

Re: .isalpha() works for ascii only :-(

Re: .isalpha() works for ascii only :-(

Re: .isalpha() works for ascii only :-(

Re: .isalpha() works for ascii only :-(