Low Overhead API JSON - avoiding mystifying tokenizer

cefn · Post by **cefn** » Wed Aug 02, 2017 11:38 am

Since I need to decompose to a single character for the parsing logic anyway, I didn't know if there was anything to be gained by having a read(x) where x>1, and then a separate operation to further decompose the result to single characters.

If I've got the right end of the stick, this is backed by a Unicode-based text stream (constructed from open('filename', 'r'), not open('filename', 'rb'), so I thought it must be buffering in the IO layer even just to do Unicode character composition. I went down the unicode text stream route to ensure that the richer JSON served by e.g. Twitter APIs could be handled, rather than a bytestream. I can't assume JSON served from APIs would be ascii.

Given it's already doing some buffering, is there a way to encourage TextIOWrapper to buffer in efficient chunks? Thought I was just requesting a single byte at a time from something already buffered in the backend. For that reason I speculated that it might be less efficient to add my own buffer layer. Is this not true?

In practice, I will be processing the JSON data from a socket attached to a remote HTTPS server anyway, and I can't imagine the IO is actually handled on a per-byte basis in that case. Do you think there would be a gain to adding my own buffer when the bytes are drawn from a secure socket?

cefn · Post by **cefn** » Wed Aug 02, 2017 2:36 pm

Worth noting the efficiency gain from chunking the IO, but unfortunately the majority of time is spent elsewhere.

The total runtime for the weatherMap.py Medea test case ( https://github.com/ShrimpingIt/medea/bl ... therMap.py ) tokenizing from a locally-stored text file is 32495ms

If I write a simpler test case just to read the test file's bytes into individual characters, the read operation of single bytes is only 4551ms, compared to roughly 1400ms if you chunk it with 512 bytes at a time. That means the vast majority of time must be being spent elsewhere. Looks like I may be able to shave about 3000ms off the process with smarter buffering, but that's not a big proportion.

I get the results below from the TextIOWrapper just reading the characters, adding my own buffering with file.read(chunkSize) but without tokenizing, just to test the chunked read speed. A little confusing that 2 bytes is slower than 4 bytes. I speculate for performance it might be worth rolling my own unicode parsing so that everything is bytewise.

Code: Select all

1 bytes : 4551 ms
2 bytes : 10940 ms
4 bytes : 6105 ms
8 bytes : 3809 ms
16 bytes : 2652 ms
32 bytes : 2051 ms
64 bytes : 1742 ms
128 bytes : 1592 ms
256 bytes : 1509 ms
512 bytes : 1465 ms
1024 bytes : 1444 ms
2048 bytes : 1428 ms

Here's the example read-performance-testing code...

Code: Select all

from utime import ticks_ms
def measure(size):
    linecount = 0
    startMs = ticks_ms()
    with open("examples/data/weatherMap.json") as f:
        while True:
            result = f.read(size)
            if result:
                for char in result:
                    if char == "\n":
                        linecount += 1     
            else:
                break
    stopMs = ticks_ms()
    return stopMs - startMs

blockSize = 1
for index in range(12):
    duration = measure(blockSize)
    print("{} bytes : {} ms".format(blockSize, duration))
    blockSize *= 2

dbc · Post by **dbc** » Wed Aug 02, 2017 5:01 pm

You might want to check out jsmn https://github.com/zserge/jsmn which tokenizes JSON by simply giving you pointers into the JSON text.

It isn't incremental, but on the other hand it doesn't do any dynamic memory allocation either. I was looking at it for an embedded project and I was attracted to it because it is pure C, no dependencies, and no dynamic memory allocation. The project went another direction for architectural reasons, so I never got as far as trying it out. I would probably be fairly simple to write a Micropython API wrapper.

cefn · Post by **cefn** » Wed Aug 02, 2017 5:34 pm

Thanks, @dbc.

I'm hoping to use this tokenizer partly as a (micro)python generator lesson, ideally.

Sadly I haven't had the confidence to author a python wrapper for any c code, yet, though I am quite interested to understand it. Does making a new C library available require the production of a freshly-built micropython firmware image to include the C library, or is there a less complicated approach which would mean people could dynamically add a capability backed by C to an established image? I would have to limit my work to the latter case as I can't expect learners to configure and make new images on their machines.

I suspect I'd hit a bit of a dead end when constructing streams in python (e.g. socket streams) and trying to then process them in C.

However, it's pretty easy to start tokenizing values anywhere in the stream, so I might take a hybrid approach, for example, if you know that the object you want is named rain, therefore prefixed by '"rain": ' then you could potentially have a much simpler matcher to find the starting point for your target value, and then fire up the full tokenizer only for the sub-part of the JSON you need, temporarily triggering the more wasteful generators, but which would also terminate as soon as the value was completed.

Still not an ideal level of performance, (compared to native C) but if I can get it to process Twitter or the weather within 5 seconds or so, that's more than enough for my simple Internet-of-Things demos. 30 seconds is a bit much, but I think I can achieve 5 seconds as a hybrid.

cefn · Post by **cefn** » Mon Aug 07, 2017 11:50 pm

By way of a project update, I can now extract the rain information from the approx 15kB JSON retrieved from openWeatherMap (as typified by http://shrimping.it/tmp/weatherapi/openweathermap.json ) within just over 3 seconds using a pure micropython implementation.

The trick has been a hybrid approach in which 512 unicode character buffers are filled at a time from the source stream, and then a match for the "rain:" key is searched as efficiently as possible within those buffers. Only when a match for the key is found is the full generator-based JSON tokenizer fired up to process the JSON characters for the value which follows. After this procedure has consumed the respective characters, it then releases processing back to the matching of a further "rain:" key. This result is illustrated by the example...

https://github.com/ShrimpingIt/medea/bl ... esNamed.py

After adding a feature which allows multiple such keys to be searched in parallel, I was able to create a process for chunking minimal tweet information from the Twitter API yielding tweet (id, text) pairs, yet without triggering a 'false positive' for the id keys found below the "user" key. This strategy could be the basis for quite detailed yet efficient recursive processing of JSON. This result is illustrated by the example...

https://github.com/ShrimpingIt/medea/bl ... esNamed.py

MicroPython Forum (Archive)

Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer