Low Overhead API JSON - avoiding mystifying tokenizer

cefn · Post by **cefn** » Tue Jul 04, 2017 9:37 am

I am thinking of putting together a simple weather() function for my Rainbow badge demo (driving a 8-light Neopixel strip) which looks up the forecast for the coming days, and sets the blue-brightness for each coming day according to the rain.

Unfortunately, the full JSON data structures received from Wunderground (45kB) or Openweathermap (14kB) which you can see at http://shrimping.it/tmp/weatherapi/ are both rather large and mostly redundant for the application, meaning JSON parsing directly isn't a realistic option. For this reason I would rather treat them as a stream, and process incoming chunks looking for matching substructures, and disposing of any stream contents which don't match.

It should of course be feasible to run a Regex-like state-machine over the stream directly, avoiding parsing anything but individual keypairs deep within the data structures, and doing that on a one-by-one basis, only preserving the information needed. For example in the OpenWeatherMap data, there are structures like this to show the rainfall in a 3 hour period...

Code: Select all

"rain":{"3h":0.16}

...or...

Code: Select all

"rain":{}

if there's no rain.

While I could knock up a loop which pulls these structures out, (a bit like the processCommand() function in https://github.com/ShrimpingIt/projects ... etTime.ino ), I would like to avoid authoring an impenetrable and api-specific tokenizing state-machine to flummox learners. Ideally the demo would be a reference example of good programming practice.

In normal circumstances I would be investigating the 'partial=True' flag of Pypi's regex module. The partial flag would enable me to author a Regular Expression which characterised a 'rain' substructure within the stream, and handle each byte as a potential continuation of a partially-matching string against that pattern until it gets a substructure which can be passed to the json module. Of course when the regex rejects the string (which it mostly would) then no string is progressively cached and matched, meaning you actually only perform substantial matching on...

Code: Select all

"
"r
"ra
"rai
"rain
"rain"
"rain":
"rain":{

...and so on.

Unfortunately the ure module is much more cut-down than PyPi Regex, meaning partial match is not available in the module.

My question is what would people consider good programming practice for this case?

Is there any generic state-machine strategy that I can expose learners to, and that they can hope to replicate for other APIs they encounter, or is the only way to put together my own mystifying tokenizer?

deshipu · Post by **deshipu** » Tue Jul 04, 2017 10:02 am

What you need is a SAX-style parser, which processes the stream gradually, instead of parsing the whole thing at once and returning a complete parse tree. Note that you can't parse JSON just with regular expressions, just like you can't do that with HTML, or you risk summoning ͓̠̓̊̑ZA͂̇̈ͭ̈̚͢L̯͇͐̊̑͑̂̀G̷͇͈̬̼̠̉̈́O̭̫͌̉̈!̰̗̲͕ͩ̿̂ͭ͆̅.

I think that Python with its iterator protocol and async/await machinery is great for building this style of parsers. That makes me really surprised, as there don't seem to be any around. I guess kids these days prefer to hand-code a recursive descent parser by hand.

cefn · Post by **cefn** » Tue Jul 04, 2017 10:09 am

That's definitely the space I'm operating in, although SAX specifically is for XML of course, so can't be directly translated.

Your suggestion has led me to some reference SAX-style parsers for JSON, though, which I might be able to 'port'...

http://rapidjson.org/md_doc_sax.html

https://github.com/dscape/clarinet

Your comment on python language structures raises the possibility of exposing a helpful demonstrator of yield as a way of not going mad whilst returning results from a state machine progressing through a byte stream, so that's a very helpful point and can turn this into something of a tutorial resource.

cefn · Post by **cefn** » Tue Jul 04, 2017 10:14 am

Intriguingly ijson is an evented json parser with a pure python tokenizer backend...

https://github.com/isagalaev/ijson

https://pypi.python.org/pypi/ijson

cefn · Post by **cefn** » Tue Jul 04, 2017 7:12 pm

It's a shame the ijson module clutters the key 'prefix' namespace with the word 'item' for items in a list. It could have used # [] or * (an invalid javascript variable name, visibly unlike a key), but it's otherwise pretty darn good.

Here's a worked example using genuine openweathermap data in tokeniser parse mode. This evented mode contrasts with ijson's items processing mode, which unmarshals JSON subtrees matching a prefix (misusing and overloading 'item' as vocabulary yet again)...

Code: Select all

from urllib.request import urlopen
import ijson
forecast = []
total = 0
for prefix,event,value in ijson.parse(urlopen('http://shrimping.it/tmp/weatherapi/openweathermap.json')):
    if prefix =="list.item.rain.3h" and event=="number":
        total += float(value)
    elif prefix == "list.item.rain" and event=="end_map":
        forecast.append(total)
        total = 0
print(forecast)

...produces...

Code: Select all

[0.0024999999999999, 0, 0.44, 0.945, 1.12, 0.63, 0.22, 0.0099999999999998, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0.38, 1.91, 0, 0, 0, 0, 0, 0, 0.29, 0.38, 0.12, 0.3, 0.32, 1.17, 0, 0.0099999999999998, 0, 0, 0, 0, 0]

The use of ijson does fall out very elegantly in python3, so I'll have a go at porting it for micropython.

deshipu · Post by **deshipu** » Tue Jul 04, 2017 8:54 pm

Years ago I used a simple query language to extract parts of (already parsed) json files easily. Perhaps you would like that syntax better: https://bitbucket.org/thesheep/jpath/src/tip/jpath.py

cefn · Post by **cefn** » Wed Jul 05, 2017 9:57 pm

Success, of a sort!

I've ported ijson to work in Micropython (tested on unix micropython 1.8.7, and it can parse a 14k JSON API result within (I would estimate) just 128 bytes of buffered stream data in scope at any one time.

The main.py print_local() demonstration I've provided currently runs processing a locally-stored file, but the same result should be possible with a remotely retrieved bytestream over HTTP.

I'm having trouble porting it to run on the ESP8266, though because I can't seem to get a stable re library via upip, and the ure library built-in complains that the ijson/backends/python.js module contains invalid regular expressions (although the same expressions were handled fine by cpython re and by micropython-re-pcre. I'll post questions about that under a separate thread.

You can see the microfied ijson at https://github.com/ShrimpingIt/micropython-ijson

pfalcon · Post by **pfalcon** » Sat Jul 08, 2017 11:56 am

You may also be interested in this: https://github.com/micropython/micropython/issues/3072 and this: https://github.com/pfalcon/notes-pico/b ... o/ijson.py

cefn · Post by **cefn** » Tue Aug 01, 2017 11:51 pm

I bailed from using Regex, and built a tutorial demonstrator of protocols/tokenizers using yield on a per-unicode-character basis.

The nil-memory JSON tokenizer is available at https://github.com/ShrimpingIt/medea

It's very slow, and I would welcome any suggestions how to accelerate it without substantial memory overhead. You can see the core of the implementation at...
https://github.com/ShrimpingIt/medea/bl ... _init__.py

The implementation I've provided literally uses a single byte buffer, thanks to the minimal backtracking in the JSON standard so the memory overhead should be well controlled. I've tested it minimally by processing OpenWeatherMap and Twitter API JSON data and it has been successful so far.

deshipu · Post by **deshipu** » Wed Aug 02, 2017 11:00 am

One obvious thing that you could try to speed it up is to make it read more than one byte at a time and buffer it. I know that it wouldn't be so "pure" anymore then, but I can see an opportunity for some speedup there.

MicroPython Forum (Archive)

Low Overhead API JSON - avoiding mystifying tokenizer

Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer

Re: Low Overhead API JSON - avoiding mystifying tokenizer