A small amount of stats analysis, mostly max callback time, with simple
display in the UI.
Also improves pow calculation to use lut implementation instead of
math.h pow(), for a speedup somewhere around 20-30%.
The FM kernel yields itself well to speedup using NEON assembler. This
patch contains the NEON assembly code, plus C integration code
(including making sure that buffers are aligned to 16 bytes).