VOXL2-Mini (QRB5165): YOLO on NPU/NNAPI or GPU keeps falling back to CPU. Looking for a known-good export/runtime path (or QNN/HTP).
-
VOXL2-Mini (QRB5165): YOLO on NPU/NNAPI or GPU keeps falling back to CPU. Looking for a known-good export/runtime path (or QNN/HTP).
Device: VOXL2-Mini (QRB5165)
Goal: Run object detection on NPU/HTP via NNAPI (preferred) or GPU (FP16) usingvoxl-tflite-server
, keeping CPU low.
Problem: Models load, but many ops aren’t delegated (NNAPI) or GPU delegate fails → XNNPACK (CPU) takes over and pegs the CPU.
System snapshot
dpkg -l | grep -Ei 'tflite|tensorflow|voxl-tflite-server' ii packagegroup-qti-ml-tflite 1.0-r0 all ii qrb5165-tflite 2.8.0-2 arm64 (TensorFlow Lite for qrb5165) ii tensorflow-lite 2.2-r0 arm64 ii voxl-tflite-server 0.4.1 arm64
GPU device and libs appear present:
ls -l /sys/class/kgsl/kgsl-3d0 -> ../../devices/platform/soc/3d00000.qcom,kgsl-3d0/kgsl/kgsl-3d0 ldconfig -p | grep -Ei 'EGL|GLES|OpenCL' ... libEGL_adreno.so, libGLESv2_adreno.so, libOpenCL.so, etc.
What works (sanity) — clarified
-
Mobilenet (quant) with NNAPI via
benchmark_model
delegates and runs:benchmark_model --graph=mobilenetv1_nnapi_quant.tflite \ --use_nnapi=true --num_runs=50 --enable_op_profiling=true INFO: Created TensorFlow Lite delegate for NNAPI. INFO: Replacing 63 node(s) with delegate (TfLiteNnapiDelegate) node... Avg inference ~15.7 ms
Confirms NNAPI delegation in the benchmarking tool.
️ Note: We did not run a Mobilenet-style model end-to-end through
voxl-tflite-server
. The CPU pegging we observed was with YOLO models that partially delegated (with many ops refused) and fell back to XNNPACK (CPU). So Mobilenet is only validated inbenchmark_model
, not in the full server pipeline yet.
What fails or falls back
1) GPU (FP16) attempts failed / underutilized
-
In app tests, selecting GPU showed GPU ~0.5% while CPU spiked → suggests GPU delegate wasn’t actually used.
-
benchmark_model
with YOLOv11n FP16 on GPU:--use_gpu=true ERROR: Didn't find op for builtin opcode 'RESIZE_NEAREST_NEIGHBOR' version '3' ERROR: Registration failed. Failed to construct interpreter
-
benchmark_model
with YOLOv8n FP16 on GPU:INFO: Created TensorFlow Lite delegate for GPU. ERROR: Next operations are not supported by GPU delegate: ADD, CONCATENATION, LOGISTIC, MUL, STRIDED_SLICE, SUB ... ERROR: TfLiteGpuDelegate Init: MUL: Dimension can not be reduced to linear. ERROR: TfLiteGpuDelegate Prepare: delegate is not initialized ERROR: .../tflite/kernels/conv.cc:340 bias->type != input_type (10 != 1) Failed to apply GPU delegate.
-
ModalAI’s own source comments the same (YOLOv8 path):
// GPU delegate doesn't seem to work for yolov8 for whatever reason
(Invoxl-tflite-server
v0.4.1)
https://gitlab.com/voxl-public/voxl-sdk/services/voxl-tflite-server/-/blob/v0.4.1/src/model_helper/yolov8_model_helper.cpp#L167The surrounding call path is a straight
interpreter->Invoke()
after feeding FP32 input, consistent with our observations that YOLOv8 graphs don’t delegate to GPU on this image.
2) NNAPI (INT8) mostly loads but refuses key ops → CPU fallback
-
YOLOv11n INT8 via
voxl-tflite-server
:INFO: Created TensorFlow Lite delegate for NNAPI. WARNING: PACK (v1) refused: Android sdk version less than 30 WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true WARNING: DEPTHWISE_CONV_2D (v4) refused: OP Version higher than 3 INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Successfully built interpreter
→ Runs, but CPU pegged.
-
YOLOv8n INT8 via
voxl-tflite-server
:INFO: Created TensorFlow Lite delegate for NNAPI. WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Successfully built interpreter
→ Fewer refusals, but still CPU fallback.
-
We also hit a schema/version mismatch on one early export:
ERROR: Didn't find op for builtin opcode 'CONV_2D' version '6'
→ suggests exported ops exceeded what this image’s TFLite supports.
What we tried on the export side
Objective: Emit NNAPI/GPU-friendly TFLite graphs the VOXL2-Mini image accepts.
- Exported YOLO (v11 and v8) INT8 with Ultralytics, imgsz=640, built-in NMS, and a calibration set (~200 images from MPA pipe, letterboxed to 640×640).
- Attempted to force older TFLite converters (Docker with TF 2.5/2.6) to down-rev op versions / change resize semantics, but official TF 2.5/2.6 images we pulled are Python 3.6 → many modern wheels (Ultralytics, numpy, protobuf) unavailable → dead ends.
- Kept a working host venv on TF 2.8.4. Exported YOLOv8n INT8 successfully; it loads on VOXL, but NNAPI still refuses
RESIZE_NEAREST_NEIGHBOR v3
withhalf_pixel_centers==true
→ CPU fallback.
Current status
-
NNAPI delegation confirmed for classic quant models (Mobilenet) in
benchmark_model
. -
GPU delegate not working for YOLOv8/YOLO11 FP16 on this image (unsupported ops / delegate init failure; echoed in ModalAI’s own code comment).
-
️ YOLO (v8/v11) INT8 loads under
voxl-tflite-server
, but NNAPI refuses:PACK (v1)
(SDK < 30)RESIZE_NEAREST_NEIGHBOR (v3, half_pixel_centers=true)
DEPTHWISE_CONV_2D (v4)
(needs ≤ v3)
→ Falls back to XNNPACK (CPU) → high CPU.
Questions for ModalAI
-
Exact NNAPI & GPU capabilities on this VOXL2-Mini image:
- Which builtin op versions are supported by NNAPI (e.g.,
DEPTHWISE_CONV_2D ≤ v3
)? - Is
PACK
supported at all on this SDK, or always refused on SDK < 30? - For
RESIZE_NEAREST_NEIGHBOR
, do we needhalf_pixel_centers=false
to get delegation? - For GPU, what’s the expected op coverage and are there known limitations with YOLOv8/YOLO11 graphs?
- Which builtin op versions are supported by NNAPI (e.g.,
-
Recommended export toolchain/settings to produce YOLO TFLite that fully delegates on this image:
-
Specific TensorFlow/TFLite converter version and export flags to:
- Emit
RESIZE_NEAREST_NEIGHBOR
in a NNAPI-friendly form - Keep
DEPTHWISE_CONV_2D
at v≤3 - Avoid inserting
PACK
nodes
- Emit
-
A known-good YOLO .tflite that fully delegates to NNAPI/GPU on VOXL2-Mini would be ideal to test.
-
-
HTP/QNN path: Is there an officially supported way to run detection on QNN/HTP while keeping the MPA pipe UX (i.e., a
voxl-qnn-server
-like component or documented pattern)? On QRB5165, QNN/HTP typically gives the best perf/compat. -
Image update: Is there a newer VOXL2-Mini image with newer NNAPI/GPU delegates (e.g., SDK ≥ 30; wider op coverage) so current YOLO exports delegate out-of-the-box?
Repro bits (so you can mirror)
Calibration set: ~200 frames from the MPA pipe, 1280×720, letterboxed to 640×640.
data.yaml
(calibration only):path: /data1/kjwindham/home/kjwindham/src/kw-mpa/test train: calib_images_640 val: calib_images_640 names: ['obj']
Export (host venv, TF 2.8.4):
yolo export model=yolov8n.pt format=tflite int8=True nms=True imgsz=640 \ data=/data1/kjwindham/home/kjwindham/src/kw-mpa/test/data.yaml
Server config:
{ "skip_n_frames": 0, "model": "/usr/bin/dnn/yolov8n_int8.tflite", "input_pipe": "/run/mpa/hires_low_latency_misp_color/", "delegate": "nnapi", "requires_labels": true, "labels": "/usr/bin/dnn/yolov5_labels.txt", "allow_multiple": false, "output_pipe_prefix": "mobilenet" }
Typical
voxl-tflite-server
log (YOLOv8n INT8):INFO: Created TensorFlow Lite delegate for NNAPI. WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Successfully built interpreter
GPU example (YOLOv8n FP16) with
benchmark_model
:INFO: Created TensorFlow Lite delegate for GPU. ... unsupported ops (ADD, CONCAT, LOGISTIC, MUL, STRIDED_SLICE, SUB) ERROR: TfLiteGpuDelegate Init: MUL: Dimension can not be reduced to linear. ERROR: ...conv.cc:340 bias->type != input_type (10 != 1) Failed to apply GPU delegate.
What we’re hoping for
- A known-good recipe (export + runtime) that fully delegates YOLO to NNAPI or GPU on VOXL2-Mini as shipped, or
- An official QNN/HTP example/service that integrates with the MPA pipe, or (ideally via python would be great to integrate with our current project)
- An updated system image that broadens NNAPI/GPU support to match current YOLO exports.
Happy to run any candidate model/image and provide full logs. Thanks!
-