VOXL2-Mini (QRB5165): YOLO on NPU/NNAPI or GPU keeps falling back to CPU. Looking for a known-good export/runtime path (or QNN/HTP).
Device: VOXL2-Mini (QRB5165)
Goal: Run object detection on NPU/HTP via NNAPI (preferred) or GPU (FP16) using voxl-tflite-server
, keeping CPU low.
Problem: Models load, but many ops aren’t delegated (NNAPI) or GPU delegate fails → XNNPACK (CPU) takes over and pegs the CPU.
System snapshot
dpkg -l | grep -Ei 'tflite|tensorflow|voxl-tflite-server'
ii packagegroup-qti-ml-tflite 1.0-r0 all
ii qrb5165-tflite 2.8.0-2 arm64 (TensorFlow Lite for qrb5165)
ii tensorflow-lite 2.2-r0 arm64
ii voxl-tflite-server 0.4.1 arm64
GPU device and libs appear present:
ls -l /sys/class/kgsl/kgsl-3d0
-> ../../devices/platform/soc/3d00000.qcom,kgsl-3d0/kgsl/kgsl-3d0
ldconfig -p | grep -Ei 'EGL|GLES|OpenCL'
... libEGL_adreno.so, libGLESv2_adreno.so, libOpenCL.so, etc.
What works (sanity) — clarified
-
Mobilenet (quant) with NNAPI via
benchmark_model
delegates and runs:benchmark_model --graph=mobilenetv1_nnapi_quant.tflite \ --use_nnapi=true --num_runs=50 --enable_op_profiling=true INFO: Created TensorFlow Lite delegate for NNAPI. INFO: Replacing 63 node(s) with delegate (TfLiteNnapiDelegate) node... Avg inference ~15.7 ms
Confirms NNAPI delegation in the benchmarking tool.
️ Note: We did not run a Mobilenet-style model end-to-end through
voxl-tflite-server
. The CPU pegging we observed was with YOLO models that partially delegated (with many ops refused) and fell back to XNNPACK (CPU). So Mobilenet is only validated inbenchmark_model
, not in the full server pipeline yet.
What fails or falls back
1) GPU (FP16) attempts failed / underutilized
-
In app tests, selecting GPU showed GPU ~0.5% while CPU spiked → suggests GPU delegate wasn’t actually used.
-
benchmark_model
with YOLOv11n FP16 on GPU:--use_gpu=true ERROR: Didn't find op for builtin opcode 'RESIZE_NEAREST_NEIGHBOR' version '3' ERROR: Registration failed. Failed to construct interpreter
-
benchmark_model
with YOLOv8n FP16 on GPU:INFO: Created TensorFlow Lite delegate for GPU. ERROR: Next operations are not supported by GPU delegate: ADD, CONCATENATION, LOGISTIC, MUL, STRIDED_SLICE, SUB ... ERROR: TfLiteGpuDelegate Init: MUL: Dimension can not be reduced to linear. ERROR: TfLiteGpuDelegate Prepare: delegate is not initialized ERROR: .../tflite/kernels/conv.cc:340 bias->type != input_type (10 != 1) Failed to apply GPU delegate.
-
ModalAI’s own source comments the same (YOLOv8 path):
// GPU delegate doesn't seem to work for yolov8 for whatever reason
(Invoxl-tflite-server
v0.4.1)
https://gitlab.com/voxl-public/voxl-sdk/services/voxl-tflite-server/-/blob/v0.4.1/src/model_helper/yolov8_model_helper.cpp#L167The surrounding call path is a straight
interpreter->Invoke()
after feeding FP32 input, consistent with our observations that YOLOv8 graphs don’t delegate to GPU on this image.
2) NNAPI (INT8) mostly loads but refuses key ops → CPU fallback
-
YOLOv11n INT8 via
voxl-tflite-server
:INFO: Created TensorFlow Lite delegate for NNAPI. WARNING: PACK (v1) refused: Android sdk version less than 30 WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true WARNING: DEPTHWISE_CONV_2D (v4) refused: OP Version higher than 3 INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Successfully built interpreter
→ Runs, but CPU pegged.
-
YOLOv8n INT8 via
voxl-tflite-server
:INFO: Created TensorFlow Lite delegate for NNAPI. WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true. INFO: Created TensorFlow Lite XNNPACK delegate for CPU. Successfully built interpreter
→ Fewer refusals, but still CPU fallback.
-
We also hit a schema/version mismatch on one early export:
ERROR: Didn't find op for builtin opcode 'CONV_2D' version '6'
→ suggests exported ops exceeded what this image’s TFLite supports.
What we tried on the export side
Objective: Emit NNAPI/GPU-friendly TFLite graphs the VOXL2-Mini image accepts.
- Exported YOLO (v11 and v8) INT8 with Ultralytics, imgsz=640, built-in NMS, and a calibration set (~200 images from MPA pipe, letterboxed to 640×640).
- Attempted to force older TFLite converters (Docker with TF 2.5/2.6) to down-rev op versions / change resize semantics, but official TF 2.5/2.6 images we pulled are Python 3.6 → many modern wheels (Ultralytics, numpy, protobuf) unavailable → dead ends.
- Kept a working host venv on TF 2.8.4. Exported YOLOv8n INT8 successfully; it loads on VOXL, but NNAPI still refuses
RESIZE_NEAREST_NEIGHBOR v3
withhalf_pixel_centers==true
→ CPU fallback.
Current status
-
NNAPI delegation confirmed for classic quant models (Mobilenet) in
benchmark_model
. -
GPU delegate not working for YOLOv8/YOLO11 FP16 on this image (unsupported ops / delegate init failure; echoed in ModalAI’s own code comment).
-
️ YOLO (v8/v11) INT8 loads under
voxl-tflite-server
, but NNAPI refuses:PACK (v1)
(SDK < 30)RESIZE_NEAREST_NEIGHBOR (v3, half_pixel_centers=true)
DEPTHWISE_CONV_2D (v4)
(needs ≤ v3)
→ Falls back to XNNPACK (CPU) → high CPU.
Questions for ModalAI
-
Exact NNAPI & GPU capabilities on this VOXL2-Mini image:
- Which builtin op versions are supported by NNAPI (e.g.,
DEPTHWISE_CONV_2D ≤ v3
)? - Is
PACK
supported at all on this SDK, or always refused on SDK < 30? - For
RESIZE_NEAREST_NEIGHBOR
, do we needhalf_pixel_centers=false
to get delegation? - For GPU, what’s the expected op coverage and are there known limitations with YOLOv8/YOLO11 graphs?
- Which builtin op versions are supported by NNAPI (e.g.,
-
Recommended export toolchain/settings to produce YOLO TFLite that fully delegates on this image:
-
Specific TensorFlow/TFLite converter version and export flags to:
- Emit
RESIZE_NEAREST_NEIGHBOR
in a NNAPI-friendly form - Keep
DEPTHWISE_CONV_2D
at v≤3 - Avoid inserting
PACK
nodes
- Emit
-
A known-good YOLO .tflite that fully delegates to NNAPI/GPU on VOXL2-Mini would be ideal to test.
-
-
HTP/QNN path: Is there an officially supported way to run detection on QNN/HTP while keeping the MPA pipe UX (i.e., a
voxl-qnn-server
-like component or documented pattern)? On QRB5165, QNN/HTP typically gives the best perf/compat. -
Image update: Is there a newer VOXL2-Mini image with newer NNAPI/GPU delegates (e.g., SDK ≥ 30; wider op coverage) so current YOLO exports delegate out-of-the-box?
Repro bits (so you can mirror)
Calibration set: ~200 frames from the MPA pipe, 1280×720, letterboxed to 640×640.
data.yaml
(calibration only):
path: /data1/kjwindham/home/kjwindham/src/kw-mpa/test
train: calib_images_640
val: calib_images_640
names: ['obj']
Export (host venv, TF 2.8.4):
yolo export model=yolov8n.pt format=tflite int8=True nms=True imgsz=640 \
data=/data1/kjwindham/home/kjwindham/src/kw-mpa/test/data.yaml
Server config:
{
"skip_n_frames": 0,
"model": "/usr/bin/dnn/yolov8n_int8.tflite",
"input_pipe": "/run/mpa/hires_low_latency_misp_color/",
"delegate": "nnapi",
"requires_labels": true,
"labels": "/usr/bin/dnn/yolov5_labels.txt",
"allow_multiple": false,
"output_pipe_prefix": "mobilenet"
}
Typical voxl-tflite-server
log (YOLOv8n INT8):
INFO: Created TensorFlow Lite delegate for NNAPI.
WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Successfully built interpreter
GPU example (YOLOv8n FP16) with benchmark_model
:
INFO: Created TensorFlow Lite delegate for GPU.
... unsupported ops (ADD, CONCAT, LOGISTIC, MUL, STRIDED_SLICE, SUB)
ERROR: TfLiteGpuDelegate Init: MUL: Dimension can not be reduced to linear.
ERROR: ...conv.cc:340 bias->type != input_type (10 != 1)
Failed to apply GPU delegate.
What we’re hoping for
- A known-good recipe (export + runtime) that fully delegates YOLO to NNAPI or GPU on VOXL2-Mini as shipped, or
- An official QNN/HTP example/service that integrates with the MPA pipe, or (ideally via python would be great to integrate with our current project)
- An updated system image that broadens NNAPI/GPU support to match current YOLO exports.
Happy to run any candidate model/image and provide full logs. Thanks!