kjwindham

kjwindham

VOXL2-Mini (QRB5165): YOLO on NPU/NNAPI or GPU keeps falling back to CPU. Looking for a known-good export/runtime path (or QNN/HTP).

Device: VOXL2-Mini (QRB5165)
Goal: Run object detection on NPU/HTP via NNAPI (preferred) or GPU (FP16) using voxl-tflite-server, keeping CPU low.
Problem: Models load, but many ops aren’t delegated (NNAPI) or GPU delegate fails → XNNPACK (CPU) takes over and pegs the CPU.

System snapshot

dpkg -l | grep -Ei 'tflite|tensorflow|voxl-tflite-server'
ii  packagegroup-qti-ml-tflite   1.0-r0   all
ii  qrb5165-tflite               2.8.0-2  arm64   (TensorFlow Lite for qrb5165)
ii  tensorflow-lite              2.2-r0   arm64
ii  voxl-tflite-server           0.4.1    arm64

GPU device and libs appear present:

ls -l /sys/class/kgsl/kgsl-3d0
-> ../../devices/platform/soc/3d00000.qcom,kgsl-3d0/kgsl/kgsl-3d0

ldconfig -p | grep -Ei 'EGL|GLES|OpenCL'
... libEGL_adreno.so, libGLESv2_adreno.so, libOpenCL.so, etc.

What works (sanity) — clarified

Mobilenet (quant) with NNAPI via benchmark_model delegates and runs:
```
benchmark_model --graph=mobilenetv1_nnapi_quant.tflite \
  --use_nnapi=true --num_runs=50 --enable_op_profiling=true

INFO: Created TensorFlow Lite delegate for NNAPI.
INFO: Replacing 63 node(s) with delegate (TfLiteNnapiDelegate) node...
Avg inference ~15.7 ms
```
Confirms NNAPI delegation in the benchmarking tool.

️ Note: We did not run a Mobilenet-style model end-to-end through voxl-tflite-server. The CPU pegging we observed was with YOLO models that partially delegated (with many ops refused) and fell back to XNNPACK (CPU). So Mobilenet is only validated in benchmark_model, not in the full server pipeline yet.

What fails or falls back

1) GPU (FP16) attempts failed / underutilized

In app tests, selecting GPU showed GPU ~0.5% while CPU spiked → suggests GPU delegate wasn’t actually used.

benchmark_model with YOLOv11n FP16 on GPU:

--use_gpu=true
ERROR: Didn't find op for builtin opcode 'RESIZE_NEAREST_NEIGHBOR' version '3'
ERROR: Registration failed.
Failed to construct interpreter

benchmark_model with YOLOv8n FP16 on GPU:

INFO: Created TensorFlow Lite delegate for GPU.
ERROR: Next operations are not supported by GPU delegate:
       ADD, CONCATENATION, LOGISTIC, MUL, STRIDED_SLICE, SUB
...
ERROR: TfLiteGpuDelegate Init: MUL: Dimension can not be reduced to linear.
ERROR: TfLiteGpuDelegate Prepare: delegate is not initialized
ERROR: .../tflite/kernels/conv.cc:340 bias->type != input_type (10 != 1)
Failed to apply GPU delegate.

ModalAI’s own source comments the same (YOLOv8 path):

// GPU delegate doesn't seem to work for yolov8 for whatever reason
(In voxl-tflite-server v0.4.1)
https://gitlab.com/voxl-public/voxl-sdk/services/voxl-tflite-server/-/blob/v0.4.1/src/model_helper/yolov8_model_helper.cpp#L167

The surrounding call path is a straight interpreter->Invoke() after feeding FP32 input, consistent with our observations that YOLOv8 graphs don’t delegate to GPU on this image.

2) NNAPI (INT8) mostly loads but refuses key ops → CPU fallback

YOLOv11n INT8 via voxl-tflite-server:

INFO: Created TensorFlow Lite delegate for NNAPI.
WARNING: PACK (v1) refused: Android sdk version less than 30
WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true
WARNING: DEPTHWISE_CONV_2D (v4) refused: OP Version higher than 3
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Successfully built interpreter

→ Runs, but CPU pegged.

YOLOv8n INT8 via voxl-tflite-server:

INFO: Created TensorFlow Lite delegate for NNAPI.
WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Successfully built interpreter

→ Fewer refusals, but still CPU fallback.

We also hit a schema/version mismatch on one early export:
```
ERROR: Didn't find op for builtin opcode 'CONV_2D' version '6'
```
→ suggests exported ops exceeded what this image’s TFLite supports.

What we tried on the export side

Objective: Emit NNAPI/GPU-friendly TFLite graphs the VOXL2-Mini image accepts.

Exported YOLO (v11 and v8) INT8 with Ultralytics, imgsz=640, built-in NMS, and a calibration set (~200 images from MPA pipe, letterboxed to 640×640).
Attempted to force older TFLite converters (Docker with TF 2.5/2.6) to down-rev op versions / change resize semantics, but official TF 2.5/2.6 images we pulled are Python 3.6 → many modern wheels (Ultralytics, numpy, protobuf) unavailable → dead ends.
Kept a working host venv on TF 2.8.4. Exported YOLOv8n INT8 successfully; it loads on VOXL, but NNAPI still refuses RESIZE_NEAREST_NEIGHBOR v3 with half_pixel_centers==true → CPU fallback.

Current status

NNAPI delegation confirmed for classic quant models (Mobilenet) in benchmark_model.
GPU delegate not working for YOLOv8/YOLO11 FP16 on this image (unsupported ops / delegate init failure; echoed in ModalAI’s own code comment).
️ YOLO (v8/v11) INT8 loads under voxl-tflite-server, but NNAPI refuses:
- PACK (v1) (SDK < 30)
- RESIZE_NEAREST_NEIGHBOR (v3, half_pixel_centers=true)
- DEPTHWISE_CONV_2D (v4) (needs ≤ v3)
  → Falls back to XNNPACK (CPU) → high CPU.

Questions for ModalAI

Exact NNAPI & GPU capabilities on this VOXL2-Mini image:
- Which builtin op versions are supported by NNAPI (e.g., DEPTHWISE_CONV_2D ≤ v3)?
- Is PACK supported at all on this SDK, or always refused on SDK < 30?
- For RESIZE_NEAREST_NEIGHBOR, do we need half_pixel_centers=false to get delegation?
- For GPU, what’s the expected op coverage and are there known limitations with YOLOv8/YOLO11 graphs?
Recommended export toolchain/settings to produce YOLO TFLite that fully delegates on this image:
- Specific TensorFlow/TFLite converter version and export flags to:
  - Emit RESIZE_NEAREST_NEIGHBOR in a NNAPI-friendly form
  - Keep DEPTHWISE_CONV_2D at v≤3
  - Avoid inserting PACK nodes
- A known-good YOLO .tflite that fully delegates to NNAPI/GPU on VOXL2-Mini would be ideal to test.
HTP/QNN path: Is there an officially supported way to run detection on QNN/HTP while keeping the MPA pipe UX (i.e., a voxl-qnn-server-like component or documented pattern)? On QRB5165, QNN/HTP typically gives the best perf/compat.
Image update: Is there a newer VOXL2-Mini image with newer NNAPI/GPU delegates (e.g., SDK ≥ 30; wider op coverage) so current YOLO exports delegate out-of-the-box?

Repro bits (so you can mirror)

Calibration set: ~200 frames from the MPA pipe, 1280×720, letterboxed to 640×640.
data.yaml (calibration only):

path: /data1/kjwindham/home/kjwindham/src/kw-mpa/test
train: calib_images_640
val: calib_images_640
names: ['obj']

Export (host venv, TF 2.8.4):

yolo export model=yolov8n.pt format=tflite int8=True nms=True imgsz=640 \
     data=/data1/kjwindham/home/kjwindham/src/kw-mpa/test/data.yaml

Server config:

{
  "skip_n_frames": 0,
  "model": "/usr/bin/dnn/yolov8n_int8.tflite",
  "input_pipe": "/run/mpa/hires_low_latency_misp_color/",
  "delegate": "nnapi",
  "requires_labels": true,
  "labels": "/usr/bin/dnn/yolov5_labels.txt",
  "allow_multiple": false,
  "output_pipe_prefix": "mobilenet"
}

Typical voxl-tflite-server log (YOLOv8n INT8):

INFO: Created TensorFlow Lite delegate for NNAPI.
WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
Successfully built interpreter

GPU example (YOLOv8n FP16) with benchmark_model:

INFO: Created TensorFlow Lite delegate for GPU.
... unsupported ops (ADD, CONCAT, LOGISTIC, MUL, STRIDED_SLICE, SUB)
ERROR: TfLiteGpuDelegate Init: MUL: Dimension can not be reduced to linear.
ERROR: ...conv.cc:340 bias->type != input_type (10 != 1)
Failed to apply GPU delegate.

What we’re hoping for

A known-good recipe (export + runtime) that fully delegates YOLO to NNAPI or GPU on VOXL2-Mini as shipped, or
An official QNN/HTP example/service that integrates with the MPA pipe, or (ideally via python would be great to integrate with our current project)
An updated system image that broadens NNAPI/GPU support to match current YOLO exports.

Happy to run any candidate model/image and provide full logs. Thanks!