ModalAI Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login
    1. Home
    2. kjwindham
    K
    • Profile
    • Following 0
    • Followers 0
    • Topics 1
    • Posts 1
    • Best 0
    • Controversial 0
    • Groups 0

    kjwindham

    @kjwindham

    0
    Reputation
    1
    Profile views
    1
    Posts
    0
    Followers
    0
    Following
    Joined Last Online

    kjwindham Unfollow Follow

    Latest posts made by kjwindham

    • VOXL2-Mini (QRB5165): YOLO on NPU/NNAPI or GPU keeps falling back to CPU. Looking for a known-good export/runtime path (or QNN/HTP).

      VOXL2-Mini (QRB5165): YOLO on NPU/NNAPI or GPU keeps falling back to CPU. Looking for a known-good export/runtime path (or QNN/HTP).

      Device: VOXL2-Mini (QRB5165)
      Goal: Run object detection on NPU/HTP via NNAPI (preferred) or GPU (FP16) using voxl-tflite-server, keeping CPU low.
      Problem: Models load, but many ops aren’t delegated (NNAPI) or GPU delegate fails → XNNPACK (CPU) takes over and pegs the CPU.


      System snapshot

      dpkg -l | grep -Ei 'tflite|tensorflow|voxl-tflite-server'
      ii  packagegroup-qti-ml-tflite   1.0-r0   all
      ii  qrb5165-tflite               2.8.0-2  arm64   (TensorFlow Lite for qrb5165)
      ii  tensorflow-lite              2.2-r0   arm64
      ii  voxl-tflite-server           0.4.1    arm64
      

      GPU device and libs appear present:

      ls -l /sys/class/kgsl/kgsl-3d0
      -> ../../devices/platform/soc/3d00000.qcom,kgsl-3d0/kgsl/kgsl-3d0
      
      ldconfig -p | grep -Ei 'EGL|GLES|OpenCL'
      ... libEGL_adreno.so, libGLESv2_adreno.so, libOpenCL.so, etc.
      

      What works (sanity) — clarified

      • Mobilenet (quant) with NNAPI via benchmark_model delegates and runs:

        benchmark_model --graph=mobilenetv1_nnapi_quant.tflite \
          --use_nnapi=true --num_runs=50 --enable_op_profiling=true
        
        INFO: Created TensorFlow Lite delegate for NNAPI.
        INFO: Replacing 63 node(s) with delegate (TfLiteNnapiDelegate) node...
        Avg inference ~15.7 ms
        

        ✅ Confirms NNAPI delegation in the benchmarking tool.

        ⚠️ Note: We did not run a Mobilenet-style model end-to-end through voxl-tflite-server. The CPU pegging we observed was with YOLO models that partially delegated (with many ops refused) and fell back to XNNPACK (CPU). So Mobilenet is only validated in benchmark_model, not in the full server pipeline yet.


      What fails or falls back

      1) GPU (FP16) attempts failed / underutilized

      • In app tests, selecting GPU showed GPU ~0.5% while CPU spiked → suggests GPU delegate wasn’t actually used.

      • benchmark_model with YOLOv11n FP16 on GPU:

        --use_gpu=true
        ERROR: Didn't find op for builtin opcode 'RESIZE_NEAREST_NEIGHBOR' version '3'
        ERROR: Registration failed.
        Failed to construct interpreter
        
      • benchmark_model with YOLOv8n FP16 on GPU:

        INFO: Created TensorFlow Lite delegate for GPU.
        ERROR: Next operations are not supported by GPU delegate:
               ADD, CONCATENATION, LOGISTIC, MUL, STRIDED_SLICE, SUB
        ...
        ERROR: TfLiteGpuDelegate Init: MUL: Dimension can not be reduced to linear.
        ERROR: TfLiteGpuDelegate Prepare: delegate is not initialized
        ERROR: .../tflite/kernels/conv.cc:340 bias->type != input_type (10 != 1)
        Failed to apply GPU delegate.
        
      • ModalAI’s own source comments the same (YOLOv8 path):

        // GPU delegate doesn't seem to work for yolov8 for whatever reason
        (In voxl-tflite-server v0.4.1)
        https://gitlab.com/voxl-public/voxl-sdk/services/voxl-tflite-server/-/blob/v0.4.1/src/model_helper/yolov8_model_helper.cpp#L167

        The surrounding call path is a straight interpreter->Invoke() after feeding FP32 input, consistent with our observations that YOLOv8 graphs don’t delegate to GPU on this image.

      2) NNAPI (INT8) mostly loads but refuses key ops → CPU fallback

      • YOLOv11n INT8 via voxl-tflite-server:

        INFO: Created TensorFlow Lite delegate for NNAPI.
        WARNING: PACK (v1) refused: Android sdk version less than 30
        WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true
        WARNING: DEPTHWISE_CONV_2D (v4) refused: OP Version higher than 3
        INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
        Successfully built interpreter
        

        → Runs, but CPU pegged.

      • YOLOv8n INT8 via voxl-tflite-server:

        INFO: Created TensorFlow Lite delegate for NNAPI.
        WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true.
        INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
        Successfully built interpreter
        

        → Fewer refusals, but still CPU fallback.

      • We also hit a schema/version mismatch on one early export:

        ERROR: Didn't find op for builtin opcode 'CONV_2D' version '6'
        

        → suggests exported ops exceeded what this image’s TFLite supports.


      What we tried on the export side

      Objective: Emit NNAPI/GPU-friendly TFLite graphs the VOXL2-Mini image accepts.

      • Exported YOLO (v11 and v8) INT8 with Ultralytics, imgsz=640, built-in NMS, and a calibration set (~200 images from MPA pipe, letterboxed to 640×640).
      • Attempted to force older TFLite converters (Docker with TF 2.5/2.6) to down-rev op versions / change resize semantics, but official TF 2.5/2.6 images we pulled are Python 3.6 → many modern wheels (Ultralytics, numpy, protobuf) unavailable → dead ends.
      • Kept a working host venv on TF 2.8.4. Exported YOLOv8n INT8 successfully; it loads on VOXL, but NNAPI still refuses RESIZE_NEAREST_NEIGHBOR v3 with half_pixel_centers==true → CPU fallback.

      Current status

      • ✅ NNAPI delegation confirmed for classic quant models (Mobilenet) in benchmark_model.

      • ❌ GPU delegate not working for YOLOv8/YOLO11 FP16 on this image (unsupported ops / delegate init failure; echoed in ModalAI’s own code comment).

      • ⚠️ YOLO (v8/v11) INT8 loads under voxl-tflite-server, but NNAPI refuses:

        • PACK (v1) (SDK < 30)
        • RESIZE_NEAREST_NEIGHBOR (v3, half_pixel_centers=true)
        • DEPTHWISE_CONV_2D (v4) (needs ≤ v3)
          → Falls back to XNNPACK (CPU) → high CPU.

      Questions for ModalAI

      1. Exact NNAPI & GPU capabilities on this VOXL2-Mini image:

        • Which builtin op versions are supported by NNAPI (e.g., DEPTHWISE_CONV_2D ≤ v3)?
        • Is PACK supported at all on this SDK, or always refused on SDK < 30?
        • For RESIZE_NEAREST_NEIGHBOR, do we need half_pixel_centers=false to get delegation?
        • For GPU, what’s the expected op coverage and are there known limitations with YOLOv8/YOLO11 graphs?
      2. Recommended export toolchain/settings to produce YOLO TFLite that fully delegates on this image:

        • Specific TensorFlow/TFLite converter version and export flags to:

          • Emit RESIZE_NEAREST_NEIGHBOR in a NNAPI-friendly form
          • Keep DEPTHWISE_CONV_2D at v≤3
          • Avoid inserting PACK nodes
        • A known-good YOLO .tflite that fully delegates to NNAPI/GPU on VOXL2-Mini would be ideal to test.

      3. HTP/QNN path: Is there an officially supported way to run detection on QNN/HTP while keeping the MPA pipe UX (i.e., a voxl-qnn-server-like component or documented pattern)? On QRB5165, QNN/HTP typically gives the best perf/compat.

      4. Image update: Is there a newer VOXL2-Mini image with newer NNAPI/GPU delegates (e.g., SDK ≥ 30; wider op coverage) so current YOLO exports delegate out-of-the-box?


      Repro bits (so you can mirror)

      Calibration set: ~200 frames from the MPA pipe, 1280×720, letterboxed to 640×640.
      data.yaml (calibration only):

      path: /data1/kjwindham/home/kjwindham/src/kw-mpa/test
      train: calib_images_640
      val: calib_images_640
      names: ['obj']
      

      Export (host venv, TF 2.8.4):

      yolo export model=yolov8n.pt format=tflite int8=True nms=True imgsz=640 \
           data=/data1/kjwindham/home/kjwindham/src/kw-mpa/test/data.yaml
      

      Server config:

      {
        "skip_n_frames": 0,
        "model": "/usr/bin/dnn/yolov8n_int8.tflite",
        "input_pipe": "/run/mpa/hires_low_latency_misp_color/",
        "delegate": "nnapi",
        "requires_labels": true,
        "labels": "/usr/bin/dnn/yolov5_labels.txt",
        "allow_multiple": false,
        "output_pipe_prefix": "mobilenet"
      }
      

      Typical voxl-tflite-server log (YOLOv8n INT8):

      INFO: Created TensorFlow Lite delegate for NNAPI.
      WARNING: RESIZE_NEAREST_NEIGHBOR (v3) refused: half_pixel_centers == true.
      INFO: Created TensorFlow Lite XNNPACK delegate for CPU.
      Successfully built interpreter
      

      GPU example (YOLOv8n FP16) with benchmark_model:

      INFO: Created TensorFlow Lite delegate for GPU.
      ... unsupported ops (ADD, CONCAT, LOGISTIC, MUL, STRIDED_SLICE, SUB)
      ERROR: TfLiteGpuDelegate Init: MUL: Dimension can not be reduced to linear.
      ERROR: ...conv.cc:340 bias->type != input_type (10 != 1)
      Failed to apply GPU delegate.
      

      What we’re hoping for

      • A known-good recipe (export + runtime) that fully delegates YOLO to NNAPI or GPU on VOXL2-Mini as shipped, or
      • An official QNN/HTP example/service that integrates with the MPA pipe, or (ideally via python would be great to integrate with our current project)
      • An updated system image that broadens NNAPI/GPU support to match current YOLO exports.

      Happy to run any candidate model/image and provide full logs. Thanks!

      posted in Video and Image Sensors
      K
      kjwindham