ModalAI Forum
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Groups
    • Register
    • Login
    1. Home
    2. dario-pisanti
    D
    • Profile
    • Following 0
    • Followers 0
    • Topics 2
    • Posts 2
    • Best 1
    • Controversial 0
    • Groups 0

    dario-pisanti

    @dario-pisanti

    1
    Reputation
    11
    Profile views
    2
    Posts
    0
    Followers
    0
    Following
    Joined Last Online

    dario-pisanti Unfollow Follow

    Best posts made by dario-pisanti

    • Neural network inference fails on VOXL2 Adreno GPU, but works on CPU, with Qualcomm SDK

      Hi,
      I hope you could help me with the following issue.

      SUMMARY:
      I am interested in running inference of deep neural network models on a VOXL2 by using the Qualcomm Neural Processing SDK, hopefully benefiting from the GPU and the NPUs onboard.
      Specifically, I'm trying to run a pre-trained VGG-16 model from the ONNX framework, following the tutorial at https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/tutorial_onnx.html

      After successfully converting the model from ONNX to DLC format through Qualcomm SDK, everything works fine when I run inference of the vgg16.dlc model (Step 7. of the tutorial) on the VOXL2 CPUs by running:

      cd $SNPE_ROOT/examples/Models/VGG/data/cropped
      snpe-net-run --input_list raw_list.txt --container ../../dlc/vgg16.dlc --output_dir ../../output***
      

      with the expected output:

      -------------------------------------------------------------------------------
      Model String: N/A
      SNPE v2.15.4.231013125348_62905
      -------------------------------------------------------------------------------
      Processing DNN input(s):
      /opt/qcom/aistack/snpe/2.15.4.231013/examples/Models/VGG/data/cropped/kitten.raw
      Successfully executed!
      

      However, when I enable GPU usage, by running:

      snpe-net-run --input_list raw_list.txt --container ../../dlc/vgg16.dlc --output_dir ../../output --use_gpu
      

      I get the following error:

      error_code=201; error_message=Casting of tensor failed. error_code=201; error_message=Casting of tensor failed. Failed to create input tensor: vgg0_dense0_weight_permute for Op: vgg0_dense0_fwd error: 1002; error_component=Dl System; line_no=817; thread_id=547788872288; error_component=Dl System; line_no=277; thread_id=547865747472
      

      In conclusion, why the same model inference works on the VOXL2 CPU, but not on its GPU? In addition: does anyone have any experience with running deep learning inference on the VOXL2 NPUs with Qualcomm SDKs?

      HOW TO REPRODUCE:
      I succesfully setup Qualcomm Neural Processing SDK on VOXL2 following the instructions at
      https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/setup.html, using the binaries in $SNPE_ROOT/bin/aarch64-ubuntu-gcc7.5 and I accordingly modified $SNPE_ROOT/bin/envsetup.sh for correct environment variables setup.

      I followed the instructions from steps1 to step 4 of the VGG tutorial at https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/tutori..., on VOXL2.

      I converted the VGG ONNX model into Qualcomm SDK DLC format (step 5) on a Host machine running with Ubuntu 20.04 and a Clang 9 compiler installed, where I setup the Qualcomm Neural Processing SDK addressing the binaries in $SNPE_ROOT/bin/x86_64-linux-clang (the conversion operation is not supported on VOXL2 architecture).

      I pushed the converted VGG model in DLC format to the VOXL2 and I followed the remaining instructions of the tutorial up to step 7, where I got the situation reported in the summary above.

      VOXL2 SPECS:
      Architecture: Aarch64
      OS: Ubuntu 18.04
      CPU: Qualcomm® QRB5165: 8 cores up to 3.091 GHz, 8GB LPDDR5
      GPU: Adreno 650 GPU - 1024 ALU
      NPU: 15 TOPS AI embedded Neural Processing Unit
      ONNX PYTHON PACKAGES: onnx==1.14.1, onnxruntime==1.16.1

      HOST SPECS:
      Architecture: x86
      OS: Ubuntu 20.04
      CPU: Intel(R) Xeon(R) W-2125 8 cores @ 4.00GHz
      GPU: NVIDIA Corporation GP106GL [Quadro P2000]
      ONNX PYTHON PACKAGES: onnx==1.14.1, onnxruntime==1.16.1

      FURTHER DETAILS:
      I checked the availability of GPU runtime on VOXL2, by executing the snpe-platform-validator tool (available with the Qualcomm Neural Processing SDK) from my Host machine:

      cd /opt/qcom/aistack/snpe/2.15.4.231013/bin/x86_64-linux-clang 
      python3 snpe-platform-validator-py --runtime="all" --directory=/opt/qcom/aistack/snpe/2.15.4.231013 --buildVariant="aarch64-ubuntu-gcc7.5"
      

      The platform validator results for GPU are:

      Runtime supported: Supported
      Library Prerequisites: Found
      Library Version: Not Queried
      Runtime Core Version: Not Queried
      Unit Test: Passed
      Overall Result: Passed
      
      posted in Software Development
      D
      dario-pisanti

    Latest posts made by dario-pisanti

    • Fail to apply GPU delegate with custom model on voxl-tflite-server

      Hi @modaltb, @Chad-Sweet,

      I hope you can help me with this issue:

      SUMMARY:

      I deployed my own .tflite model on VOXL2 by properly customizing the inference_helper.cpp class of the voxl-tflite-server. My model is supposed to take two input images pre-loaded on-board and perform image matching. No input is taken from the voxl cameras.

      When I run the server, it fails to apply GPU delegate, as shown in this output:

      (base) voxl2:/$ voxl-tflite-server 
      
      ================================================================= 
      skip_n_frames:                    0 
      ================================================================= 
      model:                            /usr/bin/dnn/outdoor_ds_640_ONNXop12_TFv2.8_ExpNewConv_custOps_float16.tflite 
      ================================================================= 
      input_pipe:                       /run/mpa/hires/ 
      ================================================================= 
      delegate:                         gpu 
      ================================================================= 
      allow_multiple:                   false 
      ================================================================= 
      output_pipe_prefix:               mobilenet 
      ================================================================= 
      
      existing instance of voxl-tflite-server found, attempting to stop it 
      
      INFO: Created TensorFlow Lite delegate for GPU. 
      
      Failed to apply GPU delegate 
      
      ------VOXL TFLite Server------ 
      
      

      It failed to apply also the XNNPACK and NNAPI delegates.

      For the deployment on Voxl2, i modified the following files of the voxl-tflite-server:

      • ./src/inference_helper.cpp
      • ./include/inference_helper.h
      • ./src/main.cpp
      • ./scripts/qrb5165/voxl-configure-tflite

      VOXL2 SPECS:
      Architecture: Aarch64
      OS: Ubuntu 18.04
      CPU: Qualcomm® QRB5165: 8 cores up to 3.091 GHz, 8GB LPDDR5
      GPU: Adreno 650 GPU - 1024 ALU
      NPU: 15 TOPS AI embedded Neural Processing Unit

      HOST (from which the voxl-tflite-served is deployed):
      Architecture: x86
      OS: Ubuntu 20.04
      CPU: Intel(R) Xeon(R) W-2125 8 cores @ 4.00GHz
      GPU: NVIDIA Corporation GP106GL [Quadro P2000]

      MODEL CONVERSION DETAILS:

      I converted my .tflite model from a TensorFlow model with a post-training quantization as in the Python instructions at https://docs.modalai.com/voxl-tflite-server/

      This is my Python code for the conversion with tensorflow==2.8.0, following v2.8 API:

      # Load the tensorflow model
      converter = tf.lite.TFLiteConverter.from_saved_model(tf_model_path)
      
      # Set converter flags
      converter.experimental_new_converter = True
      converter.allow_custom_ops = True
                 
      # Post-training quantization
      converter.optimizations = [tf.lite.Optimize.DEFAULT]
      converter.target_spec.supported_types = [tf.float16]
      
      # Model conversion and saving
      tflite_model = converter.convert()
      with open(tflite_model_path, 'wb') as f:
          f.write(tflite_model)
      

      The model is converted although these warning messages are shown in the output:

      2023-12-08 19:34:18.409799: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA 
      
      To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. 
      
      2023-12-08 19:34:19.967671: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 2945 MB memory:  -> device: 0, name: Quadro P2000, pci bus id: 0000:65:00.0, compute capability: 6.1 
      
      2023-12-08 20:12:28.357684: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:357] Ignored output_format. 
      
      2023-12-08 20:12:28.357739: W tensorflow/compiler/mlir/lite/python/tf_tfl_flatbuffer_helpers.cc:360] Ignored drop_control_dependency. 
      
      2023-12-08 20:12:28.359555: I tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: models/LoFTR/weights/outdoor_ds_640_ONNXop12_TFv2.8 
      
      2023-12-08 20:12:28.437131: I tensorflow/cc/saved_model/reader.cc:78] Reading meta graph with tags { serve } 
      
      2023-12-08 20:12:28.437171: I tensorflow/cc/saved_model/reader.cc:119] Reading SavedModel debug info (if present) from: models/LoFTR/weights/outdoor_ds_640_ONNXop12_TFv2.8 
      
      2023-12-08 20:12:28.618928: I tensorflow/cc/saved_model/loader.cc:228] Restoring SavedModel bundle. 
      
      2023-12-08 20:12:29.814406: I tensorflow/cc/saved_model/loader.cc:212] Running initialization op on SavedModel bundle at path: models/LoFTR/weights/outdoor_ds_640_ONNXop12_TFv2.8 
      
      2023-12-08 20:12:30.886233: I tensorflow/cc/saved_model/loader.cc:301] SavedModel load for tags { serve }; Status: success: OK. Took 2526683 microseconds. 
      
      2023-12-08 20:12:32.444671: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:237] disabling MLIR crash reproducer, set env var `MLIR_CRASH_REPRODUCER_DIRECTORY` to enable. 
      
      2023-12-08 20:12:34.744161: W tensorflow/compiler/mlir/lite/flatbuffer_export.cc:1903] The following operation(s) need TFLite custom op implementation(s): 
      
      Custom ops: Cast, Range, RealDiv, StridedSlice, Transpose 
      
      Details: 
      
              tf.Cast(tensor<1xf64>) -> (tensor<1xi64>) : {Truncate = false, device = ""} 
      
              tf.Cast(tensor<1xi64>) -> (tensor<1xf64>) : {Truncate = false, device = ""} 
      
              tf.Cast(tensor<?xf64>) -> (tensor<?xi64>) : {Truncate = false, device = ""} 
      
              tf.Cast(tensor<?xi64>) -> (tensor<?xf64>) : {Truncate = false, device = ""} 
      
              tf.Cast(tensor<f64>) -> (tensor<i64>) : {Truncate = false, device = ""} 
      
              tf.Cast(tensor<i64>) -> (tensor<f64>) : {Truncate = false, device = ""} 
      
              tf.Range(tensor<i64>, tensor<i64>, tensor<i64>) -> (tensor<?xi64>) : {device = ""} 
      
              tf.RealDiv(tensor<1xf64>, tensor<1xf64>) -> (tensor<1xf64>) : {device = ""} 
      
              tf.RealDiv(tensor<?xf64>, tensor<f64>) -> (tensor<?xf64>) : {device = ""} 
      
              tf.RealDiv(tensor<f64>, tensor<f64>) -> (tensor<f64>) : {device = ""} 
      
              tf.StridedSlice(tensor<5x2x60x80x60x1xi64>, tensor<1xi64>, tensor<1xi64>, tensor<1xi64>) -> (tensor<2x60x80x60x1xi64>) : {begin_mask = 0 : i64, device = "", ellipsis_mask = 0 : i64, end_mask = 0 : i64, new_axis_mask = 0 : i64, shrink_axis_mask = 1 : i64} 
      
              tf.Transpose(tensor<1x128x5x60x5x80xf32>, tensor<6xi32>) -> (tensor<1x128x5x5x60x80xf32>) : {device = ""} 
      
      See instructions: https://www.tensorflow.org/lite/guide/ops_custom
      

      If in the Python conversion code I disable the allow_custom_ops flag and I enable the supported ops as shown in this code snippet:

      -  converter.allow_custom_ops = True
      +  converter.target_spec.supported_ops = [
      +            tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
      +            tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
                  ]
      

      The last output warnings turn into these ones:

      2023-12-08 14:35:09.243817: W tensorflow/compiler/mlir/lite/flatbuffer_export.cc:1892] TFLite interpreter needs to link Flex delegate in order to run the model since it contains the following Select TFop(s): 
      Flex ops: FlexCast, FlexRange, FlexRealDiv, FlexStridedSlice, FlexTranspose 
      
      Details: 
      
              tf.Cast(tensor<1xf64>) -> (tensor<1xi64>) : {Truncate = false, device = ""} 
      
              tf.Cast(tensor<1xi64>) -> (tensor<1xf64>) : {Truncate = false, device = ""} 
      
              tf.Cast(tensor<?xf64>) -> (tensor<?xi64>) : {Truncate = false, device = ""} 
      
              tf.Cast(tensor<?xi64>) -> (tensor<?xf64>) : {Truncate = false, device = ""} 
      
              tf.Cast(tensor<f64>) -> (tensor<i64>) : {Truncate = false, device = ""} 
      
              tf.Cast(tensor<i64>) -> (tensor<f64>) : {Truncate = false, device = ""} 
      
              tf.Range(tensor<i64>, tensor<i64>, tensor<i64>) -> (tensor<?xi64>) : {device = ""} 
      
              tf.RealDiv(tensor<1xf64>, tensor<1xf64>) -> (tensor<1xf64>) : {device = ""} 
      
              tf.RealDiv(tensor<?xf64>, tensor<f64>) -> (tensor<?xf64>) : {device = ""} 
      
              tf.RealDiv(tensor<f64>, tensor<f64>) -> (tensor<f64>) : {device = ""} 
      
              tf.StridedSlice(tensor<5x2x60x80x60x1xi64>, tensor<1xi64>, tensor<1xi64>, tensor<1xi64>) -> (tensor<2x60x80x60x1xi64>) : {begin_mask = 0 : i64, device = "", ellipsis_mask = 0 : i64, end_mask = 0 : i64, new_axis_mask = 0 : i64, shrink_axis_mask = 1 : i64} 
      
              tf.Transpose(tensor<1x128x5x60x5x80xf32>, tensor<6xi32>) -> (tensor<1x128x5x5x60x80xf32>) : {device = ""} 
      
      See instructions: https://www.tensorflow.org/lite/guide/ops_select
      

      By using the .tflite model generated with these new flags, the voxl-tflite-server still fails to apply the GPU delegate.

      FURTHER DETAILS:

      I tested the same .tflite model in C++ by building TensorFlow Lite with CMake on a macOS Ventura 13.6 (x86), following the instructions at https://www.tensorflow.org/lite/guide/build_cmake

      I built the Flex delegate shared library "libtensorflowlite_flex.so" with the following command (see instructions at https://www.tensorflow.org/lite/guide/ops_select)

      bazel build -c opt --config=monolithic tensorflow/lite/delegates/flex:tensorflowlite_flex
      

      and I linked it to my model.

      I was able to succesfully run an inference of the model and get correct output.

      posted in Software Development
      D
      dario-pisanti
    • Neural network inference fails on VOXL2 Adreno GPU, but works on CPU, with Qualcomm SDK

      Hi,
      I hope you could help me with the following issue.

      SUMMARY:
      I am interested in running inference of deep neural network models on a VOXL2 by using the Qualcomm Neural Processing SDK, hopefully benefiting from the GPU and the NPUs onboard.
      Specifically, I'm trying to run a pre-trained VGG-16 model from the ONNX framework, following the tutorial at https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/tutorial_onnx.html

      After successfully converting the model from ONNX to DLC format through Qualcomm SDK, everything works fine when I run inference of the vgg16.dlc model (Step 7. of the tutorial) on the VOXL2 CPUs by running:

      cd $SNPE_ROOT/examples/Models/VGG/data/cropped
      snpe-net-run --input_list raw_list.txt --container ../../dlc/vgg16.dlc --output_dir ../../output***
      

      with the expected output:

      -------------------------------------------------------------------------------
      Model String: N/A
      SNPE v2.15.4.231013125348_62905
      -------------------------------------------------------------------------------
      Processing DNN input(s):
      /opt/qcom/aistack/snpe/2.15.4.231013/examples/Models/VGG/data/cropped/kitten.raw
      Successfully executed!
      

      However, when I enable GPU usage, by running:

      snpe-net-run --input_list raw_list.txt --container ../../dlc/vgg16.dlc --output_dir ../../output --use_gpu
      

      I get the following error:

      error_code=201; error_message=Casting of tensor failed. error_code=201; error_message=Casting of tensor failed. Failed to create input tensor: vgg0_dense0_weight_permute for Op: vgg0_dense0_fwd error: 1002; error_component=Dl System; line_no=817; thread_id=547788872288; error_component=Dl System; line_no=277; thread_id=547865747472
      

      In conclusion, why the same model inference works on the VOXL2 CPU, but not on its GPU? In addition: does anyone have any experience with running deep learning inference on the VOXL2 NPUs with Qualcomm SDKs?

      HOW TO REPRODUCE:
      I succesfully setup Qualcomm Neural Processing SDK on VOXL2 following the instructions at
      https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/setup.html, using the binaries in $SNPE_ROOT/bin/aarch64-ubuntu-gcc7.5 and I accordingly modified $SNPE_ROOT/bin/envsetup.sh for correct environment variables setup.

      I followed the instructions from steps1 to step 4 of the VGG tutorial at https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-2/tutori..., on VOXL2.

      I converted the VGG ONNX model into Qualcomm SDK DLC format (step 5) on a Host machine running with Ubuntu 20.04 and a Clang 9 compiler installed, where I setup the Qualcomm Neural Processing SDK addressing the binaries in $SNPE_ROOT/bin/x86_64-linux-clang (the conversion operation is not supported on VOXL2 architecture).

      I pushed the converted VGG model in DLC format to the VOXL2 and I followed the remaining instructions of the tutorial up to step 7, where I got the situation reported in the summary above.

      VOXL2 SPECS:
      Architecture: Aarch64
      OS: Ubuntu 18.04
      CPU: Qualcomm® QRB5165: 8 cores up to 3.091 GHz, 8GB LPDDR5
      GPU: Adreno 650 GPU - 1024 ALU
      NPU: 15 TOPS AI embedded Neural Processing Unit
      ONNX PYTHON PACKAGES: onnx==1.14.1, onnxruntime==1.16.1

      HOST SPECS:
      Architecture: x86
      OS: Ubuntu 20.04
      CPU: Intel(R) Xeon(R) W-2125 8 cores @ 4.00GHz
      GPU: NVIDIA Corporation GP106GL [Quadro P2000]
      ONNX PYTHON PACKAGES: onnx==1.14.1, onnxruntime==1.16.1

      FURTHER DETAILS:
      I checked the availability of GPU runtime on VOXL2, by executing the snpe-platform-validator tool (available with the Qualcomm Neural Processing SDK) from my Host machine:

      cd /opt/qcom/aistack/snpe/2.15.4.231013/bin/x86_64-linux-clang 
      python3 snpe-platform-validator-py --runtime="all" --directory=/opt/qcom/aistack/snpe/2.15.4.231013 --buildVariant="aarch64-ubuntu-gcc7.5"
      

      The platform validator results for GPU are:

      Runtime supported: Supported
      Library Prerequisites: Found
      Library Version: Not Queried
      Runtime Core Version: Not Queried
      Unit Test: Passed
      Overall Result: Passed
      
      posted in Software Development
      D
      dario-pisanti