voxl-tflite-server v0.3.0 apq8096 yolov5

Philemon Benner

Hey,
just saw in the voxl-tflite-server dev branch, that you now also support yolov5 models for apq8096, i wanted to ask if you have any performance (inference time, max fps) results on that model yet ? Also is there already a prebuilt ipk package for tensorflow lite v2.8.0 ?

Philemon Benner

Also wich yolov5 model size are you using (n,s,m,l) ?

Hey @Philemon-Benner,

We have yet to record any benchmarks for yolov5 on apq8096, but they will be up on our docs site here once they are collected.

I used the yolov5s weights for the model included in the voxl-tflite-server package, and the ipk can be downloaded from http://voxl-packages.modalai.com/dists/apq8096/dev/binary-arm64/ or installed via opkg if you are pointing to the ap8096 dev repo.

Philemon Benner

@Matt-Turi Thank you for the fast answer

Philemon Benner

@Matt-Turi Also do you know what the class confidence exactly is ?

static constexpr float threshold_class_confidence_ = 0.20;    // not sure if this is too low or

But thanks for getting the new version running on apq8096, helps me a lot.

Philemon Benner

@Matt-Turi in yolov5 export did you also use the --half option for fp16 inference.

@Philemon-Benner,

The threshold_class_confidence_ parameter just sets a lower bound for detections to be discarded based on the confidence of that class being detected in the image. We have this information available since we do the NMS post-processing ourselves, and it can be turned up for more robust detections. Also, the --half option was used when exported, so a different quantization technique etc will be needed for increased speed.

Philemon Benner

@Matt-Turi Ah ok. How did you quantize the Model ?

Philemon Benner

Also the Model that i trained is slower than your's, is that because of the bigger model image size. I use 640x640, your's is 320x320, with your model i get around 41ms and with mine around 110ms per Image.
Edit: I Also used yolov5s and trained from scratch, i exported with tflite and half option.

I also used the tflite and half options when exporting, but you could try integer quantization instead of sticking with a floating point model. As for inference speed, that sounds about right - I chose the input dimensions of 320x320 when testing on voxl for a balance of performance and speed, going with larger inputs usually drastically slows down the inference.

Philemon Benner

@Matt-Turi Is integer Quantisation faster than floating point model ? Yeah Problem is that we are using a higher resolution camera, and we are flying at high altitudes, so cropping images down to a smaller resolution drastically changes the Precision. Also wich input dimensions are you using for testing(Camera size), and have u also tested the model outside of your office ?

Integer quantization can drastically reduce the size of the model and speed up inference, but you typically do not target the GPU with an int model. See here for more details: https://www.tensorflow.org/lite/performance/post_training_quantization.

I understand the issue of cropping/downscaling images, and it may be worth implementing a tracker of sorts to selectively feed in different portions of the frame rather than just downscaling the entire image to a smaller resolution. Otherwise, if high fps is something that can be sacrificed, using a larger model would likely be better. The model has been tested outside of the office, and I tested with VGA [640x480] camera input but have only used the default yolo weights with no extra training.

Philemon Benner

@Matt-Turi Thanks for the answer. What do you mean with implementing a tracker, that selectively feed different portions of the frame ? Something like just running Inference on the Roi of the Image ?

Philemon Benner

@Matt-Turi also besides the topic you could add:
InferenceHelper.h

//gpu Opts for more Info see:  https://github.com/tensorflow/tensorflow/blob/v2.8.0/tensorflow/lite/delegates/gpu/delegate.h
static constexpr TfLiteGpuDelegateOptionsV2 gpuOptions = {-1, //Allow Precision Loss
            TFLITE_GPU_INFERENCE_PREFERENCE_SUSTAINED_SPEED, //Preference
            TFLITE_GPU_INFERENCE_PRIORITY_MIN_LATENCY, //Prio1
            TFLITE_GPU_INFERENCE_PRIORITY_AUTO, //Prio2
            TFLITE_GPU_INFERENCE_PRIORITY_AUTO, //Prio3
            TFLITE_GPU_EXPERIMENTAL_FLAGS_ENABLE_QUANT, //Experimental Flags
            1, nullptr, nullptr //Serializiation Stuff probably don't touch
        };

And might:

constexpr TfLiteGpuDelegateOptionsV2 InferenceHelper::gpuOptions;

InferenceHelper.cpp (because C++11 sucks with constepxr).

With that you can move the creation of the gpu Options to Compile time, and these should always be the same. Tensorflow recommends doing this because the default configuration might change.
Here is the function that you are calling when you are creating the options:
https://github.com/tensorflow/tensorflow/blob/v2.8.0/tensorflow/lite/delegates/gpu/delegate.cc#:~:text=TfLiteGpuDelegateOptionsV2-,TfLiteGpuDelegateOptionsV2Default,-() {

Philemon Benner

@Matt-Turi also is there any particular reason for having a custom resize Function instead of Opencv's resize function?

Hey @Philemon-Benner,

In terms of a tracker, something we have done before is only run inference on the entire image (downscaled) until we find an object of interest, and then switching to feed in regions focused around the last detection to prevent downscaling and losing information. This is only useful in certain cases but can help with a smaller input size.

I will add your suggestion for the TfLiteGpuDelegateOptionsV2 in the next version, and we have a custom resize function using a lookup table for speed improvements, as opencv is a bit too slow for low-latency inference.

Philemon Benner

@Matt-Turi Ah ok that sounds interesting might be worth a try. Good to know that. Thanks for the fast answer as always.