@Leo-Allesch , if you are talking about VSLAM, which stands for Visual Simultaneous Localization and Mapping, then this is a sub-class of SLAM which uses only cameras (and often IMUs, then it is VISLAM).
VSLAM and VISLAM usually generate and maintain a sparse map of 3D features (points), which is used for self localization.
If you want to be able to detect a target, and you have your 3D pose from VIO (VIO = VISLAM), then you can either triangulate the target location or get it's location from a single frame if you know the size of the target (such as April tag, etc). Then you can transform the coordinate of the tag from the camera frame into your "world" (VIO) coordinates.
The depth sensor would allow you to build a more detailed 3D map / point cloud using the 3D pose output from the VIO algorithm (or just using a depth sensor, which may be more complicated). You can also build a dense 3D map using camera only (and IMU), but that typically requires a lot more computation than using a depth sensor.
You don't necessarily need a depth sensor for identifying the position of a specific target that is easily detected by a camera.
There is a lot of research material you can find on these topics, please look into it deeper and let us know if you have specific questions about using VOXL2 hardware and software.
Alex