What is the visual SLAM? 07/04 Update SLTechnology News&Howtos

What is the visual SLAM?

2025-07-04 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

Today, I would like to talk to you about what visual SLAM is, many people may not know much about it. In order to make you understand better, the editor summarized the following content for you. I hope you can get something according to this article.

In recent years, SLAM technology has made amazing development, the leading laser SLAM has been maturely used in various scenes. Although the landing application of visual SLAM is not as good as laser SLAM, it is also a hot spot of current research. Today we will talk about those things of visual SLAM in detail.

What is the visual SLAM?

Visual SLAM is mainly based on the camera to complete the perception of the environment, relatively speaking, the camera cost is relatively low, easy to put on the commodity hardware, and rich image information, so visual SLAM has also attracted much attention.

At present, visual SLAM can be divided into monocular, binocular (multiple), RGBD, and there are fisheye, panoramic and other special cameras, but it is still a minority in research and products. In addition, visual SLAM combined with inertial measurement devices (Inertial Measurement Unit,IMU) is also one of the current research hotspots. In terms of implementation difficulty, the three methods are roughly sorted as follows: monocular vision > binocular vision > RGBD.

Monocular camera SLAM, referred to as MonoSLAM, can complete SLAM with only one camera. The biggest advantage is that the sensor is simple and cheap, but there is also a big problem, that is, the exact depth can not be obtained.

On the one hand, because the absolute depth is unknown, the monocular SLAM can not get the true size of the robot trajectory and map. If the trajectory and the room are magnified twice at the same time, the monocular view is the same, so the monocular SLAM can only estimate a relative depth. On the other hand, the monocular camera cannot rely on an image to obtain the relative distance between the object in the image and itself. In order to estimate this relative depth, monocular SLAM relies on triangulation in motion to solve the camera motion and estimate the spatial position of pixels. That is to say, its trajectory and map can only converge after the camera moves, and if the camera does not move, it is impossible to know the position of the pixels. At the same time, the camera motion can not be pure rotation, which brings some trouble to the application of monocular SLAM.

The difference between binocular camera and monocular is that stereo vision can estimate the depth not only when moving, but also at rest, which eliminates many troubles of monocular vision. However, the configuration and calibration of binocular or multiocular cameras are complex, and their depth ranges are limited by the baseline and resolution of binoculars. Calculating pixel distance through binocular images is a very computationally consuming thing, and now it is often done with FPGA.

RGBD camera is a kind of camera that began to rise around 2010. its biggest feature is that it can measure the distance of each pixel in the image directly from the camera through infrared structured light or TOF principle. Therefore, it can provide more information than traditional cameras, and it does not have to calculate depth as time-consuming and laborious as monocular or binocular.

Interpretation of Visual SLAM Framework

1. Sensor data

In the visual SLAM, it is mainly for the reading and preprocessing of camera image information. If in the robot, there may also be code disk, inertial sensors and other information reading and synchronization.

two。 Visual odometer

The main task of the visual odometer is to estimate the camera motion between adjacent images and the appearance of the local map, and the simplest thing is the motion relationship between the two images. How the computer determines the motion of the camera through the image. In the image, we can only see the pixels one by one, knowing that they are the result of some spatial points projected on the imaging plane of the camera. Therefore, we must first understand the geometric relationship between the camera and the space point.

Vo (also known as front-end) can estimate camera motion from images between adjacent frames and restore the spatial structure of the scene, which is called odometer. It is called an odometer because it only calculates the motion of adjacent moments and has nothing to do with previous information. When the adjacent moments are connected in series, the motion trajectory of the robot is formed, thus the positioning problem is solved. On the other hand, according to the camera position of each moment, the position of the space point corresponding to each pixel is calculated, and the map is obtained.

3. Back-end optimization

The back-end optimization is mainly to deal with the noise in the slam process. Any sensor has noise, so in addition to dealing with "how to estimate camera motion from an image", we should also care about how much noise this estimate contains.

The front end provides the back end with the data to be optimized, as well as the initial values of these data, while the back end is responsible for the overall optimization process, it often only faces the data, and it does not have to care about where the data comes from. In the visual slam, the front end is more related to the computing vision research field, such as image feature extraction and matching, while the back end is mainly filtering and nonlinear optimization algorithms.

4. Loop detection

Loop detection can also be called closed-loop detection, which refers to the ability of the robot to recognize that it has reached the scene. If the detection is successful, the cumulative error can be significantly reduced. Loop detection is essentially an algorithm to detect the similarity of observed data. For visual SLAM, most systems adopt the relatively mature word bag model (Bag-of-Words, BoW). The word bag model clusters the visual features (SIFT, SURF, etc.) in the image, and then sets up a dictionary to find out what "word" is contained in each graph. Some researchers also use traditional pattern recognition methods to build loop detection into a classification problem and train classifiers for classification.

5. Building a map

The main purpose of mapping is to build maps corresponding to task requirements according to the estimated trajectories. In robotics, there are mainly four kinds of maps: raster map, direct representation method, topological map and feature point map. Feature point maps represent the environment with relevant geometric features (such as points, lines, faces), which are common in visual SLAM technology. This kind of map is generally generated by vSLAM algorithms such as GPS, UWB and camera with sparse mode. The advantage is that the relative data storage and computation are relatively small, and it is often seen in the earliest SLAM algorithm.

The working principle of Visual SLAM

Most visual SLAM systems work through continuous camera frames, track and set key points, locate their 3D position by triangulation algorithm, and use this information to approach and speculate the camera's own attitude. Simply put, the goal of these systems is to map the environment related to their own location. This map can be used for robot system navigation in this environment. Unlike other forms of SLAM technology, you only need a 3D vision camera to do this.

By tracking a sufficient number of key points in the camera video frame, we can quickly understand the direction of the sensor and the structure of the surrounding physical environment. All visual SLAM systems are constantly working to minimize the reprojection error (Reprojection Error) or the difference between the projection point and the actual point, usually through an algorithm called Bundle Adjustment (BA). VSLAM system requires real-time operation, which involves a large number of operations, so location data and mapping data are often Bundle Adjustment separately, but at the same time, so it is easy to speed up the processing speed before the final merger.

What is the difference between visual SLAM and laser SLAM?

In the industry, the question of who is better than the visual SLAM or laser SLAM and who will become the mainstream trend in the future has become the focus of attention, and different people also have different views and opinions. The following will be described in detail from the aspects of cost, application scene, map accuracy and ease of use.

1. Cost

In terms of cost, the price of lidar is generally high, but there are also low-cost lidar solutions in China, and VSLAM mainly collects data and information through cameras. Compared with lidar, the cost of cameras is obviously much lower. However, the lidar can measure the angle and distance of the obstacle point with higher precision, which is convenient for positioning and navigation.

two。 Application scenario

In terms of application scenarios, the application scenarios of VSLAM are much richer. VSLAM can work in both indoor and outdoor environments, but it is highly dependent on light and cannot work in the dark or in some untextured areas. At present, laser SLAM is mainly used indoors for map construction and navigation.

3. Map accuracy

Laser SLAM in the construction of maps, high accuracy, Silan Technology RPLIDAR series of maps built around 2cm; VSLAM, such as common, we also use a lot of depth camera Kinect, (ranging range between 3-12m), map construction accuracy of about 3cm; so laser SLAM maps generally higher accuracy than VSLAM, and can be directly used for positioning and navigation.

Map Establishment of Visual SLAM

4. Ease of use

Both laser SLAM and visual SLAM based on depth camera directly obtain the point cloud data in the environment, and measure where there are obstacles and the distance of obstacles according to the generated point cloud data. However, the visual SLAM scheme based on monocular, binocular and fisheye cameras can not directly obtain point clouds in the environment, but form gray or color images, which need to constantly move their own position, extract and match feature points, and use triangulation to measure the distance of obstacles.

Generally speaking, laser SLAM is relatively more mature and is the most reliable positioning and navigation scheme at present, while visual SLAM is still a mainstream research direction in the future, but the integration of the two is an inevitable trend in the future.

After reading the above, do you have any further understanding of what visual SLAM is? If you want to know more knowledge or related content, please follow the industry information channel, thank you for your support.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.