Overview
今天从zhihu上看到tesla AI day上关于Tesla’s Autopilot Explained in 10 mins的video1, 感觉层次地讲了autopilot, 下面将其以及一些周边延伸mark一下,
tesla has 3 tiers of autonomous driving2,3,
- Autopilot(standard)
- Enhanced Autopilot
- Full Self-Driving(FSD)
Eyes
8 cameras + 12 ultrasonic sensors
advanced sensor coverage, via tesla.com
Breakdown
here describes how the vehicle see the world,
input/hydraNet5
The 8 cameras(Left) around vehicle generate 3-Dimensional “Vector Space”(Right) via neural networks, which represents everything you need for driving, such as lines, edges, curbs, traffic signs, traffic lights, cars, and positions, orientations, depth, velocities of cars
camera image -> image space -> 3D vector space representation
component
object detection structure, input → backbone → neck → head → output, which is similar to human upper body
- backbone: refers to the feature extracting network, which is used to recognize several objects in a single image and provides rich features information of objects. We often use AlexNet, ResNet, VGGNet, Darknet53 as backbone networks
- head: After the feature extract, backbone gives us a feature maps representation of the input. do the prediction based on the feature maps. e.g., one-stage: YOLO, SSD, two-stage: faster R-CNN
- neck: The neck is between the backbone and head, it is used to extract more elaborate features. e.g. feautre pyramid network(FPN), BiFPN
tesla object detection structure hydraNet(九头蛇), which means that there are multiple detection heads in the network
brief
- 多摄像头 VS. 单摄像头, 获取360度全景信息
- normal feature extract(multi-scale features), multi-scenarios/multi-detections的先决条件
- multi-head predict with feature sharing as feature map’s multiple scales
multi-camera
only one input/raw data/camera is not enough so 8 cameras come out
image space(fusion of per-camera detection) to represent a typical input
using a transformer to convert image space to vector space(red part below)
red
virtual camera/calibration
Because the parameters of tesla’s 8 cameras are different: focal length, angle of view, depth of field, installation position are different, the same object under different cameras is not the same
This kind of data can not be directly used for training, so before training, we need standardize 8 cameras into one synthetic virtual camera6
rectify/calibrate before regNet
(backbone + neck + transform) -> multi-camera features + kinematics as input
the kinematics information is basically the velocity and the acceleration
feature queue contains a time-based queue(36Hz=27ms) and a space-based queue(1 meter) to cache features and continues into the video module
brick stands for vehicle, red brick is tesla
orange is car detected by single-frame, while light blue is car detected by video(continuous-frames)
we can see that in the red circle/dark blue, car would not vanish even occluded via video detecting
brief
tesla vision network final structure
steps6,
- raw images feeding on the bottom and go through rectification layer to correct for camera calibration and put everything into a common virtual camera(one way one same virtual camera description)
- pass them through RegNets residual networks to process them into a number of features at different scales and fuse the multi-scale information with BiFBN
- goes through a transformer module to re-represent it into the vector space, and output space
- feeds into a featured queue in time and space that gets processed by a video module, like the Spatial RNN(GRU, LSTM)
- hydraNet with trunks and heads for all the different tasks
没有一味追求最新模型, 而是融合现有多种模型搭建一个practical engineering network
planner/controller
route planning and speed/acceleration control
route planning and taking other car’s status into consideration in real-time
Monte Carlo tree search(MCTS)
a model-based planning system and a model-based reinforcement learning system
neural network + explicit cost functions(e.g., distance, collisions, comfort, traversal time)
the final architecture
vision system
(bottom-left) aims to generate vector space via camera
neural network planner
receives vector space and other features to generate trajectory/route
explicit planner
combines vector space and trajectory to generate steering and acceleration/deceleration command
solid dataset
- manual labeling
- auto labeling
- HD map
- simulation in various scenarios
Reference
- Tesla Autopilot Explained in 10 Minutes - Tesla AI Day Highlights
- Autopilot and Full Self-Driving Capability
- Tesla Autopilot – What Does It Do and How Does it Work?
- Tesla’s Autopilot Explained! Tesla AI Day in 10 Minutes
- Deep Understanding Tesla FSD Part 1: HydraNet
- Deep Understanding Tesla FSD Part 2: Vector Space
- Deep Understanding Tesla FSD Part 3: Planning & Control
- Deep Understanding Tesla FSD Part 4: Auto Labeling, Simulation
- 解读: Tesla Autopilot技术架构