tesla autonomous driving关注点

Overview

今天从zhihu上看到tesla AI day上关于Tesla’s Autopilot Explained in 10 mins的video¹, 感觉层次地讲了autopilot, 下面将其以及一些周边延伸mark一下,

tesla has 3 tiers of autonomous driving^2,3,

Autopilot(standard)
Enhanced Autopilot
Full Self-Driving(FSD)

Eyes

8 cameras + 12 ultrasonic sensors

advanced sensor coverage, via tesla.com

Breakdown

here describes how the vehicle see the world,

input/hydraNet⁵

The 8 cameras(Left) around vehicle generate 3-Dimensional “Vector Space”(Right) via neural networks, which represents everything you need for driving, such as lines, edges, curbs, traffic signs, traffic lights, cars, and positions, orientations, depth, velocities of cars

camera image -> image space -> 3D vector space representation

component

object detection structure, input → backbone → neck → head → output, which is similar to human upper body

backbone: refers to the feature extracting network, which is used to recognize several objects in a single image and provides rich features information of objects. We often use AlexNet, ResNet, VGGNet, Darknet53 as backbone networks
head: After the feature extract, backbone gives us a feature maps representation of the input. do the prediction based on the feature maps. e.g., one-stage: YOLO, SSD, two-stage: faster R-CNN
neck: The neck is between the backbone and head, it is used to extract more elaborate features. e.g. feautre pyramid network(FPN), BiFPN

tesla object detection structure hydraNet(九头蛇), which means that there are multiple detection heads in the network

brief

多摄像头 VS. 单摄像头, 获取360度全景信息
normal feature extract(multi-scale features), multi-scenarios/multi-detections的先决条件
multi-head predict with feature sharing as feature map’s multiple scales

multi-camera

only one input/raw data/camera is not enough so 8 cameras come out

image space(fusion of per-camera detection) to represent a typical input

using a transformer to convert image space to vector space(red part below)

red

virtual camera/calibration

Because the parameters of tesla’s 8 cameras are different: focal length, angle of view, depth of field, installation position are different, the same object under different cameras is not the same

This kind of data can not be directly used for training, so before training, we need standardize 8 cameras into one synthetic virtual camera⁶

rectify/calibrate before regNet

(backbone + neck + transform) -> multi-camera features + kinematics as input

the kinematics information is basically the velocity and the acceleration

feature queue contains a time-based queue(36Hz=27ms) and a space-based queue(1 meter) to cache features and continues into the video module

brick stands for vehicle, red brick is tesla

orange is car detected by single-frame, while light blue is car detected by video(continuous-frames)

we can see that in the red circle/dark blue, car would not vanish even occluded via video detecting

brief

tesla vision network final structure

steps⁶,

raw images feeding on the bottom and go through rectification layer to correct for camera calibration and put everything into a common virtual camera(one way one same virtual camera description)
pass them through RegNets residual networks to process them into a number of features at different scales and fuse the multi-scale information with BiFBN
goes through a transformer module to re-represent it into the vector space, and output space
feeds into a featured queue in time and space that gets processed by a video module, like the Spatial RNN(GRU, LSTM)
hydraNet with trunks and heads for all the different tasks

没有一味追求最新模型, 而是融合现有多种模型搭建一个practical engineering network

planner/controller

route planning and speed/acceleration control

route planning and taking other car’s status into consideration in real-time

Monte Carlo tree search(MCTS)

a model-based planning system and a model-based reinforcement learning system

neural network + explicit cost functions(e.g., distance, collisions, comfort, traversal time)

the final architecture

vision system(bottom-left) aims to generate vector space via camera

neural network planner receives vector space and other features to generate trajectory/route

explicit planner combines vector space and trajectory to generate steering and acceleration/deceleration command

solid dataset

manual labeling
auto labeling
HD map
simulation in various scenarios

Reference

PREVIOUSyolo in pytorch

NEXT2021年跑步总结

Overview

Eyes

Breakdown

input/hydraNet5

component

brief

multi-camera

virtual camera/calibration

brief

planner/controller

solid dataset

Reference

input/hydraNet⁵