tesla autonomous driving关注点

Overview

今天从zhihu上看到tesla AI day上关于Tesla’s Autopilot Explained in 10 mins的video1, 感觉层次地讲了autopilot, 下面将其以及一些周边延伸mark一下,

tesla has 3 tiers of autonomous driving2,3,

  1. Autopilot(standard)
  2. Enhanced Autopilot
  3. Full Self-Driving(FSD)

Eyes

8 cameras + 12 ultrasonic sensors

image

advanced sensor coverage, via tesla.com

Breakdown

here describes how the vehicle see the world,

input/hydraNet5

The 8 cameras(Left) around vehicle generate 3-Dimensional “Vector Space”(Right) via neural networks, which represents everything you need for driving, such as lines, edges, curbs, traffic signs, traffic lights, cars, and positions, orientations, depth, velocities of cars

image

camera image -> image space -> 3D vector space representation

component

image

object detection structure, input → backbone → neck → head → output, which is similar to human upper body

  • backbone: refers to the feature extracting network, which is used to recognize several objects in a single image and provides rich features information of objects. We often use AlexNet, ResNet, VGGNet, Darknet53 as backbone networks
  • head: After the feature extract, backbone gives us a feature maps representation of the input. do the prediction based on the feature maps. e.g., one-stage: YOLO, SSD, two-stage: faster R-CNN
  • neck: The neck is between the backbone and head, it is used to extract more elaborate features. e.g. feautre pyramid network(FPN), BiFPN

image

tesla object detection structure hydraNet(九头蛇), which means that there are multiple detection heads in the network

brief

  • 多摄像头 VS. 单摄像头, 获取360度全景信息
  • normal feature extract(multi-scale features), multi-scenarios/multi-detections的先决条件
  • multi-head predict with feature sharing as feature map’s multiple scales

multi-camera

only one input/raw data/camera is not enough so 8 cameras come out

image space(fusion of per-camera detection) to represent a typical input

using a transformer to convert image space to vector space(red part below)

image

red

virtual camera/calibration

Because the parameters of tesla’s 8 cameras are different: focal length, angle of view, depth of field, installation position are different, the same object under different cameras is not the same

This kind of data can not be directly used for training, so before training, we need standardize 8 cameras into one synthetic virtual camera6

image

rectify/calibrate before regNet

image

(backbone + neck + transform) -> multi-camera features + kinematics as input

the kinematics information is basically the velocity and the acceleration

feature queue contains a time-based queue(36Hz=27ms) and a space-based queue(1 meter) to cache features and continues into the video module

image

brick stands for vehicle, red brick is tesla

orange is car detected by single-frame, while light blue is car detected by video(continuous-frames)

we can see that in the red circle/dark blue, car would not vanish even occluded via video detecting

brief

image

tesla vision network final structure

steps6,

  1. raw images feeding on the bottom and go through rectification layer to correct for camera calibration and put everything into a common virtual camera(one way one same virtual camera description)
  2. pass them through RegNets residual networks to process them into a number of features at different scales and fuse the multi-scale information with BiFBN
  3. goes through a transformer module to re-represent it into the vector space, and output space
  4. feeds into a featured queue in time and space that gets processed by a video module, like the Spatial RNN(GRU, LSTM)
  5. hydraNet with trunks and heads for all the different tasks

没有一味追求最新模型, 而是融合现有多种模型搭建一个practical engineering network

planner/controller

route planning and speed/acceleration control

image

route planning and taking other car’s status into consideration in real-time

image

Monte Carlo tree search(MCTS)

a model-based planning system and a model-based reinforcement learning system

neural network + explicit cost functions(e.g., distance, collisions, comfort, traversal time)

image

the final architecture

vision system(bottom-left) aims to generate vector space via camera

neural network planner receives vector space and other features to generate trajectory/route

explicit planner combines vector space and trajectory to generate steering and acceleration/deceleration command

solid dataset

  • manual labeling
  • auto labeling
  • HD map
  • simulation in various scenarios

Reference

  1. Tesla Autopilot Explained in 10 Minutes - Tesla AI Day Highlights
  2. Autopilot and Full Self-Driving Capability
  3. Tesla Autopilot – What Does It Do and How Does it Work?
  4. Tesla’s Autopilot Explained! Tesla AI Day in 10 Minutes
  5. Deep Understanding Tesla FSD Part 1: HydraNet
  6. Deep Understanding Tesla FSD Part 2: Vector Space
  7. Deep Understanding Tesla FSD Part 3: Planning & Control
  8. Deep Understanding Tesla FSD Part 4: Auto Labeling, Simulation
  9. 解读: Tesla Autopilot技术架构