What is SSD(Single Shot Detection)?
SSD is devised to improve YOLO, which can’t detect small objects. SSD differs from YOLO in two main aspects:
Multi Scale Feature Maps
Default Boxes Generation
Let’s see the shape of SSD before knowing what the two things are.
One important thing to note is that we should set the number of channels in the output feature map to $k \times (C + 4)$, where
- $k$ : the number of default boxes.
- $C$ : the number of classes to predict.
Now, let’s discuss these two aspects.
Multi Scale Feature Maps
Conv 4_3, 7, 8_2, 9_2, 10_2, and 11_2 are a total of 6 different scale feature maps used for prediction.
Default Boxes Generation
Default boxes are candidates for bounding boxes. For example, we create 42 points(grid) to generate 3 default boxes each like:
And in this step, we set k numbers of default box in order to perform image classification. Before doing that, we should set $S_k$, which is the size ratio of the default box to the input image. $S_k$ is calculated by:
Through these processes, we can obtain a red box that detects a person.