What is SPP?
SPP allows CNN to be applied regardless of the size of the input image. Now, let’s see how SPP works.
First, SPP takes the feature map extracted through the Conv Layers as input and then separates the inputs into predetermined areas. For example, as you can see in the image below, it provides three areas in advance: 4x4, 2x2, and 1x1, each called a pyramid.
Spatial Pyramid Pooling Structure
At this point, we call one block of the pyramid a “bin”. For example, if the input size of the feature map is $(128 \times 128 \times 256)$, the bin size of the $4 \times 4$ pyramid will be $(32 \times 32)$. 
Each bin performs max pooling and concatenates all the results. At this point, the output of SPP is a $kM$-dimensional vector where $k$ is the number of channels in the input feature map and $M$ is the number of bins. So, in this example, $k = 256$ and $M = (16 + 4 + 1)$. 
 
As you can see in the calculation above, regardless of the size of the input feature map, the shape of the output vector is the same. This is why SPP allows CNN to be applied regardless of the size of the input image.
