论文中提到如果用3x3 的 slice window,其对应到原图的感受野(effetctive receptive field)在VGG和ZF模型上分别是228 pixels,171 pixels。 对于VGG16来说(图片来源: kaggle) In Faster-rcnn, the effective receptive field can be calculated as follow (VGG16): Img-> Conv1(3)->Conv1(3)->Pool1(2) ==> Conv2(3)->Conv2(3)->Pool2(2) ==> Conv3(3)->Conv3(3)->Conv3(3)->Pool3(2) ==> Conv4(3)->Conv4(3)->Conv4(3)->Pool4(2) ==> Conv5(3)->Conv5(3)->Conv5(3) ====> a 3 3 window in feature map. Lets take one dimension for simplicity. If we derive back from size 3, the original receptive field: 1). in the beginning of Conv5: 3 + 2 + 2 + 2 = 9 2). in the beginning of Conv4: 9 2 + 2 + 2 + 2 = 24 3). in the beginning of Conv3: 24 2 + 2 + 2 + 2 = 54 4). in the beginning of Conv2: 54 2 + 2 + 2 = 112 5). in the beginning of Conv1 (original input): 112 * 2 + 2 + 2 = 228