Until now attention mechanisms could be generally divided into two types:
1) Detection proposals, such as the Faster R-CNN (RPN) proposals. The ROI-Pooling operation is an attention mechanism that enables the second stage of the detector to attend only to the relevant features. The disadvantage of this approach is that it doesn’t use information outside of that proposal that can be very important for classifying it correctly in the second stage.
2) Global attention mechanisms, which re-weight the entire feature map according to a learned attention “heat map”. The disadvantage of this approach is that it doesn’t use the information about the objects in the image to generate the attention map.
This paper combines the two approaches into one, and thus mitigating their disadvantages. This is done by generating the attention map over the proposals generated by the RPN, instead of an attention map over the global feature map. This is a very strong mechanism and you can get an impression for that in the images below.
To implement this approach, they use Faster R-CNN to generate the 36 top proposals, and ROI-Pool each proposal to a 2048-d feature map (with average pooling).
These pooled feature maps are averaged into a single feature map and fed into the attention LSTM. The output of the attention LSTM is a weight vector of size 36 (one weight for each proposal).
The next stage of the process is to calculate the attended feature map, by summing all of the pooled feature maps according to their predicted weights. These attended feature maps can be used as an input for a second network that performs the actual task. In the paper it was served as an input to another LSTM which generated a single word for the image captioning task at each timestep.
------------------------------------------------------------------------
Bottom-up vs top-down
There are two kinds of attention mechanisms in the human visual system. Top-down attention is determined by the current task, and we focus on the part that is closely related to the task based on the current task (ie, the problem in VQA). Bottom-up attention means that we will be attracted by significant, outstanding and novel things.
Most of the visual attention mechanisms used in the previous methods belong to the top-down type, that is, taking the problem as input, modeling the attention distribution, and then acting on the image features extracted by the CNN. However, the image of the attention effect of this method corresponds to the left image of the following figure, without considering the content of the picture. For humans, attention will be more focused on the target of the picture or other significant areas, so the author introduces the Bottom-up attention mechanism, as shown in the right image below, and the attention acts on the object proposal.