Faster R-CNN Tensorflow Model

Building on the foundation of Convolutional Neural Networks, the Faster R-CNN essentially utilizes the CNN computed features and the Region Proposal Network half of the model by using the features collected to detect bounding boxes that have a probability of containing the object(s) of interest by obtaining bounding boxes, labels assigned to the boxes and probabilities (objectiveness score) for each label and box. The architecture for the model is as follows: Region Proposal Network, Anchors, Training/Loss, Region of Interest (RoI) Pooling, and Region Based CNN. Region Proposal Networks take an image as input and output rectangular object proposals, which are the regions believed to contain the object, with an objectiveness score. This is done by sliding a small network over the convolutional feature map output created by the last convolutional layer. The network then takes a sliding spatial window of the input convolutional feature map and the sliding window is then mapped to a lower-dimensional feature using ReLU. The newly created feature is then inputted into fully connected layers of a box-regression layer and a box-classification layer. This creates a single, unified network for object detection.
Anchors are fixed sized reference bounding boxes within the window placed uniformly throughout the original image and the anchors are both Translational-Invariant, meaning that the functions and the proposals of that anchor are translative to varying locations of an object within the image.
Multi-Scale anchors classify and regress bounding boxes with reference to anchor boxes of varying scales and aspect ratios to address multiple scales and sizes of the images.
In training the Region Proposal Network (RPN), a class label of either being an object or not is assigned to each anchor. A positive label is assigned to anchors with the highest Intersection-over-Union overlap with a ground truth box or with an overlap with a value >0.7 with any ground truth box, indicating an object has been detected. A negative label is assigned if the IoU ratio is <0.3 for all ground-truth boxes, indicating there is no object detected. The RPN is able to be trained through backpropagation and stochastic gradient descent.  To train the network, each mini-batch derives from a single image that contains many positive and negative example anchors in order to compute the loss function and to obtain a near 1:1 ratio (Ren et. al, 2016).
The Region of Interest Pooling (RoI Pooling) uses convolutional neural networks to detect multiple objects in an image by performing max pooling on inputs of varying sizes to obtain a fixed-size feature maps of the image by dividing the region proposal into equal sized sections, finding the largest value of each section and copying the max values into the output buffer. (deepsense ai). This layer takes an input of a fixed-size feature map obtained from a deep convolutional network with several convolutions and max pooling layers. It also takes an input of a Region of Interests x 5 matrix that represents a list of regions of interests, where the first column represents the image index and the four are coordinates of the corners of the region. Figure 1.2reveals a simplified CNN-associated process diagram of the Faster R-CNN model: