Faster-RCNN论文精读

时间：2019-12-25 11:14:09 阅读：126 评论：0 收藏：0 [点我收藏+]

State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet [7] and Fast R-CNN [5] have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate highquality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model [19], our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image.

目前最先进的目标检测网络依赖于区域建议算法来假设目标位置。SPPnet和Fast-R-CNN等技术的发展减少了这些检测网络的运行时间，暴露了区域建议计算的瓶颈。在这项工作中，我们引入了一个区域建议网络（RPN），它与检测网络共享完整的图像卷积特征，从而实现几乎无成本的区域建议。RPN是一个完全卷积的网络，它同时预测每个位置的对象边界和对象得分。RPN是端到端训练以产生高质量的区域建议，供快速R-CNN用于检测。通过简单的交替优化，可以训练RPN和快速R-CNN来共享卷积特征。对于非常深的VGG-16模型[19]，我们的检测系统在GPU上的帧速率为5fps（包括所有步骤），同时在PASCAL VOC 2007（73.2%的地图）和2012（70.4%的地图）上使用每幅图像300个建议实现最新的目标检测精度。

Introduction

Recent advances in object detection are driven by the success of region proposal methods (e.g., [22]) and region-based convolutional neural networks (R-CNNs) [6]. Although region-based CNNs were computationally expensive as originally developed in [6], their cost has been drastically reduced thanks to sharing convolutions across proposals [7, 5]. The latest incarnation, Fast R-CNN [5], achieves near real-time rates using very deep networks [19], when ignoring the time spent on region proposals. Now, proposals are the computational bottleneck in state-of-the-art detection systems.

区域提议方法和基于区域的卷积神经网络（R-CNNs）的成功推动了目标检测的最新进展。尽管基于区域的cnn与最初在中开发的cnn一样在计算上非常昂贵，但是由于在方案之间共享卷积，它们的成本已经大大降低。最新的体现，Fast R-CNN[5]，在忽略区域提案所花费的时间的情况下，使用非常深的网络实现接近实时的速率。现在，建议是最先进的检测系统中的计算瓶颈。

Region proposal methods typically rely on inexpensive features and economical inference schemes. Selective Search (SS) [22], one of the most popular methods, greedily merges superpixels based on engineered low-level features. Yet when compared to efficient detection networks [5], Selective Search is an order of magnitude slower, at 2s per image in a CPU implementation. EdgeBoxes [24] currently provides the best tradeoff between proposal quality and speed, at 0.2s per image. Nevertheless, the region proposal step still consumes as much running time as the detection network.

区域建议方法通常依赖于廉价的特征和经济的推理方案。选择性搜索（SS）[22]是最流行的方法之一，它基于工程化的低级特征贪婪地合并超级混合体。然而，与高效的检测网络相比，选择性搜索的速度要慢一个数量级，CPU实现中每幅图像的速度为2秒。EdgeBoxes[24]目前提供了提案质量和速度之间的最佳折衷，每张图像0.2秒。然而，区域建议步骤仍然消耗与检测网络一样多的运行时间。

One may note that fast region-based CNNs take advantage of GPUs, while the region proposal methods used in research are implemented on the CPU, making such runtime comparisons inequitable. An obvious way to accelerate proposal computation is to re-implement it for the GPU. This may be an effective engineering solution, but re-implementation ignores the down-stream detection network and therefore misses important opportunities for sharing computation.

人们可能会注意到，基于快速区域的cnn利用gpu，而研究中使用的区域建议方法是在CPU上实现的，使得这样的运行时比较不公平。加速方案计算的一个明显方法是为GPU重新实现它。这可能是一个有效的工程解决方案，但重新实现时忽略了下游检测网络，因此错过了共享计算的重要机会。

In this paper, we show that an algorithmic change—computing proposals with a deep net—leads to an elegant and effective solution, where proposal computation is nearly cost-free given the detection network’s computation. To this end, we introduce novel Region Proposal Networks (RPNs) that share convolutional layers with state-of-the-art object detection networks [7, 5]. By sharing convolutions at test-time, the marginal cost for computing proposals is small (e.g., 10ms per image).

在本文中，我们证明了使用深网络的算法变更计算方案可以得到一个优雅而有效的解决方案，在给定检测网络的计算量的情况下，方案计算几乎是无成本的。为此，我们引入了新的区域建议网络（RPNs），它与最新的目标检测网络共享卷积层[7，5]。通过在测试时共享卷积，计算建议的边际成本很小（例如，每张图像10毫秒）。

Our observation is that the convolutional (conv) feature maps used by region-based detectors, like Fast R-CNN, can also be used for generating region proposals. On top of these conv features, we construct RPNs by adding two additional conv layers: one that encodes each conv map position into a short (e.g., 256-d) feature vector and a second that, at each conv map position, outputs an objectness score and regressed bounds for k region proposals relative to various scales and aspect ratios at that location (k = 9 is a typical value).

我们的观察是，基于区域的检测器（如Fast R-CNN）使用的卷积（conv）特征图也可用于生成区域建议。在这些conv特征的基础上，我们通过添加两个额外的conv层来构造RPNs：一个将每个conv映射位置编码为一个短的（例如256-d）特征向量，另一个在每个conv映射位置输出相对于该位置的各种尺度和纵横比的k区域建议的对象性得分和回归界限（k=9是典型值）。

Our RPNs are thus a kind of fully-convolutional network (FCN) [14] and they can be trained end-toend specifically for the task for generating detection proposals. To unify RPNs with Fast R-CNN [5] object detection networks, we propose a simple training scheme that alternates between fine-tuning for the region proposal task and then fine-tuning for object detection, while keeping the proposals fixed. This scheme converges quickly and produces a unified network with conv features that are shared between both tasks.

因此，我们的rpn是一种完全卷积网络（FCN）[14]，它们可以被训练为端到端，专门用于生成检测建议的任务。为了将RPNs与快速R-CNN[5]目标检测网络相结合，我们提出了一种简单的训练方案，在保持目标检测方案不变的情况下，先对区域建议任务进行微调，再对目标检测进行微调。该方案收敛速度快，生成一个具有conv特性的统一网络，在两个任务之间共享。

We evaluate our method on the PASCAL VOC detection benchmarks [4], where RPNs with Fast R-CNNs produce detection accuracy better than the strong baseline of Selective Search with Fast R-CNNs. Meanwhile, our method waives nearly all computational burdens of SS at test-time—the effective running time for proposals is just 10 milliseconds. Using the expensive very deep models of [19], our detection method still has a frame rate of 5fps (including all steps) on a GPU, and thus is a practical object detection system in terms of both speed and accuracy (73.2% mAP on PASCAL VOC 2007 and 70.4% mAP on 2012).

我们在PASCAL VOC检测基准[4]上评估我们的方法，其中具有快速R-CNNs的RPNs产生的检测精度优于具有快速R-CNNs的选择性搜索的强基线。同时，我们的方法在测试时几乎免除了SS的所有计算负担，方案的有效运行时间只有10毫秒。使用昂贵的非常深入的模型[19]，我们的检测方法在GPU上仍然具有5fps（包括所有步骤）的帧速率，因此在速度和精度方面都是一个实用的目标检测系统。

Related Work

Several recent papers have proposed ways of using deep networks for locating class-specific or classagnostic bounding boxes [21, 18, 3, 20]. In the OverFeat method [18], a fully-connected (fc) layer is trained to predict the box coordinates for the localization task that assumes a single object. The fc layer is then turned into a conv layer for detecting multiple class-specific objects. The MultiBox methods [3, 20] generate region proposals from a network whose last fc layer simultaneously predicts multiple (e.g., 800) boxes, which are used for R-CNN [6] object detection. Their proposal network is applied on a single image or multiple large image crops (e.g., 224×224) [20]. We discuss OverFeat and MultiBox in more depth later in context with our method.

最近的几篇论文提出了使用深度网络来定位类特定或类不可知的边界框的方法[21，18，3，20]。在OverFeat方法[18]中，训练一个完全连接（fc）层来预测假设单个对象的定位任务的框坐标。然后将fc层转换为conv层，用于检测多个类特定对象。多框方法[3，20]从其最后一个fc层同时预测多个（例如，800）框的网络生成区域建议，用于R-CNN[6]对象检测。他们的建议网络应用于单个图像或多个大图像作物（例如224×224）[20]。稍后我们将在上下文中用我们的方法更深入地讨论OverFeat和MultiBox。

Shared computation of convolutions [18, 7, 2, 5] has been attracting increasing attention for effi- cient, yet accurate, visual recognition. The OverFeat paper [18] computes conv features from an image pyramid for classification, localization, and detection. Adaptively-sized pooling (SPP) [7] on shared conv feature maps is proposed for efficient region-based object detection [7, 16] and semantic segmentation [2]. Fast R-CNN [5] enables end-to-end detector training on shared conv features and shows compelling accuracy and speed.

卷积的共享计算[18，7，2，5]在高效、准确的视觉识别方面越来越受到关注。OverFeat论文[18]计算图像金字塔的conv特征，以进行分类、定位和检测。为了有效地进行基于区域的目标检测[7,16]和语义分割[2]，提出了基于共享conv特征映射的自适应大小池（SPP）[7]。Fast R-CNN[5]能够在共享的conv功能上进行端到端探测器培训，并显示出令人信服的准确性和速度。

Region Proposal Networks

A Region Proposal Network (RPN) takes an image (of any size) as input and outputs a set of rectangular object proposals, each with an objectness score.1 We model this process with a fullyconvolutional network [14], which we describe in this section. Because our ultimate goal is to share computation with a Fast R-CNN object detection network [5], we assume that both nets share a common set of conv layers. In our experiments, we investigate the Zeiler and Fergus model [23] (ZF), which has 5 shareable conv layers and the Simonyan and Zisserman model [19] (VGG), which has 13 shareable conv layers.

区域建议网络（RPN）以任意大小的图像为输入，输出一组矩形的目标建议，每个建议都有一个目标得分。1我们用一个完整的进化网络[14]来模拟这个过程，我们在本节中描述了这个网络。因为我们的最终目标是与快速R-CNN目标检测网络共享计算[5]，所以我们假设两个网络共享一组公共的conv层。在我们的实验中，我们研究了具有5个可共享conv层的Zeiler和Fergus模型（ZF），以及具有13个可共享conv层的Simonyan和Zisserman模型（VGG）。

To generate region proposals, we slide a small network over the conv feature map output by the last shared conv layer. This network is fully connected to an n × n spatial window of the input conv feature map. Each sliding window is mapped to a lower-dimensional vector (256-d for ZF and 512-d for VGG). This vector is fed into two sibling fully-connected layers—a box-regression layer (reg) and a box-classification layer (cls). We use n = 3 in this paper, noting that the effective receptive field on the input image is large (171 and 228 pixels for ZF and VGG, respectively). This mininetwork is illustrated at a single position in Fig. 1 (left). Note that because the mini-network operates in a sliding-window fashion, the fully-connected layers are shared across all spatial locations. This architecture is naturally implemented with an n × n conv layer followed by two sibling 1 × 1 conv layers (for reg and cls, respectively). ReLUs [15] are applied to the output of the n × n conv layer.

为了生成区域建议，我们在最后一个共享conv层输出的conv特征映射上滑动一个小网络。该网络完全连接到输入conv特征图的n×n空间窗口。每个滑动窗口都映射到一个低维向量（ZF为256-d，VGG为512-d）。这个向量被输入到两个完全连接的兄弟层中——一个盒子回归层（reg）和一个盒子分类层（cls）。在本文中，我们使用n=3，注意到输入图像上的有效接收场很大（ZF和VGG分别为171和228像素）。在图1（左）中的单个位置示出了该微型网络。请注意，由于迷你网络以滑动窗口方式运行，因此完全连接的层在所有空间位置共享。该体系结构自然地由一个n×n conv层和两个兄弟的1×1 conv层（分别用于reg和cls）实现。ReLUs[15]应用于n×n转换器层的输出。

Translation-Invariant Anchors

At each sliding-window location, we simultaneously predict k region proposals, so the reg layer has 4k outputs encoding the coordinates of k boxes. The cls layer outputs 2k scores that estimate probability of object / not-object for each proposal.2 The k proposals are parameterized relative to k reference boxes, called anchors. Each anchor is centered at the sliding window in question, and is associated with a scale and aspect ratio. We use 3 scales and 3 aspect ratios, yielding k = 9 anchors at each sliding position. For a conv feature map of a size W ×H (typically ∼2,400), there are W Hk anchors in total. An important property of our approach is that it is translation invariant, both in terms of the anchors and the functions that compute proposals relative to the anchors.

在每个滑动窗口位置，我们同时预测k个区域的建议，因此reg层有4k输出，编码k个框的坐标。cls层输出2k个分数，用于估计每个建议的对象/非对象概率。2 k个建议相对于k个引用框（称为锚）进行参数化。每个锚定都位于所讨论的滑动窗口的中心，并且与比例和纵横比相关联。我们使用3个比例尺和3个纵横比，在每个滑动位置产生k=9个锚。对于大小为W×H（通常为2400）的conv要素地图，总共有W Hk锚。我们的方法的一个重要特性是它是平移不变的，无论是在锚方面还是在计算相对于锚的建议的函数方面。

As a comparison, the MultiBox method [20] uses k-means to generate 800 anchors, which are not translation invariant. If one translates an object in an image, the proposal should translate and the same function should be able to predict the proposal in either location. Moreover, because the MultiBox anchors are not translation invariant, it requires a (4+1)×800-dimensional output layer, whereas our method requires a (4+2)×9-dimensional output layer. Our proposal layers have an order of magnitude fewer parameters (27 million for MultiBox using GoogLeNet [20] vs. 2.4 million for RPN using VGG-16), and thus have less risk of overfitting on small datasets, like PASCAL VOC.

作为比较，MultiBox方法[20]使用k-均值生成800个锚，这些锚不是平移不变的。如果一个人翻译一个图像中的一个对象，建议应该翻译和相同的功能应该能够预测建议在任何位置。此外，由于多盒锚不是平移不变的，它需要一个（4+1）×800维的输出层，而我们的方法需要一个（4+2）×9维的输出层。我们的建议层具有数量级较少的参数（使用GoogLeNet[20]的MultiBox为2700万，而使用VGG-16的RPN为240万），因此在小数据集（如PASCAL VOC）上过度拟合的风险较小。

技术分享图片

图1：左：区域建议网络（RPN）。右：对PASCAL VOC 2007测试使用RPN建议的示例检测。我们的方法可以在很宽的尺度和长宽比范围内检测物体。

A Loss Function for Learning Region Proposals

For training RPNs, we assign a binary class label (of being an object or not) to each anchor. We assign a positive label to two kinds of anchors: (i) the anchor/anchors with the highest Intersectionover-Union (IoU) overlap with a ground-truth box, or (ii) an anchor that has an IoU overlap higher than 0.7 with any ground-truth box. Note that a single ground-truth box may assign positive labels to multiple anchors. We assign a negative label to a non-positive anchor if its IoU ratio is lower than 0.3 for all ground-truth boxes. Anchors that are neither positive nor negative do not contribute to the training objective.

为了训练rpn，我们为每个锚分配一个二进制类标签（不管是否为对象）。我们为两种锚分配了一个正标签：（i）具有最高交叉联合（IoU）的锚/锚与地面真值框重叠，或（ii）IoU与任何地面真值框重叠大于0.7的锚。请注意，单个地面真实值框可以为多个锚点指定正标签。如果一个非正锚的IoU比率低于0.3，我们就给它分配一个负标签。既不积极也不消极的锚对训练目标没有贡献。

With these definitions, we minimize an objective function following the multi-task loss in Fast RCNN [5]. Our loss function for an image is defined as:

利用这些定义，我们在Fast-RCNN[5]中最小化了多任务丢失后的目标函数。我们对图像的损失函数定义为：

技术分享图片

Here, i is the index of an anchor in a mini-batch and pi is the predicted probability of anchor i being an object. The ground-truth label p ∗ i is 1 if the anchor is positive, and is 0 if the anchor is negative. ti is a vector representing the 4 parameterized coordinates of the predicted bounding box, and t ∗ i is that of the ground-truth box associated with a positive anchor. The classification loss Lcls is log loss over two classes (object vs. not object). For the regression loss, we use Lreg (ti , t∗ i ) = R(ti − t ∗ i ) where R is the robust loss function (smooth L1) defined in [5]. The term p ∗ i Lreg means the regression loss is activated only for positive anchors (p ∗ i = 1) and is disabled otherwise (p ∗ i = 0). The outputs of the cls and reg layers consist of {pi} and {ti} respectively. The two terms are normalized with Ncls and Nreg , and a balancing weight λ.

这里，i是一个小批量锚的索引，pi是锚i作为对象的预测概率。如果锚定为正，则地面真值标签p*i为1；如果锚定为负，则地面真值标签p*i为0。t i是表示预测边界框的4个参数化坐标的向量，t*i是与正锚关联的地面真值框的向量。分类丢失LCL是两个类（对象与非对象）上的日志丢失。对于回归损失，我们使用Lreg（t i，t*i）=R（ti-t*i），其中R是在[5]中定义的稳健损失函数（平滑L1）。术语p*i Lreg是指回归损失仅对正锚（p*i=1）激活，否则禁用（p*i=0）。cls层和reg层的输出分别由{pi}和{ti}组成。这两个项用Ncls和Nreg以及平衡权λ进行规范化。

For regression, we adopt the parameterizations of the 4 coordinates following [6]:

对于回归，我们采用以下4个坐标的参数化[6]：

技术分享图片

where x, y, w, and h denote the two coordinates of the box center, width, and height. Variables x, xa, and x ∗ are for the predicted box, anchor box, and ground-truth box respectively (likewise for y, w, h). This can be thought of as bounding-box regression from an anchor box to a nearby ground-truth box.

其中x、y、w和h表示框中心、宽度和高度的两个坐标。变量x、xa和x*分别用于预测框、锚定框和地面真值框（同样用于y、w、h）。这可以被认为是从锚定框到附近地面真值框的边界框回归。

Nevertheless, our method achieves bounding-box regression by a different manner from previous feature-map-based methods [7, 5]. In [7, 5], bounding-box regression is performed on features pooled from arbitrarily sized regions, and the regression weights are shared by all region sizes. In our formulation, the features used for regression are of the same spatial size (n × n) on the feature maps. To account for varying sizes, a set of k bounding-box regressors are learned. Each regressor is responsible for one scale and one aspect ratio, and the k regressors do not share weights. As such, it is still possible to predict boxes of various sizes even though the features are of a fixed size/scale.

然而，我们的方法通过不同于以往基于特征地图的方法来实现bbox回归[7，5]。在[7，5]中，对任意大小区域的特征集合进行bbox回归，回归权重由所有区域大小共享。在我们的公式中，用于回归的特征在特征映射上具有相同的空间大小（n×n）。为了解释不同的大小，我们学习了一组k-bbox回归方程。每个回归者负责一个尺度和一个纵横比，而k个回归者不共享权重。因此，即使特征具有固定的大小/比例，仍然可以预测各种大小的框。

Optimization

The RPN, which is naturally implemented as a fully-convolutional network [14], can be trained end-to-end by back-propagation and stochastic gradient descent (SGD) [12]. We follow the “imagecentric” sampling strategy from [5] to train this network. Each mini-batch arises from a single image that contains many positive and negative anchors. It is possible to optimize for the loss functions of all anchors, but this will bias towards negative samples as they are dominate. Instead, we randomly sample 256 anchors in an image to compute the loss function of a mini-batch, where the sampled positive and negative anchors have a ratio of up to 1:1. If there are fewer than 128 positive samples in an image, we pad the mini-batch with negative ones.

RPN自然地实现为完全卷积网络[14]，可以通过反向传播和随机梯度下降（SGD）[12]端到端地训练。我们从[5]开始遵循“以图像为中心”的采样策略来训练这个网络。每个小批量都来自一个包含许多正锚和负锚的图像。可以优化所有锚的损失函数，但这会偏向负样本，因为它们占主导地位。取而代之的是，我们随机抽取图像中的256个锚来计算一个小批量的损失函数，其中抽样的正锚和负锚的比率高达1:1。如果图像中的阳性样本少于128个，我们用阴性样本填充小批量。

We randomly initialize all new layers by drawing weights from a zero-mean Gaussian distribution with standard deviation 0.01. All other layers (i.e., the shared conv layers) are initialized by pretraining a model for ImageNet classification [17], as is standard practice [6]. We tune all layers of the ZF net, and conv3 1 and up for the VGG net to conserve memory [5]. We use a learning rate of 0.001 for 60k mini-batches, and 0.0001 for the next 20k mini-batches on the PASCAL dataset. We also use a momentum of 0.9 and a weight decay of 0.0005 [11]. Our implementation uses Caffe.

我们通过从标准差为0.01的零均值高斯分布中提取权重来随机初始化所有新层。所有其他层（即，共享conv层）通过预先训练ImageNet分类的模型来初始化[17]，这是标准实践[6]。我们对ZF网络的所有层进行了优化，并对VGG网络的conv3 1和up进行了优化，以节省内存[5]。在PASCAL数据集上，我们对60k个小批量使用0.001的学习率，对接下来的20k个小批量使用0.0001的学习率。我们还使用0.9的动量和0.0005的重量衰减[11]。我们的实现使用Caffe。

Sharing Convolutional Features for Region Proposal and Object Detection

Thus far we have described how to train a network for region proposal generation, without considering the region-based object detection CNN that will utilize these proposals. For the detection network, we adopt Fast R-CNN [5] 4 and now describe an algorithm that learns conv layers that are shared between the RPN and Fast R-CNN.

到目前为止，我们已经描述了如何训练一个网络来生成区域建议，而不考虑将利用这些建议的基于区域的目标检测CNN。对于检测网络，我们采用了Fast R-CNN，并描述了一种学习RPN和Fast R-CNN之间共享的conv层的算法。

Both RPN and Fast R-CNN, trained independently, will modify their conv layers in different ways. We therefore need to develop a technique that allows for sharing conv layers between the two networks, rather than learning two separate networks. Note that this is not as easy as simply defining a single network that includes both RPN and Fast R-CNN, and then optimizing it jointly with backpropagation. The reason is that Fast R-CNN training depends on fixed object proposals and it isnot clear a priori if learning Fast R-CNN while simultaneously changing the proposal mechanism will converge. While this joint optimizing is an interesting question for future work, we develop a pragmatic 4-step training algorithm to learn shared features via alternating optimization.

RPN和Fast R-CNN都是独立训练的，将以不同的方式修改其conv层。因此，我们需要开发一种技术，允许在两个网络之间共享conv层，而不是学习两个独立的网络。注意，这并不像简单地定义一个同时包含RPN和Fast R-CNN的网络，然后结合反向传播优化它那么简单。其原因是快速R-CNN训练依赖于固定目标的提议，而同时改变提议机制的快速R-CNN学习是否会收敛还不清楚。虽然这种联合优化是一个有趣的问题，为未来的工作，我们开发了一个实用的四步训练算法，通过交替优化学习共享特征。

In the first step, we train the RPN as described above. This network is initialized with an ImageNetpre-trained model and fine-tuned end-to-end for the region proposal task. In the second step, we train a separate detection network by Fast R-CNN using the proposals generated by the step-1 RPN. This detection network is also initialized by the ImageNet-pre-trained model. At this point the two networks do not share conv layers. In the third step, we use the detector network to initialize RPN training, but we fix the shared conv layers and only fine-tune the layers unique to RPN. Now the two networks share conv layers. Finally, keeping the shared conv layers fixed, we fine-tune the fc layers of the Fast R-CNN. As such, both networks share the same conv layers and form a unified network.

在第一步中，我们按照上述方法训练RPN。该网络使用ImageNetpre-trained模型初始化，并为区域建议任务端到端进行微调。在第二步中，我们使用步骤1的RPN生成的建议通过Fast R-CNN训练一个独立的检测网络。该检测网络也由ImageNet预训练模型初始化。此时，两个网络不共享conv层。在第三步中，我们使用检测器网络初始化RPN训练，但我们修复了共享conv层，只微调RPN特有的层。现在这两个网络共享conv层。最后，保持共享conv层不变，我们对Fast R-CNN的fc层进行微调。因此，两个网络共享相同的conv层并形成一个统一的网络。

Implementation Details实施细节

We train and test both region proposal and object detection networks on single-scale images [7, 5]. We re-scale the images such that their shorter side is s = 600 pixels [5]. Multi-scale feature extraction may improve accuracy but does not exhibit a good speed-accuracy trade-off [5]. We also note that for ZF and VGG nets, the total stride on the last conv layer is 16 pixels on the re-scaled image, and thus is ∼10 pixels on a typical PASCAL image (∼500×375). Even such a large stride provides good results, though accuracy may be further improved with a smaller stride.

我们在单尺度图像上训练和测试了区域建议和目标检测网络[7，5]。我们重新缩放图像，使其较短的边为s=600像素[5]。多尺度特征提取可以提高精度，但不能表现出良好的速度精度权衡[5]。我们还注意到，对于ZF和VGG网络，最后一个conv层上的总跨距在重新缩放的图像上为16像素，因此在典型的PASCAL图像上为10像素（500×375）。即使如此大的步幅也能提供很好的效果，尽管用较小的步幅可以进一步提高精确度。

For anchors, we use 3 scales with box areas of 1282 , 2562 , and 5122 pixels, and 3 aspect ratios of 1:1, 1:2, and 2:1. We note that our algorithm allows the use of anchor boxes that are larger than the underlying receptive field when predicting large proposals. Such predictions are not impossible— one may still roughly infer the extent of an object if only the middle of the object is visible. With this design, our solution does not need multi-scale features or multi-scale sliding windows to predict large regions, saving considerable running time. Fig. 1 (right) shows the capability of our method for a wide range of scales and aspect ratios. The table below shows the learned average proposal size for each anchor using the ZF net (numbers for s = 600).

对于锚定，我们使用3个比例尺，框区域为1282、2562和5122像素，3个纵横比为1:1、1:2和2:1。我们注意到，我们的算法允许在预测大型提案时使用比潜在接受域大的锚定框。这样的预测并非不可能——如果只有物体的中间部分是可见的，人们仍然可以粗略地推断出物体的范围。通过这种设计，我们的解决方案不需要多尺度特征或多尺度滑动窗口来预测大区域，节省了相当长的运行时间。图1（右）示出了我们的方法在各种尺度和宽高比下的能力。下表显示了使用ZF网络（s=600的数字）获得的每个锚的平均建议大小。

技术分享图片

The anchor boxes that cross image boundaries need to be handled with care. During training, we ignore all cross-boundary anchors so they do not contribute to the loss. For a typical 1000 × 600 image, there will be roughly 20k (≈ 60 × 40 × 9) anchors in total. With the cross-boundary anchors ignored, there are about 6k anchors per image for training. If the boundary-crossing outliers are not ignored in training, they introduce large, difficult to correct error terms in the objective, and training does not converge. During testing, however, we still apply the fully-convolutional RPN to the entire image. This may generate cross-boundary proposal boxes, which we clip to the image boundary.

需要小心处理跨越图像边界的锚定框。在训练过程中，我们忽略所有的跨界锚，所以它们不会造成损失。对于典型的1000×600图像，总共大约有20k（≈60×40×9）个锚。忽略跨界锚定后，每个图像约有6k个锚定用于训练。如果训练中不忽略边界交叉离群点，则会在目标中引入较大且难以修正的误差项，训练不会收敛。然而，在测试过程中，我们仍然对整个图像应用完全卷积的RPN。这可能会生成跨边界建议框，我们将其剪裁到图像边界。

Some RPN proposals highly overlap with each other. To reduce redundancy, we adopt nonmaximum suppression (NMS) on the proposal regions based on their cls scores. We fix the IoU threshold for NMS at 0.7, which leaves us about 2k proposal regions per image. As we will show, NMS does not harm the ultimate detection accuracy, but substantially reduces the number of proposals. After NMS, we use the top-N ranked proposal regions for detection. In the following, we train Fast R-CNN using 2k RPN proposals, but evaluate different numbers of proposals at test-time.

一些RPN方案高度重叠。为了减少冗余，我们采用基于CLS分数的建议区域上的非最大抑制（NMS）。我们将NMS的IoU阈值定为0.7，这样每个图像就有大约2k个提议区域。正如我们将要展示的，NMS不会损害最终的检测精度，但会大大减少建议的数量。在NMS之后，我们使用排名前N的建议区域进行检测。接下来，我们使用2k RPN方案训练快速R-CNN，但在测试时评估不同数量的方案。

Experiments

We comprehensively evaluate our method on the PASCAL VOC 2007 detection benchmark [4]. This dataset consists of about 5k trainval images and 5k test images over 20 object categories. We also provide results in the PASCAL VOC 2012 benchmark for a few models. For the ImageNet pre-trained network, we use the “fast” version of ZF net [23] that has 5 conv layers and 3 fc layers, and the public VGG-16 model5 [19] that has 13 conv layers and 3 fc layers. We primarily evaluate detection mean Average Precision (mAP), because this is the actual metric for object detection (rather than focusing on object proposal proxy metrics).

我们在PASCAL VOC 2007检测基准上对我们的方法进行了综合评估[4]。该数据集由约5k个trainval图像和5k个测试图像组成，涵盖20个对象类别。我们还提供了一些车型的PASCAL VOC 2012基准测试结果。对于ImageNet预训练网络，我们使用“快速”版本的ZF net[23]，它有5个conv层和3个fc层，以及公共VG-16 model5，它有13个conv层和3个fc层。我们主要评估检测平均精度（mAP），因为这是对象检测的实际度量（而不是关注对象建议代理度量）。

Table 1 (top) shows Fast R-CNN results when trained and tested using various region proposal methods. These results use the ZF net. For Selective Search (SS) [22], we generate about 2k SSproposals by the “fast” mode. For EdgeBoxes (EB) [24], we generate the proposals by the default EB setting tuned for 0.7 IoU. SS has an mAP of 58.7% and EB has an mAP of 58.6%. RPN with Fast R-CNN achieves competitive results, with an mAP of 59.9% while using up to 300 proposals6 . Using RPN yields a much faster detection system than using either SS or EB because of shared conv computations; the fewer proposals also reduce the region-wise fc cost. Next, we consider several ablations of RPN and then show that proposal quality improves when using the very deep network.

建议采用“快速”模式。对于EdgeBoxes（EB）[24]，我们在默认EB设置下生成建议，该设置调整为0.7iou。SS为58.7%，EB为58.6%。快速R-CNN的RPN实现了竞争结果，mAP为59.9%，同时使用多达300个提议6。由于共享conv计算，使用RPN产生的检测系统比使用SS或EB快得多；较少的方案也降低了区域fc成本。接下来，我们考虑了几个RPN的烧蚀，然后证明了当使用非常深的网络时，建议质量提高。

Ablation Experiments. To investigate the behavior of RPNs as a proposal method, we conducted several ablation studies. First, we show the effect of sharing conv layers between the RPN and Fast R-CNN detection network. To do this, we stop after the second step in the 4-step training process. Using separate networks reduces the result slightly to 58.7% (RPN+ZF, unshared, Table 1). We observe that this is because in the third step when the detector-tuned features are used to fine-tune the RPN, the proposal quality is improved.

为了研究RPNs的行为，我们进行了几项消融研究。首先，我们展示了在RPN和快速R-CNN检测网络之间共享conv层的效果。为此，我们在4步训练过程的第二步后停止。使用单独的网络将结果稍微降低到58.7%（RPN+ZF，非共享，表1）。我们观察到这是因为在第三步中，当使用检测器调谐特性来微调RPN时，提案质量得到了提高。

Next, we disentangle the RPN’s influence on training the Fast R-CNN detection network. For this purpose, we train a Fast R-CNN model by using the 2k SS proposals and ZF net. We fix this detector and evaluate the detection mAP by changing the proposal regions used at test-time. In these ablation experiments, the RPN does not share features with the detector.

接下来，我们分析了RPN对快速R-CNN检测网络训练的影响。为此，我们利用2k SS方案和ZF网训练了一个快速的R-CNN模型。我们修复了这个检测器，并通过改变测试时使用的建议区域来评估检测图。在这些烧蚀实验中，RPN与探测器不具有相同的特性。

Replacing SS with 300 RPN proposals at test-time leads to an mAP of 56.8%. The loss in mAP is because of the inconsistency between the training/testing proposals. This result serves as the baseline for the following comparisons.

在测试时用300个RPN方案替换SS，得到56.8%的mAP。mAP中的损失是因为培训/测试建议之间的不一致。此结果用作以下比较的基线。

Somewhat surprisingly, the RPN still leads to a competitive result (55.1%) when using the topranked 100 proposals at test-time, indicating that the top-ranked RPN proposals are accurate. On the other extreme, using the top-ranked 6k RPN proposals (without NMS) has a comparable mAP (55.2%), suggesting NMS does not harm the detection mAP and may reduce false alarms.

令人有些意外的是，在测试时使用排名前100的提议时，RPN仍然会导致竞争结果（55.1%），这表明排名前100的RPN提议是准确的。另一方面，使用排名靠前的6k RPN方案（不带NMS）有一个可比的mAP（55.2%），这表明NMS不会损害检测mAP，并可能减少误报。

Next, we separately investigate the roles of RPN’s cls and reg outputs by turning off either of them at test-time. When the cls layer is removed at test-time (thus no NMS/ranking is used), we randomly sample N proposals from the unscored regions. The mAP is nearly unchanged with N = 1k (55.8%), but degrades considerably to 44.6% when N = 100. This shows that the cls scores account for the accuracy of the highest ranked proposals.

接下来，我们通过在测试时关闭RPN的cls和reg输出，分别研究它们的作用。当cls层在测试时被移除（因此不使用NMS/排名）时，我们从未被评分的区域随机抽取N个建议。当N=1k（55.8%）时，mAP几乎不变，但当N=100时，mAP显著下降到44.6%。这表明cls分数说明了排名最高的提案的准确性。

On the other hand, when the reg layer is removed at test-time (so the proposals become anchor boxes), the mAP drops to 52.1%. This suggests that the high-quality proposals are mainly due to regressed positions. The anchor boxes alone are not sufficient for accurate detection.

另一方面，当reg层在测试时被删除时（因此建议成为锚定框），地图下降到52.1%。这表明，高质量的提案主要是由于立场倒退。锚箱本身不足以进行准确的检测。

We also evaluate the effects of more powerful networks on the proposal quality of RPN alone. We use VGG-16 to train the RPN, and still use the above detector of SS+ZF. The mAP improves from 56.8% (using RPN+ZF) to 59.2% (using RPN+VGG). This is a promising result, because it suggests that the proposal quality of RPN+VGG is better than that of RPN+ZF. Because proposals of RPN+ZF are competitive with SS (both are 58.7% when consistently used for training and testing), we may expect RPN+VGG to be better than SS. The following experiments justify this hypothesis.

我们还评估了更强大的网络单独对RPN提议质量的影响。我们使用VGG-16对RPN进行训练，仍然使用上述SS+ZF检测器。mAP从56.8%（使用RPN+ZF）提高到59.2%（使用RPN+VGG）。这是一个很有希望的结果，因为这表明RPN+VGG的提议质量比RPN+ZF好。由于RPN+ZF的提议与SS具有竞争性（当持续用于培训和测试时，两者都是58.7%），我们可能期望RPN+VGG优于SS。下面的实验证明了这个假设的正确性。

技术分享图片

Detection Accuracy and Running Time of VGG-16. Table 2 shows the results of VGG-16 for both proposal and detection. Using RPN+VGG, the Fast R-CNN result is 68.5% for unshared features, slightly higher than the SS baseline. As shown above, this is because the proposals generated by RPN+VGG are more accurate than SS. Unlike SS that is pre-defined, the RPN is actively trained and benefits from better networks. For the feature-shared variant, the result is 69.9%—better than the strong SS baseline, yet with nearly cost-free proposals. We further train the RPN and detection network on the union set of PASCAL VOC 2007 trainval and 2012 trainval, following [5]. The mAP is 73.2%. On the PASCAL VOC 2012 test set (Table 3), our method has an mAP of 70.4% trained on the union set of VOC 2007 trainval+test and VOC 2012 trainval, following [5].

VGG-16的检测精度和运行时间。表2显示了VGG-16的建议和检测结果。使用RPN+VGG，非共享特征的快速R-CNN结果为68.5%，略高于SS基线。如上所示，这是因为RPN+VGG生成的方案比SS生成的方案更准确。与预先定义的SS不同，RPN是主动训练的，并且受益于更好的网络。对于特征共享变体，结果是69.9%，比强大的SS基线好，但几乎没有成本建议。我们进一步在PASCAL VOC 2007 trainval和2012 trainval的联合集上训练RPN和检测网络，如下[5]。地图为73.2%。在PASCAL VOC 2012测试集（表3）上，我们的方法在VOC 2007 trainval+测试和VOC 2012 trainval的联合集上训练了70.4%的映射，如下所示[5]。

In Table 4 we summarize the running time of the entire object detection system. SS takes 1-2 seconds depending on content (on average 1.51s), and Fast R-CNN with VGG-16 takes 320ms on 2k SS proposals (or 223ms if using SVD on fc layers [5]). Our system with VGG-16 takes in total 198ms for both proposal and detection. With the conv features shared, the RPN alone only takes 10ms computing the additional layers. Our region-wise computation is also low, thanks to fewer proposals (300). Our system has a frame-rate of 17 fps with the ZF net.

表4总结了整个目标检测系统的运行时间。SS根据内容需要1-2秒（平均1.51s），而带有VGG-16的Fast R-CNN在2k SS提案上需要320毫秒（如果在fc层使用SVD，则需要223毫秒[5]）。我们的VGG-16系统的建议和检测总共需要198ms。在共享conv特性的情况下，仅RPN就只需要10毫秒计算额外的层。由于提案较少（300份），我们的区域计算也很低。我们的系统在ZF网络上的帧速率为17 fps。

Analysis of Recall-to-IoU. Next we compute the recall of proposals at different IoU ratios with ground-truth boxes. It is noteworthy that the Recall-to-IoU metric is just loosely [9, 8, 1] related to the ultimate detection accuracy. It is more appropriate to use this metric to diagnose the proposal method than to evaluate it.

借据召回分析。下一步，我们计算不同IoU比率下的提案召回率。值得注意的是，对IoU的召回度量与最终的检测准确度只是松散的相关[9，8，1]。使用此度量来诊断建议方法比评估它更合适。

技术分享图片

In Fig. 2, we show the results of using 300, 1k, and 2k proposals. We compare with SS and EB, and the N proposals are the top-N ranked ones based on the confidence generated by these methods. The plots show that the RPN method behaves gracefully when the number of proposals drops from 2k to 300. This explains why the RPN has a good ultimate detection mAP when using as few as 300 proposals. As we analyzed before, this property is mainly attributed to the cls term of the RPN. The recall of SS and EB drops more quickly than RPN when the proposals are fewer.

在图2中，我们展示了使用300、1k和2k方案的结果。我们比较了SS和EB，根据这些方法产生的置信度，N个方案是排名前N的方案。图中显示，当建议数量从2k下降到300时，RPN方法表现得非常优雅。这就解释了为什么RPN在使用300个建议时有一个很好的最终检测图。如前所述，这个性质主要归因于RPN的cls项。当提案较少时，SS和EB的召回率比RPN下降得更快。

One-Stage Detection vs. Two-Stage Proposal + Detection. The OverFeat paper [18] proposes a detection method that uses regressors and classifiers on sliding windows over conv feature maps. OverFeat is a one-stage, class-specific detection pipeline, and ours is a two-stage cascade consisting of class-agnostic proposals and class-specific detections. In OverFeat, the region-wise features come from a sliding window of one aspect ratio over a scale pyramid. These features are used to simultaneously determine the location and category of objects. In RPN, the features are from square (3×3) sliding windows and predict proposals relative to anchors with different scales and aspect ratios. Though both methods use sliding windows, the region proposal task is only the first stage of RPN + Fast R-CNN—the detector attends to the proposals to refine them. In the second stage of our cascade, the region-wise features are adaptively pooled [7, 5] from proposal boxes that more faithfully cover the features of the regions. We believe these features lead to more accurate detections.

一阶段检测vs.两阶段建议+检测。OverFeat论文[18]提出了一种在conv特征映射的滑动窗口上使用回归器和分类器的检测方法。OverFeat是一个单阶段的、特定于类的检测管道，而我们的是一个由不确定类的建议和特定于类的检测组成的两阶段级联。在OverFeat中，按区域划分的特征来自缩放金字塔上一个纵横比的滑动窗口。这些特征用于同时确定对象的位置和类别。在RPN中，这些特征来自于正方形（3×3）滑动窗口，并对不同尺度和长宽比的锚进行预测。虽然这两种方法都使用滑动窗口，但是区域建议任务只是RPN+Fast R-CNN的第一步，检测器会关注这些建议来改进它们。在我们的级联的第二阶段中，从更忠实地覆盖区域特征的建议框中自适应地汇集区域特征[7，5]。我们相信这些特征会导致更精确的检测。

To compare the one-stage and two-stage systems, we emulate the OverFeat system (and thus also circumvent other differences of implementation details) by one-stage Fast R-CNN. In this system, the “proposals” are dense sliding windows of 3 scales (128, 256, 512) and 3 aspect ratios (1:1, 1:2, 2:1). Fast R-CNN is trained to predict class-specific scores and regress box locations from these sliding windows. Because the OverFeat system uses an image pyramid, we also evaluate using conv features extracted from 5 scales. We use those 5 scales as in [7, 5].

为了比较一级和两级系统，我们用一级快速R-CNN模拟了OverFeat系统（从而也避免了实现细节的其他差异）。在这个系统中，“建议”是3个比例（128、256、512）和3个纵横比（1:1、1:2、2:1）的密集滑动窗口。快速R-CNN被训练来预测特定类别的分数，并从这些滑动窗口中回归盒子的位置。因为OverFeat系统使用了一个图像金字塔，所以我们也使用从5个尺度提取的conv特征进行评估。我们使用这5个比例尺，如[7，5]所示。

Table 5 compares the two-stage system and two variants of the one-stage system. Using the ZF model, the one-stage system has an mAP of 53.9%. This is lower than the two-stage system (58.7%) by 4.8%. This experiment justifies the effectiveness of cascaded region proposals and object detection. Similar observations are reported in [5, 13], where replacing SS region proposals with sliding windows leads to ∼6% degradation in both papers. We also note that the one-stage system is slower as it has considerably more proposals to process.

表5比较了两级系统和一级系统的两种变体。使用ZF模型，一级系统的mAP为53.9%。这比两级系统（58.7%）低4.8%。实验证明了级联区域建议和目标检测的有效性。文献[5,13]中也报道了类似的观察结果，在这两篇论文中，用滑动窗口替换SS区域方案会导致6%的性能下降。我们还注意到，一个阶段的系统速度较慢，因为它要处理的建议要多得多。

Conclusion

We have presented Region Proposal Networks (RPNs) for efficient and accurate region proposal generation. By sharing convolutional features with the down-stream detection network, the region proposal step is nearly cost-free. Our method enables a unified, deep-learning-based object detection system to run at 5-17 fps. The learned RPN also improves region proposal quality and thus the overall object detection accuracy.

我们提出了区域建议网络（RPNs）来有效和准确地生成区域建议。通过与下游检测网络共享卷积特征，区域建议步骤几乎是免费的。我们的方法使得基于深度学习的目标检测系统能够以5-17 fps的速度运行。学习的RPN还提高了区域建议质量，从而提高了整体目标检测精度。

Faster-RCNN论文精读

原文：https://www.cnblogs.com/gui-yan-ru-yun/p/12092752.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年09月23日 (328)
2021年09月24日 (313)
2021年09月17日 (191)
2021年09月15日 (369)
2021年09月16日 (411)
2021年09月13日 (439)
2021年09月11日 (398)
2021年09月12日 (393)
2021年09月10日 (160)
2021年09月08日 (222)