首页 > 技术文章 > 3、VSRNet: End-to-end video segment retrieval with text query

zhangxianrong 2021-09-23 19:27 原文

3、方法

In this section, we introduce the proposed VSRNet which aims to jointly retrieve the corresponding videos and locate the related segments according to a text query. The overall pipeline is shown in Fig. 3 . we use two branches to complete the tasks of video- level retrieval and description localization, individually. In the first branch, the discriminative representations of videos and texts are mapped into a common feature space, in which metric learning is adopted to enhance the correlation between paired data and dis- tinguish the difference between mismatched ones, as shown in the blue area of Fig. 3 . In the second branch, we design a supervised text-aligned attention mechanism to measure the response of each frames to the query, which provides evidence to locate relevant segments. These two branches are embedded into one network and optimized in an end-to-end manner. We believe such design en- ables each branch to utilize the complementary information of the other branch and leads the neural network to extract powerful rep- resentations of video and text. Details of our method are described as follows.

在本节中,我们介绍了所提出的 VSRNet,它旨在联合检索相应的视频并根据文本查询定位相关片段。整体流水线如图3所示。我们使用两个分支分别完成视频级检索和描述定位的任务。在第一个分支中,视频和文本的判别表示被映射到一个公共特征空间,其中采用度量学习来增强配对数据之间的相关性并区分不匹配数据之间的差异,如蓝色区域所示图3。在第二个分支中,我们设计了一个有监督的文本对齐注意机制来测量每个帧对查询的响应,这为定位相关片段提供了证据。这两个分支嵌入到一个网络中,并以端到端的方式进行优化。我们相信这样的设计使每个分支都能利用另一个分支的互补信息,并引导神经网络提取视频和文本的强大表示。我们的方法的细节描述如下。

 

 Fig. 3. The pipeline of the proposed VSRNet. The video and text features are extracted independently and then mapped into a common feature space through two branches. In the first branch (leftmost part), classical ranking method is applied to establish the relation between video and text. In the second branch, we fuse the video and text features to form a joint embedding, based on which we rank the video and text. We propose text-aligned attention to obtain the semantic attention score that reflects the response of local video structure to the text, which is supervised by the ground-truth temporal boundary. 

图 3. 提议的 VSRNet 的流水线。 视频和文本特征被独立提取,然后通过两个分支映射到一个公共特征空间。 在第一个分支(最左侧)中,应用经典排名方法来建立视频和文本之间的关系。 在第二个分支中,我们融合了视频和文本特征以形成联合嵌入,基于此对视频和文本进行排名。 我们提出了文本对齐的注意力以获得反映局部视频结构对文本的响应的语义注意力分数,该分数由真实时间边界监督。

3.1. Multimodal embedding 

3.1. 多模态嵌入

The aim of video segment retrieval is to find targets with clos- est relations for queries. Thus, one of the key points is to learn powerful representation for both videos and texts. In this section, we depict the design of feature extractors for videos and texts, re- spectively.

视频片段检索的目的是找到与查询关系最接近的目标。 因此,关键点之一是学习视频和文本的强大表示。 在本节中,我们分别描述了视频和文本特征提取器的设计。

3.1.1. Video encoding

A video can be decomposed into a sequence of frames, which contains static RGB pictures and motion signals. For the former, we rely on convolutional neural networks to extract deep features. To capture motion information, one choice is to derive optical flow images representing horizontal and vertical vector fields, and then feed them into convolutional neural networks. Despite of the gains in performance, such operations need plenty of extra com- puting resources. In this context, we alter to adopt bidirectional GRU [34] to model the motion and capture the temporal infor- mation between continuous frames. Before feeding these varying- length frame-level features into BiGRU units, we sample them into a fixed length along the temporal dimension by interpolation. On the one hand, the original feature sequence is quite redundant and it is difficult for recurrent neural networks to converge and model long term temporal relations. On the other hand, it is con- venient to associate the event timestamps to specific frames. In order to fully exploit the subtle short-term patterns and make the representations more discriminative, 1D convolutions are ap- plied on the output of BiGRU. We use convolutions with kernel sizes of { 3 , 5 , 7 } to combine features in different receptive fields, then these features are concatenated to form a local descriptor for every frame. We denote the concatenated frame-level features as { v 1 , v 2 , v 3 , ..., v n } . The global representation for a video v is ob- tained through weighted average pooling across frame-level fea- tures

视频可以分解为一系列帧,其中包含静态 RGB 图片和运动信号。对于前者,我们依靠卷积神经网络来提取深层特征。为了捕获运动信息,一种选择是导出表示水平和垂直矢量场的光流图像,然后将它们输入到卷积神经网络中。尽管性能有所提高,但此类操作需要大量额外的计算资源。在这种情况下,我们改为采用双向 GRU [34] 来建模运动并捕获连续帧之间的时间信息。在将这些不同长度的帧级特征输入 BiGRU 单元之前,我们通过插值将它们沿时间维度采样为固定长度。一方面,原始特征序列非常冗余,循环神经网络难以收敛和建模长期时间关系。另一方面,将事件时间戳与特定帧相关联很方便。为了充分利用微妙的短期模式并使表示更具辨别力,一维卷积应用于 BiGRU 的输出。我们使用内核大小为 { 3 , 5 , 7 } 的卷积来组合不同感受野中的特征,然后将这些特征连接起来形成每一帧的局部描述符。我们将连接的帧级特征表示为 { v 1 , v 2 , v 3 , ..., v n } 。视频 v 的全局表示是通过跨帧级特征的加权平均池化获得的

 

 3.1.2. Text encoding

Since both videos and texts can be viewed as sequences, the methods that handle videos can be adapted to process the texts with minor modifications. We follow [17] and encode the text in a coarse-to-fine way by stacking different operations. The words in a sentence are represented by one-hot vectors. By multiplying the one-hot vector with a word embedding matrix, we can ob- tain the embedding of that word, which will be further processed by BiGRU. The word embedding matrix is initialized by training word2vec [35] on English tags of 30 million Flickr images. We denote h i as the output at the i th time step that contains both forward and backward GRU hidden states, then the BiGRU based feature f (1) t can be obtained by conducting average pooling along temporal dimension f (1) t = a v erage _ pooling( h 1 , ..., h l ) . In addition, 1D convolutions with kernel sizes of { 2 , 3 , 4 } are also applied upon the output of each time step of BiGRU, followed by a max-pooling layer to compress such features into a vector f (2) t . Thus, the final encoding vector of the text is the concatenation of hierarchy rep- resentations, i.e. , t = ( f (1) t , f (2) t ) .

由于视频和文本都可以被视为序列,因此可以调整处理视频的方法来处理文本,只需稍作修改。我们遵循 [17] 并通过堆叠不同的操作以从粗到细的方式对文本进行编码。句子中的单词由 one-hot 向量表示。通过将 one-hot 向量与词嵌入矩阵相乘,我们可以获得该词的嵌入,由 BiGRU 进一步处理。通过在 3000 万张 Flickr 图像的英文标签上训练 word2vec [35] 来初始化词嵌入矩阵。我们将 hi 表示为第 i 个时间步长的输出,包含前向和后向 GRU 隐藏状态,然后可以通过沿时间维度 f (1) t = average 进行平均池化来获得基于 BiGRU 的特征 f (1) t _ 池化( h 1 , ..., hl ) 。此外,内核大小为 { 2 , 3 , 4 } 的一维卷积也应用于 BiGRU 的每个时间步长的输出,然后是最大池化层以将这些特征压缩为向量 f (2) t 。因此,文本的最终编码向量是层次表示的串联,即 t = ( f (1) t , f (2) t ) 。

3.2. Supervised text-aligned attention

We have both video-level and frame-level features after video feature extraction. Since our goal is to achieve segment-level re- trieval, we need to fully explore local structures of a video and de- rive the relation between text embedding and video segments. A common attention practice is to conduct dot product between text embedding and frame-level features, followed by a softmax oper- ation that serves as a normalization function. Then the attention- based video embedding is obtained by a weighted summation or average over frame-level features, which is essentially a combi- nation of pure visual features and related less to the text em- bedding. To solve this problem, we propose a text-aligned atten- tion mechanism that builds close relation between multi-modality features, with the description temporal boundaries as supervision. Specifically, we first align the text with video frames by using the Hadamard product between text embedding and frame-level fea- tures and obtain alignment vector by a i = t v i . (3) Then the alignment matrix A = [ a 1 , ..., a n ] is used to derive a se- quence of attention scores s = (s 1 , ..., s n ) , where s i is calculated by s i = sigmoid( u T a i ) , (4) where u is a learnable weight. We refer to s as the semantic atten- tion score, which reflects the correlation between text and each frame. Compared to traditional attention mechanism, our text- aligned attention build tight relation between frame-level features and text features. Another difference is that we replace the soft- max activation with the sigmoid activation, which makes sense in this scenario. Unlike most existing temporal localization methods that predict the start and end times of an activity, we use the se- mantic attention scores to generate temporal proposals. Thus in Eq. (4) , we use sigmoid as the activation function so that s i represents the confidence of the i th frame being included into the proposal. The semantic attention score is supervised by ground-truth labels, which is created in the following way. Suppose t s and t e denote the start and end times of an activity, respectively, and drepresents the total duration of the whole video. The relative time points ˆ t s and ˆ t e are obtained via ˆ t s = t s /d, ˆ t e = t e /d. (5) Since we have interpolated the frame-level video feature to a cer- tain length n at the beginning, it is convenient to match the cor- responding frame locations according to the relative time points. The ground-truth label ˆ s is represented by a n -dimensional vector ( ˆ s 1 , ..., ˆ s n ) where ˆ s i is equal to 1 if i th frame locates in the duration between the start point and end points, otherwise 0. We apply the L 2 loss on the semantic attention score and ground truth L reg = 1 n n ∑ i =1 ‖ ˆ s i −s i ‖ 2 . (6) Through our experiments, we find that the learned semantic at- tention score acts as an approximate unimodal curve, which con- forms to the fact that the annotated duration of a text query is continuous. It motivates us that the temporal proposal can be ef- ficiently generated by screening out segments whose scores are larger than a threshold. It is difficult to find a proper threshold constant that suits well on all samples because the distribution of attention scores may vary from one to another. Thus, we al- ter to use a self-adaption strategy. To be specific, for every pair of video and text, we find the maximum value V max among seman- tic attention scores, and set the threshold to be δ= γV max , where γ∈ (0 , 1) can be adjusted according to the datasets. Empirically, we set γ= 0 . 33 for ActivityNet Captions and γ= 0 . 5 for DiDeMo. We look for the first position on which the value is larger than δin two directions, from the beginning to the end and vice versa. The segment between the two selected positions serves as the tempo- ral proposal. In the test time, we apply segments localization only on the top- Kretrieved videos by running the feedforward process once. 

视频特征提取后,我们同时拥有视频级和帧级特征。由于我们的目标是实现片段级检索,我们需要充分探索视频的局部结构并推导出文本嵌入和视频片段之间的关系。一种常见的注意力实践是在文本嵌入和帧级特征之间进行点积,然后是作为归一化函数的 softmax 操作。然后通过对帧级特征进行加权求和或平均获得基于注意力的视频嵌入,这本质上是纯视觉特征的组合,与文本嵌入的相关性较小。为了解决这个问题,我们提出了一种文本对齐的注意力机制,在多模态特征之间建立密切关系,以描述时间边界作为监督。具体来说,我们首先通过使用文本嵌入和帧级特征之间的 Hadamard 乘积将文本与视频帧对齐,并通过 a i = t v i 获得对齐向量。 (3) 然后使用对齐矩阵 A = [ a 1 , ..., an ] 推导出一系列注意力分数 s = (s 1 , ..., sn ) ,其中 si 由 si = 计算sigmoid( u Tai ) , (4) 其中 u 是可学习的权重。我们将 s 称为语义注意力分数,它反映了文本与每一帧之间的相关性。与传统的注意力机制相比,我们的文本对齐注意力在帧级特征和文本特征之间建立了紧密的关系。另一个区别是我们用 sigmoid 激活替换了 softmax 激活,这在这种情况下是有意义的。与大多数现有的预测活动开始和结束时间的时间定位方法不同,我们使用语义注意力分数来生成时间建议。因此在方程中。 (4) ,我们使用 sigmoid 作为激活函数,因此 s i 代表第 i 个帧被包含到提案中的置信度。语义注意力分数由真实标签监督,其创建方式如下。假设ts和te分别表示一个活动的开始和结束时间,d代表整个视频的总时长。相对时间点 ˆ t s 和 ˆ t e 通过 ˆ t s = t s /d, ˆ t e = t e /d 获得。 (5) 由于我们在开始时已经将帧级视频特征插值到某个长度n,因此可以方便地根据相对时间点匹配相应的帧位置。真实标签 ˆ s 由一维向量 ( ˆ s 1 , ..., ˆ sn ) 表示,其中如果第 i 帧位于起点和终点之间的持续时间,则 ˆ si 等于 1,否则0. 我们将 L 2 损失应用于语义注意力得分和基本事实 L reg = 1 nn ∑ i =1 ‖ ˆ si −si ‖ 2 。 (6) 通过我们的实验,我们发现学习到的语义注意力得分表现为一个近似的单峰曲线,这符合文本查询的注释持续时间是连续的事实。它激励我们可以通过筛选出分数大于阈值的片段来有效地生成时间建议。很难找到适合所有样本的合适阈值常数,因为注意力分数的分布可能因人而异。因此,我们改为使用自适应策略。具体来说,对于每一对视频和文本,我们在语义注意力得分中找到最大值V max ,并将阈值设置为δ= γV max ,其中γ∈(0, 1)可以根据数据集。根据经验,我们设置 γ= 0 。 33 对于 ActivityNet 字幕和 γ= 0 。 5 为 DiDeMo。我们在两个方向上寻找第一个值大于δ的位置,从头到尾,反之亦然。两个选定位置之间的部分用作时间建议。在测试期间,我们通过运行一次前馈过程,仅在前 Kretrieved 视频上应用片段定位。

3.3. Collaborative ranking

In order to exploit the correlation between the representations of video and text at different granularity, the proposed architec- ture consists of two branches, each constructing a feature space respectively. For the first branch, we follow the idea of classical re- trieval pipeline. The features of video v and text t are extracted independently and then mapped into a common space. The pro- jected features in the common space are associated by optimizing the applied ranking loss, so that the similarity between matched pairs is larger than that of mismatched ones. We use the improved triplet loss with maximum violation [36] , which is defined as 

为了利用不同粒度的视频和文本表示之间的相关性,所提出的架构由两个分支组成,每个分支分别构建一个特征空间。 对于第一个分支,我们遵循经典检索管道的思想。 视频 v 和文本 t 的特征被独立提取,然后映射到一个公共空间。 通过优化应用的排序损失来关联公共空间中的投影特征,使得匹配对之间的相似度大于不匹配对之间的相似度。 我们使用最大违规的改进三元组损失 [36] ,其定义为

 

where v −indicates the negative sample for video v , while t −for text t . Sdenotes the cosine similarity between the representations of video and text and αis the predefined margin constant. We se- lect the most similar ones in negative samples of videos and texts as the hardest negative examples when optimizing. 

其中 v - 表示视频 v 的负样本,而 t - 表示文本 t 。 S 表示视频和文本表示之间的余弦相似度,α 是预定义的边距常数。 我们在优化时选择视频和文本的负样本中最相似的作为最难的负样本。

The relationship between video and text in the common feature space is built only by optimizing the ranking loss, which is some- what weak in distinguishing similar samples. In fact, the triplet loss only cares about the ranking of the targets with respect to the query rather than their exact distances. Ideally, the video em- bedding may be as close to the corresponding text embedding as possible. That is to say, it is feasible to find an anchor for both embeddings and make them close to the anchor. Meanwhile, we push mismatched video and text embeddings away from this an- chor. Thus, the second retrieval branch is based on the assumption that a joint embedding E(v , t) exists in the common feature space. Here we formulate E(v , t) by averaging frame-level text-aligned features with semantic attention weights

公共特征空间中的视频和文本之间的关系仅通过优化排名损失来构建,这在区分相似样本方面有些弱。 事实上,triplet loss 只关心目标相对于查询的排名,而不是它们的确切距离。 理想情况下,视频嵌入可能尽可能接近相应的文本嵌入。 也就是说,为两个embeddings都找到一个anchor并使它们靠近anchor是可行的。 同时,我们将不匹配的视频和文本嵌入推离这个锚点。 因此,第二个检索分支基于共同特征空间中存在联合嵌入 E(v, t) 的假设。 在这里,我们通过平均具有语义注意权重的帧级文本对齐特征来公式化 E(v, t)

 

 An intuitive idea is that the joint embedding E(v , t) matches cor- responding samples v and t , but keeps far away from negative samples v −and s −, which can be formulated as

一个直观的想法是联合嵌入 E(v, t) 匹配相应的样本 v 和 t,但远离负样本 v - 和 s -,可以表示为

 

 We name it as the collaborative ranking (CR) strategy. Different from Eq. (7) , we do not use hardest negative samples in optimiza- tion since we find it does not always bring improvements for the performance. Note that this strategy is only adopted in the train- ing stage. In the inference stage, the features of all videos in the repository can be extracted in advance. That is to say, we do not need to implement joint embedding for video-level retrieval.

我们将其命名为协作排名 (CR) 策略。 不同于等式。 (7) 中,我们在优化中不使用最难的负样本,因为我们发现它并不总能带来性能改进。 请注意,此策略仅在训练阶段采用。 在推理阶段,可以提前提取存储库中所有视频的特征。 也就是说,我们不需要为视频级检索实现联合嵌入。

All components of the network are trained together to optimize the objective function

网络的所有组件一起训练以优化目标函数

 

 

 

 

 

 

推荐阅读