Plizzari C, Cannici M, Matteucci M. Skeleton-based action recognition via spatial and temporal transformer networks[J]. Computer Vision and Image Understanding, 2021, 208: 103219.
Politecnico di Torino 意大利都灵理工大学
Q3

Introduction

尽管ST-GCN的结构已经在骨架动作识别中广泛应用，但是仍然有一些结构上的缺陷。

本文设计了Spatial Self-Attention (SSA) 模块，用于在骨架之间动态的建立联系，而独立于人体真实骨架结构。另外在时间维度上设计了Temporal Self-Attention (TSA)模块用于学习关节在时间上的变化。

Spatial–Temporal Transformer (ST-TR)

Self-Attention最初的灵感是希望对句子中单词进行跨越距离的相关性编码。因此本文希望同样的方式也可以应用到骨架节点上。

空间上：节点之间的相关性是很重要的，因此抛弃了任何预定义的骨架结构，让self-attention自动查找关节关系，类似于动态边的图卷积。

时间上：也希望通过self-attention查找出不同帧的关系。

Spatial Self-Attention (SSA)

类似于文本的self-Attention，抛开图结构，对关节点进行注意力计算。，最终输出这个关节点经过注意力编码的特征：

\[ \mathbf{z}_{i}^{t}=\sum_{j} \operatorname{softmax}_{j}\left(\frac{\alpha_{i j}^{t}}{\sqrt{d_{\mathrm{k}}}}\right) \mathbf{v}_{j}^{t} \]

Temporal Self-Attention (TSA)

在时间计算中，每个关节点被视为独立的存在，去计算帧之间的注意力关联性：

\[ \alpha_{t u}^{v}=\mathbf{q}_{t}^{v} \cdot \mathbf{k}_{u}^{v} \quad \forall v \in V, \quad \mathbf{z}_{t}^{v}=\sum_{j} \operatorname{softmax}_{u}\left(\frac{\alpha_{t u}^{v}}{\sqrt{d_{\mathrm{k}}}}\right) \mathbf{v}_{u}^{v}, \]

Two-Stream Spatial–Temporal Transformer (ST-TR) Network

为了结合SSA和TSA模块，构造了一个 two-stream architecture(ST-TR)。SSA和TSA分别独立在S-TR stream和 T-TR stream进行应用，然后再融合。(Shi et al. 2019)

两个streams中都先应用了三层的残差网络对特征进行提取。S-TR中在空间上使用了GCN进行提取特征，T-TR中使用了标准的2D卷积(TCN,Yan et al. 2018)。

然后在后续的S-TR模块，T-TR模块(结构如上图)处理中分别用SSA和TSA来替换GCN和TCN。

最后在经历一系列S-TR , T-TR模块的处理后，通过累加起两个stream的softmax输出来获得最终的score。

参考文献

[1] Shi, L., Zhang, Y., Cheng, J., Lu, H., 2019b. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12026–12035.

[2] Shi, L., Zhang, Y., Cheng, J., Lu, H., 2019a. Skeleton-based action recognition with directed graph neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7912–7921.

Sitch's Blog

SkeletonTransformer