An Efficient PointLSTM for Point Clouds Based Gesture Recognition

Min Y, Zhang Y, Chai X, et al. An efficient pointlstm for point clouds based gesture recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 5761-5770.

把手势识别看作一个不规则序列的识别任务，致力于在点云序列中找到长期空间关联性。PointLSTM用于在保持空间结构的前提下，将信息从过去传播到未来。PointLSTM将过去的邻域点状态信息与现在的特征结合，以通过一个权重共享的LSTM层更新现在状态信息。这个方法可以集成在很多序列学习方法中。

在手势识别上，（NVGesture、SHREC'17）达到SOTA，超越了之前基于骨架的方法。

主要贡献工作：

提出PointLSTM ，在不规则序列数据上，保持空间结构的同时抽取长期时空联系。
简化版本PointLSTM-PSS可以减少计算，并且易于探索改善性能的方法。
在3D手势识别和动作识别上展现出了实时应用的巨大潜力。

Introduction

相比于RGB数据，点云更精准的描述了物体表面的几何结构和距离信息，从而为手势识别提供辅助。如何从点云中抽取丰富的特征则是一个主要任务。Qi et al.[30]的PointNet可以从原始点云直接抽取信息。PointNet++[31]则扩展了层级聚合和采样操作，以捕捉局部联系特征。一些近期工作[19, 20, 23]对聚类操作做出了调整，以便于从时空领域中抽取运动特征和结构特征。然而这些工作仅局限于短期模型，缺乏捕捉长期联系的能力。

近期RNN和LSTM在序列模型上的成功[3,7]提供了解决上述问题的灵感。然而点云数据是无序的，因此直接在没有对齐的点云序列上应用一个权重共享的LSTM层会有更新困难的问题。因此，如何在保持空间结构的前提下利用时间信息就是主要的挑战。

为了解决这个问题，提出了点云定制版的LSTM，即PointLSTM。理想情况下，当前帧的每个点都希望找到一个过去的点与之对应，并且进行相关处理。然而这是一个条件很强的假设，实际基本不可能满足，因此放松一点条件：寻找并聚合过去帧中相关的一些点的状态。

原图 1：s为state，f为feature。(a)理想状态下，每个点都能找到其在上一个时间步中对应的点。(b)在没有那么强的假设条件下，PointLSTM也能从过去的空间邻域中聚合相关信息

另外，为减少计算提出了一个带有Point-shared states的PointLSTM简化版本：PointLSTM-PSS。

基于视觉手势识别

[2,26,27]使用了图神经网络和LSTM来学习手部关节的时空序列。然而基于骨架的方法对遮挡状态，运动速度，图像分辨率等都十分敏感。相比于骨架数据，点云数据更能反映几何特征。

序列模型的LSTM

[3,7]等工作都证实了在序列模型中，RNN的一个特例LSTM拥有出色的长期模型能力。PointRNN[8]和CloudLSTM[43]都在动态点云上应用了RNN，以进行pointwise的预测。

PointLSTM不同于它们使用pooling操作来汇总局部信息来进行逐点的预测，而是保持了空间结构，并且使用pooling操作来找到全局特征的相关信息。

Method

PointLSTM

为解决之前的当前帧无法匹配过去帧的问题，根据同一帧中的点是否共享状态信息，提出了两个解决办法来包容没有对齐的点云。

定义：

点云时间序列为 $\mathbb{P}$。每帧包含任意数量的点 $\mathbb{P}^{(t)}=\{p_{i}^{(t)} \mid i=$ $1,2, \cdots, n_{t}\}$。每个点$p_{i}^{(t)}$ 可以表示为两部分：一个 $d$ 维坐标 $\boldsymbol{x}_{i}^{(t)}$ 和一个 $m$ 维特征向量 $\boldsymbol{f}_{i}^{(t)}$
点 $p_{i}^{(t)}$在 $\mathbb{P}^{(t+\Delta t)}$ 帧的邻域点集：$\mathcal{N}_{\Delta t}\left(\boldsymbol{x}_{i}^{(t)}\right)$
一般的LSTM层: $h^{(t)}, c^{(t)}=LSTM(\boldsymbol{y}^{(t)}, h^{(t-1)}，c^{(t-1)})$

原图 2：(a)在PointLSTM中每个点有独立的state，基于当前输入和过去邻域的states进行更新。(b)PointLSTM-PSS同一帧中的点共享一个state，基于对输出的一系列states取平均进行更新。

Point-independent states

假设每个点有独立的state $h^{(t)}_i$ 和cell state $c^{(t)}_i$。对每个点，在过去的邻域中找到相关点形成点对 $\left(p_{i}^{(t)}, p_{j}^{(t-1)}\right), p_{j}^{(t-1)} \in \mathcal{N}_{-1}\left(x_{i}^{(t)}\right)$。通过点对计算输出：

\[ \begin{aligned} \boldsymbol{y}_{i, j}^{(t)} &=\left[\boldsymbol{x}_{i}^{(t)}-\boldsymbol{x}_{j}^{(t-1)} ; \boldsymbol{f}_{i}^{(t)}\right] \\ \tilde{h}_{i, j}^{(t)}, \tilde{\boldsymbol{c}}_{i, j}^{(t)} &=\operatorname{LSTM}\left(\boldsymbol{y}_{i, j}^{(t)}, h_{j}^{(t-1)}, c_{j}^{(t-1)}\right) \end{aligned} \]

得到每一对点 $\left(p_{i}^{(t)}, p_{j}^{(t-1)}\right)$ 的临时state和cell state $\tilde{h}_{i, j}^{(t)}, \tilde{c}_{i, j}^{(t)}$。再利用这些临时变量更新点云每个点 $p_{i}^{(t)}$ 的states和cell states $h_{i}^{(t)}, c_{i}^{(t)}$实现全部更新：

\[ \begin{gathered} h_{i}^{(t)}=g\left(\tilde{h}_{i, 1}^{(t)}, \tilde{h}_{i, 2}^{(t)}, \cdots, \tilde{h}_{i, n_{t-1}}^{(t)}\right) \\ c_{i}^{(t)}=g\left(\tilde{c}_{i, 1}^{(t)}, \tilde{c}_{i, 2}^{(t)}, \cdots, \tilde{c}_{i, n_{t-1}}^{(t)}\right) \end{gathered} \]

其中$g$是一个对称函数，实现中采取一个max pooling层

Point-shared states

如上每个点都要计算点对则消耗巨大。简化版本中同一帧的所有点拥有同一个states和cell states。计算公式即把上面公式中$h_i，c_i$表示每个点的$i$去掉即可。即每一个时间步只需计算一轮。

Neighborhood Grouping

为体现出是否对齐的影响，采用了两种grouping方式

Direct grouping

直接寻找中心点$p_{t,i}$的k近邻。在物体静止的时候可以聚合相邻帧的空间信息。如果不设距离限制，同样也可以捕捉到一些运动信息。

Aligned grouping

假设当前点 $p^{(t)}_i$ 在前一帧中有一个虚拟的对应点 $\tilde{p}^{(t-1)}_i$ ，通过估计它们之间的反向流 $ ^{(t)}_i= ^{(t)}_i-x^{(t)}_i$ 来定位 $\tilde{p}^{(t-1)}_i$ ，并借此找到上一帧中的k邻域点集 $\mathcal{N}_{-1}\left(\boldsymbol{x}_{i}^{(t)};k\right)$ 。

然而，这种非刚性场景流估计依然是一个难题。

实现细节

Density-based sampling layer

从深度视频中得到的大多数点都是多余的。工作[23]说明了对于手势识别，每帧100-200的小数目点云是一个合理的选择。因此为了减少计算冗余，采用了一个基于密度的采样方法[21]，点 ${x}^{(t)}_i$ 密度估计如下：

\[ \rho\left(\boldsymbol{x}_{i}^{(t)}\right)=\frac{1}{n_{t} r^{d}} \sum_{j=1}^{n_{t}} w\left(\frac{\boldsymbol{x}_{i}^{(t)}-\boldsymbol{x}_{j}^{(t)}}{r}\right) \]

其中 $r$ 是 ${x}^{(t)}_i$ 与其第k近邻的欧拉距离。$w$ 是一个有界可积权重函数。通过密度估计，在每一个采样层中，都采样密度较小的点，即对应点云的边界信息。

原图 4：第一行是从深度信息中分割出手部区域的点云序列，且经过预处理。第一行每帧有128个点。第二行通过采样后每帧有64个点。第三行显示了对应的骨架序列。

动作识别和手势识别的区别：Gesture手势识别是设计用于非口语交流，携带有语言特性。Action动作是完成一个目的的行为形式，拥有更大的类内变化。

参考文献

[1] Roy P, Bhattacharya S, Roy P P, et al. Position and Rotation Invariant Sign Language Recognition from 3D Point Cloud Data with Recurrent Neural Networks[J]. arXiv preprint arXiv:2010.12669, 2020.

[2] Yuxiao Chen, Long Zhao, Xi Peng, Jianbo Yuan, and Dimitris N Metaxas. Construct dynamic graphs for hand gesture recognition via spatial-temporal attention. In British Machine Vision Conference, 2019

[3] Kyunghyun Cho, Bart Van Merrienboer, Caglar Gulcehre, ¨Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1724–1734, 2014

[7] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama,Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko,and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2625–2634, 2015.

[8] Hehe Fan and Yi Yang. Pointrnn: Point recurrent neural network for moving point cloud processing. arXiv preprint arXiv:1910.08287, 2019.

[19] Xingyu Liu, Charles R Qi, and Leonidas J Guibas.Flownet3d: Learning scene flow in 3d point clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 529–537, 2019.

[20] Xingyu Liu, Mengyuan Yan, and Jeannette Bohg. Meteornet: Deep learning on dynamic 3d point cloud sequences. In Proceedings of the IEEE International Conference on Computer Vision, pages 9246–9255, 2019

[21] YP Mack and Murray Rosenblatt. Multivariate k-nearest neighbor density estimates. Journal of Multivariate Analysis, 9(1):1–15, 1979.

[23] Yuecong Min, Xiujuan Chai, Lei Zhao, and Xilin Chen. Flickernet: Adaptive 3d gesture recognition from sparse point clouds. In British Machine Vision Conference, 2019.

[26] Xuan Son Nguyen, Luc Brun, Olivier Lezoray, and S ´ ebastien ´ Bougleux. A neural network based on spd manifold learning for skeleton-based hand gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12036–12045, 2019.

[27] Juan C Nunez, Raul Cabido, Juan J Pantrigo, Antonio S Montemayor, and Jose F Velez. Convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition. Pattern Recognition, 76:80–94, 2018.

[30] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 652–660,

[31] Qi C R, Yi L, Su H, et al. Pointnet++: Deep hierarchical feature learning on point sets in a metric space[J]. arXiv preprint arXiv:1706.02413, 2017.

[43] Chaoyun Zhang, Marco Fiore, Iain Murray, and Paul Patras. Cloudlstm: A recurrent neural model for spatiotemporal point-cloud stream forecasting. arXiv preprint arXiv:1907.12410, 2019.