In this paper, we optimize the computations of third-order low-tubal-rank tensor operations on many-core GPUs. Tensor operations are compute-intensive and existing studies optimize such operations in a case-by-case manner, which can be inefficient and error-prone. We develop and optimize a BLAS-like library for the low-tubal-rank tensor model called cuTensor-tubal, which includes efficient GPU primitives for tensor operations and key processes. We compute tensor operations in the frequency domain and fully exploit tube-wise and slice-wise parallelisms. We design, implement, and optimize four key tensor operations namely t-FFT, inverse t-FFT, t-product, and t-SVD. For t-product and t-SVD, cuTensor-tubal demonstrates significant speedups: maximum 29.16 ×, 6.72× speedups over the non-optimized GPU counterparts, and maximum 16.91× and 27.03× speedups over the CPU implementations running on dual 10-core Xeon CPUs.
Tópico:
Tensor decomposition and applications
Citaciones:
4
Citaciones por año:
Altmétricas:
0
Información de la Fuente:
FuenteICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)