We developed efficient tensor libraries for tensor decompositions and tensor networks, including CP, Tucker, Hierarchical Tucker, tensor-train, tensor-ring, low-tubal-rank tensor decompositions, etc. We provide efficient primitives for tensor, Hadamard, Khatri-Tao products; contraction, matricization, tensor times matrix (TTM), matricized tensor times Khatri-Rao product (MTTKRP), on tensor cores. These operations are the key components of the tensor algebra.

E.g., cuTensor-tubal library adopts a frequency domain computation scheme. We optimize the data transfer, memory access, and support seven key tensor operations: t-FFT, inverse t-FFT, t-product, t-SVD, t-QR, t-inverse, and t-normalization. cuTensor-tubal library fully exploits the separability in the frequency domain and maps the tube-wise and slice-wise parallelisms onto the single instruction multiple thread (SIMT) GPU architecture.