Sparse General Matrix-matrix Multiplication for Sunway Manycore Architecture

doi:10.19596/j.cnki.1001-246x.8766

Abstract

Abstract:

A parallel algorithm for sparse general matrix-matrix multiplication (SpGEMM), swSpGEMM, targeting the new generation Sunway many-core architecture is proposed. The algorithm addresses the load balance issue caused by the distribution of nonzeros in input matrix, using a light weight parallel task partitioning. For the irregular memory access and inefficient instruction pipelining in accumulating the product, a hierarchical sparse accumulator has been proposed to maximize the utilization of local memory with different input matrix features and to relieve the instruction dependency in integer searching, resulting in more efficient use of the computing capability of the hardware. On large matrices from the SuiteSparse sparse matrix collection, the algorithm outperforms MKL on two Intel Xeon GOLD 6132 processors by 21.1% and cuSPARSE on NVIDIA A100 by 95.3%.

Key words: Sunway many-core architecture, sparse matrix computation, matrix-matrix multiplication

CLC Number:

O4-39

Kan LIU, Lei YANG, Wei XUE, Wenguang CHEN. Sparse General Matrix-matrix Multiplication for Sunway Manycore Architecture[J]. Chinese Journal of Computational Physics, 2024, 41(1): 22-32.

Figures/Tables 10

Fig.1 The architecture of SW26010-Pro

Fig.2 An example of sparse matrix representation (a)sparse matrix; (b) COO; (c) CSR

Fig.3 An example of SpGEMM

Fig.4 An example of partition according to the number of nonzeros A

Fig.5 An example of partition according to the number of intermediate products

Table 1 Experimental platforms

	平台1	平台2	平台3
处理器	申威26010-Pro 6核组6主核+384从核	Intel Xeon Gold 6132 2芯片28核56线程	NVIDIA A100-PCIe
双精度浮点计算能力	13.8 TFlop·s^-1	2.33 TFlop·s^-1 Base	9.7 TFlop·s^-1 19.5 TFlop·s^-1 Tensor Core
内存容量	96 GB	96 GB	82 GB GPU
内存带宽	307 GB·s^-1	256 GB·s^-1	1.9 TB·s^-1
基础软件	swgcc	icc 2019	CUDA 11.4 GCC 10.2.1

Fig.6 The floating-point performance of swSpGEMM (a) nnz 101~103; (b) nnz 103~105; (c) nnz 105~107; (d) nnz 107~109

Fig.7 The floating-point performance of swSpGEMM

Fig.8 Performance between swSpGEMM and standard library on other platforms (a) nnz 101~103; (b) nnz 103~105; (c) nnz 105~107; (d) nnz 107~109

Fig.9 Performance between swSpGEMM and the state-of-the-art algorithms on GPU (a) nnz 101~103; (b) nnz 103~105; (c) nnz 105~107; (d) nnz 107~109

References 25

1	BAKER A H, GAMBLIN T, SCHULZ M, et al. Challenges of scaling algebraic multigrid across modern multicore architectures[C]//2011 IEEE International Parallel & Distributed Processing Symposium. Anchorage, AK, USA: IEEE, 2011: 275-286.
2	BELL N , DALTON S , OLSON L N . Exposing fine-grained parallelism in algebraic multigrid methods[J]. SIAM Journal on Scientific Computing, 2012, 34 (4): C123- C152. DOI
3	BALLARD G , SIEFERT C , HU J . Reducing communication costs for sparse matrix multiplication within algebraic multigrid[J]. SIAM Journal on Scientific Computing, 2016, 38 (3): C203- C231. DOI
4	XU Xiaowen , YUE Xiaoqiang , MAO Runzhang , et al. JXPAMG: A parallel algebraic multigrid solver for extreme-scale numerical simulations[J]. CCF Transactions on High Performance Computing, 2023, 5 (1): 72- 83. DOI
5	MULLOWNEY P, LI Ruipeng, THOMAS S, et al. Preparing an incompressible-flow fluid dynamics code for exascale-class wind energy simulations[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. St. Louis, Missouri: Association for Computing Machinery, 2021: 1-16.
6	TIAN Rong , ZHOU Mozhen , WANG Jingtao , et al. A challenging dam structural analysis: large-scale implicit thermo-mechanical coupled contact simulation on Tianhe-Ⅱ[J]. Computational Mechanics, 2019, 63 (1): 99- 119. DOI
7	GILBERT J R , MOLER C , SCHREIBER R . Sparse matrices in MATLAB: Design and implementation[J]. SIAM Journal on Matrix Analysis and Applications, 1992, 13 (1): 333- 356. DOI
8	DEMOUTH J. Sparse matrix-matrix multiplication on the GPU[C]. NVIDIA GPU Technology Conference. San Jose, CA: NVIDIA, 2012.
9	ANH P N Q, FAN Rui, WEN Yonggang. Balanced hashing and efficient GPU sparse general matrix-matrix multiplication[C]//Proceedings of the 2016 International Conference on Supercomputing. Istanbul, Turkey: Association for Computing Machinery, 2016: 1-12.
10	NAGASAKA Yu sue, NUKADA A, MATSUOKA S. High-performance and memory-saving sparse general matrix-matrix multiplication for NVIDIA pascal GPU[C]//2017 46th International Conference on Parallel Processing (ICPP). Bristol, UK: IEEE, 2017: 101-110.
11	DEVECI M, TROTT C, RAJAMANICKAM S. Performance-portable sparse matrix-matrix multiplication for many-core architectures[C]//2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). Lake Buena Vista, FL, USA: IEEE, 2017: 693-702.
12	PARGER M, WINTER M, MLAKAR D, et al. SpECK: Accelerating GPU sparse matrix-matrix multiplication through lightweight analysis[C]//Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. San Diego, California: Association for Computing Machinery, 2020: 362-375.
13	DALTON S, BAXTER S, MERRILL D, et al. Optimizing sparse matrix operations on GPUs using merge path[C]//2015 IEEE International Parallel and Distributed Processing Symposium. Hyderabad, India: IEEE, 2015: 407-416.
14	MERRILL D. CUB: CUDA unbound, a library of warp-wide, block-wide, and device-wide GPU parallel primitives[Z]. 2015.
15	HOU Kaixi, LIU Weifeng, WANG Hao, et al. Fast segmented sort on GPUs[C]//Proceedings of the International Conference on Supercomputing. Chicago, Illinois: Association for Computing Machinery, 2017: 1-10.
16	JI Haonan , LU Shibo , HOU Kaixi , et al. Segmented merge: A new primitive for parallel sparse matrix computations[J]. International Journal of Parallel Programming, 2021, 49 (5): 732- 744. DOI
17	DALTON S , OLSON L , BELL N . Optimizing sparse matrix-matrix multiplication for the GPU[J]. ACM Transactions on Mathematical Software, 2015, 41 (4): 1- 20.
18	LIU Weifeng, VINTER B. An efficient GPU general sparse matrix-matrix multiplication for irregular data[C]//2014 IEEE 28th International Parallel and Distributed Processing Symposium. Phoenix, AZ, USA: IEEE, 2014: 370-381.
19	GREMSE F , HÖFTER A , SCHWEN L O , et al. GPU-accelerated sparse matrix-matrix multiplication by iterative row merging[J]. SIAM Journal on Scientific Computing, 2015, 37 (1): C54- C71. DOI
20	WINTER M, MLAKAR D, ZAYER R, et al. Adaptive sparse matrix-matrix multiplication on the GPU[C]//Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. Washington, District of Columbia: Association for Computing Machinery, 2019: 68-81.
21	XIE Zhen, TAN Guangming, LIU Weifeng, et al. IA-SpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication[C]//Proceedings of the ACM International Conference on Supercomputing. Phoenix, Arizona: Association for Computing Machinery, 2019: 94-105.
22	ZHANG Jianting, GRUENWALD L. Regularizing irregularity: Bitmap-based and portable sparse matrix multiplication for graph data on GPUs[C]//Proceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA). Houston, Texas: Association for Computing Machinery, 2018: 1-8.
23	NIU Yuyao, LU Zhengyang, JI Haonan, et al. TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs[C]//Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Seoul, Republic of Korea: Association for Computing Machinery, 2022: 90-106.
24	INOUE H, MORIYAMA T, KOMATSU H, et al. AA-Sort: A new parallel sorting algorithm for multi-core SIMD processors[C]//16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007). Brasov, Romania: IEEE, 2007: 189-198.
25	CHHUANI J , NGUYEN A D , LEE V W , et al. Efficient implementation of sorting on multi-core SIMD CPU architecture[J]. Proceedings of the VLDB Endowment, 2008, 1 (2): 1313- 1324. DOI

[1]	YANG Xiaoyu, WANG Juan, REN Jie, SONG Jianlong, WANG Zongguo, ZENG Zhi, ZHANG Xiaoli, HUANG Sunchao, ZHANG Ping, LIN Haiqing. An Integrated High-throughput Computational Material Platform [J]. CHINESE JOURNAL OF COMPUTATIONAL PHYSICS, 2017, 34(6): 697-704.
[2]	LI Guo-bin, SONG Shun-cheng, ZHAO Bao-rong, YANG Run-tian, CAO Xue-jun. Numerical Analysis of Microstructural Magnetization in Magnetic Fluid Materials [J]. CHINESE JOURNAL OF COMPUTATIONAL PHYSICS, 2006, 23(5): 571-575.