适用于申威众核架构的稀疏矩阵-矩阵乘法

doi:10.19596/j.cnki.1001-246x.8766

摘要/Abstract

摘要：

本文提出新一代申威众核架构上稀疏通用矩阵-矩阵乘法(SpGEMM)的并行算法swSpGEMM。设计轻量级并行任务划分有效地应对了矩阵非零元分布引起的负载不均衡问题; 针对累加过程中的不规则访存和指令流水低效问题, 设计了分层稀疏累加器, 在不同输入特征下高效利用申威从核层次化内存, 且减少了整数查找中的指令间依赖, 更有效地发挥硬件的计算能力。SuiteSparse稀疏矩阵测试集中较大规模输入矩阵上, swSpGEMM的性能相比Intel Skylake双CPU上的MKL和NVIDIA A100上的cuSPARSE分别加速了21.1%和95.3%。

关键词: 申威众核架构, 稀疏矩阵计算, 矩阵-矩阵乘法

Abstract:

A parallel algorithm for sparse general matrix-matrix multiplication (SpGEMM), swSpGEMM, targeting the new generation Sunway many-core architecture is proposed. The algorithm addresses the load balance issue caused by the distribution of nonzeros in input matrix, using a light weight parallel task partitioning. For the irregular memory access and inefficient instruction pipelining in accumulating the product, a hierarchical sparse accumulator has been proposed to maximize the utilization of local memory with different input matrix features and to relieve the instruction dependency in integer searching, resulting in more efficient use of the computing capability of the hardware. On large matrices from the SuiteSparse sparse matrix collection, the algorithm outperforms MKL on two Intel Xeon GOLD 6132 processors by 21.1% and cuSPARSE on NVIDIA A100 by 95.3%.

Key words: Sunway many-core architecture, sparse matrix computation, matrix-matrix multiplication

中图分类号:

O4-39

刘侃, 杨磊, 薛巍, 陈文光. 适用于申威众核架构的稀疏矩阵-矩阵乘法[J]. 计算物理, 2024, 41(1): 22-32.

Kan LIU, Lei YANG, Wei XUE, Wenguang CHEN. Sparse General Matrix-matrix Multiplication for Sunway Manycore Architecture[J]. Chinese Journal of Computational Physics, 2024, 41(1): 22-32.

图/表 10

图1 申威26010-Pro架构

Fig.1 The architecture of SW26010-Pro

图2 稀疏矩阵的表示格式(a) 稀疏矩阵；(b) 坐标列表(COO)；(c) 稀疏压缩行(CSR)

Fig.2 An example of sparse matrix representation (a)sparse matrix; (b) COO; (c) CSR

图3 SpGEMM的例子

Fig.3 An example of SpGEMM

图4 根据A的非零元数进行划分的示意图

Fig.4 An example of partition according to the number of nonzeros A

图5 根据中间积数进行划分的示意图

Fig.5 An example of partition according to the number of intermediate products

表1 实验平台

Table 1 Experimental platforms

	平台1	平台2	平台3
处理器	申威26010-Pro 6核组6主核+384从核	Intel Xeon Gold 6132 2芯片28核56线程	NVIDIA A100-PCIe
双精度浮点计算能力	13.8 TFlop·s^-1	2.33 TFlop·s^-1 Base	9.7 TFlop·s^-1 19.5 TFlop·s^-1 Tensor Core
内存容量	96 GB	96 GB	82 GB GPU
内存带宽	307 GB·s^-1	256 GB·s^-1	1.9 TB·s^-1
基础软件	swgcc	icc 2019	CUDA 11.4 GCC 10.2.1

图6 swSpGEMM的浮点性能(a) nnz 101~103；(b) nnz 103~105；(c) nnz 105~107；(d) nnz 107~109

Fig.6 The floating-point performance of swSpGEMM (a) nnz 101~103; (b) nnz 103~105; (c) nnz 105~107; (d) nnz 107~109

图7 swSpGEMM的浮点性能

Fig.7 The floating-point performance of swSpGEMM

图8 swSpGEMM与其他平台标准库的性能(a) nnz 101~103；(b) nnz 103~105；(c) nnz 105~107；(d) nnz 107~109

Fig.8 Performance between swSpGEMM and standard library on other platforms (a) nnz 101~103; (b) nnz 103~105; (c) nnz 105~107; (d) nnz 107~109

图9 swSpGEMM与GPU上最新算法的性能(a) nnz 101~103；(b) nnz 103~105；(c) nnz 105~107; (d) nnz 107~109

Fig.9 Performance between swSpGEMM and the state-of-the-art algorithms on GPU (a) nnz 101~103; (b) nnz 103~105; (c) nnz 105~107; (d) nnz 107~109

参考文献 25

1	BAKER A H, GAMBLIN T, SCHULZ M, et al. Challenges of scaling algebraic multigrid across modern multicore architectures[C]//2011 IEEE International Parallel & Distributed Processing Symposium. Anchorage, AK, USA: IEEE, 2011: 275-286.
2	BELL N , DALTON S , OLSON L N . Exposing fine-grained parallelism in algebraic multigrid methods[J]. SIAM Journal on Scientific Computing, 2012, 34 (4): C123- C152. DOI
3	BALLARD G , SIEFERT C , HU J . Reducing communication costs for sparse matrix multiplication within algebraic multigrid[J]. SIAM Journal on Scientific Computing, 2016, 38 (3): C203- C231. DOI
4	XU Xiaowen , YUE Xiaoqiang , MAO Runzhang , et al. JXPAMG: A parallel algebraic multigrid solver for extreme-scale numerical simulations[J]. CCF Transactions on High Performance Computing, 2023, 5 (1): 72- 83. DOI
5	MULLOWNEY P, LI Ruipeng, THOMAS S, et al. Preparing an incompressible-flow fluid dynamics code for exascale-class wind energy simulations[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. St. Louis, Missouri: Association for Computing Machinery, 2021: 1-16.
6	TIAN Rong , ZHOU Mozhen , WANG Jingtao , et al. A challenging dam structural analysis: large-scale implicit thermo-mechanical coupled contact simulation on Tianhe-Ⅱ[J]. Computational Mechanics, 2019, 63 (1): 99- 119. DOI
7	GILBERT J R , MOLER C , SCHREIBER R . Sparse matrices in MATLAB: Design and implementation[J]. SIAM Journal on Matrix Analysis and Applications, 1992, 13 (1): 333- 356. DOI
8	DEMOUTH J. Sparse matrix-matrix multiplication on the GPU[C]. NVIDIA GPU Technology Conference. San Jose, CA: NVIDIA, 2012.
9	ANH P N Q, FAN Rui, WEN Yonggang. Balanced hashing and efficient GPU sparse general matrix-matrix multiplication[C]//Proceedings of the 2016 International Conference on Supercomputing. Istanbul, Turkey: Association for Computing Machinery, 2016: 1-12.
10	NAGASAKA Yu sue, NUKADA A, MATSUOKA S. High-performance and memory-saving sparse general matrix-matrix multiplication for NVIDIA pascal GPU[C]//2017 46th International Conference on Parallel Processing (ICPP). Bristol, UK: IEEE, 2017: 101-110.
11	DEVECI M, TROTT C, RAJAMANICKAM S. Performance-portable sparse matrix-matrix multiplication for many-core architectures[C]//2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). Lake Buena Vista, FL, USA: IEEE, 2017: 693-702.
12	PARGER M, WINTER M, MLAKAR D, et al. SpECK: Accelerating GPU sparse matrix-matrix multiplication through lightweight analysis[C]//Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. San Diego, California: Association for Computing Machinery, 2020: 362-375.
13	DALTON S, BAXTER S, MERRILL D, et al. Optimizing sparse matrix operations on GPUs using merge path[C]//2015 IEEE International Parallel and Distributed Processing Symposium. Hyderabad, India: IEEE, 2015: 407-416.
14	MERRILL D. CUB: CUDA unbound, a library of warp-wide, block-wide, and device-wide GPU parallel primitives[Z]. 2015.
15	HOU Kaixi, LIU Weifeng, WANG Hao, et al. Fast segmented sort on GPUs[C]//Proceedings of the International Conference on Supercomputing. Chicago, Illinois: Association for Computing Machinery, 2017: 1-10.
16	JI Haonan , LU Shibo , HOU Kaixi , et al. Segmented merge: A new primitive for parallel sparse matrix computations[J]. International Journal of Parallel Programming, 2021, 49 (5): 732- 744. DOI
17	DALTON S , OLSON L , BELL N . Optimizing sparse matrix-matrix multiplication for the GPU[J]. ACM Transactions on Mathematical Software, 2015, 41 (4): 1- 20.
18	LIU Weifeng, VINTER B. An efficient GPU general sparse matrix-matrix multiplication for irregular data[C]//2014 IEEE 28th International Parallel and Distributed Processing Symposium. Phoenix, AZ, USA: IEEE, 2014: 370-381.
19	GREMSE F , HÖFTER A , SCHWEN L O , et al. GPU-accelerated sparse matrix-matrix multiplication by iterative row merging[J]. SIAM Journal on Scientific Computing, 2015, 37 (1): C54- C71. DOI
20	WINTER M, MLAKAR D, ZAYER R, et al. Adaptive sparse matrix-matrix multiplication on the GPU[C]//Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. Washington, District of Columbia: Association for Computing Machinery, 2019: 68-81.
21	XIE Zhen, TAN Guangming, LIU Weifeng, et al. IA-SpGEMM: An input-aware auto-tuning framework for parallel sparse matrix-matrix multiplication[C]//Proceedings of the ACM International Conference on Supercomputing. Phoenix, Arizona: Association for Computing Machinery, 2019: 94-105.
22	ZHANG Jianting, GRUENWALD L. Regularizing irregularity: Bitmap-based and portable sparse matrix multiplication for graph data on GPUs[C]//Proceedings of the 1st ACM SIGMOD Joint International Workshop on Graph Data Management Experiences & Systems (GRADES) and Network Data Analytics (NDA). Houston, Texas: Association for Computing Machinery, 2018: 1-8.
23	NIU Yuyao, LU Zhengyang, JI Haonan, et al. TileSpGEMM: A tiled algorithm for parallel sparse general matrix-matrix multiplication on GPUs[C]//Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. Seoul, Republic of Korea: Association for Computing Machinery, 2022: 90-106.
24	INOUE H, MORIYAMA T, KOMATSU H, et al. AA-Sort: A new parallel sorting algorithm for multi-core SIMD processors[C]//16th International Conference on Parallel Architecture and Compilation Techniques (PACT 2007). Brasov, Romania: IEEE, 2007: 189-198.
25	CHHUANI J , NGUYEN A D , LEE V W , et al. Efficient implementation of sorting on multi-core SIMD CPU architecture[J]. Proceedings of the VLDB Endowment, 2008, 1 (2): 1313- 1324. DOI

[1]	杨小渝, 王娟, 任杰, 宋健龙, 王宗国, 曾雉, 张小丽, 黄孙超, 张平, 林海青. 支撑材料基因工程的高通量材料集成计算平台[J]. 计算物理, 2017, 34(6): 697-704.
[2]	李国斌, 宋顺成, 赵宝荣, 杨润田, 曹学军. 磁性液体材料微结构磁化状态数值分析[J]. 计算物理, 2006, 23(5): 571-575.