计算物理 ›› 2024, Vol. 41 ›› Issue (1): 22-32.DOI: 10.19596/j.cnki.1001-246x.8766

• 面向超级计算机的性能优化技术与数值并行算法专刊 • 上一篇    下一篇

适用于申威众核架构的稀疏矩阵-矩阵乘法

刘侃1(), 杨磊2, 薛巍1,*(), 陈文光1   

  1. 1. 清华大学计算机科学与技术系, 北京 100084
    2. 国家超级计算无锡中心, 江苏 无锡 214000
  • 收稿日期:2023-05-30 出版日期:2024-01-25 发布日期:2024-02-05
  • 通讯作者: 薛巍
  • 作者简介:刘侃, 男, 博士研究生, 研究方向为高性能稀疏矩阵计算, E-mail: liukan17@mails.tsinghua.edu.cn
  • 基金资助:
    国家重点专项(2020YFA0607903);国家自然科学基金项目(U2242210)

Sparse General Matrix-matrix Multiplication for Sunway Manycore Architecture

Kan LIU1(), Lei YANG2, Wei XUE1,*(), Wenguang CHEN1   

  1. 1. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
    2. National Supercomputing Center in Wuxi, Wuxi, Jiangsu 214000, China
  • Received:2023-05-30 Online:2024-01-25 Published:2024-02-05
  • Contact: Wei XUE

摘要:

本文提出新一代申威众核架构上稀疏通用矩阵-矩阵乘法(SpGEMM)的并行算法swSpGEMM。设计轻量级并行任务划分有效地应对了矩阵非零元分布引起的负载不均衡问题; 针对累加过程中的不规则访存和指令流水低效问题, 设计了分层稀疏累加器, 在不同输入特征下高效利用申威从核层次化内存, 且减少了整数查找中的指令间依赖, 更有效地发挥硬件的计算能力。SuiteSparse稀疏矩阵测试集中较大规模输入矩阵上, swSpGEMM的性能相比Intel Skylake双CPU上的MKL和NVIDIA A100上的cuSPARSE分别加速了21.1%和95.3%。

关键词: 申威众核架构, 稀疏矩阵计算, 矩阵-矩阵乘法

Abstract:

A parallel algorithm for sparse general matrix-matrix multiplication (SpGEMM), swSpGEMM, targeting the new generation Sunway many-core architecture is proposed. The algorithm addresses the load balance issue caused by the distribution of nonzeros in input matrix, using a light weight parallel task partitioning. For the irregular memory access and inefficient instruction pipelining in accumulating the product, a hierarchical sparse accumulator has been proposed to maximize the utilization of local memory with different input matrix features and to relieve the instruction dependency in integer searching, resulting in more efficient use of the computing capability of the hardware. On large matrices from the SuiteSparse sparse matrix collection, the algorithm outperforms MKL on two Intel Xeon GOLD 6132 processors by 21.1% and cuSPARSE on NVIDIA A100 by 95.3%.

Key words: Sunway many-core architecture, sparse matrix computation, matrix-matrix multiplication

中图分类号: