Bruce-Lee-LY

@Bruce-Lee-LY

LLM Infer, AI Infra, CUDA

Tsinghua University

221

Followers

Following

Public Repos

Private Repos

Language Breakdown

Lines of code distribution across 17 owned repositories

7.3M Total LOC

3,381,987 lines

46.4%

N/A

C++

2,840,105 lines

39.0%

N/A

Python

609,951 lines

8.4%

N/A

Cuda

360,848 lines

4.9%

N/A

Shell

68,825 lines

0.9%

N/A

Other

28,286 lines

0.4%

N/A

T-Shaped Developer

T-shaped

Deep in C with broad versatility

C++

Python

Cuda

Shell

Collaboration Network

Global Impact visualization

LIVE

0 active collaborators

Repos

PRs

Growth

+18%

Top Collaborators

No collaborator data yet.

Coding Streak

Contribution activity over the past year

3 days

Contributions

Commits

Pull Requests

Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun

Based on GitHub activity

Less

Followers 221

Shreekanth Guttedar

@shreekanthashokg-lang

Yinzuo Jiang

@jiangyinzuo

alemredd

@alemredd

Donghwi Seo

@blowthehwistle

Huang Yubiao

@H-Y-B

View All

Following

1 total

Tri Dao

@tridao

Synced via GitHub

Top Repositories

cuda_hgemm

Several optimization methods of half-precision general matrix multiplication (HGEMM) using tensor core with WMMA API and MMA PTX instruction.

548 90

Cuda

cuda_hook

Hooked CUDA-related dynamic libraries by using automated code generation tools.

173 48

cuda_hgemv

Several optimization methods of half-precision general matrix vector multiplication (HGEMV) using CUDA core.

74 9

Cuda

decoding_attention

Decoding Attention is specially optimized for MHA, MQA, GQA and MLA using CUDA core for the decoding stage of LLM inference.

47 4

C++

flash_attention_inference

Performance of the C++ interface of flash attention and flash attention v2 in large language model (LLM) inference scenarios.

45 7

C++

cuda_auto_tune

NCU-driven iterative optimization workflow for CUDA/CUTLASS/Triton/CuTe DSL kernels.

23 2

Python

cutlass_gemm

Multiple GEMM operators are constructed with cutlass to support LLM inference.

20 2

C++

matrix_multiply

Several common methods of matrix multiplication are implemented on CPU and Nvidia GPU using C++11 and CUDA.

14 3

C++

cuda_back2back_hgemm

Use tensor core to calculate back-to-back HGEMM (half-precision general matrix multiplication) with MMA PTX instruction.

13 3

Cuda

memory_pool

Simple and efficient memory pool is implemented with C++11.

10 4

C++

Open Source Impact

Contributions to external projects

30 merged PRs

Contributed to 2 repositories