• Accelerated BERT inference by 34% on NVIDIA T4 GPU by fusing the GEMM, bias-add, and GELU into a single kernel.
  • Reduced training step time by 38% using the fused kernel in back-propagation, speeding up model iteration and lowering projected GPU cloud costs.
  • Identified bottlenecks consuming ~83% of GPU time through profiler, ensuring maximum return on optimization effort.

Updated: