• Developed a cycle-accurate simulator for VMIPS-based vector processors with 6 pipelined functional units
  • Implemented machine learning algorithms using the ISA, including dot products, matrix multiplications, and strided convolution layers
  • Conducted design space exploration to optimize configurations, resulting in up to 35% faster execution across benchmarks
  • Introduced novel architecture optimizations, improving performance by 15% (dot product), 20% (matrix multiplication), and 10% (convolution)
  • Improved memory access efficiency via parallel memory bank access, reducing dot product execution time by 33% and convolution by 27%

View on GitHub

Updated: