- Developed a cycle-accurate simulator for VMIPS-based vector processors with 6 pipelined functional units
- Implemented machine learning algorithms using the ISA, including dot products, matrix multiplications, and strided convolution layers
- Conducted design space exploration to optimize configurations, resulting in up to 35% faster execution across benchmarks
- Introduced novel architecture optimizations, improving performance by 15% (dot product), 20% (matrix multiplication), and 10% (convolution)
- Improved memory access efficiency via parallel memory bank access, reducing dot product execution time by 33% and convolution by 27%
View on GitHub