Wenqi Lou (娄文启 )

Associate Researcher (副研究员)
School of Software Engineering
Suzhou Institute for Advanced Research
University of Science and Technology of China (USTC)
Research Interests: Neural Network Accelerators、Hardware-Software Co-Optimization
Address: 508, ShaoJun Building, Suzhou Institute for Advanced Research, USTC, Suzhou, Jiangsu, China
E-mail: louwenqi@ustc.edu.cn

About me

I received Bachelor's degree in School of Computer in June 2018 from Northwestern Polytechnical University (NWPU), Xi'an. In the same year, I was admitted to study for a M.Sc. degree in School of Computer Science and Technology, USTC without entrance examination. From Sept. 2020, I started my Ph.D. degree under the supervision of Professor Xuehai Zhou and Professor Chao Wang. I received my PhD degree of computer science in December, 2023 at USTC. The main research focuses on intelligent accelerator architecture, FPGA accelerator design, and software-hardware co-optimization, dedicated to alleviating the deployment challenges of deep learning models from both algorithmic and hardware perspectives. In recent years, over 20 academic papers have been published in the field of computer architecture, including top-tier journals and conferences such as IEEE TCAD, IEEE TC, DAC, FPGA, and RTSS. Among these, more than 10 CFF-A/B category papers were published as the first/corresponding author, with 4 granted patents.

娄文启，现为中国科大软件学院长聘副研究员，硕士生导师，CCF体系结构专委会执行委员。2018年本科毕业于西北工业大学计算机学院， 2023年于中国科学技术大学获得计算机系统结构博士学位，导师为 周学海教授与王超教授, 毕业后留组任教。主要研究方向为智能加速器架构、FPGA加速器设计、软硬件协同优化等，致力于从算法与硬件角度缓解深度学习模型的部署压力。近年来，累计在计算机系统结构领域发表学术论文 30 余篇，其中一作/通讯论文 26 篇；包含CCF A类论文 8 篇（DAC、AAAI、TCAD、TC等），B类论文 10 篇（ICPP、DATE、CODES、TVLSI等），授权发明专利 5 项。同时，主持国家自然科学基金青年项目、江苏省自然科学基金青年项目、校级科研项目以及思必驰企业合作项目；担任《计算机学报》、IEEE TCAD、TVLSI、TCBB等审稿人。曾获江苏省青年人才托举工程计划、江苏省自然科学三等奖、安徽省教学成果一等奖等省市级科研与教学奖项。

📢: 招收对硬件加速与模型推理优化感兴趣、数理基础扎实的同学（推免硕士/直博/联培），由我与王超教授联合指导，欢迎邮件联系，附简历。

Educations

2018.09 - 2023.12, Ph.D., School of Computer Science and Technology, University of Science and Technology of China (USTC)
2014.09 - 2018.06, B.S., School of Computer, Northwestern Polytechnical University (NWPU)

Research Interests

FPGA Accelerator Design: Specialized in designing FPGA accelerators for Convolutional Neural Networks (CNN), Vision Transformers (ViT), and Large Language Models (LLM) to enhance performance and efficiency.
Neural Architecture and Accelerator Co-Search: Focused on the co-evolution of neural network architectures and hardware accelerators to achieve optimized performance on specific hardware platforms.
Model Quantization and Pruning for FPGA/GPU Inference: Expertise in reducing model complexity and size through quantization and pruning techniques, specifically tailored for efficient inference on FPGA and GPU architectures.
AI for HW Design: Utilizing artificial intelligence to revolutionize the hardware design process, including automated design exploration, optimization, and verification.

Publications (* corresponding author)

Journal Article

[TCAD] Wenqi Lou, Hongbing Wen, Zihao Wang, Jiale Dong, Teng Wang, Lei Gong, Chao Wang, Xuehai Zhou. "AlloBit: Algorithm-Hardware Co-Design for Group-Wise Mixed-Bit LLM Inference on FPGA", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), 2026, 1:14. (CCF-A, accepted at the CODES+ISSS 2026)
[TCAD] Zihan Wang, Lei Gong, Xiangjun Qu, Cheng Tang, Wenqi Lou, Teng Wang, Xianglan Chen, Chao Wang, Xuehai Zhou. "UniSparTa: A Unified Sparse Tensor Program Tuning Framework." IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), 2025:1-14. Early Access Article (CCF-A)
[TC] Zihan Wang, Lei Gong, Wenqi Lou, Teng Wang, Qianyu Cheng, Xianglan Chen, Chao Wang, Xuehai Zhou. "UniCoX: A Unified Cost Model for Tensorized Program Tuning Across Ubiquitous Accelerators". IEEE Transactions on Computers (IEEE TC), 2025,1-15. Early Access Article (CCF-A)
[TCAD] ZhenDong Zheng, Qianyu Cheng, Teng Wang, Wenqi Lou, Lei Gong, Chao Wang, Xuehai Zhou. "LORA: A Latency-Oriented Recurrent Architecture for Large Language Model on Multi-FPGA Platform with Communication Optimization", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), 2025:1-14. Early Access Article (CCF-A)
[TVLSI] Jiale Dong, Wenqi Lou*, Hao Wu, Zhendong Zheng, Yunji Qin, Lei Gong, Chao Wang*, Xuehai Zhou. "MoE-Sched: Enabling Efficient FPGA Deployment of Mixture-of-Experts Vision Transformers via Coordinated Scheduling", IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 2026, 34(1):104 - 117. (CCF-B, JCR-Q1)
[TOMM] Haoyu Cai, Wenqi Lou*, Chao Wang*, Xuehai Zhou. "Picasso: Analyzing Prompt Design for Text-To-Image Generative Diffusion Models from a Temporal-Spatial Perspective", ACM Transactions on Multimedia Computing Communications and Applications (TOMM), 2025,21(11):1-24. (CCF-B, JCR-Q1)
[ESL] Hongbing Wen, Zihao Wang, Jiale Dong, Wenqi Lou*, Chao Wang, Xuehai Zhou. "QLlama: An FPGA-Based Microscaling Quantization Accelerator for Energy-Efficient Llama2 Inference", IEEE Embedded Systems Letters (ESL), September 28-October 3, 2025, 17(5):337-340, Taipei, China. (CCF-B, CODES+ISSS 2025, Late Breaking Tracks)
[TCAD] Wenqi Lou, Yunji Qin, Xuan Wang, Lei Gong, Chao Wang, Xuehai Zhou. "FlexBCM: Hybrid Block-Circulant Neural Network and Accelerator Co-Search on FPGAs", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), 2024, 43(11):3852-3863. (CCF-A, accepted at the CODES+ISSS 2024)
[TCAD] Wenqi Lou, Lei Gong, Chao Wang, Jiaming Qian, Xuan Wang, Changlong Li, Xuehai Zhou. "Unleashing Network/Accelerator Co-Exploration Potential on FPGAs: A Deeper Joint Search", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), 2024, 43(10):3041-3054. (CCF-A)
[TC] Wenqi Lou, Lei Gong, Chao Wang, Zidong Du, Xuehai Zhou. "OctCNN: A High Throughput FPGA Accelerator for CNNs Using Octave Convolution Algorithm", IEEE Transactions on Computers (IEEE TC), 2022, 71(8): 1847-1859. (CCF-A)
[JOS] 娄文启, 王超, 宫磊, 周学海. 一种神经网络指令集扩展与代码映射机制. 软件学报, 3074-3086, 2020. (CCF-T1, Chinese Journal)

Conference Paper

[APPT'26] Fengrui Zuo, Zhiwei Ke, Yiming Liu, Cheng Tang, Wenqi Lou*, Teng Wang, Lei Gong, Chao Wang, Xuehai Zhou. "Window-Diffusion: Accelerating Diffusion Language Model Inference with Windowed Token Pruning and Caching", 17th International Symposium on Advanced Parallel Processing Technology (APPT), Brussels, Belgium, July 27-31, 2026:1-12. (CCF-C, CCF Architecture Committee held)
[DAC'26] Haoran Xue, Teng Wang*, Qianyu Cheng, Zhendong Zheng, Wenqi Lou*, Lei Gong, Xi Li, Xuehai Zhou. "Efficient HLS Accelerator Floorplan on Multi-Die FPGA Aided by Graph Neural Networks", ACM/IEEE Design Automation Conference (DAC), 2026:1-7. (CCF-A)
[DAC'26] Zhiwei Ke, Wenqi Lou*, Yiming Liu, Fengrui Zuo, Chao Wang, Xuehai Zhou. "Late Breaking Results: ASTFusion:Two-Stage Structural Enhancement Learning for Robust HLS Code Generation", ACM/IEEE Design Automation Conference (DAC), 2026:1-2. (CCF-A)
[DAC'26] Yiming Liu, Wenqi Lou*, Zhiwei Ke, Fengrui Zuo, Chao Wang, Xuehai Zhou. "Late Breaking Results: Recoverability-guided Layer-wise N:M Sparsity under Latency Constraints", ACM/IEEE Design Automation Conference (DAC), 2026:1-2. (CCF-A)
[AAAI'26] Cheng Tang, Guochong Sui, Wenqi Lou*, Zihan Wang, Jiayi Tuo, Wenqian Xie, Yinkang Gao, Yixuan Zhu, Lei Gong, Chao Wang*, Xuehai Zhou. "CloserToMe: A Unified Framework for Accurate and Transferable Latency Prediction across Heterogeneous Devices." AAAI Conference on Artificial Intelligence (AAAI). 2026. (CCF-A)
[EuroPar'26] Yiming Liu, Wenqi Lou*, Zhiguang Wang, Zhiwei Ke, Fengrui Zuo, Chao Wang, Xuehai Zhou. "Realizable N:M Sparse Transformer Inference via Search–Kernel Co-Design", 32nd International European Conference on Parallel and Distributed Computing (EURO-PAR), 2026:1-14. (CCF-B)
[EuroPar'26] Zhiwei Ke, Yiming Liu, Fengrui Zuo, Wenqi Lou*, Chao Wang, Xuehai Zhou. "Two-Stage Hierarchy-Aware Learning with Gradient Conflict Mitigation for HLS Latency and Resource Prediction ", 32nd International European Conference on Parallel and Distributed Computing (EURO-PAR), 2026:1-14. (CCF-B)
[RTSS'25] YinKang Gao, Bo Zhang, Yixuan Zhu, Lei Gong, Teng Wang, Wenqi Lou, Chao Wang, Xi Li, Xuehai Zhou. "TSI: A Time-semantic Instruction Set for Deterministic Data-flow Execution in Real-time Embedded Systems." The 46th IEEE Real-Time Systems Symposium (RTSS), 2025, (CCF-A)
[DAC'25] Wei Fu#, Wenqi Lou#*, Cheng Tang, Hongbing Wen, Yunji Qin, Lei Gong, Chao Wang*, Xuehai Zhou. "UniCoS: A Unified Neural and Accelerator Co-Search Framework for CNNs and ViTs", ACM/IEEE Design Automation Conference (DAC), San Francisco, Jun. 22-25, 2025:1-6. (CCF-A, Top Conference in EDA Area)
[ICPP'25] Wenqi Lou, Yunji Qin, Zihao Wang, Chao Wang*, Lei Gong, Xuehai Zhou. "Automated FPGA Accelerator Generation Framework for Transformers with Dataflow Optimization", 54th International Conference on Parallel Processing (ICPP), San Diego, Sept. 8-11, 2025:406-416. (CCF-B, Leading Conference in Parallel Computing)
[EuroPar'25] Jiale Dong#, Hao Wu#, Zihao Wang, Wenqi Lou*, Zhendong Zheng, Lei Gong, Chao Wang and Xuehai Zhou. "CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA", 31st International European Conference on Parallel and Distributed Computing (EURO-PAR), 2025:60-74. (CCF-B)
[ISCAS'25] Jiale Dong, Wenqi Lou*, Zhendong Zheng, Yunji Qin, Lei Gong, Chao Wang, Xuehai Zhou. "UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA". 2025 IEEE International Symposium on Circuits and Systems (ISCAS), London, May 25-28, 2025:1-5. (CCF-B, Oral)
[GLVLSI'24] Yunji Qin, Wenqi Lou*, Chao Wang*, Lei Gong, Xuehai Zhou. "Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention Fusion". Proceedings of the 2024 ACM Great Lakes Symposium on VLSI (GLVLSI). 2024:599-603. (CCF-C)
[ICCD'24] Yixuan Zhu, Wenqi Lou, Yinkang Gao, Binze Jiang, Xiaohang Gong, Xi Li. "Fine-Grained Shared Cache Interference Analysis using Basic Block's Execution Time". IEEE International Conference on Computer Design (ICCD), 2024. (CCF-B)
[FPGA'23] Xuan Wang, Lei Gong, Jing Cao, Wenqi Lou, Weiya Wang, Chao Wang, Xuehai Zhou. "hAP: A Spatial-von Neumann Heterogeneous Automata Processor with Optimized Resource and IO Overhead on FPGA", Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA). 2023: 185-196. (CCF-B, Top Conference in FPGA Area)
[DATE'23] Wenqi Lou, Jiaming Qian, Lei Gong, Xuan Wang, Chao Wang, Xuehai Zhou. "NAF: Deeper Network/Accelerator Co-Exploration for Customizing CNNs on FPGA", Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2023:1-6. (CCF-B, Top Conference in EDA Area)
[CLUSTER'20] Wenqi Lou, Chao Wang, Lei Gong, Xuehai Zhou. "OctCNN: An Energy-Efficient FPGA Accelerator for CNNs using Octave Convolution Algorithm". IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2020:410-411. (CCF-B, Poster)
[APPT'19] Wenqi Lou, Chao Wang, Lei Gong, Xuehai Zhou. "RV-CNN: Flexible and efficient instruction set for CNNs based on RISC-V processors". Advanced Parallel Processing Technologies: 13th International Symposium (APPT), 2019. (CCF-C, CCF Architecture Committee held)

Academic Services

《计算机工程与技术》（CCF-T2，中文核心期刊）青年编委，2026-
CCF体系结构专委会执行委员， 2025-
Reviewer for 《计算机学报》, CCF-T1期刊
Reviewer for IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)
Reviewer for IEEE Transactions on Very Large Scale Integration Systems (TVLSI)
Reviewer for Neural Networks
Reviewer for Journal of Systems Architecture (JSA)
Reviewer for 2026 IEEE International Symposium on Circuits and Systems (ISCAS)
Reviewer for IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)
Reviewer for IEEE Transactions on Biomedical Circuits and Systems (TBioCAS), JCR-Q1
Reviewer for The Journal of Supercomputing
Reviewer for Neurocomputing
Reviewer for Computers & Electrical Engineering (JCR-Q1)
Reviewer for Scientific Reports (JCR-Q1)
Reviewer for Journal of Cryptographic Engineering (JCR-Q2)
Reviewer for International Journal of Electronics (SCIE)
Reviewer for IET Computers & Digital Techniques (EI)
2025 全国高校程序设计教育大会程序设计类实训案例论坛主持

Awards

2026 亚太人工智能教育优秀青年学者
2025 安徽省教学成果一等奖
2025 江苏省青年科技人才托举工程
2024 年度江苏省自然科学三等奖
2025 年度江苏省双创博士
2025 中国科大教学成果特等奖
2025 第十届全国计算机类课程实验教学案例二等奖
2025 全国高校程序设计教育大会程序设计类实训案例特等奖
Intel Fellowship 2022
USTC-Gusu First Class Scholarship 2021
Outstanding Graduate of NWPU 2018

Projects

"异构算子驱动的视觉神经网络与加速器联合定制方法研究", 国家自然科学基金青年科学基金项目（C类），2026-2028，项目主持人，在研.
"卷积-注意力混合模型与FPGA加速器联合优化方法研究"，江苏省自然科学基金青年基金项目，2025-2028，项目主持人，在研.
"通用芯片语音模型优化部署技术研发", 思必驰科技有限公司合作项目, 2025-2026, 项目主持人, 结题.
"实用算法课程教学改革研究与实践", 安徽省新时代育人质量工程项目（教学改革研究）, 2025-2026, 项目主持人，在研.
"基于频域滤波卷积的神经网络可重构加速器新原理、新结构与新方法", 国家自然科学基金面上项目, 2022-2025, 技术骨干, 结题.

Teaching

2024 中国科大软件学院专业基础课《实用算法设计》, 主讲, 2024.03; 2024.09; 2025.09
2025 中国科大软件学院专业选修课《高级计算机体系结构》主讲, 2025.02

Patent

"性能预测模型的训练方法及装置、性能预测方法及装置"；宫磊，王超，周学海，王腾，娄文启，李曦，陈香兰, 发明专利, 已授权
"图数据处理方法及装置"；宫磊，王超，周学海，王腾，娄文启，李曦，陈香兰, 发明专利, 已授权
"CNN模型与加速器的联合搜索方法、装置、设备及介质"； 娄文启, 发明专利, 已公开
"神经网络和硬件的联合搜索方法及装置"； 娄文启，王超，付薇，唐承，宫磊，王腾，周学海，发明专利, 已授权
"基于融合注意力与量化操作的数据处理方法及加速器"; 娄文启;王超;覃云集;陈子齐;宫磊;王腾;周学海, 发明专利, 已授权