|
Wenqi Lou (娄文启 )
About me
I received Bachelor's degree in School of Computer in June 2018 from Northwestern Polytechnical University (NWPU), Xi'an.
In the same year, I was admitted to study for a M.Sc. degree in School of Computer Science and Technology, USTC without entrance examination.
From Sept. 2020, I started my Ph.D. degree under the supervision of Professor Xuehai Zhou and Professor Chao Wang.
I received my PhD degree of computer science in December, 2023 at USTC.
The main research focuses on intelligent accelerator architecture, FPGA accelerator design, and software-hardware co-optimization, dedicated to alleviating the deployment challenges of deep learning models from both algorithmic and hardware perspectives. In recent years, over 20 academic papers have been published in the field of computer architecture, including top-tier journals and conferences such as IEEE TCAD, IEEE TC, DAC, FPGA, and RTSS.
Among these, more than 10 CFF-A/B category papers were published as the first/corresponding author, with 4 granted patents.
娄文启,现为中国科大软件学院副研究员(长聘),硕士生导师,CCF体系结构专委会执行委员。2018年本科毕业于西北工业大学计算机学院,
2023年于中国科学技术大学获得计算机系统结构博士学位,导师为
周学海教授与王超教授, 毕业后留组任教。
主要研究方向为智能加速器架构、FPGA加速器设计、软硬件协同优化等,致力于从算法与硬件角度缓解深度学习模型的部署压力。
近年来,在计算机系统结构领域发表学术论文30余篇,其中一作/通讯论文26篇,CCF A类论文8篇(包括DAC、AAAI、TCAD、TC等顶级会议期刊),B类论文10篇(包括ICPP、DATE、EuroPar、TVLSI等旗舰会议期刊),授权专利5项。
同时,主持国家自然科学基金青年项目、江苏省自然科学基金青年项目、校级项目以及思必驰企业合作项目,担任《计算机学报》、IEEE TCAD、TVLSI、TCBB等审稿人。曾获江苏省青年人才托举工程计划、江苏省自然科学三等奖、安徽省教学成果一等奖等奖项。
📢: 招收对硬件加速与模型推理优化感兴趣、数理基础扎实的同学(推免硕士/直博/联培),由我与王超 教授联合指导,欢迎邮件联系,附简历。
Educations
- 2018.09 - 2023.12, Ph.D., School of Computer Science and Technology, University of Science and Technology of China (USTC)
- 2014.09 - 2018.06, B.S., School of Computer, Northwestern Polytechnical University (NWPU)
Research Interests
- FPGA Accelerator Design: Specialized in designing FPGA accelerators for Convolutional Neural Networks (CNN), Vision Transformers (ViT), and Large Language Models (LLM) to enhance performance and efficiency.
- Neural Architecture and Accelerator Co-Search: Focused on the co-evolution of neural network architectures and hardware accelerators to achieve optimized performance on specific hardware platforms.
- Model Quantization and Pruning for FPGA/GPU Inference: Expertise in reducing model complexity and size through quantization and pruning techniques, specifically tailored for efficient inference on FPGA and GPU architectures.
- AI for HW Design: Utilizing artificial intelligence to revolutionize the hardware design process, including automated design exploration, optimization, and verification.
Publications (* corresponding author)
Journal Article
-
[TCAD]
Zihan Wang, Lei Gong, Xiangjun Qu, Cheng Tang, Wenqi Lou, Teng Wang, Xianglan Chen, Chao Wang, Xuehai Zhou.
"UniSparTa: A Unified Sparse Tensor Program Tuning Framework."
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), 2025:1-14. Early Access Article
(CCF-A)
-
[TC]
Zihan Wang, Lei Gong, Wenqi Lou, Teng Wang, Qianyu Cheng, Xianglan Chen, Chao Wang, Xuehai Zhou.
"UniCoX: A Unified Cost Model for Tensorized Program Tuning Across Ubiquitous Accelerators".
IEEE Transactions on Computers (IEEE TC), 2025,1-15. Early Access Article
(CCF-A)
-
[TCAD]
ZhenDong Zheng, Qianyu Cheng, Teng Wang, Wenqi Lou, Lei Gong, Chao Wang, Xuehai Zhou.
"LORA: A Latency-Oriented Recurrent Architecture for Large Language Model on Multi-FPGA Platform with Communication Optimization",
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), 2025:1-14. Early Access Article
(CCF-A)
-
[TVLSI]
Jiale Dong, Wenqi Lou*, Hao Wu, Zhendong Zheng, Yunji Qin, Lei Gong, Chao Wang*, Xuehai Zhou.
"MoE-Sched: Enabling Efficient FPGA Deployment of Mixture-of-Experts Vision Transformers via Coordinated Scheduling",
IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 2026, 34(1):104 - 117.
(CCF-B, JCR-Q1)
-
[TOMM]
Haoyu Cai, Wenqi Lou*, Chao Wang*, Xuehai Zhou.
"Picasso: Analyzing Prompt Design for Text-To-Image Generative Diffusion Models from a Temporal-Spatial Perspective",
ACM Transactions on Multimedia Computing Communications and Applications (TOMM), 2025,21(11):1-24.
(CCF-B, JCR-Q1)
-
[ESL]
Hongbing Wen, Zihao Wang, Jiale Dong, Wenqi Lou*, Chao Wang, Xuehai Zhou.
"QLlama: An FPGA-Based Microscaling Quantization Accelerator for Energy-Efficient Llama2 Inference",
IEEE Embedded Systems Letters (ESL), September 28-October 3, 2025, 17(5):337-340, Taipei, China.
(CCF-B, CODES+ISSS 2025, Late Breaking Tracks)
-
[TCAD]
Wenqi Lou, Yunji Qin, Xuan Wang, Lei Gong, Chao Wang, Xuehai Zhou.
"FlexBCM: Hybrid Block-Circulant Neural Network and Accelerator Co-Search on FPGAs",
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), 2024, 43(11):3852-3863.
(CCF-A, accepted at the CODES+ISSS 2024)
-
[TCAD]
Wenqi Lou, Lei Gong, Chao Wang, Jiaming Qian, Xuan Wang, Changlong Li, Xuehai Zhou.
"Unleashing Network/Accelerator Co-Exploration Potential on FPGAs: A Deeper Joint Search",
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (IEEE TCAD), 2024, 43(10):3041-3054.
(CCF-A)
-
[TC]
Wenqi Lou, Lei Gong, Chao Wang, Zidong Du, Xuehai Zhou.
"OctCNN: A High Throughput FPGA Accelerator for CNNs Using Octave Convolution Algorithm",
IEEE Transactions on Computers (IEEE TC), 2022, 71(8): 1847-1859.
(CCF-A)
-
[JOS]
娄文启, 王超, 宫磊, 周学海.
一种神经网络指令集扩展与代码映射机制. 软件学报, 3074-3086, 2020.
(CCF-T1, Chinese Journal)
-
[IJSI]
Wenqi Lou, Chao Wang, Lei Gong, Xuehai Zhou.
"Neural Network Instruction Set Extension and Code Mapping Mechanism".
International Journal of Software and Informatics (IJSI), 2020. (EI Index)
Conference Paper
-
[DAC'26]
Haoran Xue, Teng Wang*, Qianyu Cheng, Zhendong Zheng, Wenqi Lou*, Lei Gong, Xi Li, Xuehai Zhou.
"Efficient HLS Accelerator Floorplan on Multi-Die FPGA Aided by Graph Neural Networks",
ACM/IEEE Design Automation Conference (DAC), 2026:1-7.
(CCF-A)
-
[DAC'26]
Zhiwei Ke, Wenqi Lou*, Yiming Liu, Fengrui Zuo, Chao Wang, Xuehai Zhou.
"Late Breaking Results: ASTFusion:Two-Stage Structural Enhancement Learning for Robust HLS Code Generation",
ACM/IEEE Design Automation Conference (DAC), 2026:1-2.
(CCF-A)
-
[DAC'26]
Yiming Liu, Wenqi Lou*, Zhiwei Ke, Fengrui Zuo, Chao Wang, Xuehai Zhou.
"Late Breaking Results: Recoverability-guided Layer-wise N:M Sparsity under Latency Constraints",
ACM/IEEE Design Automation Conference (DAC), 2026:1-2.
(CCF-A)
-
[AAAI'26]
Cheng Tang, Guochong Sui, Wenqi Lou*, Zihan Wang, Jiayi Tuo, Wenqian Xie, Yinkang Gao, Yixuan Zhu, Lei Gong, Chao Wang*, Xuehai Zhou.
"CloserToMe: A Unified Framework for Accurate and Transferable Latency Prediction across Heterogeneous Devices."
AAAI Conference on Artificial Intelligence (AAAI). 2026.
(CCF-A)
-
[EuroPar'26]
Yiming Liu, Wenqi Lou*, Zhiguang Wang, Zhiwei Ke, Fengrui Zuo, Chao Wang, Xuehai Zhou.
"Realizable N:M Sparse Transformer Inference via Search–Kernel Co-Design",
32nd International European Conference on Parallel and Distributed Computing (EURO-PAR), 2026:1-14.
(CCF-B)
[EuroPar'26]
Zhiwei Ke, Yiming Liu, Fengrui Zuo, Wenqi Lou*, Chao Wang, Xuehai Zhou.
"Two-Stage Hierarchy-Aware Learning with Gradient Conflict Mitigation for HLS Latency and Resource Prediction ",
32nd International European Conference on Parallel and Distributed Computing (EURO-PAR), 2026:1-14.
(CCF-B)
-
[RTSS'25]
YinKang Gao, Bo Zhang, Yixuan Zhu, Lei Gong, Teng Wang, Wenqi Lou, Chao Wang, Xi Li, Xuehai Zhou.
"TSI: A Time-semantic Instruction Set for Deterministic Data-flow Execution in Real-time Embedded Systems."
The 46th IEEE Real-Time Systems Symposium (RTSS), 2025,
(CCF-A)
-
[DAC'25]
Wei Fu#, Wenqi Lou#*, Cheng Tang, Hongbing Wen, Yunji Qin, Lei Gong, Chao Wang*, Xuehai Zhou.
"UniCoS: A Unified Neural and Accelerator Co-Search Framework for CNNs and ViTs",
ACM/IEEE Design Automation Conference (DAC), San Francisco, Jun. 22-25, 2025:1-6.
(CCF-A, Top Conference in EDA Area)
-
[ICPP'25]
Wenqi Lou, Yunji Qin, Zihao Wang, Chao Wang*, Lei Gong, Xuehai Zhou.
"Automated FPGA Accelerator Generation Framework for Transformers with Dataflow Optimization",
54th International Conference on Parallel Processing (ICPP), San Diego, Sept. 8-11, 2025:406-416.
(CCF-B, Leading Conference in Parallel Computing)
-
[EuroPar'25]
Jiale Dong#, Hao Wu#, Zihao Wang, Wenqi Lou*, Zhendong Zheng, Lei Gong, Chao Wang and Xuehai Zhou.
"CoQMoE: Co-Designed Quantization and Computation Orchestration for Mixture-of-Experts Vision Transformer on FPGA",
31st International European Conference on Parallel and Distributed Computing (EURO-PAR), 2025:60-74.
(CCF-B)
-
[ICME'25]
Cheng Tang, Wenqi Lou*, Qianyu Cheng, Jiayi Tuo, Wei Fu, Tianhao Jiang, Chao Wang, Xuehai Zhou.
"Spectral Enhanced Tuning: A Plug-and-Play Framework for Dehazing Models with Frequency Decoding and Fusion".
IEEE International Conference on Multimedia & Expo (ICME) 2025, Nantes, Jun. 30 to July 4, 2025.
(CCF-B)
-
[ISCAS'25]
Jiale Dong, Wenqi Lou*, Zhendong Zheng, Yunji Qin, Lei Gong, Chao Wang, Xuehai Zhou.
"UbiMoE: A Ubiquitous Mixture-of-Experts Vision Transformer Accelerator With Hybrid Computation Pattern on FPGA".
2025 IEEE International Symposium on Circuits and Systems (ISCAS), London, May 25-28, 2025:1-5.
(CCF-B, Oral)
-
[GLVLSI'24]
Yunji Qin, Wenqi Lou*, Chao Wang*, Lei Gong, Xuehai Zhou.
"Enhancing Long Sequence Input Processing in FPGA-Based Transformer Accelerators through Attention Fusion".
Proceedings of the 2024 ACM Great Lakes Symposium on VLSI (GLVLSI). 2024:599-603.
(CCF-C)
-
[PRICAI'24]
Wei Fu, Wenqi Lou*, Yunji Qin, Lei Gong, Chao Wang, Xuehai Zhou.
"MFNAS: Multi-Fidelity Exploration in Neural Architecture Search with Stable Zero-shot Proxy".
Pacific Rim International Conference on Artificial Intelligence (PRICAI). Springer, 2024:348-360.
(CCF-C)
-
[CLUSTER'24]
Wei Fu, Wenqi Lou*, Lei Gong, Chao Wang, Xuehai Zhou.
"Beyond Training: A Zero-Shot Framework to Neural Architecture and Accelerator Co-Exploration".
IEEE International Conference on Cluster Computing Workshops (CLUSTER Workshops). IEEE, 2024.
(CCF-B)
-
[ACCV'24]
Cheng Tang, Wenqi Lou*.
"TCL-Net: A Lightweight and Efficient Dehazing Network with Frequency-Domain Fusion and Multi-Angle Attention".
Asian Conference on Computer Vision (ACCV). Spring, 2024.
(CCF-C, oral, 5.6%)
-
[ICCD'24]
Yixuan Zhu, Wenqi Lou, Yinkang Gao, Binze Jiang, Xiaohang Gong, Xi Li.
"Fine-Grained Shared Cache Interference Analysis using Basic Block's Execution Time".
IEEE International Conference on Computer Design (ICCD), 2024.
(CCF-B)
-
[ICCD'24]
Xiangjun Qu, Lei Gong, Wenqi Lou, Qianyu Cheng, Xianglan Chen, Chao Wang, Xuehai Zhou.
"AutoSparse: A Source-to-Source Format and Schedule Auto-Tuning Framework for Sparse Tensor Program".
IEEE International Conference on Computer Design (ICCD), IEEE, 2024: 504-512.
(CCF-B)
-
[ICCD'24]
Zihan Wang, Lei Gong, Wenqi Lou, Qianyu Cheng, Xianglan Chen, Chao Wang, Xuehai Zhou.
"UniCoMo: A Unified Learning-Based Cost Model for Tensorized Program Tuning".
IEEE International Conference on Computer Design (ICCD), IEEE, 2024: 487-495.
(CCF-B)
-
[FPGA'23]
Xuan Wang, Lei Gong, Jing Cao, Wenqi Lou, Weiya Wang, Chao Wang, Xuehai Zhou.
"hAP: A Spatial-von Neumann Heterogeneous Automata Processor with Optimized Resource and IO Overhead on FPGA",
Proceedings of the 2023 ACM/SIGDA International Symposium on Field Programmable Gate Arrays (FPGA). 2023: 185-196.
(CCF-B, Top Conference in FPGA Area)
-
[DATE'23]
Wenqi Lou, Jiaming Qian, Lei Gong, Xuan Wang, Chao Wang, Xuehai Zhou.
"NAF: Deeper Network/Accelerator Co-Exploration for Customizing CNNs on FPGA",
Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2023:1-6.
(CCF-B, Top Conference in EDA Area)
-
[CLUSTER'20]
Wenqi Lou, Chao Wang, Lei Gong, Xuehai Zhou.
"OctCNN: An Energy-Efficient FPGA Accelerator for CNNs using Octave Convolution Algorithm".
IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 2020:410-411.
(CCF-B, Poster)
-
[APPT'19]
Wenqi Lou, Chao Wang, Lei Gong, Xuehai Zhou.
"RV-CNN: Flexible and efficient instruction set for CNNs based on RISC-V processors".
Advanced Parallel Processing Technologies: 13th International Symposium (APPT), 2019.
(CCF-C, CCF Architecture Committee held)
Academic Services
- 《计算机工程与技术》(CCF-T2,中文核心期刊) 青年编委 ,2026-
- CCF体系结构专委会执行委员, 2025-
- Reviewer for 《计算机学报》, CCF-T1期刊
- Reviewer for IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), CCF-A
- Reviewer for IEEE Transactions on Very Large Scale Integration Systems (TVLSI), CCF-B
- Reviewer for Neural Networks, CCF-B
- Reviewer for Journal of Systems Architecture (JSA), CCF-B
- Reviewer for 2026 IEEE International Symposium on Circuits and Systems (ISCAS), CCF-B
- Reviewer for IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), CCF-B
- Reviewer for IEEE Transactions on Biomedical Circuits and Systems (TBioCAS), JCR-Q1
- Reviewer for The Journal of Supercomputing, CCF-C
- Reviewer for Neurocomputing, CCF-C
- Reviewer for Computers & Electrical Engineering (JCR-Q1)
- Reviewer for Scientific Reports (JCR-Q1)
- Reviewer for Journal of Cryptographic Engineering (JCR-Q2)
- Reviewer for International Journal of Electronics (SCIE)
- Reviewer for IET Computers & Digital Techniques (EI)
- 2025 全国高校程序设计教育大会 程序设计类实训案例论坛 主持
Awards
- 2025 安徽省教学成果一等奖
- 2025 江苏省青年科技人才托举工程
- 2024 年度江苏省自然科学三等奖
- 2025 年度江苏省双创博士
- 2025 中国科大教学成果特等奖
- 2025 第十届全国计算机类课程实验教学案例 二等奖
- 2025 全国高校程序设计教育大会 程序设计类实训案例 特等奖
- Intel Fellowship 2022
- USTC-Gusu First Class Scholarship 2021
- Outstanding Graduate of NWPU 2018
Projects
- "异构算子驱动的视觉神经网络与加速器联合定制方法研究", 国家自然科学基金青年科学基金项目(C类),2026-2028,项目主持人,在研.
- "卷积-注意力混合模型与FPGA加速器联合优化方法研究",江苏省自然科学基金青年基金项目,2025-2028,项目主持人,在研.
- "通用芯片语音模型优化部署技术研发", 思必驰科技有限公司合作项目, 2025-2026, 项目主持人, 结题.
- "实用算法课程教学改革研究与实践", 安徽省新时代育人质量工程项目(教学改革研究), 2025-2026, 项目主持人,在研.
- "基于频域滤波卷积的神经网络可重构加速器新原理、新结构与新方法", 国家自然科学基金面上项目, 2022-2025, 技术骨干, 结题.
Teaching
- 2024 中国科大软件学院专业基础课 《实用算法设计》, 主讲, 2024.03; 2024.09; 2025.09
- 2025 中国科大软件学院专业选修课《高级计算机体系结构》主讲, 2025.02
Patent
- "性能预测模型的训练方法及装置、性能预测方法及装置"; 宫磊,王超,周学海,王腾,娄文启,李曦,陈香兰, 发明专利, 已授权
- "图数据处理方法及装置"; 宫磊,王超,周学海,王腾,娄文启,李曦,陈香兰, 发明专利, 已授权
- "CNN模型与加速器的联合搜索方法、装置、设备及介质"; 娄文启, 发明专利, 已公开
- "神经网络和硬件的联合搜索方法及装置"; 娄文启,王超,付薇,唐承,宫磊,王腾,周学海,发明专利, 已授权
- "基于融合注意力与量化操作的数据处理方法及加速器"; 娄文启;王超;覃云集;陈子齐;宫磊;王腾;周学海, 发明专利, 已授权
|