Multi-Scale High-Resolution Vision Transformer for Semantic Segmentation

摘要

Vision Transformers (ViTs) have emerged with superior performance on computer vision tasks compared to convolutional neural network (CNN)-based models. However, ViTs are mainly designed for image classification that generate single-scale low-resolution representations, which makes dense prediction tasks such as semantic segmentation challenging for ViTs. Therefore, we propose HRViT, which enhances ViTs to learn semantically-rich and spatially-precise multi-scale representations by integrating high-resolution multi-branch architectures with ViTs. We balance the model performance and efficiency of HRViT by various branch-block co-optimization techniques. Specifically, we explore heterogeneous branch designs, reduce the redundancy in linear layers, and augment the attention block with enhanced expressiveness. Those approaches enabled HRViT to push the Pareto frontier of performance and efficiency on semantic segmentation to a new level, as our evaluation results on ADE20K and Cityscapes show. HRViT achieves 50.20% mIoU on ADE20K and 83.16% mIoU on Cityscapes, surpassing state-of-the-art MiT and CSWin backbones with an average of +1.78 mIoU improvement, 28% parameter saving, and 21% FLOPs reduction, demonstrating the potential of HRViT as a strong vision backbone for semantic segmentation.

出版物
In Conference on Computer Vision and Pattern Recognition
李萌
李萌
助理教授、研究员、博雅青年学者

李萌,北京大学人工智能研究院和集成电路双聘助理教授、研究员、博雅青年学者。他的研究兴趣集中于高效、安全的多模态人工智能加速算法和芯片,旨在通过算法到芯片的跨层次协同设计和优化,为人工智能构建高能效、高可靠、高安全的算力基础。

var dimensionValue = 'SOME_DIMENSION_VALUE'; ga('set', 'dimension1', dimensionValue);