kekezack fbbbfe53c8 feat(data): 添加数据集图片资源		14 hours ago
..
.idea	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
assets	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
configs	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
data	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
downstream	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
loss	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
model	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
optim	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
trainer	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
util	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
weights	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
LICENSE	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
README.md	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago
run.py	fbbbfe53c8 feat(data): 添加数据集图片资源	14 hours ago

Official PyTorch implementation of "MobileMamba: Lightweight Multi-Receptive Visual Mamba Network".

Congratulations! Our MobileMamba has been accepted at the CVPR'25 conference!

Haoyang He^1*, Jiangning Zhang^2*, Yuxuan Cai³, Hongxu Chen¹ Xiaobin Hu²,

Zhenye Gan², Yabiao Wang², Chengjie Wang², Yunsheng Wu², Lei Xie^1†

¹College of Control Science and Engineering, Zhejiang University, ²Youtu Lab, Tencent, ³Huazhong University of Science and Technology

Abstract: Previous research on lightweight models has primarily focused on CNNs and Transformer-based designs. CNNs, with their local receptive fields, struggle to capture long-range dependencies, while Transformers, despite their global modeling capabilities, are limited by quadratic computational complexity in high-resolution scenarios. Recently, state-space models have gained popularity in the visual domain due to their linear computational complexity. Despite their low FLOPs, current lightweight Mamba-based models exhibit suboptimal throughput. In this work, we propose the MobileMamba framework, which balances efficiency and performance. We design a three-stage network to enhance inference speed significantly. At a fine-grained level, we introduce the Multi-Receptive Field Feature Interaction MRFFI module, comprising the Long-Range Wavelet Transform-Enhanced Mamba WTE-Mamba, Efficient Multi-Kernel Depthwise Convolution MK-DeConv, and Eliminate Redundant Identity components. This module integrates multi-receptive field information and enhances high-frequency detail extraction. Additionally, we employ training and testing strategies to further improve performance and efficiency. MobileMamba achieves up to 83.6% on Top-1, surpassing existing state-of-the-art methods which is maximum x21 faster than LocalVim on GPU. Extensive experiments on high-resolution downstream tasks demonstrate that MobileMamba surpasses current efficient models, achieving an optimal balance between speed and accuracy.

Top: Visualization of the Effective Receptive Fields (ERF) for different architectures. Bottom: Performance vs. FLOPs with recent CNN/Transformer/Mamba-based methods.

Accuracy vs. Speed with Mamba-based methods.

Classification results

Image Classification for ImageNet-1K:

Model	FLOPs	#Params	Resolution	Top-1	Cfg	Log	Model
MobileMamba-T2	255M	8.8M	192 x 192	71.5	cfg	log	model
MobileMamba-T2†	255M	8.8M	192 x 192	76.9	cfg	log	model
MobileMamba-T4	413M	14.2M	192 x 192	76.1	cfg	log	model
MobileMamba-T4†	413M	14.2M	192 x 192	78.9	cfg	log	model
MobileMamba-S6	652M	15.0M	224 x 224	78.0	cfg	log	model
MobileMamba-S6†	652M	15.0M	224 x 224	80.7	cfg	log	model
MobileMamba-B1	1080M	17.1M	256 x 256	79.9	cfg	log	model
MobileMamba-B1†	1080M	17.1M	256 x 256	82.2	cfg	log	model
MobileMamba-B2	2427M	17.1M	384 x 384	81.6	cfg	log	model
MobileMamba-B2†	2427M	17.1M	384 x 384	83.3	cfg	log	model
MobileMamba-B4	4313M	17.1M	512 x 512	82.5	cfg	log	model
MobileMamba-B4†	4313M	17.1M	512 x 512	83.6	cfg	log	model

Downstream Results

Object Detection and Instant Segmentation Results

Object Detection and Instant Segmentation Performance Based on Mask-RCNN for COCO2017:

Backbone	AP^b	AP^b₅₀	AP^b₇₅	AP^b_S	AP^b_M	AP^b_L	AP^m	AP^m₅₀	AP^m₇₅	AP^m_S	AP^m_M	AP^m_L	#Params	FLOPs	Cfg	Log	Model
MobileMamba-B1	40.6	61.8	43.8	22.4	43.5	55.9	37.4	58.9	39.9	17.1	39.9	56.4	38.0M	178G	cfg	log	model

Object Detection Performance Based on RetinaNet for COCO2017:

Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	#Params	FLOPs	Cfg	Log	Model
MobileMamba-B1	39.6	59.8	42.4	21.5	43.4	53.9	27.1M	151G	cfg	log	model

Object Detection Performance Based on SSDLite for COCO2017:

Backbone	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L	#Params	FLOPs	Cfg	Log	Model
MobileMamba-B1	24.0	39.5	24.0	3.1	23.4	46.9	18.0M	1.7G	cfg	log	model
MobileMamba-B1-r512	29.5	47.7	30.4	8.9	35.0	47.0	18.0M	4.4G	cfg	log	model

Semantic Segmentation Results

Semantic Segmentation Based on Semantic FPN for ADE20k:

Backbone	aAcc	mIoU	mAcc	#Params	FLOPs	Cfg	Log	Model
MobileMamba-B4	79.9	42.5	53.7	19.8M	5.6G	cfg	log	model

Semantic Segmentation Based on DeepLabv3 for ADE20k:

Backbone	aAcc	mIoU	mAcc	#Params	FLOPs	Cfg	Log	Model
MobileMamba-B4	76.3	36.6	47.1	23.4M	4.7G	cfg	log	model

Semantic Segmentation Based on PSPNet for ADE20k:

Backbone	aAcc	mIoU	mAcc	#Params	FLOPs	Cfg	Log	Model
MobileMamba-B4	76.2	36.9	47.9	20.5M	4.5G	cfg	log	model

All Pretrained Weights and Logs

The model weights and log files for all classification and downstream tasks are available for download via weights.

Classification

Environments

pip3 install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118
pip3 install timm==0.9.16 tensorboardX einops torchprofile fvcore==0.1.5.post20221221 triton==2.1.0
cd model/lib_mamba/kernels/selective_scan && pip install . && cd ../../../..
git clone https://github.com/NVIDIA/apex && cd apex && pip3 install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./ (optional)

Prepare ImageNet-1K Dataset

Download and extract ImageNet-1K dataset in the following directory structure:

├── imagenet
    ├── train
        ├── n01440764
            ├── n01440764_10026.JPEG
            ├── ...
        ├── ...
    ├── train.txt (optional)
    ├── val
        ├── n01440764
            ├── ILSVRC2012_val_00000293.JPEG
            ├── ...
        ├── ...
    └── val.txt (optional)

There are two methods to load ImageNet data.

The first method uses imagenet/train.lmdb for loading. The train.lmdb and val.lmdb files can be generated using the repository at https://github.com/xunge/pytorch_lmdb_imagenet. On a mechanical hard drive, using LMDB for data I/O increases the speed by approximately ten times compared to the default PyTorch data loading interface.

The second method uses the original ImageNet data. To use this method, change line 26 in all the config file to data.type = 'DefaultCLS'. This allows loading from the original ImageNet data, but it is significantly slower.

Test

Test with 8 GPUs in one node:

MobileMamba-T2

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_t2 -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_T2/mobilemamba_t2.pth ``` This should give `Top-1: 73.638 (Top-5: 91.422)`

MobileMamba-T2†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_t2s -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_T2s/mobilemamba_t2s.pth ``` This should give `Top-1: 76.934 (Top-5: 93.100)`

MobileMamba-T4

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_t4 -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_T4/mobilemamba_t4.pth ``` This should give `Top-1: 76.086 (Top-5: 92.772)`

MobileMamba-T4†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_t4s -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_T4s/mobilemamba_t4s.pth ``` This should give `Top-1: 78.914 (Top-5: 94.160)`

MobileMamba-S6

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_s6 -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_S6/mobilemamba_s6.pth ``` This should give `Top-1: 78.002 (Top-5: 93.992)`

MobileMamba-S6†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_s6s -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_S6s/mobilemamba_s6s.pth ``` This should give `Top-1: 80.742 (Top-5: 95.182)`

MobileMamba-B1

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b1 -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_B1/mobilemamba_b1.pth ``` This should give `Top-1: 79.948 (Top-5: 94.924)`

MobileMamba-B1†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b1s -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_B1s/mobilemamba_b1s.pth ``` This should give `Top-1: 82.234 (Top-5: 95.872)`

MobileMamba-B2

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b2 -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_B2/mobilemamba_b2.pth ``` This should give `Top-1: 81.624 (Top-5: 95.890)`

MobileMamba-B2†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b2s -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_B2s/mobilemamba_b2s.pth ``` This should give `Top-1: 83.260 (Top-5: 96.438)`

MobileMamba-B4

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b4 -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_B4/mobilemamba_b4.pth ``` This should give `Top-1: 82.496 (Top-5: 96.252)`

MobileMamba-B4†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b4s -m test model.model_kwargs.checkpoint_path=weights/MobileMamba_B4s/mobilemamba_b4s.pth ``` This should give `Top-1: 83.644 (Top-5: 96.606)`

Train

Train with 8 GPUs in one node:

MobileMamba-T2

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_t2 -m train ```

MobileMamba-T2†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_t2s -m train ```

MobileMamba-T4

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_t4 -m train ```

MobileMamba-T4†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_t4s -m train ```

MobileMamba-S6

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_s6 -m train ```

MobileMamba-S6†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_s6s -m train ```

MobileMamba-B1

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b1 -m train ```

MobileMamba-B1†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b1s -m train ```

MobileMamba-B2

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b2 -m train ```

MobileMamba-B2†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b2s -m train ```

MobileMamba-B4

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b4 -m train ```

MobileMamba-B4†

``` python3 -m torch.distributed.launch --nproc_per_node=8 --nnodes=1 --use_env run.py -c configs/mobilemamba/mobilemamba_b4s -m train ```

Down-Stream Tasks

Environments

pip3 install terminaltables pycocotools prettytable xtcocotools
pip3 install mmpretrain==1.2.0 mmdet==3.3.0 mmsegmentation==1.2.2
pip3 install mmcv==2.1.0 -f https://download.openmmlab.com/mmcv/dist/cu118/torch2.1/index.html
cd det/backbones/lib_mamba/kernels/selective_scan && pip install . && cd ../../../..

Prepare COCO and ADE20k Dataset

Download and extract COCO2017 and ADE20k dataset in the following directory structure:

downstream
├── det
├──── data
│   ├──── coco
│   │   ├──── annotations
│   │   ├──── train2017
│   │   ├──── val2017
│   │   ├──── test2017
├── seg
├──── data
│   ├──── ade
│   │   ├──── ADEChallengeData2016
│   │   ├──────── annotations
│   │   ├──────── images

Object Detection

Mask-RCNN

#### Train: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/mask_rcnn/mask-rcnn_mobilemamba_b1_fpn_1x_coco.py 4 ``` #### Test: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_test.sh configs/mask_rcnn/mask-rcnn_mobilemamba_b1_fpn_1x_coco.py ../../weights/downstream/det/maskrcnn.pth 4 ```

RetinaNet

#### Train: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/retinanet/retinanet_mobilemamba_b1_fpn_1x_coco.py 4 ``` #### Test: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_test.sh configs/retinanet/retinanet_mobilemamba_b1_fpn_1x_coco.py ../../weights/downstream/det/retinanet.pth 4 ```

SSDLite

#### Train with 320 x 320 resolution: ``` ./tools/dist_train.sh configs/ssd/ssdlite_mobilemamba_b1_8gpu_2lr_coco.py 8 ``` #### Test with 320 x 320 resolution: ``` ./tools/dist_test.sh configs/ssd/ssdlite_mobilemamba_b1_8gpu_2lr_coco.py ../../weights/downstream/det/ssdlite.pth 8 ``` #### Train with 512 x 512 resolution: ``` ./tools/dist_train.sh configs/ssd/ssdlite_mobilemamba_b1_8gpu_2lr_512_coco.py 8 ``` #### Test with 512 x 512 resolution: ``` ./tools/dist_test.sh configs/ssd/ssdlite_mobilemamba_b1_8gpu_2lr_512_coco.py ../../weights/downstream/det/ssdlite_512.pth 8 ```

Semantic Segmentation

DeepLabV3

#### Train: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/deeplabv3/deeplabv3_mobilemamba_b4-80k_ade20k-512x512.py 4 ``` #### Test: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_test.sh configs/deeplabv3/deeplabv3_mobilemamba_b4-80k_ade20k-512x512.py ../../weights/downstream/seg/deeplabv3.pth 4 ```

Semantic FPN

#### Train: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/sem_fpn/fpn_mobilemamba_b4-160k_ade20k-512x512.py 4 ``` #### Test: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_test.sh configs/sem_fpn/fpn_mobilemamba_b4-160k_ade20k-512x512.py ../../weights/downstream/seg/fpn.pth 4 ```

PSPNet

#### Train: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_train.sh configs/pspnet/pspnet_mobilemamba_b4-80k_ade20k-512x512.py 4 ``` #### Test: ``` CUDA_VISIBLE_DEVICES=0,1,2,3 ./tools/dist_test.sh configs/pspnet/pspnet_mobilemamba_b4-80k_ade20k-512x512.py ../../weights/downstream/seg/pspnet.pth 4 ```

Citation

If our work is helpful for your research, please consider citing:

@article{mobilemamba,
  title={MobileMamba: Lightweight Multi-Receptive Visual Mamba Network},
  author={Haoyang He and Jiangning Zhang and Yuxuan Cai and Hongxu Chen and Xiaobin Hu and Zhenye Gan and Yabiao Wang and Chengjie Wang and Yunsheng Wu and Lei Xie},
  journal={arXiv preprint arXiv:2411.15941},
  year={2024}
}

Acknowledgements

We thank but not limited to following repositories for providing assistance for our research:

README.md

Congratulations! Our MobileMamba has been accepted at the CVPR'25 conference!

Classification results

Image Classification for ImageNet-1K:

Downstream Results

Object Detection and Instant Segmentation Results

Object Detection and Instant Segmentation Performance Based on Mask-RCNN for COCO2017:

Object Detection Performance Based on RetinaNet for COCO2017:

Object Detection Performance Based on SSDLite for COCO2017:

Semantic Segmentation Results

Semantic Segmentation Based on Semantic FPN for ADE20k:

Semantic Segmentation Based on DeepLabv3 for ADE20k:

Semantic Segmentation Based on PSPNet for ADE20k:

All Pretrained Weights and Logs

Classification

Environments

Prepare ImageNet-1K Dataset

Test

Train

Down-Stream Tasks

Environments

Prepare COCO and ADE20k Dataset

Object Detection

Semantic Segmentation

Citation

Acknowledgements