Main Results of VMamba Series

name	pretrain	resolution	acc@1	#params	FLOPs	TP.	Train TP.	configs/logs/ckpts
Swin-T	ImageNet-1K	224x224	81.2	28M	4.5G	1244	987	--
Swin-S	ImageNet-1K	224x224	83.2	50M	8.7G	718	642	--
Swin-B	ImageNet-1K	224x224	83.5	88M	15.4G	458	496	--
Vanilla-VMamba-T	ImageNet-1K	224x224	82.2	23M	~~4.5G~~ 5.6G	638	195	config/log/ckpt
Vanilla-VMamba-S	ImageNet-1K	224x224	83.5	44M	~~9.1G~~ 11.2G	359	111	config/log/ckpt
Vanilla-VMamba-B	ImageNet-1K	224x224	83.7	76M	~~15.2G~~ 18.0G	268	84	config/log/ckpt
VMamba-T[`s2l5`]	ImageNet-1K	224x224	82.5	31M	4.9G	1340	464	config/log/ckpt
VMamba-S[`s2l15`]	ImageNet-1K	224x224	83.6	50M	8.7G	877	314	config/log/ckpt
VMamba-B[`s2l15`]	ImageNet-1K	224x224	83.9	89M	15.4G	646	247	config/log/ckpt
VMamba-T[`s1l8`]	ImageNet-1K	224x224	82.6	30M	4.9G	1686	571	config/log/ckpt
VMamba-S[`s1l20`]	ImageNet-1K	224x224	83.3	49M	8.6G	1106	390	config/log/ckpt
VMamba-B[`s1l20`]	ImageNet-1K	224x224	83.8	87M	15.2G	827	313	config/log/ckpt

Models in this subsection is trained from scratch with random or manual initialization. The hyper-parameters are inherited from Swin, except for drop_path_rate and EMA. All models are trained with EMA except for the Vanilla-VMamba-T.
TP.(Throughput) and Train TP. (Train Throughput) are assessed on an A100 GPU paired with an AMD EPYC 7542 CPU, with batch size 128. Train TP. is tested with mix-resolution, excluding the time consumption of optimizers.
FLOPs and parameters are now gathered with head (In previous versions, without head, so the numbers raise a little bit).
we calculate FLOPs with the algorithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algorithm).

Backbone	#params	FLOPs	Detector	bboxAP	bboxAP50	bboxAP75	segmAP	segmAP50	segmAP75	configs/logs/ckpts
Swin-T	48M	267G	MaskRCNN@1x	42.7	65.2	46.8	39.3	62.2	42.2	--
Swin-S	69M	354G	MaskRCNN@1x	44.8	66.6	48.9	40.9	63.4	44.2	--
Swin-B	107M	496G	MaskRCNN@1x	46.9	--	--	42.3	--	--	--
Vanilla-VMamba-T	42M	~~262G~~ 286G	MaskRCNN@1x	46.5	68.5	50.7	42.1	65.5	45.3	config/log/ckpt
Vanilla-VMamba-S	64M	~~357G~~ 400G	MaskRCNN@1x	48.2	69.7	52.5	43.0	66.6	46.4	config/log/ckpt
Vanilla-VMamba-B	96M	~~482G~~ 540G	MaskRCNN@1x	48.6	70.0	53.1	43.3	67.1	46.7	config/log/ckpt
VMamba-T[`s2l5`]	50M	270G	MaskRCNN@1x	47.4	69.5	52.0	42.7	66.3	46.0	config/log/ckpt
VMamba-S[`s2l15`]	70M	384G	MaskRCNN@1x	48.7	70.0	53.4	43.7	67.3	47.0	config/log/ckpt
VMamba-B[`s2l15`]	108M	485G	MaskRCNN@1x	49.2	71.4	54.0	44.1	68.3	47.7	config/log/ckpt
VMamba-B[`s2l15`]	108M	485G	MaskRCNN@1x[`bs8`]	49.2	70.9	53.9	43.9	67.7	47.6	config/log/ckpt
VMamba-T[`s1l8`]	50M	271G	MaskRCNN@1x	47.3	69.3	52.0	42.7	66.4	45.9	config/log/ckpt
:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:	:---:
Swin-T	48M	267G	MaskRCNN@3x	46.0	68.1	50.3	41.6	65.1	44.9	--
Swin-S	69M	354G	MaskRCNN@3x	48.2	69.8	52.8	43.2	67.0	46.1	--
Vanilla-VMamba-T	42M	~~262G~~ 286G	MaskRCNN@3x	48.5	70.0	52.7	43.2	66.9	46.4	config/log/ckpt
Vanilla-VMamba-S	64M	~~357G~~ 400G	MaskRCNN@3x	49.7	70.4	54.2	44.0	67.6	47.3	config/log/ckpt
VMamba-T[`s2l5`]	50M	270G	MaskRCNN@3x	48.9	70.6	53.6	43.7	67.7	46.8	config/log/ckpt
VMamba-S[`s2l15`]	70M	384G	MaskRCNN@3x	49.9	70.9	54.7	44.20	68.2	47.7	config/log/ckpt
VMamba-T[`s1l8`]	50M	271G	MaskRCNN@3x	48.8	70.4	53.50	43.7	67.4	47.0	config/log/ckpt

Models in this subsection is initialized from the models trained in classfication.
we now calculate FLOPs with the algrithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algrithm).

Backbone	Input	#params	FLOPs	Segmentor	mIoU(SS)	mIoU(MS)	configs/logs/logs(ms)/ckpts
Swin-T	512x512	60M	945G	UperNet@160k	44.4	45.8	--
Swin-S	512x512	81M	1039G	UperNet@160k	47.6	49.5	--
Swin-B	512x512	121M	1188G	UperNet@160k	48.1	49.7	--
Vanilla-VMamba-T	512x512	55M	~~939G~~ 964G	UperNet@160k	47.3	48.3	config/log/log(ms)/ckpt
Vanilla-VMamba-S	512x512	76M	~~1037G~~ 1081G	UperNet@160k	49.5	50.5	config/log/log(ms)/ckpt
Vanilla-VMamba-B	512x512	110M	~~1167G~~ 1226G	UperNet@160k	50.0	51.3	config/log/log(ms)/ckpt
VMamba-T[`s2l5`]	512x512	62M	948G	UperNet@160k	48.3	48.6	config/log/log(ms)/ckpt
VMamba-S[`s2l15`]	512x512	82M	1028G	UperNet@160k	50.6	51.2	config/log/log(ms)/ckpt
VMamba-B[`s2l15`]	512x512	122M	1170G	UperNet@160k	51.0	51.6	config/log/log(ms)/ckpt
VMamba-T[`s1l8`]	512x512	62M	949G	UperNet@160k	47.9	48.8	config/log/log(ms)/ckpt

Models in this subsection is initialized from the models trained in classfication.
we now calculate FLOPs with the algrithm @albertgu provides, which will be bigger than previous calculation (which is based on the selective_scan_ref function, and ignores the hardware-aware algrithm).