Publications

Publications by category in reverse chronological order.

2026

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Shang-Jui Ray Kuo and Paola Cascante-Bonilla

. Featured on Hugging Face Daily Papers , 2026

Under review

Abs arXiv Code Website

We conduct a controlled study comparing Transformer, SSM, and hybrid vision backbones as frozen encoders in a LLaVA-style VLM pipeline. Under matched pretraining conditions, SSM backbones (VMamba) provide substantially stronger spatial grounding while remaining competitive on open-ended VQA, and can match or outperform much larger ViT-based encoders on localization benchmarks. We further show that localization failures can be stabilized with simple interface adjustments.

2024

Improving Limited Supervised Foot Ulcer Segmentation Using Cross-Domain Augmentation

Shang-Jui Kuo^*, Po-Han Huang^*, Chia-Ching Lin, and 2 more authors

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). * Equal contribution (Kuo and Huang) , 2024

Abs arXiv Website

We propose a two-stage cross-domain augmentation methodology (TransMix) for foot-ulcer segmentation under limited supervision: Augmented Global Pre-training on the source-domain skin-lesion dataset HAM10000, followed by Localized CutMix Fine-tuning on the target-domain FUSeg benchmark. On FUSeg the method lifts Dice from 74.83% to 85.26% using only 40 labeled images (+10.43% over the U-Net baseline), and reaches 91.08% at the full 810-image scale with a lighter ResNet-50 backbone.