Shang-Jui Ray Kuo

I am a PhD researcher at the SPELL Lab (Synthetic Perception and Learning Lab), Stony Brook University, advised by Prof. Paola Cascante-Bonilla. My research focuses on fundamental questions about the architectural choices AI systems have inherited and built on. Many of those choices were made when the practical constraints looked very different, and I work on rigorously comparing the defaults against the alternatives, whether those alternatives already exist or have to be designed for the comparison.

My current work runs concurrent threads on both sides of the vision-language interface. On the vision side, I ask whether Vision Transformers are actually the right backbone for VLMs, or whether this is a historical default worth revisiting. On the language side, I ask whether the standard tokenization pipeline is the right input interface, particularly for writing systems it was never designed for. Underlying both is a deeper question: are there more natural ways for AI systems to process the modalities they take in, ways that fit both the structure of the data and the modern computing systems that run on it? My background in hardware and systems shapes how I work at this boundary.

Before starting my PhD I was an AI researcher and AI accelerator engineer at Inventec Corporation in Taipei, working on medical image segmentation and NPU IP design. I received my B.S. in Electrical Engineering from National Taiwan University in 2023.

Recent. Our VLM-SSM vision-encoder paper was featured on Hugging Face Daily Papers; the HF open-source team offered a ZeroGPU (A100) grant for a public demo. The work was also accepted as a poster at the SUNY AI Symposium 2026.

Research interests. Vision-language models · state space models as vision encoders · tokenization and learned input representations · multimodal learning · hardware-AI co-design.

Selected Publications

Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

Shang-Jui Ray Kuo and Paola Cascante-Bonilla

. Featured on Hugging Face Daily Papers , 2026

Under review

Abs arXiv Code Website

We conduct a controlled study comparing Transformer, SSM, and hybrid vision backbones as frozen encoders in a LLaVA-style VLM pipeline. Under matched pretraining conditions, SSM backbones (VMamba) provide substantially stronger spatial grounding while remaining competitive on open-ended VQA, and can match or outperform much larger ViT-based encoders on localization benchmarks. We further show that localization failures can be stabilized with simple interface adjustments.
Improving Limited Supervised Foot Ulcer Segmentation Using Cross-Domain Augmentation

Shang-Jui Kuo^*, Po-Han Huang^*, Chia-Ching Lin, and 2 more authors

In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). * Equal contribution (Kuo and Huang) , 2024

Abs arXiv Website

We propose a two-stage cross-domain augmentation methodology (TransMix) for foot-ulcer segmentation under limited supervision: Augmented Global Pre-training on the source-domain skin-lesion dataset HAM10000, followed by Localized CutMix Fine-tuning on the target-domain FUSeg benchmark. On FUSeg the method lifts Dice from 74.83% to 85.26% using only 40 labeled images (+10.43% over the U-Net baseline), and reaches 91.08% at the full 810-image scale with a lighter ResNet-50 backbone.