|
Xiaoxia (Shirley) Wu (吴晓霞) Principal Scientist, Together AI Email: shirley AT together dot ai |
I am a Principal Scientist at Together AI (promoted May 2026; joined July 2024), where I lead research on LLM inference efficiency. My work spans speculative decoding, quantization, and RL-driven post-training. Concretely, I lead the Aurora and ATLAS projects — unified training–serving systems that reduce the training–serving mismatch in speculative decoding through continuous online adaptation from live traffic. I also build and scale speculator training and distillation pipelines (SFT, distillation, RL post-training) and drive full-stack inference optimization across quantization formats including FP8, FP6, NVFP4, INT4, INT2, ternary, and binary, with deployment in vLLM, SGLang, and TensorRT-LLM. Ping me if you're interested in building fast and efficient LLM inference!
Previously, I was a Senior Researcher on the DeepSpeed team at Microsoft, led by Zhewei Yao and Yuxiong He. I focused on algorithm- and system-level optimizations for large-scale LLM training and inference, with emphasis on compression, quantization, and multi-modal research. Key projects include DeepSpeed-FP6, DeepSpeed-Chat, ZeroQuant, and Extreme Compression for Pre-trained Transformers (NeurIPS 2022 Oral).
Before that, I was a postdoctoral research fellow at the University of Chicago and the Toyota Technological Institute at Chicago, mentored by Rebecca Willett, where I worked on differentially private empirical risk minimization.
I completed my Ph.D. in Machine Learning at The University of Texas at Austin, advised by Rachel Ward and co-advised by Léon Bottou. My dissertation, awarded the Frank Gerth III Dissertation Award (top dissertation in Mathematics at UT Austin), focused on gradient-based optimization and implicit regularization over non-convex landscapes. I interned at Meta AI Research (Fall 2017, with Léon Bottou) and at Google (Summer 2020, with Behnam Neyshabur and Ethan Dyer), where my work on curriculum learning was published as an ICLR 2021 Oral.
I hold an M.Sc. with Distinction in Financial Mathematics from the University of Edinburgh. Before that, I studied Mathematics and Applied Mathematics at Shantou University, where I was awarded the Li Ka-shing Scholarship to participate in Semester at Sea. I am from Guangdong, China, and speak Cantonese and Hakka.
6,700+ citations, h-index 24 (as of May 2026). For a full list, see my Google Scholar.
LLM Systems & Speculative Decoding
When RL Meets Adaptive Speculative Training
J. Wang, F. Bie, J. Li, Z. Zhou, et al., X. Wu (project lead)
ICML 2026
Beat the Long Tail: Distribution-Aware Speculative Decoding for RL Training
Z. Shao, V. Srivatsa, Q. Wu, A. Ariyak, X. Wu, et al.
MLSys 2026
Understanding and Steering the Cognitive Behaviors of Reasoning Models at Test-Time
Z. Zhang, X. Wu (project lead), et al.
arXiv 2025
Model Compression & Quantization
OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
Z. Zhou, et al., X. Wu (project lead)
arXiv 2025
Kitty: Accurate and Efficient 2-bit KV Cache Quantization
H. Xia, X. Wu (project lead), et al.
MLSys 2026
Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs
H. Xia, Z. Zheng, X. Wu, S. Chen, Z. Yao, S. Youn, et al.
USENIX ATC 2024
ZeroQuant(4+2): Redefining LLM Quantization with an FP6 Strategy
X. Wu, H. Xia, S. Youn, Z. Zheng, et al.
arXiv 2023
Extreme Compression for Pre-trained Transformers Made Simple
X. Wu, et al.
NeurIPS 2022 (Oral)
Optimization & Theory
When Do Curricula Work?
X. Wu, E. Dyer, B. Neyshabur
ICLR 2021 (Oral)
[code, slides]
AdaGrad Stepsizes: Sharp Convergence over Nonconvex Landscapes
R. Ward*, X. Wu*, L. Bottou
ICML 2019 (Oral); extended version in Journal of Machine Learning Research
[code]
*: equal contribution.