Xiaoxia Wu

Email: shirley AT Together dot ai

About me

I am currently a Senior Staff Scientist at TogetherAI (since July 2024), where I build tools and work on quantization. Ping me if you're interested in building and making inference fast!

Previously, I was a Senior Researcher at Microsoft GenAI in which we developed the Phi-3 family of models. I was fortunate to be a member of the DeepSpeed team, led by Zhewei Yao and Yuxiong He , and worked closely with Weizhu's Team. At DeepSpeed, I focused on system- and algorithm-level optimizations for large-scale training and inference of LLMs, with a particular emphasis on compression, long sequences, multi-modal research. Some of my projects includes DeepSpeed-FP6 and DeepSpeed-Chat . More information, please check deepspeed.ai.

I was a postdoctoral research fellow mentored by Rebecca Willett at University of Chicago and Toyota Techonological Institute at Chicago. I have successfully completed the Ph.D. program at The University of Texas at Austin, where I was fortunately advised by Rachel Ward and informally co-advised by Léon Bottou. My PhD research interests are in the areas of optimization method:it is about efficient and robust methods (to hyperparameter tuning) such as adaptive gradient descent and batch normalization. I was a research intern at Facebook AI Research (New York office) during Fall 2017, and a research intern at Google working with Ethan Dyer and Behnam Neyshabur during Summer 2020.

I hold an M.Sc. with Distinction at the University of Edinburgh in Financial Mathematics. Before that, I spent a wonderful four-year in the Department of Mathematics and Applied Mathematics at Shantou University where I was awarded Li-Kashing Scholarship to participate in Semester at Sea. I am from Guangdong, China, speaking Cantonese and Hakka.

Papers and Preprints (updated on Nov 2021)

Google Scholar
See my recent papers

Adaptive Differentially Private Empirical Risk Minimization
Xiaoxia Wu, Lingxiao Wang, Irina Cristali, Quanquan Gu, Rebecca Willett
arXiv:2110.07435

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method
Xiaoxia Wu, Yuege Xie, Simon Du and Rachel Ward
arXiv:2109.08282

Hierarchical Learning for Generation with Long Source Sequences
Tobias Rohde, Xiaoxia Wu, and Yinhan Liu
arXiv:2104.07545

When Do Curricula Work?
Xiaoxia Wu, Ethan Dyer, and Behnam Neyshabur
ICLR (Oral, 53 papers accepted as oral out of 2997 submissions), 2021
[code, slides]

Implicit Regularization and Convergence for Weight Normalization
Xiaoxia Wu*, Edgar Dobriban*, Tongzheng Ren*, Shanshan Wu*, Yuanzhi Li, Suriya Gunasekar, Rachel Ward and Qiang Liu
NeurIPS, 2020
[slides]

Choosing the Sample with Lowest Loss makes SGD Robust
Vatsal Shah, Xiaoxia Wu, and Sujay Sanghavi
AISTATS, 2020

Linear Convergence of Adaptive Stochastic Gradient Descent
Yuege Xie, Xiaoxia Wu, and Rachel Ward
AISTATS, 2020

Global Convergence of Adaptive Gradient Methods for An Over-parameterized Neural Network
Xiaoxia Wu, Simon S. Du, and Rachel Ward
preprint, 2019

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes
Rachel Ward*, Xiaoxia Wu*, Leon Bottou
ICML (Oral), 2019
(The longer version is published in Journal of Machine Learning Research)
[code, 20 mins video and slides, 机器之心报导]

WNGrad: Learn the Learning Rate in Gradient Descent
Xiaoxia Wu*, Rachel Ward*, Leon Bottou
preprint, 2018

An Optimal Mortgage Refinancing Strategy with Stochastic Interest Rate
Xiaoxia Wu, Dejun Xie, David A Edwards
Computational Economics, 1-23, 2018

Value-at-Risk estimation with stochastic interest rate models for option-bond portfolios
Xiaoyu Wang, Dejun Xie, Jingjing Jiang, Xiaoxia Wu, Jia He
Finance Research Letters 21 (2017): 10-20

*: indicating equal contribution.

Teaching Assistant at UT Austin

Probability I, Spring 19

Sci Computation in Numerical Analysis, Spring 18

Linear Algebra and Matrix Theory, Spring 17

Seq, Series, and Multivariate Calculus, Spring16, Fall16

Differential and Integral Calculus, Fall 14, Spring 15, Fall 16