SkinTokens: A Learned Compact Representation for Unified Autoregressive Rigging

Abstract

The rapid proliferation of generative 3D models has created a critical bottleneck in animation pipelines: rigging. Existing automated methods are fundamentally limited by their approach to skinning, treating it as an ill-posed, high-dimensional regression task that is inefficient to optimize and is typically decoupled from skeleton generation. We posit this is a representation problem and introduce SkinTokens: a learned, compact, and discrete representation for skinning weights. By leveraging an FSQ-CVAE to capture the intrinsic sparsity of skinning, we reframe the task from continuous regression to a more tractable token sequence prediction problem.

This representation enables TokenRig, a unified autoregressive framework that models the entire rig as a single sequence of skeletal parameters and SkinTokens, learning the complicated dependencies between skeletons and skin deformations. The unified model is then amenable to a reinforcement learning stage, where tailored geometric and semantic rewards improve generalization to complex, out-of-distribution assets.

Quantitatively, the SkinTokens representation leads to a 98%–133% improvement in skinning accuracy over state-of-the-art methods, while the full TokenRig framework, refined with RL, enhances bone prediction by 17%–22%. Our work presents a unified, generative approach to rigging that yields higher fidelity and robustness, offering a scalable solution to a long-standing challenge in 3D content creation.

Method

Overview of the TokenRig Framework. Our method consists of three key stages: (1) Learning SkinTokens: We first train a FSQ-CVAE to compress sparse skinning weights into a compact, discrete representation. Mesh geometry and skinning weights are processed by VecSet encoders, and the resulting features are discretized into SkinTokens via Finite Scalar Quantization (FSQ). We employ nested dropout and importance sampling to ensure robust reconstruction of active deformation regions. (2) Unified Autoregressive Modeling: We formulate rigging as a sequence generation task. A Transformer generates a single, unified sequence comprising the complete skeleton followed by the learned SkinTokens, conditioned on global shape embeddings to capture structural dependencies. (3) RL Refinement via GRPO: To improve generalization to in-the-wild assets, we fine-tune the model using Group Relative Policy Optimization (GRPO). We introduce four specific rewards: Volumetric Joint Coverage, Bone-Mesh Containment, Skinning Coverage and Sparsity, and Deformation Smoothness.

Key Contributions

A learned discrete representation for skinning weights, SkinTokens, that transforms skinning from a high-dimensional regression task into a compact sequence prediction problem.
A unified autoregressive framework, TokenRig, that jointly models skeleton generation and skinning, capturing their mutual dependencies for higher-fidelity results.
A reinforcement learning framework for rig refinement, with novel reward functions designed to improve the generalization and robustness of the generated rigs on out-of-distribution 3D models.

Results

Skeleton Generation Comparison

Qualitative Comparison of Skeleton Generation. We compare TokenRig (Ours) against state-of-the-art baselines. While baseline methods exhibit partial structures, missing details, or redundant joints, our method synthesizes structurally coherent and semantically faithful skeletons across diverse character types.

Skinning Prediction Comparison

Qualitative Comparison of Skinning Prediction. We visualize predicted skinning weights and the corresponding average L1 error maps. Baseline methods often suffer from "bleeding" artifacts, where weights spill onto unconnected mesh parts. TokenRig (Ours) produces clean, locally coherent influence maps that closely match the Ground Truth, particularly in fine-grained regions like fingers.

Impact of Reinforcement Learning

Diverse Generation Results

Diverse Generation Results. We demonstrate the generalization capacity of TokenRig on a wide range of inputs, including unseen test-set samples and complex in-the-wild assets. The model robustly synthesizes fully articulated skeletons and accurate skinning weights.

Quantitative Results

Skeleton Generation (Chamfer Distance, lower is better):

Method	ModelsResource			Articulation 2.0
	J2J↓	J2B↓	B2B↓	J2J↓	J2B↓	B2B↓
RigNet	3.901	2.412	2.213	7.376	5.841	4.802
MagicArticulate	3.024	2.260	1.915	4.003	3.026	2.586
Puppeteer	3.841	2.881	2.475	3.033	2.300	1.923
UniRig	3.390	2.592	1.890	3.115	2.211	1.926
TokenRig (Ours, w/ GRPO)	2.893	2.012	1.547	2.485	1.599	1.463

Skinning Prediction (lower L1/Motion is better, higher Precision/Recall is better):

Method	ModelsResource					Articulation 2.0
	L1↓	Var.↓	Prec.↑	Rec.↑	Motion↓	L1↓	Var.↓	Prec.↑	Rec.↑	Motion↓
RigNet	0.057	0.046	62.4	59.9	0.079	0.043	0.040	67.8	54.6	0.092
Puppeteer	0.032	0.017	64.4	87.2	0.028	0.028	0.014	76.7	75.1	0.031
UniRig	0.038	0.021	65.8	86.7	0.031	0.030	0.017	72.6	73.5	0.042
TokenRig (Ours)	0.016	0.007	79.2	89.1	0.016	0.015	0.006	79.0	89.2	0.021

Acknowledgements

Thanks very much to many friends for their unselfish help with our work.