JointDiT

JointDiT: Enhancing RGB-Depth
Joint Modeling with Diffusion Transformers

¹POSTECH ²Microsoft Research Asia ³KAIST

ICCV 2025

Abstract

We present JointDiT, a diffusion transformer that models the joint distribution of RGB and depth. By leveraging the architectural benefit and outstanding image prior of the state-of-the-art diffusion transformer, JointDiT not only generates high-fidelity images but also produces geometrically plausible and accurate depth maps. This solid joint distribution modeling is achieved through two simple yet effective techniques that we propose, i.e., adaptive scheduling weights, which depend on the noise levels of each modality, and the unbalanced timestep sampling strategy. With these techniques, we train our model across all noise levels for each modality, enabling JointDiT to naturally handle various combinatorial generation tasks, including joint generation, depth estimation, and depth-conditioned image generation by simply controlling the timestep of each branch. JointDiT demonstrates outstanding joint generation performance. Furthermore, it achieves comparable results in depth estimation and depth-conditioned image generation, suggesting that joint distribution modeling can serve as a replaceable alternative to conditional generation.

Interactive Demo

Joint Generation Results (Viewable on PC)

A chubby pug wearing a wizard’s robe, holding a magic scroll …

A cute orange cat holding a magic wand with glowing sparkles …

Click & drag to view the 3D point cloud

A tall waterfall cascading down a moss-covered cliff …

Realistic portrait of an elderly man …

A striped tabby cat holding a wooden …

Click & drag to view the 3D point cloud

A big brown dog sitting in the back of a red truck …

A white baby rabbit holding a glowing ball…

A legendary sword embedded in an ancient stone

Click & drag to view the 3D point cloud

A delicious, juicy hamburger with a toasted bun, melted cheese, …

A massive ancient tree towering over a castle on a floating island …

A colorful pineapple on a beach

Click & drag to view the 3D point cloud

A ethereal rainbow feather with a perfect gradient

Depth Estimation Results

Depth-Conditioned Image Generation Results

A red flower with yellow centers ...

A bird standing on a ledge

A man in a suit is playing the piano

A tent set up in a field with chairs ...

Bonsai tree in a pot on display at a show

A police vehicle parked in front of a building

A black cat sitting in a red and yellow ...

A steak and mashed potatoes on ...

Three surgeons in an operating room

Two men in black shirts and red gloves ...

A floating hut in the middle of a lakes

An old man with a beard and hatt

A hamburger and french fries on a plate

A gray fox sitting on the ground near a road

A red fire hydrant in the woods

BibTeX

@InProceedings{Byung-Ki_2025_ICCV, author = {Byung-Ki, Kwon and Dai, Qi and Hyoseok, Lee and Luo, Chong and Oh, Tae-Hyun}, title = {JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {25261-25271} }

JointDiT: Enhancing RGB-Depth Joint Modeling with Diffusion Transformers

Abstract

JointDiT's Features

Interactive Demo

Joint Generation Results (Viewable on PC)

Depth Estimation Results

Depth-Conditioned Image Generation Results

BibTeX

JointDiT: Enhancing RGB-Depth
Joint Modeling with Diffusion Transformers