Surflo: Consistent 3D Surface Flow from a Global State

arXiv preprint · 2026

Antoine Guédon* LIX, École Polytechnique

Shu Nakamura* Kyoto University

Nicolas Dufour* Kyutai

Jiahui Lei UC Berkeley

Ko Nishino Kyoto University

Angjoo Kanazawa UC Berkeley

* Equal contribution

arXiv Code Video

Fuse N views into a global state,
then decode a single coherent surface from it.

Variable views
→ one latent.

Fuse N views into a fixed state of K=128 tokens.

Decode at any resolution.

Decode as many points as you want from one global state.

Independent flow, coherent surface.

While points flow independently, our guidance couples them.

State of the art performance.

Evaluated on 8 benchmarks, from 2 to 32 views.

New surface dataset for real-world scenes.

~10.5K DL3DV scenes, full scene meshes.

01

Overview

N input views

N = 16

Encoder E_φ

VGGT backbone
+ Perceiver compressor

Global state z

K = 128 tokens · fixed for any N

Decoder v_θ

Per-point Flow-matching
ODE

M oriented points

M up to 10⁶

Global state. Surflo turns a variable number of unposed images into a single global latent—not a stack of per-view tokens.
Arbitrary resolution. Each surface point is then decoded independently, with a flow-matching ODE conditioned on that latent: we can sample any number of points from one encoder pass.
Coupling via guidance. We finally introduce a communication guidance mechanism relying on a shared rendering loss. At each step of the ODE integration, points are converted into 3D Gaussians and rendered with Gaussian Splatting. This rendering guidance limits disagreement between nearby query points.

02

One global state, One coherent surface

Modern feed-forward 3D models — VGGT, DUSt3R, DepthAnything-3 — produce a pointmap per view. The representation grows linearly with the number of views, and leads to both noise and redundancy.
Surflo encodes the entire image set into a global state z — one fixed-size representation, regardless of how many views you provide. From it, we decode a single coherent surface at whichever resolution we ask for.

The result is a representation of geometry that, by construction, captures only what is shared across views.

Geometry is what remains invariant under transformations of view.

Felix Klein · Erlangen Programme, 1872

Surflo: a single coherent surface decoded from one global state — Drag to compare — one shared latent vs N independent pointmaps.

VGGT: per-view pointmaps stacked across views — Drag to compare — one shared latent vs N independent pointmaps.

03

Decoding explicit surfaces

Each scene below is a Surflo reconstruction from 16 unposed images. For each scene, a total of 100K points are decoded from the global state before being assembled into a mesh. For the sake of visualization, RGB colors were computed by naively averaging over input images. Please click on Normals to toggle normal shading.

Buzz 16 views · textured

04

A Global State filtering out Redundancy

Adding more input images doesn't grow the global state. The K = 128 latent tokens simply see more information through cross-attention, leading to more complete reconstructions. Please see the Surflo points below, with increasing number of input images. For the sake of visualization, RGB colors are computed by naively averaging over input images.

Woody · 17 views 17 input images · K=128 latent

05

One latent, Any output resolution

One encoder forward pass, one global state — decode the surface at whichever density you can afford. No re-encoding required.

06

Points communicate through rendering

Independent ODEs are cheap and parallel, but two nearby queries can lock onto different surface ambiguities. We couple the points at inference time with a guidance mechanism: At each ODE step, we render the points with Gaussian splatting, back-propagate an image-space loss and update the velocities. The rendering gradient is the channel through which points communicate.

no guidance
Plain flow matching

+ photometric
L₁ + DSSIM

+ monodepth expert
Depth order regulariser

07

Comparison with state-of-the-art

Drag the dividers to compare Surflo against VGGT and Gaussian Wrapping, the leading method for surface reconstruction from images. Surflo holds up even on tough captures with strong exposure variation and transparent objects — see for instance the translucent Totoro figurine in Scene 01.

Scene 01

Totoro

Custom capture · 16 views · transparent figurine

Surflo points — Points — *Surflo* vs *VGGT pointmaps*

Surflo mesh — Mesh — *Surflo* vs *Gaussian Wrapping*

Gaussian Wrapping mesh — Mesh — *Surflo* vs *Gaussian Wrapping*

Scene 02

Gallos

Custom capture · 16 views

Scene 03

Caterpillar

Tanks & Temples · 16 views

Scene 04

Garden

Mip-NeRF 360 · 16 views

Scene 05

Ignatius

Tanks & Temples · 16 views

Scene 06

Robot

Custom capture · 16 views

08

Numbers

Surflo is trained on our augmented version of DL3DV only, and evaluated on eight benchmarks — four standard novel-view synthesis datasets with reference surfaces computed from dense views using the state-of-the-art meshing method Gaussian Wrapping, and four benchmarks with native surface ground truth. Every method sees the same 16 unposed input views per scene; we report Chamfer Distance (CD ↓) and F1-score (F1 ↑). Per-view feed-forward baselines are reported with TSDF fusion to a single global mesh. Both Surflo rows highlighted — with and without the shared rendering guidance.

Proxy surfaces obtained from dense views

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0126
DA3 + TSDF
0.0120
NOVA3R
0.0459
2DGS
0.0163
RaDe-GS
0.0166
Gaussian Wrapping
0.0168
Surflo (no guid.)
0.0072
Surflo (guid.)
0.0083

F1-score ↑

higher is better

VGGT + TSDF
69.23
DA3 + TSDF
72.30
NOVA3R
30.51
2DGS
60.10
RaDe-GS
59.48
Gaussian Wrapping
60.67
Surflo (no guid.)
81.92
Surflo (guid.)
78.55

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0113
DA3 + TSDF
0.0177
NOVA3R
0.0432
2DGS
0.0161
RaDe-GS
0.0170
Gaussian Wrapping
0.0157
Surflo (no guid.)
0.0053
Surflo (guid.)
0.0056

F1-score ↑

higher is better

VGGT + TSDF
77.46
DA3 + TSDF
70.80
NOVA3R
32.99
2DGS
62.95
RaDe-GS
61.67
Gaussian Wrapping
64.94
Surflo (no guid.)
88.57
Surflo (guid.)
86.40

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0178
DA3 + TSDF
0.0182
NOVA3R
0.0429
2DGS
0.0222
RaDe-GS
0.0224
Gaussian Wrapping
0.0201
Surflo (no guid.)
0.0068
Surflo (guid.)
0.0103

F1-score ↑

higher is better

VGGT + TSDF
60.64
DA3 + TSDF
59.91
NOVA3R
25.60
2DGS
51.08
RaDe-GS
50.83
Gaussian Wrapping
57.86
Surflo (no guid.)
82.00
Surflo (guid.)
76.57

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0193
DA3 + TSDF
0.0210
NOVA3R
0.0550
2DGS
0.0204
RaDe-GS
0.0202
Gaussian Wrapping
0.0164
Surflo (no guid.)
0.0116
Surflo (guid.)
0.0109

F1-score ↑

higher is better

VGGT + TSDF
62.30
DA3 + TSDF
54.03
NOVA3R
27.61
2DGS
59.54
RaDe-GS
60.04
Gaussian Wrapping
64.54
Surflo (no guid.)
70.96
Surflo (guid.)
75.09

Native surface ground truth

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0138
DA3 + TSDF
0.0151
NOVA3R
0.0635
2DGS
0.0176
RaDe-GS
0.0174
Gaussian Wrapping
0.0145
Surflo (no guid.)
0.0097
Surflo (guid.)
0.0079

F1-score ↑

higher is better

VGGT + TSDF
74.08
DA3 + TSDF
69.07
NOVA3R
27.65
2DGS
62.03
RaDe-GS
62.68
Gaussian Wrapping
66.86
Surflo (no guid.)
77.98
Surflo (guid.)
87.97

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0270
DA3 + TSDF
0.0875
NOVA3R
0.0413
2DGS
0.0295
RaDe-GS
0.0303
Gaussian Wrapping
0.0259
Surflo (no guid.)
0.0103
Surflo (guid.)
0.0114

F1-score ↑

higher is better

VGGT + TSDF
59.64
DA3 + TSDF
52.61
NOVA3R
32.13
2DGS
48.19
RaDe-GS
48.85
Gaussian Wrapping
55.64
Surflo (no guid.)
76.50
Surflo (guid.)
77.28

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0380
DA3 + TSDF
0.0801
NOVA3R
0.0307
2DGS
0.0394
RaDe-GS
0.0393
Gaussian Wrapping
0.0460
Surflo (no guid.)
0.0242
Surflo (guid.)
0.0240

F1-score ↑

higher is better

VGGT + TSDF
28.93
DA3 + TSDF
17.93
NOVA3R
31.41
2DGS
28.78
RaDe-GS
28.30
Gaussian Wrapping
30.17
Surflo (no guid.)
39.23
Surflo (guid.)
42.05

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0226
DA3 + TSDF
0.0220
NOVA3R
0.0771
2DGS
0.0234
RaDe-GS
0.0242
Gaussian Wrapping
0.0123
Surflo (no guid.)
0.0114
Surflo (guid.)
0.0070

F1-score ↑

higher is better

VGGT + TSDF
59.89
DA3 + TSDF
60.95
NOVA3R
27.41
2DGS
53.57
RaDe-GS
53.52
Gaussian Wrapping
62.96
Surflo (no guid.)
61.20
Surflo (guid.)
81.11

Surflo still performs better than the baselines when changing the number of input views. We vary the number of unposed views per scene from 2 to 32 and report the same Chamfer Distance (CD ↓) and F1-score (F1 ↑) on the OOD datasets Tanks & Temples and Mip-NeRF 360. Surflo leads at every view count, including the hard 2-view regime.

Varying input views

Chamfer distance ↓

lower is better

VGGT + TSDF
0.1444
DA3 + TSDF
0.1428
NOVA3R
0.2620
2DGS
0.1453
RaDe-GS
0.1454
Gaussian Wrapping
0.1476
Surflo (no guid.)
0.1345
Surflo (guid.)
0.1416

F1-score ↑

higher is better

VGGT + TSDF
6.83
DA3 + TSDF
6.97
NOVA3R
5.78
2DGS
6.09
RaDe-GS
6.30
Gaussian Wrapping
5.24
Surflo (no guid.)
9.28
Surflo (guid.)
7.08

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0285
DA3 + TSDF
0.0293
NOVA3R
0.0502
2DGS
0.0316
RaDe-GS
0.0314
Gaussian Wrapping
0.0313
Surflo (no guid.)
0.0135
Surflo (guid.)
0.0198

F1-score ↑

higher is better

VGGT + TSDF
53.62
DA3 + TSDF
47.90
NOVA3R
30.29
2DGS
43.05
RaDe-GS
42.60
Gaussian Wrapping
45.28
Surflo (no guid.)
75.07
Surflo (guid.)
72.65

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0138
DA3 + TSDF
0.0210
NOVA3R
0.0557
2DGS
0.0187
RaDe-GS
0.0191
Gaussian Wrapping
0.0176
Surflo (no guid.)
0.0059
Surflo (guid.)
0.0061

F1-score ↑

higher is better

VGGT + TSDF
70.85
DA3 + TSDF
61.64
NOVA3R
31.78
2DGS
55.95
RaDe-GS
55.67
Gaussian Wrapping
60.04
Surflo (no guid.)
86.59
Surflo (guid.)
86.25

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0094
DA3 + TSDF
0.0140
NOVA3R
0.0423
2DGS
0.0152
RaDe-GS
0.0156
Gaussian Wrapping
0.0133
Surflo (no guid.)
0.0049
Surflo (guid.)
0.0049

F1-score ↑

higher is better

VGGT + TSDF
84.83
DA3 + TSDF
74.59
NOVA3R
35.54
2DGS
68.43
RaDe-GS
66.24
Gaussian Wrapping
72.10
Surflo (no guid.)
90.76
Surflo (guid.)
90.34

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0755
DA3 + TSDF
0.0746
NOVA3R
0.0795
2DGS
0.0754
RaDe-GS
0.0756
Gaussian Wrapping
0.0766
Surflo (no guid.)
0.0714
Surflo (guid.)
0.0736

F1-score ↑

higher is better

VGGT + TSDF
9.63
DA3 + TSDF
9.24
NOVA3R
8.46
2DGS
8.73
RaDe-GS
8.76
Gaussian Wrapping
7.28
Surflo (no guid.)
13.07
Surflo (guid.)
10.40

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0326
DA3 + TSDF
0.0322
NOVA3R
0.0642
2DGS
0.0344
RaDe-GS
0.0352
Gaussian Wrapping
0.0339
Surflo (no guid.)
0.0192
Surflo (guid.)
0.0263

F1-score ↑

higher is better

VGGT + TSDF
40.79
DA3 + TSDF
40.31
NOVA3R
21.41
2DGS
37.15
RaDe-GS
37.11
Gaussian Wrapping
41.61
Surflo (no guid.)
58.68
Surflo (guid.)
53.71

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0244
DA3 + TSDF
0.0220
NOVA3R
0.0492
2DGS
0.0283
RaDe-GS
0.0293
Gaussian Wrapping
0.0251
Surflo (no guid.)
0.0127
Surflo (guid.)
0.0145

F1-score ↑

higher is better

VGGT + TSDF
55.29
DA3 + TSDF
55.17
NOVA3R
28.01
2DGS
47.45
RaDe-GS
47.89
Gaussian Wrapping
53.97
Surflo (no guid.)
74.07
Surflo (guid.)
73.44

Chamfer distance ↓

lower is better

VGGT + TSDF
0.0161
DA3 + TSDF
0.0162
NOVA3R
0.0562
2DGS
0.0217
RaDe-GS
0.0215
Gaussian Wrapping
0.0162
Surflo (no guid.)
0.0071
Surflo (guid.)
0.0137

F1-score ↑

higher is better

VGGT + TSDF
65.56
DA3 + TSDF
64.75
NOVA3R
18.95
2DGS
52.13
RaDe-GS
53.08
Gaussian Wrapping
62.41
Surflo (no guid.)
81.24
Surflo (guid.)
80.66

09

A new Dataset for Surface Reconstruction

Alongside Surflo we will release an augmented version of DL3DV where every scene ships with a full surface mesh covering both foreground and background geometry. Each mesh is computed with the state-of-the-art Gaussian Wrapping pipeline, giving the community a large, scene-level supervision signal for surface reconstruction.

~10.5K scenes indoor & outdoor foreground + background posed images, depth, mesh

Scene 01 — reference photo — **Scene 01** — reference vs Gaussian Wrapping mesh

Scene 01 — Gaussian Wrapping mesh render — **Scene 01** — reference vs Gaussian Wrapping mesh

Scene 02 — reference photo — **Scene 02** — reference vs Gaussian Wrapping mesh

Scene 02 — Gaussian Wrapping mesh render — **Scene 02** — reference vs Gaussian Wrapping mesh

Scene 03 — reference photo — **Scene 03** — reference vs Gaussian Wrapping mesh

Scene 03 — Gaussian Wrapping mesh render — **Scene 03** — reference vs Gaussian Wrapping mesh

11

Citation

@article{guedon2026surflo,
  title       = {Surflo: Consistent 3D Surface Flow from a Global State},
  author      = {Gu{\'e}don, Antoine and Nakamura, Shu and Dufour, Nicolas
                  and Lei, Jiahui and Nishino, Ko and Kanazawa, Angjoo},
  journal     = {arXiv preprint},
  year        = {2026}
}

Variable views → one latent.

Decode at any resolution.

Independent flow, coherent surface.

State of the art performance.

New surface dataset for real-world scenes.

Overview

One global state, One coherent surface

Decoding explicit surfaces

A Global State filtering out Redundancy

One latent, Any output resolution

Points communicate through rendering

Comparison with state-of-the-art

Numbers

Proxy surfaces obtained from dense views

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Native surface ground truth

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Varying input views

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

Chamfer distance ↓

F1-score ↑

A new Dataset for Surface Reconstruction

Citation

Variable views
→ one latent.