Photorealistic facial surgery outcome prediction from a single photo
Photorealistic facial surgery outcome prediction from a single photo, powered by anatomically-conditioned latent diffusion.
|
Input & Output
|
Capabilities
|
The 2D pipeline ships now and works well. The end goal is full 3D: you hold up your phone, slowly rotate your head, and we reconstruct a 3D face model from that video alone. Surgical deformations then happen in 3D space (anatomically grounded, not pixel-level warping) and you get an interactive model you can rotate to see the predicted result from any angle. No depth sensors, no clinical scanning rigs. Just a phone camera and a short video. See the Roadmap for details on each step.
LandmarkDiff extracts MediaPipe's 478-point face mesh from the input photo, applies procedure-specific Gaussian RBF deformations calibrated from anthropometric surgical data, renders the deformed mesh as a tessellation wireframe, and feeds that wireframe into a ControlNet-conditioned Stable Diffusion 1.5 backbone to synthesize the predicted face. The output is composited back onto the original image using Laplacian pyramid blending with feathered surgical masks, then refined through neural face restoration and identity verification.
Paper: "LandmarkDiff: Anatomically-Conditioned Latent Diffusion for Photorealistic Facial Surgery Outcome Prediction," arXiv preprint, March 2026. Targeting MICCAI 2026.
Full pipeline: input photo, landmark extraction, mesh deformation, ControlNet synthesis, compositing
Runs entirely on CPU, no GPU or local install needed. Upload a photo, pick a procedure, adjust intensity, and see the predicted result with symmetry analysis in seconds.
# Quick install
pip install -e ".[train,eval,app,dev]"
# Run a prediction
python scripts/run_inference.py photo.jpg --procedure rhinoplasty --intensity 60 --mode controlnet- Features
- Why LandmarkDiff
- Supported Procedures
- How It Works
- Demo Outputs
- Quick Start
- Inference Modes
- Gradio Web Demo
- Symmetry Analysis
- Training
- Evaluation and Metrics
- Clinical Edge Cases
- Post-Processing Pipeline
- Project Structure
- Configuration
- Benchmarks
- Model Zoo
- Requirements
- Docker
- Make Targets
- Roadmap
- Citation
- Contributors
- Contributing
- License
- Acknowledgments
- Single-photo input: works from any 2D clinical photograph or phone selfie, no 3D scanning hardware needed
- 6 surgical procedure presets: rhinoplasty, blepharoplasty, rhytidectomy, orthognathic surgery, brow lift, mentoplasty (extensible to custom procedures)
- 4 inference modes: TPS (instant CPU), img2img, ControlNet, and ControlNet+IP-Adapter with configurable quality/speed tradeoffs
- MediaPipe 478-point face mesh: anatomically grounded landmark extraction for precise deformation control
- Gaussian RBF deformation engine: smooth, spatially weighted displacements calibrated from anthropometric surgical data
- ControlNet-conditioned generation: photorealistic texture synthesis via Stable Diffusion 1.5 with wireframe conditioning
- Neural post-processing: CodeFormer face restoration, Real-ESRGAN upscaling, LAB histogram matching, Laplacian pyramid blending
- ArcFace identity verification: ensures the predicted face preserves patient identity (cosine similarity check)
- Clinical edge-case handling: built-in support for vitiligo, Bell's palsy, keloid-prone skin, and Ehlers-Danlos syndrome
- Fitzpatrick-stratified evaluation: all metrics (FID, LPIPS, SSIM, NME, identity) broken down by skin type I through VI
- Intensity slider (0-100%): preview subtle through aggressive versions of any procedure
- Gradio web demo: 5-tab interface with single procedure, multi-procedure comparison, intensity sweep, face analysis, and multi-angle capture
- HPC training pipeline: SLURM scripts with preemption checkpointing, DDP multi-GPU, curriculum training configs
- Docker and Apptainer support: CPU and GPU container images for reproducible deployment
- PEP 561 typed package: ships with
py.typedmarker for downstream type checking
Facial cosmetic surgery is one of the most common elective procedures worldwide. The American Society of Plastic Surgeons (ASPS) reported 15.6 million cosmetic procedures in the US in 2020, with rhinoplasty and blepharoplasty consistently ranking among the top 5 surgical procedures. These numbers have only grown since.
The problem is expectation management. Roughly 10 to 15% of rhinoplasty patients seek revision surgery, and a significant driver is the gap between what patients expected and what they got (Rohrich & Ahmad, "A Practical Approach to Rhinoplasty," Plastic and Reconstructive Surgery, 2016). Preoperative visualization directly affects satisfaction; patients who see a realistic preview report better alignment between expectations and results (Kandathil et al., "Examining Preoperative Expectations and Postoperative Satisfaction in Rhinoplasty Patients," Facial Plastic Surgery & Aesthetic Medicine, 2021). Systematic reviews of patient-reported outcomes in rhinoplasty confirm that expectation alignment is a key predictor of satisfaction (Leong & Iglesias, "A systematic review of patient-reported outcome measures in aesthetic and functional rhinoplasty," Journal of Plastic, Reconstructive & Aesthetic Surgery, 2016).
But here's the catch: the tools that produce good visualizations are expensive, proprietary, or both. Most surgeons, especially outside wealthy urban practices, don't have access to them.
Tier 1: Clinical 3D Simulation
- Canfield Scientific VECTRA (~$30-100K): Dedicated structured-light 3D scanner paired with Mirror simulation software. The gold standard in top-tier practices. Produces accurate surface meshes with Face Sculptor for tissue movement simulation. Requires trained operators, expensive hardware, and in-office capture. Proprietary with no published validation studies on prediction accuracy. Website
- Crisalix (~$200-500/mo): Cloud-based 3D simulation from 2D photos. 17 years in market, PE-backed (BID Equity). Supports breast and face procedures. Uses geometric morphing, not AI or diffusion. More accessible than VECTRA, but subscription-based, proprietary, and there's no open evaluation of its fidelity. Website
- AEDIT ($60/mo consumer): Phone-based 3D scanning using 100+ photos via TrueDepth camera. Patented morphing with "100,000 facial recognition points." Covers rhinoplasty, lip filler, brow lift, and Botox simulation. Multiple patents on 3D reconstruction from phone input. Consumer-first approach, iOS only. Website
Tier 2: Practice Management + Lite Simulation
- FaceTouchUp (~$50-100/mo): 2D morphing tool with AR overlay. Affordable and quick for consultations, but results look like warped photographs because that's exactly what they are, geometric transforms with no understanding of how skin, light, or tissue actually behave. Website
- TouchMD / Symplast / Consentz: EMR and practice management platforms with basic photo ghosting or overlay features, not true surgical simulation.
Tier 3: Consumer Beauty Tech
- Perfect Corp: AI-powered face reshape for beauty and med spa applications. Focused on fillers and Botox visualization, not structural surgical prediction. Website
- GlamAR: Virtual try-on API for beauty brands. Cosmetics overlay layer, not surgical simulation. Website
Academic approaches:
Most recent academic work on face manipulation focuses on generic editing (make someone look older, change their expression, swap identities) rather than surgery-specific prediction. A few notable examples:
- DiscoFaceGAN (Deng et al., CVPR 2020): Disentangled controllable face generation using 3DMM coefficients. Powerful for attribute editing, but designed for general-purpose face manipulation, not surgical planning. No procedure-specific deformation models.
- FaceShifter (Li et al., 2019): High-fidelity face swapping with occlusion awareness. Impressive identity transfer, but the goal is swapping one person's face onto another, not simulating what a surgical procedure would do to the same person.
- DiffFace (Kim et al., 2022): Diffusion-based face swapping with facial guidance. Shows the potential of diffusion models for face manipulation, but targets identity transfer, not surgical outcome prediction.
The common thread: none of the commercial tools use diffusion models (all rely on geometric warping or morphing), almost none of the academic work uses real surgical data to drive deformations, none evaluates fairness across skin tones, and none handles clinical edge cases like Bell's palsy or keloid-prone skin.
| Feature | Canfield VECTRA | Crisalix | AEDIT | FaceTouchUp | LandmarkDiff |
|---|---|---|---|---|---|
| Input | $50K+ scanner | Photos | Phone (iOS) | Photos | Any phone |
| Method | Geometric warp | Geometric morph | Patented morph | 2D pixel push | ControlNet diffusion |
| Output quality | High (3D mesh) | Medium (3D morph) | Medium (morph) | Low (pixel warp) | High (photorealistic) |
| Procedures | Many | Breast + face | Face + injectables | Manual any | 6 facial |
| Price | $30-100K | ~$200-500/mo | Free/$60/mo | $50-100/mo | Free (MIT) |
| Open source | No | No | No | No | Yes |
| Published research | No | No | No | No | Yes (arXiv) |
| Diffusion-based | No | No | No | No | Yes |
| Fairness eval | No | No | No | No | Fitzpatrick I-VI |
LandmarkDiff is not trying to compete with VECTRA on 3D accuracy; we're solving a different problem. We want to make surgery visualization accessible to any surgeon with a phone and any patient who walks into a consultation, while being honest about what the tool can and can't do.
No existing tool uses diffusion models. Every competitor in the comparison table above relies on geometric warping or morphing. LandmarkDiff is the first published system to apply ControlNet-conditioned latent diffusion to surgical outcome prediction, producing photorealistic texture synthesis rather than geometric pixel manipulation. Combined with open-source access, published research, and Fitzpatrick-stratified fairness evaluation, this positions LandmarkDiff as both the most technically advanced and most transparent surgical visualization system available.
Concretely:
- Open source (MIT license). Unlike every commercial tool listed above, you can inspect, modify, and extend the code. If you don't trust the output, you can trace exactly how it was generated.
- Single 2D photo input. No $50K+ hardware, no multi-view capture rigs. A standard clinical photograph or phone selfie is enough.
- Anatomically grounded deformations. Procedure-specific landmark displacements are fitted from real surgical data (pre/post pairs), not hand-tuned or based on generic face editing semantics.
- Diffusion-based photorealism. ControlNet-guided Stable Diffusion produces realistic skin texture, lighting, and shadows, not geometric morphs.
- Clinical edge-case handling. Built-in flags and modified behavior for vitiligo, Bell's palsy, keloid-prone skin, and Ehlers-Danlos syndrome.
- Fitzpatrick-stratified fairness evaluation. All metrics are broken down by Fitzpatrick skin type (I through VI) to catch and prevent performance disparities across skin tones.
- Roadmap toward 3D. We're working on phone-video-to-3D reconstruction to eventually provide accessible 3D visualization without Vectra-class hardware.
Honest limitations: We don't have prospective clinical validation yet (that's planned). Our deformation model is calibrated from a limited dataset. We currently produce 2D output, not 3D. And diffusion models can hallucinate details, so outputs should always be reviewed by a clinician before showing to patients. This is a research tool, not a medical device. The comparison above reflects publicly available information as of March 2026. Commercial tools may have undisclosed technical capabilities.
- American Society of Plastic Surgeons. 2020 Plastic Surgery Statistics Report. ASPS, 2021.
- Rohrich RJ, Ahmad J. "A Practical Approach to Rhinoplasty." Plastic and Reconstructive Surgery. 2016;137(4):725e-746e.
- Kandathil CK, et al. "Examining Preoperative Expectations and Postoperative Satisfaction in Rhinoplasty Patients: A Single-Center Study." Facial Plastic Surgery & Aesthetic Medicine. 2021;23(1):33-38.
- Leong SC, Iglesias MA. "A systematic review of patient-reported outcome measures in aesthetic and functional rhinoplasty." Journal of Plastic, Reconstructive & Aesthetic Surgery. 2016;69(12):1635-1645.
- Deng Y, et al. "Disentangled and Controllable Face Image Generation via 3D Imitative-Contrastive Learning." CVPR 2020.
- Li L, et al. "FaceShifter: Towards High Fidelity And Occlusion Aware Face Swapping." arXiv:1912.13457, 2019.
- Kim K, et al. "DiffFace: Diffusion-based Face Swapping with Facial Guidance." arXiv:2212.13344, 2022.
LandmarkDiff ships with six procedure presets, each targeting specific anatomical regions with calibrated displacement vectors.
Targets 24 landmarks across the nasal bridge, tip, and alar base. Key deformations include alar base narrowing (nostril width reduction), tip refinement with upward rotation, and dorsal hump reduction. Uses a 30px Gaussian RBF influence radius for smooth transitions across the nasal region.
Landmark indices: 1, 2, 4, 5, 6, 19, 94, 141, 168, 195, 197, 236, 240, 274, 275, 278, 279, 294, 326, 327, 360, 363, 370, 456, 460
Targets 28 landmarks around the upper and lower eyelids. Deformations include upper lid elevation (hooded eye correction), medial and lateral canthal tapering, and lower lid tightening. Uses a tighter 15px influence radius to avoid affecting surrounding structures like the brow.
Landmark indices: 33, 7, 163, 144, 145, 153, 154, 155, 157, 158, 159, 160, 161, 246, 362, 382, 381, 380, 374, 373, 390, 249, 263, 466, 388, 387, 386, 385, 384, 398
Targets 32 landmarks along the jawline, cheeks, and periauricular region. Deformations include jowl lifting (upward and lateral traction), submental tightening, and gentle temple lifting to simulate tissue redistribution. Uses a wider 40px influence radius for the broad soft tissue mobilization typical of facelifts.
Landmark indices: 10, 21, 54, 58, 67, 93, 103, 109, 127, 132, 136, 150, 162, 172, 176, 187, 207, 213, 234, 284, 297, 323, 332, 338, 356, 361, 365, 379, 389, 397, 400, 427, 454
Targets 47 landmarks across the mandible, maxilla, and chin. Deformations simulate mandibular advancement or setback, chin projection changes, and lateral jaw narrowing. Uses a 35px influence radius. Note that identity loss is disabled for orthognathic predictions because jaw repositioning inherently changes facial proportions more than the other procedures.
Landmark indices: 0, 17, 18, 36, 37, 39, 40, 57, 61, 78, 80, 81, 82, 84, 87, 88, 91, 95, 146, 167, 169, 170, 175, 181, 191, 200, 201, 202, 204, 208, 211, 212, 214, 269, 270, 291, 311, 312, 317, 321, 324, 325, 375, 396, 405, 407, 415
Targets 19 landmarks across the left and right brows and the upper forehead. Lateral brow landmarks receive progressively stronger upward displacement (weighted 0.7 to 1.1), simulating the lateral brow peak that defines a youthful arch. Forehead landmarks get a gentler lift with a wider influence radius (1.2x) for smooth tissue redistribution. Uses a 25px influence radius.
Landmark indices: 70, 63, 105, 66, 107, 300, 293, 334, 296, 336, 9, 8, 10, 109, 67, 103, 338, 297, 332
Contributed by @Deepak8858 in #35.
Targets 8 landmarks on the chin tip, lower contour, and jaw angles. The chin tip (landmarks 152, 175) receives the strongest advancement, the lower contour follows with softer displacement at a tighter radius (0.8x), and the jaw angles get minimal pull (0.6x radius) for a natural transition. Uses a 25px influence radius.
Landmark indices: 148, 149, 150, 152, 171, 175, 176, 377
Contributed by @P-r-e-m-i-u-m in #36.
You can define custom procedures by specifying which landmarks to move, how far, and in what direction. See docs/tutorials/custom_procedures.md for a step-by-step guide.
LandmarkDiff is a five-stage pipeline. Each stage is independently testable and swappable.
graph TD
classDef stage fill:#2563eb,stroke:#1e40af,color:#fff,stroke-width:2px
classDef post fill:#1d4ed8,stroke:#1e3a8a,color:#fff,stroke-width:2px
classDef io fill:#0f172a,stroke:#334155,color:#fff,stroke-width:2px
A["📷 Input Photo (512x512)"]:::io
A --> B["MediaPipe Face Mesh: 478 landmarks"]:::stage
B --> C["Gaussian RBF Deformation: procedure-specific displacements"]:::stage
C --> D["Conditioning Generation: wireframe + Canny edges + mask"]:::stage
D --> E["ControlNet + Stable Diffusion 1.5: CrucibleAI model"]:::stage
E --> F["Post-Processing: CodeFormer · Real-ESRGAN · LAB matching · Laplacian blending"]:::post
F --> G["ArcFace Identity Verification"]:::post
G --> H["🎯 Output Prediction"]:::io
MediaPipe Face Mesh detects 478 facial landmarks in 3D (x, y, z normalized coordinates) at roughly 30 fps on CPU. The landmarks are grouped into anatomical regions:
| Region | Landmark count |
|---|---|
| Jawline | 33 |
| Left eye | 16 |
| Right eye | 16 |
| Left eyebrow | 10 |
| Right eyebrow | 10 |
| Nose | 25 |
| Lips | 22 |
| Left iris | 5 |
| Right iris | 5 |
| Face oval | 37 |
The extraction runs at the start of every prediction and again on the output for evaluation (NME metric).
Each procedure preset defines a set of DeformationHandle objects, each specifying:
- Which landmark to move (index into the 478-point mesh)
- How far to move it (pixel displacement vector, scaled by the intensity slider)
- How wide the influence is (Gaussian RBF radius in pixels)
The deformation is applied as a smooth, spatially weighted field. Landmarks near the handle move the most; landmarks far away are unaffected. This prevents the jarring discontinuities you get from simple point-to-point warping.
All displacement magnitudes are scaled by the intensity parameter (0 to 100), so you can preview subtle through aggressive versions of the same procedure.
The deformed landmarks are rendered into conditioning images for ControlNet:
-
Tessellation wireframe - The full 2556-edge MediaPipe face mesh drawn on a black canvas. This is the primary conditioning signal. It uses a static anatomical adjacency list (not Delaunay triangulation), so the topology is invariant to landmark displacement.
-
Adaptive Canny edges - Edge detection with thresholds derived from the image median (low = 0.66 * median, high = 1.33 * median). This adapts to different skin tones without hardcoded thresholds, plus morphological skeletonization to produce 1-pixel edges that ControlNet expects.
-
Surgical mask - A feathered mask indicating where the procedure affects the face. Built from the convex hull of procedure-specific landmarks, dilated, Gaussian-feathered, then perturbed with Perlin-style boundary noise (2-4px) to prevent visible seam lines.
The conditioning images are fed to CrucibleAI's pre-trained ControlNet for MediaPipe Face, which conditions Stable Diffusion 1.5 to generate a face matching the deformed mesh topology. Procedure-specific text prompts emphasize clinical photography qualities (natural appearance, sharp focus, studio lighting).
Six-step refinement:
- CodeFormer neural face restoration (fidelity weight 0.7 for quality-fidelity balance)
- Real-ESRGAN background super-resolution (non-face regions only)
- Histogram matching in LAB color space for robust skin tone transfer from input to output
- Frequency-aware sharpening on the L channel only (avoids color fringing)
- Laplacian pyramid blending (6 levels) - low frequencies blend smoothly for lighting continuity, high frequencies transition sharply for texture/pore preservation
- ArcFace identity verification - flags if the output drifts too far from the input identity (cosine similarity threshold 0.6)
Each image shows five pipeline stages: Input | Landmarks | Conditioning | TPS Warp | Output. The first demo uses a real rhinoplasty test sample showing the full ControlNet pipeline with composited output.
Prerequisites: Python 3.10+ and PyTorch 2.1+ (install guide). GPU with 6GB+ VRAM recommended for neural modes; CPU works for TPS mode.
git clone https://github.com/dreamlessx/LandmarkDiff-public.git
cd LandmarkDiff-public
# Core (inference only)
pip install -e .
# With training dependencies
pip install -e ".[train]"
# With Gradio demo
pip install -e ".[app]"
# With evaluation metrics
pip install -e ".[eval]"
# With GPU optimizations (xformers, triton)
pip install -e ".[gpu]"
# Everything
pip install -e ".[train,eval,app,dev]"python scripts/run_inference.py /path/to/face.jpg \
--procedure rhinoplasty \
--intensity 60 \
--mode controlnetThis will:
- Detect the face and extract 478 landmarks
- Apply rhinoplasty deformation at 60% intensity
- Generate the ControlNet-conditioned prediction
- Composite the result back onto the original
- Save the output to
output/result.png
python examples/tps_only.py /path/to/face.jpg \
--procedure rhinoplasty \
--intensity 60TPS mode does pure geometric warping. It runs instantly on CPU and produces a geometrically accurate result, but without the photorealistic texture synthesis that the diffusion modes provide.
python examples/batch_inference.py /path/to/image_dir/ \
--procedure blepharoplasty \
--intensity 50 \
--output output/batch/LandmarkDiff supports four inference modes with different quality-speed-hardware tradeoffs:
| Mode | GPU Required | Speed | Quality | Identity Preservation |
|---|---|---|---|---|
tps |
No | Instant (~0.5s) | Geometric only | Perfect (pixel-level) |
img2img |
Yes (6GB) | ~5s | Good | Good |
controlnet |
Yes (6GB) | ~5s | Best | Good |
controlnet_ip |
Yes (8GB) | ~7s | Best | Best |
TPS mode - Thin-plate spline warping. No diffusion, no neural network inference. Just mathematically warps the pixels according to landmark displacements. Fast and deterministic, but the output looks like a geometric morph rather than a natural photo. Good for previewing the deformation before committing to a full diffusion run.
img2img mode - Standard Stable Diffusion img2img with the TPS-warped image as input and a feathered mask restricting generation to the surgical region. Faster than ControlNet but less controllable.
ControlNet mode - The primary mode. Uses CrucibleAI's pre-trained ControlNet for MediaPipe Face mesh conditioning. The deformed wireframe directly controls the spatial layout of the generated face, producing the most anatomically accurate results.
ControlNet + IP-Adapter mode - Adds IP-Adapter FaceID on top of ControlNet for stronger identity preservation. Uses face embeddings from the input photo to condition generation, reducing the chance of producing a different-looking person. Slightly slower due to the additional encoder pass.
from landmarkdiff.inference import LandmarkDiffPipeline
pipeline = LandmarkDiffPipeline(mode="controlnet", device="cuda")
pipeline.load()
result = pipeline.generate(
image,
procedure="rhinoplasty",
intensity=60,
num_inference_steps=30,
guidance_scale=7.5,
controlnet_conditioning_scale=1.0,
strength=0.75,
seed=42,
postprocess=True,
)
# result dict contains:
# result["output"] - final composited image
# result["output_raw"] - raw diffusion output (before compositing)
# result["output_tps"] - TPS-only geometric warp
# result["conditioning"] - wireframe fed to ControlNet
# result["mask"] - surgical mask
# result["landmarks_original"] - input landmarks
# result["landmarks_manipulated"] - deformed landmarks
# result["identity_check"] - ArcFace similarity scoreTry it online: huggingface.co/spaces/dreamlessx/LandmarkDiff (TPS mode, runs on CPU)
Or run locally:
python scripts/app.py
# Opens at http://localhost:7860The demo has five tabs:
Upload a photo, pick a procedure, adjust intensity from 0-100%. The interface shows every intermediate step: extracted landmarks, deformed mesh, wireframe conditioning, surgical mask, TPS warp, and the final result in a side-by-side before/after view. Clinical flags (vitiligo, Bell's palsy with side selector, keloid-prone regions, Ehlers-Danlos) are available as checkboxes.
Set independent intensity sliders for all six procedures and generate them all from the same photo. Useful for showing a patient their options side by side.
Pick a procedure and a number of steps (3 to 10). Generates a gallery progressing from 0% to 100% intensity so you can see exactly how the result changes with the intensity parameter.
Upload a photo and get back the detected Fitzpatrick skin type, face view classification (frontal, three-quarter, or profile), yaw and pitch angles in degrees, per-region landmark counts, confidence scores, and an annotated landmark visualization.
Guides the user through capturing 5 standardized clinical views: frontal (0 degrees), left three-quarter (45 degrees), right three-quarter (45 degrees), left profile (90 degrees), right profile (90 degrees). Validates each photo against the expected yaw range and generates predictions for all views, producing a combined before/after gallery.
LandmarkDiff includes bilateral facial symmetry measurement as part of both the demo and the evaluation pipeline. The analysis works by reflecting left-side landmarks across the facial midline (computed from the forehead apex to the chin) and measuring their Euclidean distance to the corresponding right-side landmarks.
Five anatomical regions are scored independently:
| Region | Landmark pairs | What it captures |
|---|---|---|
| Eyes | 6 pairs | Palpebral fissure symmetry, canthal tilt |
| Brows | 5 pairs | Brow arch height and position |
| Cheeks | 4 pairs | Malar prominence, midface balance |
| Mouth | 5 pairs | Commissure position, lip symmetry |
| Jaw | 5 pairs | Mandibular contour, chin alignment |
Scores range from 0 to 100, where 90-100 indicates high symmetry, 70-89 mild asymmetry, and below 70 notable asymmetry. All distances are normalized by inter-ocular distance for scale invariance.
The demo's Symmetry Analysis tab offers two modes:
- Single photo: upload any face photo to get a per-region symmetry breakdown with a color-coded overlay (green/yellow/red).
- Pre vs. post comparison: upload before and after photos to see how a procedure changed the symmetry profile, with per-region deltas.
Symmetry scores are also computed automatically during inference runs and reported alongside the prediction output.
Training happens in two phases.
Generate TPS-warped face pairs from FFHQ, then fine-tune ControlNet to reconstruct the original face from the deformed wireframe.
# 1. Download FFHQ samples
python scripts/download_ffhq.py --num 50000 --resolution 512
# 2. Generate training pairs (original + TPS-warped + wireframe)
python scripts/generate_synthetic_data.py \
--input data/ffhq_samples/ \
--output data/synthetic_pairs/ \
--num 50000
# 3. Train ControlNet
python scripts/train_controlnet.py \
--data_dir data/synthetic_pairs/ \
--output_dir checkpoints/ \
--num_train_steps 50000Phase A uses diffusion loss only (MSE between predicted and target noise).
Fine-tune further on clinical before/after pairs with the full four-term loss:
| Loss | Weight | Purpose |
|---|---|---|
| Diffusion (MSE) | 1.0 | Primary training signal |
| Landmark L2 | 0.1 | Anatomical accuracy (inside surgical mask only) |
| Identity (ArcFace) | 0.05 | Patient identity preservation |
| Perceptual (LPIPS) | 0.1 | Texture quality (outside mask, prevents penalizing the TPS warp) |
The landmark loss is normalized by inter-ocular distance (landmarks 33 vs 263) for scale invariance. The identity loss uses procedure-dependent face cropping - rhinoplasty crops to the upper face, blepharoplasty uses the full face, rhytidectomy crops above the jawline, and orthognathic disables identity loss entirely since jaw surgery inherently changes proportions.
Default config at configs/training.yaml:
| Parameter | Value | Notes |
|---|---|---|
| Learning rate | 1e-5 | With cosine scheduler |
| Warmup steps | 500 | |
| Batch size | 4 | Gradient accumulation 4x, effective batch 16 |
| Mixed precision | bf16 | NOT fp16 - activation range exceeded |
| EMA decay | 0.9999 | |
| Checkpoint interval | 5000 steps | |
| ControlNet scale max | 1.2 | Sum > 1.2 causes saturation |
Important training safeguards:
- VAE is always frozen (gradient leak corrupts the latent space)
- GroupNorm instead of BatchNorm (batch size 4 makes BN unstable)
- TPS warps are precomputed to avoid CPU bottleneck during training
- Git LFS required for checkpoints
sbatch scripts/train_slurm.shSee docs/GPU_TRAINING_GUIDE.md for detailed HPC setup, Apptainer containers, and multi-node configurations.
| Metric | What it measures | Target | How it's computed |
|---|---|---|---|
| FID | Realism | < 50 | Frechet Inception Distance via torch-fidelity (GPU-accelerated) |
| LPIPS | Perceptual similarity | < 0.15 | Learned Perceptual Image Patch Similarity (AlexNet backbone) |
| SSIM | Structural similarity | > 0.80 | Structural Similarity Index between input and output |
| NME | Landmark accuracy | < 0.05 | Normalized Mean Error - L2 distance between predicted and target landmarks, normalized by inter-ocular distance (landmarks 33 vs 263) |
| Identity Sim | Identity preservation | > 0.85 | ArcFace cosine similarity between input and output face embeddings (InsightFace buffalo_l, 512-dim) |
Every metric is broken down by Fitzpatrick skin type to ensure equitable performance. Skin type is classified automatically from the input photo using Individual Typology Angle (ITA):
ITA = arctan((L - 50) / b) * (180 / pi)
where L and b come from the LAB color space.
| ITA Range | Fitzpatrick Type | Description |
|---|---|---|
| > 55 | Type I | Very light |
| 41 to 55 | Type II | Light |
| 28 to 41 | Type III | Intermediate |
| 10 to 28 | Type IV | Tan |
| -30 to 10 | Type V | Brown |
| < -30 | Type VI | Dark |
This catches cases where the model might work well on lighter skin but degrade on darker skin (or vice versa). Results are reported per-type in evaluation output.
python scripts/evaluate.py \
--pred_dir output/predictions/ \
--target_dir data/targets/ \
--output eval_results.jsonThe evaluation harness computes all metrics, stratifies by Fitzpatrick type and by procedure, and writes a JSON report.
Evaluated on the HDA plastic surgery database (67 before/after pairs, 4 procedures). All LandmarkDiff results are mean over 5 random seeds.
| Method | LPIPS ↓ | NME ↓ | SSIM ↑ | ArcFace ↑ |
|---|---|---|---|---|
| Rhinoplasty (n=21) | ||||
| TPS-only | 0.357 | 0.000 | 0.574 | — |
| SD1.5 Img2Img | 0.379 | 0.034 | 0.539 | 0.568 |
| LandmarkDiff | 0.380 | 0.043 | 0.533 | 0.607 |
| Blepharoplasty (n=27) | ||||
| TPS-only | 0.373 | 0.000 | 0.515 | — |
| SD1.5 Img2Img | 0.386 | 0.037 | 0.481 | 0.635 |
| LandmarkDiff | 0.388 | 0.047 | 0.474 | 0.670 |
| Rhytidectomy (n=9) | ||||
| TPS-only | 0.320 | 0.000 | 0.586 | — |
| SD1.5 Img2Img | 0.354 | 0.017 | 0.563 | 0.437 |
| LandmarkDiff | 0.369 | 0.048 | 0.540 | 0.360 |
| Orthognathic (n=10) | ||||
| TPS-only | 0.393 | 0.000 | 0.548 | — |
| SD1.5 Img2Img | 0.395 | 0.039 | 0.521 | 0.544 |
| LandmarkDiff | 0.399 | 0.055 | 0.511 | 0.568 |
Key finding: LandmarkDiff achieves the highest ArcFace identity similarity for 3 of 4 procedures while producing procedure-specific geometric changes. TPS-only achieves the best LPIPS/SSIM (it directly fits target landmarks), but produces visible warping artifacts. SD1.5 Img2Img applies no geometric deformation.
| Skin Type | LPIPS ↓ | SSIM ↑ | NME ↓ | ArcFace ↑ | n |
|---|---|---|---|---|---|
| Type I/II | 0.439 | 0.478 | 0.059 | 0.510 | 18 |
| Type III | 0.422 | 0.469 | 0.052 | 0.541 | 13 |
| Type IV | 0.463 | 0.440 | 0.049 | 0.470 | 17 |
| Type V/VI | 0.410 | 0.531 | 0.037 | 0.568 | 19 |
No performance penalty for darker skin tones — Type V/VI achieves the best scores across all four metrics despite training data that under-represents these groups.
LandmarkDiff handles four clinical conditions that affect how deformations should be applied or how the mask should behave.
Vitiligo causes depigmented patches on the skin that should be preserved, not blended over. LandmarkDiff detects vitiligo patches using LAB luminance thresholding (high L, low saturation), filters by minimum area (200 px squared), and reduces mask intensity over detected patches by a preservation factor of 0.3. This means the surgical region is still modified, but depigmented areas are largely left alone.
Bell's palsy causes unilateral facial paralysis. Deforming the paralyzed side produces unrealistic results because the tissue doesn't respond to surgery the same way. LandmarkDiff takes the affected side (left or right) as input and disables all deformation handles on that side. The bilateral landmark groups (eye, eyebrow, mouth corner, jawline) for the affected side are excluded from manipulation.
Keloid-prone patients develop raised scars at incision sites. LandmarkDiff identifies keloid-prone regions (specified by anatomical zone, e.g., "jawline", "nose"), creates exclusion masks with margins, and reduces mask intensity by a factor of 0.5 with additional Gaussian blur (sigma 10.0) for softer transitions. This prevents sharp compositing boundaries that would suggest incision lines.
Ehlers-Danlos causes tissue hypermobility - the skin stretches more than typical. LandmarkDiff multiplies the Gaussian RBF influence radius by 1.5 for Ehlers-Danlos patients, producing wider, more gradual deformations that reflect how hypermobile tissue actually responds to surgical manipulation.
from landmarkdiff.clinical import ClinicalFlags
flags = ClinicalFlags(
vitiligo=True,
bells_palsy=True,
bells_palsy_side="left",
keloid_prone=True,
keloid_regions=["jawline", "nose"],
ehlers_danlos=False,
)
result = pipeline.generate(
image,
procedure="rhinoplasty",
intensity=60,
clinical_flags=flags,
)In the Gradio demo, these are checkboxes and dropdowns in Tab 1.
The raw diffusion output needs refinement before it looks right. The post-processing pipeline runs six steps:
Neural face restoration that fixes small artifacts, enhances detail, and sharpens facial features. Uses a fidelity weight of 0.7 (range 0.0 to 1.0) to balance quality enhancement against faithfulness to the diffusion output. Falls back to GFPGAN if CodeFormer is unavailable.
Super-resolution applied only to non-face regions (background, hair, clothing). Prevents the background from looking noticeably lower quality than the restored face.
CDF histogram matching in LAB color space transfers the input photo's skin tone to the generated output. LAB matching is more robust than RGB for this because it separates luminance from color, preventing brightness shifts.
Unsharp masking applied to the L channel only (luminance) with a default strength of 0.25. Sharpening only luminance avoids the color fringing artifacts you get from sharpening RGB channels directly.
The compositing step - blends the generated face into the original photo. Uses a 6-level Laplacian pyramid where low-frequency levels blend smoothly (lighting and color continuity) while high-frequency levels transition sharply (texture and pore detail). This prevents the color halos and "pasted on" look that simple alpha blending produces.
Final sanity check. Extracts ArcFace embeddings from the input and output, computes cosine similarity, and flags if the score drops below 0.6. This catches cases where the diffusion model drifted too far from the patient's appearance.
landmarkdiff/ # Core library
landmarks.py # MediaPipe 478-point face mesh extraction
# FaceLandmarks dataclass, extract_landmarks(),
# render_landmark_image(), LANDMARK_REGIONS dict
conditioning.py # ControlNet conditioning generation
# Tessellation wireframe (2556 edges), adaptive
# Canny edge detection, generate_conditioning()
manipulation.py # Gaussian RBF landmark deformation
# DeformationHandle, PROCEDURE_LANDMARKS,
# apply_procedure_preset(), clinical modifiers
masking.py # Feathered surgical mask generation
# Convex hull + dilation + Gaussian feather +
# Perlin boundary noise, clinical adjustments
inference.py # Full pipeline (4 modes: tps/img2img/controlnet/
# controlnet_ip), LandmarkDiffPipeline class,
# face view estimation, procedure-specific prompts
losses.py # Combined loss (diffusion + landmark + identity
# + perceptual), phase A/B control, procedure-
# dependent identity cropping
evaluation.py # Metrics (FID, LPIPS, SSIM, NME, Identity Sim),
# Fitzpatrick ITA classification, per-type and
# per-procedure stratification
clinical.py # Clinical edge cases: ClinicalFlags dataclass,
# vitiligo patch detection, Bell's palsy side
# exclusion, keloid mask adjustment, Ehlers-Danlos
postprocess.py # Neural + classical post-processing: CodeFormer,
# GFPGAN, Real-ESRGAN, LAB histogram matching,
# Laplacian pyramid blend, ArcFace verification
synthetic/
pair_generator.py # Training pair generation pipeline
tps_warp.py # Thin-plate spline warping with rigid regions
# (teeth, sclera), smart control point subsampling
# (max 80 from 478), batched evaluation
augmentation.py # Clinical photography augmentations
scripts/ # CLI tools
app.py # Gradio web demo (5 tabs)
run_inference.py # Single image inference
train_controlnet.py # ControlNet fine-tuning
evaluate.py # Automated evaluation harness
demo.py # CLI demo with visualizations
download_ffhq.py # FFHQ face image downloader
generate_synthetic_data.py # Synthetic training pair generator
train_slurm.sh # SLURM job script (single GPU)
train_slurm_v2.sh # SLURM job script (multi-GPU)
gen_synthetic_slurm.sh # SLURM job for data generation
examples/ # Runnable example scripts
basic_inference.py # Single image with GPU fallback to TPS
batch_inference.py # Process a directory of images
tps_only.py # CPU-only TPS warp (no GPU)
compare_procedures.py # Side-by-side all procedures grid
custom_procedure.py # Define a lip augmentation procedure
landmark_visualization.py # Visualize mesh with displacement arrows
benchmarks/ # Performance benchmarks
benchmark_inference.py # Inference speed across hardware
benchmark_landmarks.py # Landmark extraction throughput
benchmark_training.py # Training steps/hour
configs/ # Training configuration
training.yaml # Default hyperparameters, loss weights, safeguards
paper/ # MICCAI 2026 manuscript (Springer LNCS)
docs/ # Documentation
tutorials/ # quickstart, custom_procedures, training,
# evaluation, deployment
api/ # Per-module API reference (landmarks,
# manipulation, conditioning, inference,
# evaluation, clinical)
GPU_TRAINING_GUIDE.md # HPC setup, Apptainer, SLURM
containers/ # Apptainer/Singularity container definitions
tests/ # Unit tests (9 test modules)
demos/ # Curated sample output images
The training config controls all hyperparameters, loss weights, and safeguards. Key sections:
model:
controlnet: CrucibleAI/ControlNetMediaPipeFace
base_model: runwayml/stable-diffusion-v1-5
training:
learning_rate: 1.0e-5
lr_scheduler: cosine
warmup_steps: 500
batch_size: 4
gradient_accumulation_steps: 4 # effective batch = 16
num_train_steps: 10000
mixed_precision: bf16
ema_decay: 0.9999
loss_weights: # Phase B only
diffusion: 1.0
landmark: 0.1
identity: 0.05
perceptual: 0.1| Parameter | Default | Range | Effect |
|---|---|---|---|
intensity |
60 | 0 - 100 | How aggressive the deformation is (percentage) |
num_inference_steps |
30 | 10 - 100 | Diffusion denoising steps (more = higher quality, slower) |
guidance_scale |
7.5 | 1.0 - 20.0 | Classifier-free guidance strength |
controlnet_conditioning_scale |
1.0 | 0.0 - 1.2 | How strongly the wireframe controls generation. Max 1.2 to avoid saturation |
strength |
0.75 | 0.0 - 1.0 | img2img denoising strength |
seed |
None | any int | For reproducible results |
| Hardware | Mode | Time per image |
|---|---|---|
| A100 80GB | ControlNet (30 steps) | ~3 sec |
| A100 40GB | ControlNet (30 steps) | ~4 sec |
| RTX 4090 | ControlNet (30 steps) | ~5 sec |
| RTX 3090 | ControlNet (30 steps) | ~7 sec |
| T4 16GB | ControlNet (30 steps) | ~15 sec |
| M3 Pro (MPS) | ControlNet (30 steps) | ~45 sec |
| Any CPU | TPS only | ~0.5 sec |
| Hardware | Batch size | Grad accum | Effective batch | Steps/hour |
|---|---|---|---|---|
| A100 80GB | 4 | 4 | 16 | ~600 |
| A100 40GB | 2 | 8 | 16 | ~400 |
| RTX 4090 | 2 | 8 | 16 | ~350 |
| RTX 3090 | 1 | 16 | 16 | ~200 |
| Component | VRAM |
|---|---|
| SD 1.5 (FP16) | ~2.5 GB |
| ControlNet (FP16) | ~1.5 GB |
| VAE (FP32) | ~0.5 GB |
| CodeFormer | ~0.4 GB |
| ArcFace | ~0.3 GB |
| Total inference | ~5.2 GB |
| Total training | ~25 GB |
Run benchmarks yourself:
python benchmarks/benchmark_inference.py --device cuda --num_images 100
python benchmarks/benchmark_landmarks.py --num_images 1000
python benchmarks/benchmark_training.py --device cuda --num_steps 100See MODEL_ZOO.md for the full list of required and optional models.
Base models (auto-downloaded on first run):
- runwayml/stable-diffusion-v1-5 - ~4 GB
- CrucibleAI/ControlNetMediaPipeFace - ~1.4 GB
- MediaPipe Face Mesh - ~5 MB
Post-processing models (optional, auto-downloaded):
- CodeFormer - ~400 MB
- GFPGAN v1.4 - ~350 MB
- Real-ESRGAN x4 - ~64 MB
- ArcFace (InsightFace buffalo_l) - ~250 MB
- Python 3.10+
- PyTorch 2.1+ with CUDA (or MPS on Apple Silicon)
- ~6 GB VRAM for inference (SD 1.5 + ControlNet)
- ~25 GB VRAM for training (A100 40GB minimum, 80GB recommended)
- MediaPipe 0.10.9+
- diffusers 0.27.0+, transformers 4.38.0+
Full dependency list in pyproject.toml.
# CPU-only demo (TPS mode, no GPU required)
docker build -t landmarkdiff:cpu -f Dockerfile.cpu .
docker run -p 7860:7860 landmarkdiff:cpu
# GPU-accelerated demo (ControlNet inference)
docker build -t landmarkdiff:gpu -f Dockerfile.gpu .
docker run --gpus all -p 7860:7860 landmarkdiff:gpuOr with Docker Compose:
docker compose up app # CPU demo on :7860
docker compose up gpu # GPU demo on :7861
docker compose --profile training run train # training (GPU)GPU passthrough requires NVIDIA Container Toolkit. See docs/docker-gpu.md for prerequisites, VRAM requirements by GPU tier, and troubleshooting.
For HPC environments using Apptainer/Singularity, see containers/.
make help # show all commands
make install # install (inference only)
make install-dev # install with dev tools
make install-train # install with training deps
make install-app # install with Gradio
make install-all # install everything
make test # run full test suite
make test-fast # run tests excluding slow ones
make lint # run ruff linter
make format # auto-format code
make type-check # run mypy
make check # lint + format + type-check
make demo # launch Gradio demo
make inference # run single inference
make train # train ControlNet
make evaluate # run evaluation
make docker # build Docker image
make paper # build MICCAI paper PDF
make clean # remove build artifactsInstall pre-commit hooks to run linting and formatting automatically before commits:
pip install pre-commit
pre-commit installRun pre-commit manually:
pre-commit run --all-filesSee docs/ROADMAP.md for the detailed roadmap with full milestone descriptions.
- Core pipeline: landmark extraction, RBF deformation, ControlNet conditioning, mask compositing
- 6 procedure presets (rhinoplasty, blepharoplasty, rhytidectomy, orthognathic, brow lift, mentoplasty)
- Synthetic training pair generation via TPS warps
- Clinical edge case handling (vitiligo, Bell's palsy, keloid, Ehlers-Danlos)
- Neural post-processing (CodeFormer, Real-ESRGAN, ArcFace identity verification)
- Gradio demo with multi-angle capture
- Fitzpatrick-stratified evaluation protocol
- Docker and Apptainer container support
- Hugging Face Spaces interactive demo (live)
- Data-driven displacement model fitted from real surgical pairs
- Quantitative evaluation on HDA dataset (67 pairs, 4 procedures)
- Fitzpatrick-stratified fairness results
- arXiv preprint
- ControlNet fine-tuning on 50K+ synthetic pairs (Phase A)
- Combined loss training on clinical pairs (Phase B)
- Additional procedure presets (otoplasty, genioplasty)
- Anatomically constrained displacement sampling with per-procedure variance
- Phone video capture: rotate head, reconstruct full 3D face from frames
- FLAME 3D morphable model fitting from monocular video
- FLUX.1-dev or SDXL backbone upgrade (higher quality generation at 1024x1024)
- IP-Adapter FaceID v2 for stronger identity preservation
- 3D surgical deformation: procedure-specific warps in 3D space
- Interactive 3D preview: rotate the predicted result from any angle
- Mobile-optimized capture and preview workflow
- IRB-approved prospective clinical validation study
- Multi-view consistency loss across frontal/profile predictions
- Physics-informed tissue simulation (FEM for soft tissue response)
- Mobile capture app with guided head-rotation scan
- Cloud deployment with Triton inference server
- arXiv preprint (March 2026)
- MICCAI 2026 (submission July 2026)
- Full conference paper (CVPR/NeurIPS 2027)
If you use LandmarkDiff in your research, please cite it. Machine-readable citation metadata is available in CITATION.cff.
@software{landmarkdiff2026,
title = {LandmarkDiff: Anatomically-Conditioned Facial Surgery Outcome Prediction},
author = {dreamlessx},
year = {2026},
url = {https://github.com/dreamlessx/LandmarkDiff-public},
version = {0.2.2}
}We track all contributions and contributors will be acknowledged in the MICCAI 2026 paper. Significant contributions earn co-authorship.
| Contribution Level | Recognition |
|---|---|
| Bug fix or typo | Listed in CONTRIBUTORS.md |
| New procedure preset | Acknowledged in paper and README |
| Feature module (new loss, metric, clinical handler) | Co-author on paper |
| Clinical validation data | Co-author on paper |
| Sustained multi-feature contributions | Co-author on paper |
| GitHub Handle | Contribution |
|---|---|
| @dreamlessx | Core architecture, training pipeline, paper |
| @Deepak8858 | Brow lift procedure preset (#35) |
| @P-r-e-m-i-u-m | Mentoplasty procedure preset (#36) |
| @PredictiveManish | OpenAPI spec (#340), batch notebook (#334), pre-commit + mypy (#386, #405) |
| @lshariprasad | histogram_match_skin tests (#263) |
| @dagangtj | SafetyResult dataclass, file logging, API error messages (#235, #236, #237) |
| @srikar117 | Dark mode toggle for Gradio app (#387) |
| @Flames4fun | ONNX TPS warp export (#404), ONNX inference backend (#408), TPS landmark runtime wrapper (#411) |
| @passionworkeer | Docker architecture feedback (#5) |
To join this list, open a PR or contribute to an issue. See CONTRIBUTING.md for guidelines.
Contributions welcome. See CONTRIBUTING.md for the full guide, including development setup, coding style, testing requirements, and how to add new procedures.
For bug reports and feature requests, use the issue templates.
For questions and general discussion, visit GitHub Discussions.
For major changes, please open an issue first to discuss the proposed approach.
MIT License. See LICENSE for details.
- CrucibleAI for the MediaPipe Face ControlNet
- MediaPipe for the 478-point face mesh
- Stable Diffusion and ControlNet for the diffusion backbone
- CodeFormer and Real-ESRGAN for face restoration
- InsightFace for ArcFace identity verification
- FFHQ for training data
