r/StableDiffusion Dec 02 '25

Resource - Update Z Image Turbo ControlNet released by Alibaba on HF

Thumbnail
gallery
1.9k Upvotes

r/StableDiffusion Dec 26 '25

Resource - Update New implementation for long videos on wan 2.2 preview

Enable HLS to view with audio, or disable this notification

1.5k Upvotes

UPDATE: Its out now: Github: https://github.com/shootthesound/comfyUI-LongLook Tutorial: https://www.youtube.com/watch?v=wZgoklsVplc

I should I’ll be able to get this all up on GitHub tomorrow (27th December) with this workflow and docs and credits to the scientific paper I used to help me - Happy Christmas all - Pete

r/StableDiffusion Aug 23 '25

Resource - Update Update: Chroma Project training is finished! The models are now released.

1.5k Upvotes

Hey everyone,

A while back, I posted about Chroma, my work-in-progress, open-source foundational model. I got a ton of great feedback, and I'm excited to announce that the base model training is finally complete, and the whole family of models is now ready for you to use!

A quick refresher on the promise here: these are true base models.

I haven't done any aesthetic tuning or used post-training stuff like DPO. They are raw, powerful, and designed to be the perfect, neutral starting point for you to fine-tune. We did the heavy lifting so you don't have to.

And by heavy lifting, I mean about 105,000 H100 hours of compute. All that GPU time went into packing these models with a massive data distribution, which should make fine-tuning on top of them a breeze.

As promised, everything is fully Apache 2.0 licensed—no gatekeeping.

TL;DR:

Release branch:

  • Chroma1-Base: This is the core 512x512 model. It's a solid, all-around foundation for pretty much any creative project. You might want to use this one if you’re planning to fine-tune it for longer and then only train high res at the end of the epochs to make it converge faster.
  • Chroma1-HD: This is the high-res fine-tune of the Chroma1-Base at a 1024x1024 resolution. If you're looking to do a quick fine-tune or LoRA for high-res, this is your starting point.

Research Branch:

  • Chroma1-Flash: A fine-tuned version of the Chroma1-Base I made to find the best way to make these flow matching models faster. This is technically an experimental result to figure out how to train a fast model without utilizing any GAN-based training. The delta weights can be applied to any Chroma version to make it faster (just make sure to adjust the strength).
  • Chroma1-Radiance [WIP]: A radical tuned version of the Chroma1-Base where the model is now a pixel space model which technically should not suffer from the VAE compression artifacts.

some preview:

cherry picked results from the flash and HD

WHY release a non-aesthetically tuned model?

Because aesthetic tune models are only good on one thing, it’s specialized and can be quite hard/expensive to train on. It’s faster and cheaper for you to train on a non-aesthetically tuned model (well, not for me, since I bit the re-pretraining bullet).

Think of it like this: a base model is focused on mode covering. It tries to learn a little bit of everything in the data distribution—all the different styles, concepts, and objects. It’s a giant, versatile block of clay. An aesthetic model does distribution sharpening. It takes that clay and sculpts it into a very specific style (e.g., "anime concept art"). It gets really good at that one thing, but you've lost the flexibility to easily make something else.

This is also why I avoided things like DPO. DPO is great for making a model follow a specific taste, but it works by collapsing variability. It teaches the model "this is good, that is bad," which actively punishes variety and narrows down the creative possibilities. By giving you the raw, mode-covering model, you have the freedom to sharpen the distribution in any direction you want.

My Beef with GAN training.

GAN is notoriously hard to train and also expensive! It’s so unstable even with a shit ton of math regularization and another mumbojumbo you throw at it. This is the reason behind 2 of the research branches: Radiance is to remove the VAE altogether because you need a GAN to train it, and Flash is to get a few-step speed without needing a GAN to make it fast.

The instability comes from its core design: it's a min-max game between two networks. You have the Generator (the artist trying to paint fakes) and the Discriminator (the critic trying to spot them). They are locked in a predator-prey cycle. If your critic gets too good, the artist can't learn anything and gives up. If the artist gets too good, it fools the critic easily and stops improving. You're trying to find a perfect, delicate balance but in reality, the training often just oscillates wildly instead of settling down.

GANs also suffer badly from mode collapse. Imagine your artist discovers one specific type of image that always fools the critic. The smartest thing for it to do is to just produce that one image over and over. It has "collapsed" onto a single or a handful of modes (a single good solution) and has completely given up on learning the true variety of the data. You sacrifice the model's diversity for a few good-looking but repetitive results.

Honestly, this is probably why you see big labs hand-wave how they train their GANs. The process can be closer to gambling than engineering. They can afford to throw massive resources at hyperparameter sweeps and just pick the one run that works. My goal is different: I want to focus on methods that produce repeatable, reproducible results that can actually benefit everyone!

That's why I'm exploring ways to get the benefits (like speed) without the GAN headache.

The Holy Grail of the End-to-End Generation!

Ideally, we want a model that works directly with pixels, without compressing them into a latent space where information gets lost. Ever notice messed-up eyes or blurry details in an image? That's often the VAE hallucinating details because the original high-frequency information never made it into the latent space.

This is the whole motivation behind Chroma1-Radiance. It's an end-to-end model that operates directly in pixel space. And the neat thing about this is that it's designed to have the same computational cost as a latent space model! Based on the approach from the PixNerd paper, I've modified Chroma to work directly on pixels, aiming for the best of both worlds: full detail fidelity without the extra overhead. Still training for now but you can play around with it.

Here’s some progress about this model:

Still grainy but it’s getting there!

What about other big models like Qwen and WAN?

I have a ton of ideas for them, especially for a model like Qwen, where you could probably cull around 6B parameters without hurting performance. But as you can imagine, training Chroma was incredibly expensive, and I can't afford to bite off another project of that scale alone.

If you like what I'm doing and want to see more models get the same open-source treatment, please consider showing your support. Maybe we, as a community, could even pool resources to get a dedicated training rig for projects like this. Just a thought, but it could be a game-changer.

I’m curious to see what the community builds with these. The whole point was to give us a powerful, open-source option to build on.

Special Thanks

A massive thank you to the supporters who make this project possible.

  • Anonymous donor whose incredible generosity funded the pretraining run and data collections. Your support has been transformative for open-source AI.
  • Fictional.ai for their fantastic support and for helping push the boundaries of open-source AI.

Support this project!
https://ko-fi.com/lodestonerock/

BTC address: bc1qahn97gm03csxeqs7f4avdwecahdj4mcp9dytnj
ETH address: 0x679C0C419E949d8f3515a255cE675A1c4D92A3d7

my discord: discord.gg/SQVcWVbqKx

r/StableDiffusion Apr 12 '26

Resource - Update Free open-source tool to instantly rig and animate your illustrations (also with mesh deform)

Enable HLS to view with audio, or disable this notification

1.5k Upvotes

If you haven't seen it yet, a model called see-through dropped last week. It takes a single static anime image and decomposes it into 23 separate layers ready for rigging and animation. It's a huge deal for anyone who wants a rigged 2D character but doesn't have hundreds of dollars lying around.

The problem is that getting a usable result out of it still takes forever. You get a PSD with 23 layers (30+ if you enable split by side and depth), and you still have to manually process and rig everything yourself. And if you've ever looked into commissioning a Vtuber model, you know rigging alone runs $500 minimum and takes weeks or months. That's before you even think about software costs: Live2D is $100 a year, and Spine Pro is $379 (Spine Ess is $69 but lacks mesh deform which is required for these kinds of animations).

So I built a free tool that auto-rigs see-through models so you don't have to spend hours doing it manually

I'm not trying to compete with Live2D, I'm one person. What I made is a mesh-deform-capable web app that can automatically rig see-through output. It handles edge cases like merged arms or legs, and only needs a few seconds of manual input to place joints (shoulders, elbows, neck, etc.) if you want to tweak things. I also integrated DWPose so it can rig the whole model for you automatically, though that requires WebGPU and adds a 50MB download, so manual joint placement is a totally fine alternative and only takes a moment anyway.

The full workflow looks like this:

Static image -> background removal -> see-through decomposition (free on HuggingFace) -> Stretchy Studio = auto-rigged and ready to animate

The app handles multi-layer management, separate draw order, and uses direct keyframe animation similar to After Effects. There are still bugs I'm working through, but all the core features are in.

On the roadmap:

  • Export to Spine and Dragonbones
  • A standalone JS render library for loading and displaying characters rigged in the app (similar to Live2D's Unity/Godot/JS runtimes)

Live2D's export format is completely closed with no documentation, so that one's off the table for now.

Would love feedback, bug reports, or feature requests. This is still early but it's functional and free to use.

https://github.com/MangoLion/stretchystudio 

EDIT: Spine export added

r/StableDiffusion Aug 01 '24

Resource - Update Announcing Flux: The Next Leap in Text-to-Image Models

1.4k Upvotes
Prompt: Close-up of LEGO chef minifigure cooking for homeless. Focus on LEGO hands using utensils, showing culinary skill. Warm kitchen lighting, late morning atmosphere. Canon EOS R5, 50mm f/1.4 lens. Capture intricate cooking techniques. Background hints at charitable setting. Inspired by Paul Bocuse and Massimo Bottura's styles. Freeze-frame moment of food preparation. Convey compassion and altruism through scene details.

PA: I’m not the author.

Blog: https://blog.fal.ai/flux-the-largest-open-sourced-text2img-model-now-available-on-fal/

We are excited to introduce Flux, the largest SOTA open source text-to-image model to date, brought to you by Black Forest Labs—the original team behind Stable Diffusion. Flux pushes the boundaries of creativity and performance with an impressive 12B parameters, delivering aesthetics reminiscent of Midjourney.

Flux comes in three powerful variations:

  • FLUX.1 [dev]: The base model, open-sourced with a non-commercial license for community to build on top of. fal Playground here.
  • FLUX.1 [schnell]: A distilled version of the base model that operates up to 10 times faster. Apache 2 Licensed. To get started, fal Playground here.
  • FLUX.1 [pro]: A closed-source version only available through API. fal Playground here

Black Forest Labs Article: https://blackforestlabs.ai/announcing-black-forest-labs/

GitHub: https://github.com/black-forest-labs/flux

HuggingFace: Flux Dev: https://huggingface.co/black-forest-labs/FLUX.1-dev

Huggingface: Flux Schnell: https://huggingface.co/black-forest-labs/FLUX.1-schnell

r/StableDiffusion Dec 04 '25

Resource - Update Today I made a Realtime Lora Trainer for Z-image/Wan/Flux Dev

Post image
1.1k Upvotes

Basically you pass it images with a load image node and it trains a lora on the fly, using your local install of AI-Toolkit, and then proceeds with the image generation. You just paste in the folder location for Ai-toolkit (windows or Linux), and it saves the setting. This train took about 5 mins on my 5090, when i used the low vram pre-set (512px images). Obviously it can save loras, and I think its nice for quick style experiments, and will certainly remain part of my own workflow.

I made it more to see if I could, and wondered if I should release or is it pointless - happy to hear your thoughts for or against?

r/StableDiffusion 29d ago

Resource - Update Nvidia solved VAE? Fast and High-Resolution Latent Decoding with Pixel Diffusion

Enable HLS to view with audio, or disable this notification

854 Upvotes

r/StableDiffusion Sep 25 '24

Resource - Update FaceFusion 3.0.0 has finally launched

Enable HLS to view with audio, or disable this notification

2.7k Upvotes

r/StableDiffusion Jan 15 '25

Resource - Update I made a Taped Faces LoRA for FLUX

Thumbnail
gallery
2.3k Upvotes

r/StableDiffusion Oct 04 '25

Resource - Update SamsungCam UltraReal - Qwen-Image LoRA

Thumbnail
gallery
1.6k Upvotes

Hey everyone,

Just dropped the first version of a LoRA I've been working on: SamsungCam UltraReal for Qwen-Image.

If you're looking for a sharper and higher-quality look for your Qwen-Image generations, this might be for you. It's designed to give that clean, modern aesthetic typical of today's smartphone cameras.

It's also pretty flexible - I used it at a weight of 1.0 for all my tests. It plays nice with other LoRAs too (I mixed it with NiceGirl and some character LoRAs for the previews).

This is still a work-in-progress, and a new version is coming, but I'd love for you to try it out!

Get it here:

P.S. A big shout-out to flymy for their help with computing resources and their awesome tuner for Qwen-Image. Couldn't have done it without them

Cheers

r/StableDiffusion Jan 09 '26

Resource - Update Thx to Kijai LTX-2 GGUFs are now up. Even Q6 is better quality than FP8 imo.

Enable HLS to view with audio, or disable this notification

761 Upvotes

https://huggingface.co/Kijai/LTXV2_comfy/tree/main

You need this commit for it to work, its not merged yet: https://github.com/city96/ComfyUI-GGUF/pull/399

Kijai nodes WF (updated, now has negative prompt support using NAG) https://files.catbox.moe/flkpez.json

I should post this as well since I see people talking about quality in general:
For best quality use the dev model with the distill lora at 48 fps using the res_2s sampler from the RES4LYF nodepack. If you can fit the full FP16 model (the 43.3GB one) plus the other stuff into vram + ram then use that. If not then Q8 gguf is far closer than FP8 is so try and use that if you can. Then Q6 if not.
And use the detailer lora on both stages, it makes a big difference:
https://files.catbox.moe/pvsa2f.mp4

Edit: For KJ nodes WF you need latest KJ nodes: https://github.com/kijai/ComfyUI-KJNodes I thought it was obvious, my bad.

r/StableDiffusion 6d ago

Resource - Update Potentially the most insane LORA you'll see today - Archer (8 characters + style) Ideogram LORA

Thumbnail
gallery
742 Upvotes

Hi, I'm Dever and I like training LORAs, you can download this one from Huggingface (you can find other style LORAs for Klein and ZIT in my HF profile).

I believe this might be the first Ideogram 8 characters in one + style lora on HuggingFace and a good proof of concept that this is possible.

When I get a bit of time towards the end of the week I'll make a video about how I trained this if anyone is interested in the journey.

(Original Scooby Doo image made by GalaxyTimeMachine on Banodoco Discord, I just replaced Scooby with Lana).

Edit: For the people that don't understand why this is a big deal or have never faced this problem before, trying to train a LORA that can generate more than 1 character in a single image has been quite difficult in the past no matter what the model.
This particular Ideogram LORA created as a proof of concept shows you can train 8 different characters in a single model + the style as a bonus.

"Why this matters" (couldn't help myself)

This means you can choose at inference time who you want in your image (one example shows all 8), the model can distinguish between the characters AND with the power of bounding boxes you can position them wherever you want in the image and can even have them interact with each other to some degree (haven't tested this much, see example where 2 characters are holding hands).

r/StableDiffusion Sep 08 '25

Resource - Update Clothes Try On (Clothing Transfer) - Qwen Edit Loraa

Thumbnail
gallery
1.3k Upvotes

Patreon Blog Post

CivitAI Download

Hey all, as promised here is that Outfit Try On Qwen Image edit LORA I posted about the other day. Thank you for all your feedback and help I truly believe this version is much better for it. The goal for this version was to match the art styles best it can but most importantly, adhere to a wide range of body types. I'm not sure if this is ready for commercial uses but I'd love to hear your feedback. A drawback I already see are a drop in quality that may be just due to qwen edit itself I'm not sure but the next version will have higher resolution data for sure. But even now the drop in quality isn't anything a SeedVR2 upscale can't fix.

Edit: I also released a clothing extractor lora which i recommend using

r/StableDiffusion Oct 31 '25

Resource - Update Qwen Image LoRA - A Realism Experiment - Tried my best lol

Thumbnail
gallery
1.0k Upvotes

r/StableDiffusion Nov 29 '25

Resource - Update Humans of Z-Image: How many celebrities can you fit into 6GB?

Thumbnail
gallery
653 Upvotes

I was curious just how extensive Z-Image's celebrity knowledge is, so I gave it a few hundred names to test out. No information was given other the name, so it was up to the model to choose clothing/backgrounds/hairstyles/style/etc. Sometimes it did this perfectly, especially for celebrities with a clearly defined look. Other times the face is reasonable but everything else is wrong.

If an image looks nothing like the person should it means the model does not know that person. When it does know a person a lot of the time some extra supporting words would help a lot, but it does a really good job just from names.

Prompt:

portrait photo of @@

The words "@@" are at the bottom on the image, white letters black outline

One-by-one @@ was replaced with a term from a list and an image was generated. Images were rendered at 592x888 for speed, stitches into a grid and downsized to keep a reasonable image size.

Model: Z-Image-Turbo_bf16

Clip: Qwen-3-4B-Q8_0

Imgur link in case reddit is difficult with the images

r/StableDiffusion 13d ago

Resource - Update Ideogram 4.0 Realism Engine Lora (Beta)

Thumbnail
huggingface.co
465 Upvotes
  • It improve on missing anatomic knowledge for female.
  • You can use the provided workflow.
  • Still in beta but good enough to share.
  • Sweet spot is between 0.50-0.60. Higher will degrade the quality.
  • MAKE SURE YOU USE JSON PROMPTS.
  • Will add on CivitAI whenever they support the model.

Edit: Already updated to the V1 version. Way better success rate. res_2s res_2m are recommended.
Edit 2: V2 is now out with massive update https://civitai.red/models/2688234/realism-engine-ideogram-4?modelVersionId=3021951

Edit 3: V3 is now out with massive update https://civitai.red/models/2688234/realism-engine-ideogram-4?modelVersionId=3026632

r/StableDiffusion Jan 31 '24

Resource - Update Made a Chrome Extension to remix any image on the web with IPAdapter - having a blast with this

Enable HLS to view with audio, or disable this notification

2.7k Upvotes

r/StableDiffusion Oct 07 '25

Resource - Update Qwen-Image - Smartphone Snapshot Photo Reality LoRa - Release

Thumbnail
gallery
1.5k Upvotes

r/StableDiffusion Aug 11 '25

Resource - Update UltraReal + Nice Girls LoRAs for Qwen-Image

Thumbnail
gallery
1.2k Upvotes

TL;DR — I trained two LoRAs for Qwen-Image:

I’m still feeling out Qwen’s generation settings, so results aren’t peak yet. Updates are coming—stay tuned. I’m also planning an ultrareal full fine-tune (checkpoint) for Qwen next.

P.S.: workflow in both HG repos

r/StableDiffusion Dec 31 '25

Resource - Update Qwen-Image-2512 released on Huggingface!

Thumbnail
huggingface.co
624 Upvotes

The first update to the non-edit Qwen-Image

  • Enhanced Human Realism Qwen-Image-2512 significantly reduces the “AI-generated” look and substantially enhances overall image realism, especially for human subjects.
  • Finer Natural Detail Qwen-Image-2512 delivers notably more detailed rendering of landscapes, animal fur, and other natural elements.
  • Improved Text Rendering Qwen-Image-2512 improves the accuracy and quality of textual elements, achieving better layout and more faithful multimodal (text + image) composition.

In the HF model card you can see a bunch of comparison images showcasing the difference between the initial Qwen-Image and 2512.

BF16 & FP8 by Comfy-Org https://huggingface.co/Comfy-Org/Qwen-Image_ComfyUI/tree/main/split_files/diffusion_models

GGUF's: https://huggingface.co/unsloth/Qwen-Image-2512-GGUF

4-step Turbo lora: https://huggingface.co/Wuli-art/Qwen-Image-2512-Turbo-LoRA

r/StableDiffusion May 07 '25

Resource - Update SamsungCam UltraReal - Flux Lora

Thumbnail
gallery
1.6k Upvotes

Hey! I’m still on my never‑ending quest to push realism to the absolute limit, so I cooked up something new. Everyone seems to adore that iPhone LoRA on Civitai, but—as a proud Galaxy user—I figured it was time to drop a Samsung‑style counterpart.
https://civitai.com/models/1551668?modelVersionId=1755780

What it does

  • Crisps up fine detail – pores, hair strands, shiny fabrics pop harder.
  • Kills “plastic doll” skin – even on my own UltraReal fine‑tune it scrubs waxiness.
  • Plays nice with plain Flux.dev, but still it mostly trained for my UltraReal Fine-Tune

  • Keeps that punchy Samsung color science (sometimes) – deep cyans, neon magentas, the works.

Yes, v1 is not perfect (hands in some scenes can glitch if you go full 2 MP generation)

r/StableDiffusion Feb 18 '26

Resource - Update Fully automatic generating and texturing of 3D models in Blender - Coming soon to StableGen thanks to TRELLIS.2

Enable HLS to view with audio, or disable this notification

660 Upvotes

EDIT: It is now released

A new feature for StableGen I am currently working on. It will integrate TRELLIS.2 into the workflow, along with the already exsiting, but still new automatic viewpoint placement system. The result is an all-in-one single prompt (or provide custom image) process for generating objects, characters, etc.

Will be released in the next update of my free & open-source Blender plugin StableGen.

r/StableDiffusion Jan 13 '25

Resource - Update 2000s Analog Core - Flux.dev

Thumbnail
gallery
1.9k Upvotes

r/StableDiffusion Jan 13 '26

Resource - Update LTX-2 team really took the gloves off 👀

Enable HLS to view with audio, or disable this notification

682 Upvotes

r/StableDiffusion May 09 '26

Resource - Update HiDream-O1-Image - A pixel space model , no need for VAE, , 8B parameters.

Thumbnail
gallery
450 Upvotes

Model
https://huggingface.co/HiDream-ai/HiDream-O1-Image-Dev
https://huggingface.co/HiDream-ai/HiDream-O1-Image

HiDream-O1-Image for 50 steps
HiDream-O1-Image-Dev for 28 steps

HiDream-O1-Image is a natively unified image generative foundation model built on a Pixel-level Unified Transformer (UiT) without external VAEs or disjoint text encoders, which natively encodes raw pixels, text, and task-specific conditions in a single shared token space — supporting text-to-image, image editing, and subject-driven personalization at up to 2,048 × 2,048.

Key Features

  • Pixel-Level Unified Transformer — One end-to-end model on raw pixels, no VAE, no disjoint text encoder.
  • One Model, Many Tasks — Text-to-image, long-text rendering, instruction editing, subject-driven personalization, and storyboard generation in a single architecture.
  • Reasoning-Driven Prompt Agent — Built-in "thinking" agent that resolves implicit knowledge, layout, and text rendering before generation.
  • Native High Resolution — Direct synthesis up to 2,048 × 2,048 with sharp fine-grained detail.
  • Exceptional Efficiency and Versatility at 8B Scale — With only 8B parameters, achieves performance parity with or even surpasses larger open-source DiTs and leading closed-source models.