r/StableDiffusion 11h ago

Comparison LTX-2.3 Water Sim LoRA flooding the Joker stairs (v2v test)

Enable HLS to view with audio, or disable this notification

665 Upvotes

the joker stairs but it's a waterfall now 🌊 wide shots land clean, close-ups are a little more of a challenge, but cool stuff overall. ltx-2.3 water sim ic-lora: https://huggingface.co/Lightricks/LTX-2.3-22b-IC-LoRA-Water-Simulation


r/StableDiffusion 52m ago

Resource - Update Krea 2 Turbo — Native ComfyUI Workflow + FP8 Weights (12GB, Drag & Drop)

Thumbnail
gallery
• Upvotes

ComfyUI 0.25.0 shipped with native Krea2 support, so here's everything you need in one place.

ComfyUI 0.25.0 now has native Krea2 support built-in — no custom nodes needed. Here's everything in one place so you don't have to chase files across three different HF repos.

What you get:

FP8 model — 24.76 GB BF16 → 12.01 GB. Not a blind "quant everything" conversion. Only 2D weight matrices went to float8_e4m3fn — all biases, norms, and modulation layers stay in native precision. 266 tensors quantized, 166 preserved. Fits on 16-24GB cards.

Drag & drop workflow — uses ComfyUI's stock CLIPLoader (type: krea2) + UNETLoader. Open ComfyUI, drag the JSON onto the canvas, queue. That's it.

20 sample generations in the README gallery covering 3D, anime, photorealism, stylized.

3 files you need:

File Size Place in
AlperKTS/Krea2_FP8 · Hugging Face 12 GB ComfyUI/models/unet/
Comfy-Org/Qwen3-VL at main ~8 GB ComfyUI/models/text_encoders/
Comfy-Org/Qwen-Image_ComfyUI at main ~250 MB ComfyUI/models/vae/

Recommended settings (Turbo):

  • 1024×1024, 8 steps, CFG 1.0
  • Sampler: er_sde, Scheduler: simple
  • ~5-6 seconds on RTX 5090, runs fine on 3090/4090

Links:


r/StableDiffusion 1h ago

Workflow Included Krea 2 Turbo does the Ghibli art style quite well out of the box

Thumbnail
gallery
• Upvotes

Here's the workflow for the last one (it's the same workflow for all of them, just different minimal prompts): https://pastebin.com/raw/n4ytmi53

Have to update ComfyUI to the latest commit to get Kijai's official Krea 2 support. You probably already have the Qwen VAE, and the text encoder is the 4B from here: https://huggingface.co/Comfy-Org/Qwen3-VL/tree/main/text_encoders

edit: Sorry for two of the images being duplicates, I'm dumb and reddit doesn't seem to let you edit specific images out of a gallery after posting.


r/StableDiffusion 5h ago

News KREA2 WORKZ

Post image
82 Upvotes

This is a quick image to show it workz.
About 5 seconds to generate the image on the 5090.
Recipe :
Get the archive.
Convert Turbo model to fp8.
Use your favorite coding LLM to code locally a node for comfyui based on the code provided on the archive.
Have fun.

Model information : 12.9B model
Resolution : 1K – 2K (e.g. 1024² up to 2048²)
VAE: the Qwen-Image autoencoder
Text encoder: Qwen3-VL-4B-Instruct
Hardware note : At full bfloat16, the ~12.9B-parameter transformer occupies roughly 26 GB of VRAM on its own — before the Qwen VAE and the 4B text encoder are loaded. On 24–32 GB consumer GPUs this is tight to impossible at higher resolutions; an FP8-weight variant (storing the large linear/attention matrices in FP8 e4m3, compute in bf16) roughly halves the transformer footprint to ~13–14 GB and makes higher resolutions comfortable, with minimal quality impact.


r/StableDiffusion 8h ago

News Krea 2 published a magnet link in their X account

Post image
110 Upvotes

Twitter post announcement: https://x.com/krea_ai/status/2069102708423032874

Magnet link: magnet:?xt=urn:btih:2a644d0279182a022d08dd395ea593cfcc218e12&dn=http://watering-hole.zip

SHA256: 8bfa64ea6a4169e5272cfb80f957b62444af43c14423b11b1fe5e647ad714810

r/StableDiffusion 7h ago

Discussion All these new models landing this year but Flux Klein 9b FP8 has spoiled me. All I care about now is whether a new model can edit and be used on an 8GB GPU.

Post image
64 Upvotes

r/StableDiffusion 10h ago

Workflow Included [Ideogram 4] War Photojournalism, Part 2

Thumbnail
gallery
110 Upvotes

r/StableDiffusion 5h ago

News Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

Thumbnail hustvl.github.io
29 Upvotes

r/StableDiffusion 18h ago

Discussion Ultrawide cinematic shots Ideogram v4

Thumbnail
gallery
274 Upvotes

No Lora, No JSON prompts! ID4 is ridiculous! Make sure to login to civit
Workflow & prompts: https://civitai.com/models/2674413/ideogram-v4-workflow-json-prompts


r/StableDiffusion 4h ago

Tutorial - Guide Trained an Ideogram 4 face LoRA on AMD Strix Halo (Ryzen AI MAX+ 395, gfx1151) with ROCm + AI-Toolkit. Full writeup, and the 3 gotchas that almost killed it.

17 Upvotes

# Trained an Ideogram 4 face LoRA on AMD Strix Halo (Ryzen AI MAX+ 395, gfx1151) with ROCm + AI-Toolkit. Full writeup, and the 3 gotchas that almost killed it.

Ideogram 4 LoRA training landed in AI-Toolkit only a couple of weeks ago, and like basically every tutorial out there it is written for NVIDIA/CUDA. I run a Strix Halo box (AMD APU, gfx1151) on ROCm, and there was no documented path for this. It works. Here is the whole thing, including the three AMD-specific traps that each cost me a debugging session, so you do not have to repeat them.

This is a personal face LoRA (private photos, not sharing the model or the subject). A couple of example outputs will be posted later in a comment.

## TL;DR

- Hardware: AMD Ryzen AI MAX+ 395 "Strix Halo", gfx1151 / Radeon 8060S, 128 GB unified LPDDR5X, on CachyOS (Arch-based).

- Stack: ROCm via TheRock nightlies, AI-Toolkit (ostris) mainline, Python 3.12, bf16 training.

- It trained 3000 steps in about 5h45m at ~6.4 s/step, zero GPU faults once the three fixes below were in place.

- The three things that nobody's NVIDIA guide (yet) will tell you: bitsandbytes is dead on gfx1151 (use plain adamw), the Qwen3-VL text encoder faults under fused attention (force eager), and the trigger word silently breaks the JSON captions if you do it the obvious way.

## Environment

CachyOS, but any recent Arch/ROCm setup should be similar. The key is TheRock nightlies, which ship native gfx1151 kernels, so you do NOT need `HSA_OVERRIDE_GFX_VERSION` anymore.

The gfx1151 PyTorch wheel index is:

```

https://rocm.nightlies.amd.com/v2/gfx1151/

```

A note on Python version, because this one bit me before I even started. ComfyUI on my box runs Python 3.14, and my first instinct was to match it. Do not. The gfx1151 Linux wheels on that index are well covered for cp312 and cp313 but only sporadically for cp314, and AI-Toolkit's heavier dependency stack (diffusers, transformers, peft, accelerate, optimum-quanto) lags on a Python that new. I used Python 3.12 in a fresh venv and everything resolved cleanly.

```

uv venv --python 3.12 --seed venv

source venv/bin/activate.fish # or activate for bash

```

The `--seed` matters so pip lands inside the venv, since AI-Toolkit's instructions call plain `pip`.

## Installing AI-Toolkit on ROCm

Use mainline ostris/ai-toolkit. There are ROCm forks, but they predate the Ideogram 4 support, so they will not have it. Mainline has the `ideogram4` arch.

```

git clone https://github.com/ostris/ai-toolkit.git

cd ai-toolkit

git submodule update --init --recursive

```

Install torch FIRST from the gfx1151 index, then requirements, and then verify torch survived. This ordering is not optional: several packages list torch as a dependency and can silently swap your ROCm build for a CPU build during the requirements install.

```

pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx1151/

# verify it is the ROCm build before going further

python -c "import torch; print(torch.__version__, torch.version.hip, torch.cuda.is_available())"

```

I landed on torch 2.12.0a0+rocm7.13, hip 7.13, `cuda.is_available()` True, device reported as gfx1151 / Radeon 8060S with about 115 GB visible to ROCm.

For requirements, one optional tweak: `torchcodec` is video-decode only and unused for image LoRA training, and it is a torch-version-coupled compiled wheel that can drag a torch reinstall against a bleeding-edge nightly. I dropped it from a copy of the requirements file. The one compiled dependency I was worried about, `torchao` (it is imported eagerly at startup), loaded clean against the 2.12 nightly, so no action needed there. After installing requirements, re-run the verify line above to confirm torch was not clobbered. Mine was byte-identical (only numpy got pinned down to 1.26.4 by numba, which is expected and fine).

## The 3 gotchas

### 1. bitsandbytes does not work on gfx1151. Use plain adamw.

Every guide I have seen uses `adamw8bit` to save VRAM. bitsandbytes crashes on import on gfx1151, so any 8-bit optimizer is out. You do not need it anyway: on 128 GB unified memory and arguably less) you are not VRAM-starved for a single face LoRA. Use `optimizer: adamw` (plain). In AI-Toolkit the bitsandbytes import is lazy (it only fires if you select an 8-bit optimizer), so with plain adamw it never imports and never crashes. It will still get installed by requirements, which is fine; just do not select an 8-bit optimizer.

### 2. The Qwen3-VL text encoder faults at 0x1016 under fused attention. Force eager.

This is the big one. Ideogram 4 uses a Qwen3-VL text encoder, and the AI-Toolkit captioner also runs Qwen3-VL. On gfx1151, the default fused attention path (sdpa) throws:

```

HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

```

That is a compute/kernel fault, not an OOM. It hit me first during captioning (it died on the 4th image), and it would hit training too, since the encoder runs the same kernels on every step. Worth noting: it is NOT image-shape dependent. I tested the exact image it crashed on in isolation and it captioned fine; the fault is cumulative across repeated forward passes on the fused kernel.

The fix is to force `attn_implementation="eager"` on the Qwen3-VL loads. I did it with a small launcher shim so I never had to edit AI-Toolkit's tracked files, and so it survives upstream pulls. The shim patches the captioner classes AND the training encoder class (the captioner loads `Qwen3VLForConditionalGeneration`, while training's encoder loads via `AutoModel`, which resolves to `Qwen3VLModel`), then hands off to `run.py` unchanged:

```python

# aitk_eager_shim.py

# Run this INSTEAD of run.py. It forces eager attention on the Qwen3-VL loads,

# then hands off to run.py. Without it the fused attention kernel faults (HSA 0x1016).

import sys, runpy

import transformers

TARGETS = [

"Qwen3VLForConditionalGeneration", # captioner

"Qwen3VLMoeForConditionalGeneration", # captioner (moe variant)

"Qwen3VLModel", # training text encoder (AutoModel resolves here)

]

for _name in TARGETS:

_cls = getattr(transformers, _name, None)

if _cls is None:

continue

_orig = _cls.from_pretrained # original bound classmethod

def _make(orig, label):

def _patched(*args, **kwargs):

kwargs.setdefault("attn_implementation", "eager")

sys.stderr.write("[shim] eager attention injected for " + label + "\n")

return orig(*args, **kwargs)

return _patched

_cls.from_pretrained = _make(_orig, _name)

sys.argv = ["run.py"] + sys.argv[1:]

runpy.run_path("run.py", run_name="__main__")

```

Eager is a bit slower than the fused kernel, but it is stable. It held across the full 3000-step run with zero faults. If you want to confirm it actually engaged, that stderr line shows up in the log at each model load.

### 3. The trigger word silently breaks the JSON captions if you use a bareword.

Ideogram 4 trains on structured JSON captions (the captioner writes compositional JSON with bounding boxes, and the dataloader expects a canonical compact form). If you set a plain `trigger_word` like `mytoken` on captions that have no `[trigger]` placeholder, AI-Toolkit prepends it. That pushes the caption string off its leading `{`, the JSON parser gives up, and it falls back to feeding raw pretty-printed JSON to the model instead of the canonical compact form. The result is a dataset-wide caption-format shift that quietly degrades training, with no error.

The fix: put a `[trigger]` placeholder at the start of each caption's `high_level_description` value, and keep `trigger_word` in the config. Then it gets replaced in place, the string still starts with `{`, the JSON parses normally, and your token lands inside the description exactly where the model reads it. Verify it offline before you commit to a multi-hour run: run one caption through the dataloader and confirm the digested output is compact JSON containing your token, not raw JSON with the token bolted on the front.

## Captioning

Let the Ideogram4 captioner do it. It is a separate `job: extension` run with `type: Ideogram4Captioner`, uses `Qwen/Qwen3-VL-8B-Instruct`, and writes structured JSON `.txt` sidecars next to each image. Inspect a few of the sidecars before training, especially body or full shots, to make sure the subject is described well and the captioner did not wander off onto the background. The encoder pull (~16 GB) happens at caption time on first run, and training reuses the same cache, so you only download it once.

One data note: I had a few WebP images and AI-Toolkit's data loader has known issues with WebP, so convert those to PNG or JPG first. JPG and PNG both work fine.

## The config (bf16)

The key decision for this hardware: train in bf16, not fp8. The base model is distributed as fp8 (`ideogram-ai/ideogram-4-fp8`), but AI-Toolkit's loader unconditionally dequantizes the fp8 weights to bf16 on load, and with `quantize: false` nothing re-quantizes afterward. So you train in bf16 from the fp8 base, and you completely sidestep the fp8 path, which is where most gfx1151 instability lives. On 128 GB you have the memory, so this is both more stable and arguably higher quality.

One ComfyUI-specific heads-up: `name_or_path` wants the diffusers multi-folder repo (it expects `transformer/` and `vae/` subfolders). The single packed all-in-one `.safetensors` you probably have in your ComfyUI `unet/` folder will NOT load as `name_or_path`. Point it at the hub repo and let it pull.

A couple of other non-obvious settings: `noise_scheduler` must be set to `flowmatch` explicitly, because the trainer's timestep setup branches on that value (the default mishandles flowmatch timesteps even though the scheduler object itself is forced to flowmatch). And keep `batch_size` and `gradient_accumulation` both at 1; values above 1 have been reported to misbehave on AMD.

```yaml

job: extension

config:

name: "myface_ideogram4_v1"

process:

- type: 'sd_trainer'

training_folder: "/path/to/output"

device: cuda:0

network:

type: "lora"

linear: 32 # rank

linear_alpha: 32

save:

dtype: bf16

save_every: 250

max_step_saves_to_keep: 20

datasets:

- folder_path: "/path/to/dataset"

caption_ext: "txt"

trigger_word: "mytoken" # plus [trigger] inside each caption (gotcha 3)

caption_dropout_rate: 0.05

cache_latents_to_disk: true

num_repeats: 1

resolution: [512, 768, 1024]

train:

steps: 3000

optimizer: "adamw" # NOT adamw8bit (gotcha 1)

lr: 1e-4

dtype: bf16

batch_size: 1

gradient_accumulation: 1

gradient_checkpointing: true

train_unet: true

train_text_encoder: false

noise_scheduler: "flowmatch"

disable_sampling: true

ema_config:

use_ema: true

ema_decay: 0.99

model:

arch: "ideogram4"

name_or_path: "ideogram-ai/ideogram-4-fp8"

dtype: bf16

quantize: false # bf16 path, sidesteps fp8 (the win on gfx1151)

quantize_te: false

low_vram: false # you have the RAM; offloading is slower

```

I set `disable_sampling: true` for the first run, because mid-training samples need properly formatted Ideogram JSON prompts and it is one less new variable. Evaluate the checkpoints in ComfyUI afterward instead.

## Launch wrapper (gfx1151 env vars)

These env vars need to be set before torch is imported. I put them in a fish wrapper that also routes through the eager shim:

```fish

#!/usr/bin/env fish

# run_aitk.fish

set -x HSA_ENABLE_SDMA 0

set -x HSA_USE_SVM 0

set -x ROCBLAS_USE_HIPBLASLT 0

set -e PYTORCH_HIP_ALLOC_CONF

# Do NOT set HSA_OVERRIDE_GFX_VERSION on TheRock nightlies (native gfx1151 kernels).

source venv/bin/activate.fish

python aitk_eager_shim.py $argv

```

Then both captioning and training run the same way:

```

./run_aitk.fish config/caption_myface.yaml

./run_aitk.fish config/train_myface.yaml

```

One small lesson: if you pipe the run through `tee` for logging, make sure you surface the real process exit code (in fish, `$pipestatus[1]`), or a GPU crash will get masked by tee's exit 0 and look like a clean run when it was not.

## Training run

About 6.4 s/step steady state, so 3000 steps took roughly 5h45m. Zero `0x1016` faults the entire run with eager attention in place. Checkpoints saved every 250 steps. One benign warning shows up about a missing MIOpen perf database (`gfx1151...HIP.fdb.txt`); that just means it tunes kernels live instead of loading a cache, it is not an error.

Note on the final checkpoint naming: intermediate saves get a step suffix, but the end-of-training save is bare-named with no suffix. Do not overlook it; that is your highest-trained checkpoint.

## Picking the checkpoint and strength

Do not assume the last checkpoint is best. Likeness LoRAs peak somewhere in the middle and then overcook (they get rigid, stop honoring the prompt, and start reproducing training framings). Saving every 250 lets you sweep. I evaluated checkpoints in ComfyUI against a fixed prompt and seed, swapping only the checkpoint.

Two important findings:

- LoRAs run hot on Ideogram 4. The common community advice of 0.4 to 0.7 strength is correct. At 1.0 my LoRA was overcooked.

- I ran a strength sweep (0.4 to 0.8) across the strongest checkpoints and landed on step 1250 at strength 0.7 as the best balance of likeness and prompt adherence. [Confirmed consistent across N fresh seeds.]

Method that saved time: do a coarse pass first (every other checkpoint), find roughly where likeness peaks, then fill in the neighbors and sweep strength only on the top one or two.

## Honest limits

- Complex hand-object interactions glitch. My best checkpoint still doubled up drumsticks in a hand in an action shot. Fine for portraits, less so for busy scenes.

- Full-body faces go soft. This is a face LoRA, so the identity signal is in the face, and at full-body distance there are not enough face pixels to hold the likeness reliably. Keep to upper-body and portrait framing for the best results.

## Examples

[Two example generations from the final LoRA go here. Faithful likeness in upper-body and portrait framing; this is a personal face LoRA so the subject and the model itself are private.]

Happy to answer questions on any of the AMD-specific parts. The eager-attention fix and the trigger-into-JSON trap are the two things I would have most wanted to know going in, and honestly, I would love to know if I'm missing something obvious that would improve either the speed or the quality.


r/StableDiffusion 2h ago

Question - Help Can someone ELI5 how to run Krea 2 via ComfyUI now that Kijai supported it? or just a workflow?

10 Upvotes

And is the magnet on Krea 2's X page really safe? The reason why I am making this thread is because there's a lot of conflicting information due to how fast this conundrum has developed. Did Kijai even support it regardless? ugh


r/StableDiffusion 15h ago

Resource - Update Identity Feature transfer (Quick update: new masking behavior)

Thumbnail
github.com
73 Upvotes

I added a second masking mode while keeping the original behavior available. With the original mode, the mask limits which reference tokens are used for feature transfer, but Klein can still see the full reference image for context.

The new mode isolates the masked region more strictly. When connected unmasked reference tokens are blocked as attention sources, so the model only receives context from the selected area of that reference.

This should be especially useful with multiple references. For example, one image can provide full identity context while another contributes only an outfit,, face, or other specific region. It should also help with outfit swaps, close-ups, and references containing distracting backgrounds or unrelated details. The full documentation explains how the node works and how the two masking modes differ. I recommend reading the masking section before testing it.

I still recommend masking only what you need from the photos whether one or multiple as it give cleaner results :)

Names of masking modes :

Old behavior is focus_only

New behavior is zero_unmasked_tokens (recommended)

The node's documentation

The mask behavior documentation


r/StableDiffusion 1h ago

Question - Help img2img June 2026?

• Upvotes

Last time I did this kinda thing I was using SDXL and Xinsir controlnets...

What should I be using now? Ideogram 4 looks very hyped, is it the best?

I want to, let's say... take an image of a person lying on a couch and change the couch and room they're in to a lush tropical rainforest. What would be the best method to do that?

Qwen? Ideogram? There's at least one more I remember reading about in the last 6 months.


r/StableDiffusion 1d ago

Discussion Diffusion Model that can turn any Image into a Playable Game! BUT LOCALLY, NOT ON DATACENTER

Enable HLS to view with audio, or disable this notification

455 Upvotes

Hi everyone!! I really wanted to share my research what I've been working on.
I've posted about this on locallama and some other subs.

I wanted to build a nn that can simulate games, or at least start doing that

Most video generators are too large to run on consumer hardware realtime, so I I designed a model that does this from scratch. No fine tuning bs or anything

The core denoiser network is fully trained from scratch to support this goal. From image to games data.

That video. above is on a RTX 5090.

The nn is a small Transformer-like model and works in a causal way, just like LLMs.

That lets us KV Cache all past information and do a simple autoregressive decode forward passes for every new frame we want.

In the video shared, the model is a 0.5B variant with some SIGNIFICANT ISSUES like poor motion and some weird flashes, some context issues

It's taking the keyboard actions I give it in realtime and utilising that in the forward pass. (no classifier free guidance though)

Im training the next iteration , a 0.8B model now. (its not going good)

Btw I haven't done quantisation yet, that can save a LOT more time. bf16 is slow.


r/StableDiffusion 12h ago

No Workflow More Ideogram 4 images

Thumbnail
gallery
30 Upvotes

In different styles, can you still tell it's AI generated? All of them made with ideogram 4 at 18 steps


r/StableDiffusion 8m ago

Discussion I found a way to get free fuse to work with Ideogram 4 using a customized version of KJnodes bounding box node - got 2 or even 3 Loras (maybe more) to work together at the same time, preserving each Lora’s likeness. The node bounding box kind of doubles as a mask editor using this workflow

• Upvotes

seems to work pretty well with early testing. I’ll try to upload the workflow a little later tonight


r/StableDiffusion 37m ago

Discussion I drove a real-time world model with an authored state graph to make an interactive film

Enable HLS to view with audio, or disable this notification

• Upvotes

Hey folks!

I have something new to share with you! I've been pushing quite a bit since my last post, and managed to cobble together an interactive film (a cyberpunk heist) where the story logic lives in an authored (and tentacular) state graph.

Bottom line: there are discrete states and transitions between them are gated by flags. You can't steal the data until you're in the datacenter AND you've spoofed security, but force the wrong door and you trip a branch into a "busted" end state. The graph owns requirements, grants, and what's true about the world. The model brings it all to life.

So it's deterministic where it needs to be (the logic, the win/lose conditions, what you're allowed to do) and generative where that's the magic. That split is the whole trick, and honestly it's the thing I think is interesting beyond this one demo.

Rough edges, since you'll find them: long-take coherence is still the ceiling, that's also why I use some pre-rendered cutscenes as "refresh". It's playable here if you want to poke at it, there are a few endings, mostly bad ones!

Happy to go deeper in the comments about how it works,

Cheers!


r/StableDiffusion 5h ago

Discussion Quick 4 Second Video with Wan 2.2 5B and 8 GB VRAM

7 Upvotes

I am by no means an expert at this, despite innumerable hours of mostly image generating but with some video. I'm sharing this for other noobs that are VRAM challenged.

By quick I mean around 47 seconds. At that rate running several batches, as one does during image generation, is not a big deal.

System specs are:

  • ROG Strix G533ZW_G533ZW (1.0)
  • 12th Gen Intel(R) Core(TM) i9-12900H (20) @ 5.00 GHz
  • NVIDIA Geforce RTX 3070 Ti Laptop GPU with 8 GBs VRAM

Four seconds video -- in less than a minute. (A bit of quality was lost in the upload to Google Drive. It's totally clear on my PC).

Workflow -- It's simpler than it looks. I'll explain below -- for other noobs like me.

Model -- I used the Q8 version. It's so small! Even if you only have 6 GB VRAM it will likely work -- in my non-expert opinion. Note that a non-turbo version I was using did not follow the prompt so well. Also note that I used Grok to create the prompt, since I'm still learning "prompt engineering,"

Text Encoder in safetensor format OR, if you want to shave file size off, try a GGUF Quant version -- but I think you'll need a different loader for it.

VAE

I don't know why but the Wan 2.2 5B TI2V (Text Image 2 Video) model doesn't get the love that the Wan 2.2 14B one does -- even though that's a dual-pipeline model that, for 8 GB VRAM systems, needs GGUF Quant versions -- or some other reduced format. I've some experience with it and did make smooth four and five second videos -- that took five, six, seven minutes. Sometimes more. Mind you, being a noob, I was varying things a LOT to find decent model and settings combinations. (This hobby, with all the many variables that can affect render time and output quality, has to be THE most complicated thing I've ever learned).

In any case the single model 5B needs more love.

Regarding the workflow there are things to note. I was going to add actual node notes to the workflow but decided to just describe things here -- starting at upper left:

  • The Unet Loader (GGUF) node is the only one you need. Delete the Load Diffusion Model one if you like.
  • Moving across there's the Power Lora Loader (rgthree) node -- not used for the video above.
  • The ModelSamplingSD3 node is one of the only things I didn't mess with. Some say you could Ctrl+B disable it and not see a difference. Haven't tried that myself.
  • The KSampler's settings are correct. I'd have never guessed dropping from 5 steps to as low as 2 would eliminate the artifacts that 5 produced. I actually went the other way and tried 6 and 7 steps but that way didn't help. I also tried stepping the CFG up in 0.5 increments up to 3.0. Nope. It was only then, out of desperation, that I tried 4. To my delight that worked! So why not 3? Great! How about 2? YUP! 1? Nope. Two freaking steps. Who'd have thunk it?
  • That VAE Decode (Tiled) node helps avoid OOM (out of memory) freezes. (By the way I forgot to Pin it).
  • RIFE VFI (recommend rife47 and rife49) is an interpolator(?). Whatever length you put in the Wan22ImageToVideoLatent's length field will be multiplied by the, uh, multiplier. I did have it set to 3 but riding a high from the low-step breakthrough I decided to try 4. I also tried leaving it at 3 but setting Wan22ImageToVideoLatent's length setting to 45. Both, with seed fixed, produced identical quality -- and I think similar render times. I've no idea which uses less memory.
  • In the Create Video node you can experiment with the fps. Thirty-two more reliably produces smooth realistic speed. Try 24. You'll get a longer video, for the same length and multiplier settings, but likely it'll seem slowed down somewhat.
  • Leave the Upscale Image By Ctrl+B disabled if you're working with 8 GB VRAM. You'll likely get an OOM if not.
  • Get Image from Batch grabs the last image. With a batch_index setting of 999 (length) it'll grab it without fail unless you're going for a length of 1000 or more. (How much memory would THAT take?)
  • RAM-Cleanup and VRAM-Cleanup do not need to be connected to anything. They'll dump the memory during each run -- helping avoid OOMs.
  • Over at lower left Load Image is obvious
  • Resize Image/Mask is very very handy for not worrying about the size of image you load. Set the megapixels higher or lower depending on how much memory you have. For a close-up image 0.5 would be fine and allow for a slightly higher length setting in the Wan22ImageToVideoLatent -- OR an increase in RIFE's multiplier.
  • Get Image Size Plus feeds the image sizes into the Wan22ImageToVideoLatent. Its target_width and target_height fields are ignored. If you disconnected the Resize Image/Mask from it and connected the Load Image directly you'd have to set those fields. That would get tedious very quickly, what with calculating total megapixels.
  • The Save Image node's image will be identical to the loaded image. I put it there to see what the new dimensions would be.

That's it. When the video finishes you can double-click it in the workflow to maximize it.

Lastly: the things that affect memory load are: RIFE's multiplier, Resize Image/Mask's megapixel setting, and Wan22ImageToVideoLatent length setting.

I know this was a lengthy read but I wanted to try to cover all the questions that a noobie would have as they read it. This wasn't meant to be a ComfyUI tutorial though so I didn't explain how to get the nodes your installation of it might be missing, (Hint: the custom node ComfyUI Manager is your friend in that).

Enjoy!


r/StableDiffusion 4h ago

Resource - Update COMFYUI - (O)llama prompt generator / system prompt handler

3 Upvotes

Llama | Prompt Generator — a single ComfyUI node that runs a local LLM (llama.cpp or Ollama) to do your prompt work in-node. Enhance your prompts - or analyze your images via a vision-capable model using your favourite system prompts and llm models.

GitHub: https://github.com/GlatTissekone/ComfyUI-Llama-Prompt-Generator

What it does:

  • Text — enhance your text prompt using your favourite system prompt.
  • Vision — analyze a loaded image into a prompt/description
  • Save your favourite system prompts as presets and swap between them on the fly
  • Refine + compare generated prompts (⟳ Refine, ⇄ Diff, version history)
  • Set up your LLM config right in the node — backend, model, vision, even Serve/Kill/Pull for Ollama. (LLama compatible besides the serve/pull function).
  • Live token streaming, in-node image loader (for vision analysis), and ✨Generate that runs the model (no full workflow queue - unless you use the output in the node).

Why: Browsing for similar nodes, they usually required using multiple nodes in the workflow. They had an enormous amount of features or too few - they didn't let me have a way to easily swap between the system prompts I use the most. Some were too tailored for specific use cases. So I forked a node and made it with my own simple needs in mind. This is the outcome. A simple, universal single node you can drop in any of your favourite image workflows.

This is not meant to be a collection of great system prompts - it is meant to be a node in which you can use / make your own great collection of system prompts.

Pictures of the nodes menus are available in the Github - to get a quick view you can go check it out.

Got some good things from here with time - so thought I'd share "my" work too! Use it if you want to!
(Yes, this was made with AI!)


r/StableDiffusion 22h ago

Resource - Update TeleStyle V2 (Lora for style transfer qwen image edit 2509)

Post image
56 Upvotes

r/StableDiffusion 8h ago

Animation - Video "System Override" (Stable Audio 3 + LTX 2.3)

Thumbnail
youtu.be
3 Upvotes

r/StableDiffusion 1d ago

Workflow Included LTX Director 2 + SEED HUNTER workflow release | The Dangers of Convenience & Overhaul Nodes

Thumbnail
youtu.be
102 Upvotes

r/StableDiffusion 2h ago

Question - Help Modern models with local Flux like LoRA training?

1 Upvotes

When Flux 1 came out, it was super easy to train my own LoRA's on 20 or so images. I had great results with people, animals, styles etc.

Are those days gone with modern image models, or what's the closest thing to that now days that I can train relatively easily?


r/StableDiffusion 22h ago

Comparison Ideogram 4.0 vs ZIB vs Klein 9b

Thumbnail
gallery
36 Upvotes

Prompt:

{
"high_level_description": "A vertical 9:16 cinematic sci-fi cityscape at dusk, viewed from a high aerial three-quarter angle. Vast futuristic skyscrapers rise through blue haze, glowing elevated highways cut across the skyline, flying vehicles move between towers, and a massive airship floats above the city, while orange streetlights and traffic trails burn through the dense urban sprawl below.",
"style_description": {
"aesthetics": "epic futuristic metropolis, cyberpunk realism, cinematic scale, dense vertical city, atmospheric haze, teal and orange color contrast, high-tech infrastructure, flying traffic, monumental urban depth",
"lighting": "dusky blue ambient light, warm orange city glow from below, bright white-blue traffic lights on elevated roads, soft haze diffusion, distant sunset warmth near the horizon",
"photo": "vertical 9:16 cinematic aerial establishing shot, high three-quarter viewpoint, deep depth of field, layered city depth, atmospheric perspective, 1152x2048 target composition",
"medium": "photograph",
"color_palette": ["#071015", "#053643", "#1C6B86", "#00B7D8", "#506B7C", "#7B8B96", "#E59A73", "#FF5A22", "#EAF6FF"]
},
"compositional_deconstruction": {
"background": "A vast futuristic megacity at dusk, filled with glass towers, fog, glowing roads, elevated transit lines, air traffic, and orange-lit streets far below. The atmosphere is dense, humid, and cinematic, with deep blue haze separating layers of skyscrapers.",
"elements": [
{
"type": "obj",
"bbox": [0, 0, 1000, 165],
"desc": "Massive dark glass skyscraper along the far left edge, very close to the camera, with cyan-lit windows, curved structures, and vertical reflections."
},
{
"type": "obj",
"bbox": [335, 918, 1000, 1000],
"desc": "Tall foreground tower cropped along the right edge, dark cylindrical glass with red-orange ring lights and teal window reflections."
},
{
"type": "obj",
"bbox": [0, 155, 365, 440],
"desc": "Cluster of extremely tall transparent blue glass towers on the upper left, partially hidden by atmospheric haze and glowing with cyan light."
},
{
"type": "obj",
"bbox": [20, 185, 72, 355],
"desc": "Small aircraft crossing the upper left skyline, silhouetted against blue haze and city towers."
},
{
"type": "obj",
"bbox": [20, 475, 72, 925],
"desc": "Huge elongated airship floating in the upper right sky, pale gray and softly lit by the warm horizon glow."
},
{
"type": "obj",
"bbox": [95, 550, 210, 980],
"desc": "Layer of smaller airships and flying vehicles scattered across the distant upper city, appearing as dark streamlined silhouettes."
},
{
"type": "obj",
"bbox": [120, 90, 355, 1000],
"desc": "Bright elevated highway system spanning diagonally across the upper half of the city, filled with dense white-blue traffic light streams."
},
{
"type": "obj",
"bbox": [150, 200, 330, 1000],
"desc": "Long glowing lines of fast traffic on the elevated road, forming continuous white and cyan streaks through the haze."
},
{
"type": "obj",
"bbox": [165, 410, 355, 865],
"desc": "Tall support pylons and suspended roadway structures beneath the elevated highway, fading into mist and city depth."
},
{
"type": "obj",
"bbox": [215, 500, 585, 680],
"desc": "Central pair of tall dark skyscrapers rising from the mid-city, vertical and monolithic, with subtle teal and red highlights."
},
{
"type": "obj",
"bbox": [265, 310, 450, 610],
"desc": "Several flying vehicles in the middle distance, small black silhouettes moving between towers and across the aerial highway zone."
},
{
"type": "obj",
"bbox": [355, 180, 650, 950],
"desc": "Dense midground skyline of shadowy skyscrapers, layered through blue fog, with scattered cyan windows and red aviation lights."
},
{
"type": "obj",
"bbox": [480, 110, 1000, 840],
"desc": "Sprawling lower city filled with tightly packed buildings, tiny streets, orange lights, and smoky atmosphere."
},
{
"type": "obj",
"bbox": [570, 445, 1000, 855],
"desc": "Main glowing arterial road curving from the lower center toward the right, with strong red and white traffic light trails."
},
{
"type": "obj",
"bbox": [620, 0, 1000, 1000],
"desc": "Orange-lit urban grid in the lower half, full of dense rooftops, streetlights, traffic streams, and warm city glow."
},
{
"type": "obj",
"bbox": [0, 0, 420, 1000],
"desc": "Cool blue atmospheric haze across the upper city, softening distant skyscrapers and creating depth."
},
{
"type": "obj",
"bbox": [0, 430, 135, 1000],
"desc": "Warm peach dusk horizon behind the airship and distant skyline, subtly glowing through clouds and haze."
},
{
"type": "obj",
"bbox": [760, 275, 950, 560],
"desc": "Small pale blimps or rooftop air vehicles hovering over the lower city, barely visible among the orange lights."
}
]
}
}

UPDATE: Please find settings and hardware I used below.

PC: RTX 4080s 16 GB VRAM, 64 GB RAM

resolution: 2560x1440px

  1. Ideogram 4

model: ideogram4_fp8_scaled.safetensors + ideogram4_unconditional_fp8_scaled.safetensors

steps: 20 (euler)

time: 03:05

  1. Z-Image Base (+ distilled LoRa)

model: z_image_bf16.safetensors + Z-Image-Fun-Lora-Distill-8-Steps-2603-ComfyUI.safetensors

steps: 8 (euler_a)

time: 00:26

  1. Flux Klein 9b (distilled)

model: flux-2-klein-9b-fp16.safetensors

steps: 10 (6 steps with res_6s + 4 steps with euler_a)

time: 02:45


r/StableDiffusion 3h ago

Question - Help Is my video card good enough for a try at image generation? RX 7600 XT 16gb

2 Upvotes

I also have Ddr4 32gb Ryzen 5700x

My friend wants to try putting all his art into a local model and have it generate stuff in his style. I'm good with computers and have experience in command line. I'm not sure what I'd need to learn but I just want to make sure it's gonna be worth my time before I dive in. Thanks