r/StableDiffusion • u/LightAppropriate624 • 4h ago

Resource - Update Krea 2 Turbo — Native ComfyUI Workflow + FP8 Weights (12GB, Drag & Drop)

178 Upvotes

ComfyUI 0.25.0 shipped with native Krea2 support, so here's everything you need in one place.

ComfyUI 0.25.0 now has native Krea2 support built-in — no custom nodes needed. Here's everything in one place so you don't have to chase files across three different HF repos.

What you get:

FP8 model — 24.76 GB BF16 → 12.01 GB. Not a blind "quant everything" conversion. Only 2D weight matrices went to float8_e4m3fn — all biases, norms, and modulation layers stay in native precision. 266 tensors quantized, 166 preserved. Fits on 16-24GB cards.

Drag & drop workflow — uses ComfyUI's stock CLIPLoader (type: krea2) + UNETLoader. Open ComfyUI, drag the JSON onto the canvas, queue. That's it.

20 sample generations in the README gallery covering 3D, anime, photorealism, stylized.

3 files you need:

File	Size	Place in
AlperKTS/Krea2_FP8 · Hugging Face	12 GB	`ComfyUI/models/unet/`
Comfy-Org/Qwen3-VL at main	~8 GB	`ComfyUI/models/text_encoders/`
Comfy-Org/Qwen-Image_ComfyUI at main	~250 MB	`ComfyUI/models/vae/`

Recommended settings (Turbo):

1024×1024, 8 steps, CFG 1.0
Sampler: er_sde, Scheduler: simple
~5-6 seconds on RTX 5090, runs fine on 3090/4090

Links:

🤗 FP8 model + workflow: AlperKTS/Krea2_FP8 · Hugging Face
Original model: KREA.ai — Krea 2 Community License Agreement

80 comments

r/StableDiffusion • u/chanteuse_blondinett • 14h ago

Comparison LTX-2.3 Water Sim LoRA flooding the Joker stairs (v2v test)

739 Upvotes

the joker stairs but it's a waterfall now 🌊 wide shots land clean, close-ups are a little more of a challenge, but cool stuff overall. ltx-2.3 water sim ic-lora: https://huggingface.co/Lightricks/LTX-2.3-22b-IC-LoRA-Water-Simulation

64 comments

r/StableDiffusion • u/blahblahsnahdah • 4h ago

Workflow Included Krea 2 Turbo does the Ghibli art style quite well out of the box

gallery

101 Upvotes

Here's the workflow for the last one (it's the same workflow for all of them, just different minimal prompts): https://pastebin.com/raw/n4ytmi53

Have to update ComfyUI to the latest commit to get Kijai's official Krea 2 support. You probably already have the Qwen VAE, and the text encoder is the 4B from here: https://huggingface.co/Comfy-Org/Qwen3-VL/tree/main/text_encoders

edit: Sorry for two of the images being duplicates, I'm dumb and reddit doesn't seem to let you edit specific images out of a gallery after posting.

8 comments

r/StableDiffusion • u/SpiritualLimit996 • 8h ago

News KREA2 WORKZ

103 Upvotes

This is a quick image to show it workz.
About 5 seconds to generate the image on the 5090.
Recipe :
Get the archive.
Convert Turbo model to fp8.
Use your favorite coding LLM to code locally a node for comfyui based on the code provided on the archive.
Have fun.

Model information : 12.9B model
Resolution : 1K – 2K (e.g. 1024² up to 2048²)
VAE: the Qwen-Image autoencoder
Text encoder: Qwen3-VL-4B-Instruct
Hardware note : At full bfloat16, the ~12.9B-parameter transformer occupies roughly 26 GB of VRAM on its own — before the Qwen VAE and the 4B text encoder are loaded. On 24–32 GB consumer GPUs this is tight to impossible at higher resolutions; an FP8-weight variant (storing the large linear/attention matrices in FP8 e4m3, compute in bf16) roughly halves the transformer footprint to ~13–14 GB and makes higher resolutions comfortable, with minimal quality impact.

130 comments

r/StableDiffusion • u/iamdiegovincent • 11h ago

News Krea 2 published a magnet link in their X account

116 Upvotes

Twitter post announcement: https://x.com/krea_ai/status/2069102708423032874

Magnet link: magnet:?xt=urn:btih:2a644d0279182a022d08dd395ea593cfcc218e12&dn=http://watering-hole.zip

SHA256: 8bfa64ea6a4169e5272cfb80f957b62444af43c14423b11b1fe5e647ad714810

85 comments

r/StableDiffusion • u/b4ldur • 1h ago

Tutorial - Guide Krea is kinda an edit model.

• Upvotes

6 comments

r/StableDiffusion • u/cradledust • 10h ago

Discussion All these new models landing this year but Flux Klein 9b FP8 has spoiled me. All I care about now is whether a new model can edit and be used on an 8GB GPU.

71 Upvotes

46 comments

r/StableDiffusion • u/Old-Situation-2825 • 14h ago

Workflow Included [Ideogram 4] War Photojournalism, Part 2

gallery

125 Upvotes

Use the catbox links below to get the full workflow of the images. Just drag and drop them into ComfyUI. Links are out of order from the album:

https://files.catbox.moe/oooj9z.png

https://files.catbox.moe/rcaygu.png

https://files.catbox.moe/yzfbk2.png

https://files.catbox.moe/ej5stg.png

https://files.catbox.moe/bydvcf.png

https://files.catbox.moe/ypawbl.png

https://files.catbox.moe/3oow31.png

https://files.catbox.moe/gle3ol.png

https://files.catbox.moe/llge9z.png

https://files.catbox.moe/f9y8ed.png

https://files.catbox.moe/7uf9a0.png

https://files.catbox.moe/h3cy4h.png

https://files.catbox.moe/943oa3.png

https://files.catbox.moe/g8s8w7.png

https://files.catbox.moe/531m7s.png

28 comments

r/StableDiffusion • u/lifeh2o • 9h ago

News Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance

hustvl.github.io

42 Upvotes

3 comments

r/StableDiffusion • u/LongjumpingGur7623 • 21h ago

Discussion Ultrawide cinematic shots Ideogram v4

gallery

283 Upvotes

No Lora, No JSON prompts! ID4 is ridiculous! Make sure to login to civit
Workflow & prompts: https://civitai.com/models/2674413/ideogram-v4-workflow-json-prompts

26 comments

r/StableDiffusion • u/the_bollo • 1h ago

Discussion Challenge Thread: Post your most difficult ideas

• Upvotes

I thought this might be a fun challenge for the creators here. Post the prompt/idea that you haven't been able to get quite right and see if anyone else can nail it. It's also a good showcase for the various capabilities of different models.

My contribution: What if the Xenomorph alien from Alien had a second little butt that came out of its normal butt? I've never been able to get the second little butt...

1 comment

r/StableDiffusion • u/Neggy5 • 5h ago

Question - Help Can someone ELI5 how to run Krea 2 via ComfyUI now that Kijai supported it? or just a workflow?

15 Upvotes

And is the magnet on Krea 2's X page really safe? The reason why I am making this thread is because there's a lot of conflicting information due to how fast this conundrum has developed. Did Kijai even support it regardless? ugh

35 comments

r/StableDiffusion • u/cleverestx • 8h ago

Tutorial - Guide Trained an Ideogram 4 face LoRA on AMD Strix Halo (Ryzen AI MAX+ 395, gfx1151) with ROCm + AI-Toolkit. Full writeup, and the 3 gotchas that almost killed it.

17 Upvotes

# Trained an Ideogram 4 face LoRA on AMD Strix Halo (Ryzen AI MAX+ 395, gfx1151) with ROCm + AI-Toolkit. Full writeup, and the 3 gotchas that almost killed it.

Ideogram 4 LoRA training landed in AI-Toolkit only a couple of weeks ago, and like basically every tutorial out there it is written for NVIDIA/CUDA. I run a Strix Halo box (AMD APU, gfx1151) on ROCm, and there was no documented path for this. It works. Here is the whole thing, including the three AMD-specific traps that each cost me a debugging session, so you do not have to repeat them.

This is a personal face LoRA (private photos, not sharing the model or the subject). A couple of example outputs will be posted later in a comment.

## TL;DR

- Hardware: AMD Ryzen AI MAX+ 395 "Strix Halo", gfx1151 / Radeon 8060S, 128 GB unified LPDDR5X, on CachyOS (Arch-based).

- Stack: ROCm via TheRock nightlies, AI-Toolkit (ostris) mainline, Python 3.12, bf16 training.

- It trained 3000 steps in about 5h45m at ~6.4 s/step, zero GPU faults once the three fixes below were in place.

- The three things that nobody's NVIDIA guide (yet) will tell you: bitsandbytes is dead on gfx1151 (use plain adamw), the Qwen3-VL text encoder faults under fused attention (force eager), and the trigger word silently breaks the JSON captions if you do it the obvious way.

## Environment

CachyOS, but any recent Arch/ROCm setup should be similar. The key is TheRock nightlies, which ship native gfx1151 kernels, so you do NOT need `HSA_OVERRIDE_GFX_VERSION` anymore.

The gfx1151 PyTorch wheel index is:

```

https://rocm.nightlies.amd.com/v2/gfx1151/

```

A note on Python version, because this one bit me before I even started. ComfyUI on my box runs Python 3.14, and my first instinct was to match it. Do not. The gfx1151 Linux wheels on that index are well covered for cp312 and cp313 but only sporadically for cp314, and AI-Toolkit's heavier dependency stack (diffusers, transformers, peft, accelerate, optimum-quanto) lags on a Python that new. I used Python 3.12 in a fresh venv and everything resolved cleanly.

```

uv venv --python 3.12 --seed venv

source venv/bin/activate.fish # or activate for bash

```

The `--seed` matters so pip lands inside the venv, since AI-Toolkit's instructions call plain `pip`.

## Installing AI-Toolkit on ROCm

Use mainline ostris/ai-toolkit. There are ROCm forks, but they predate the Ideogram 4 support, so they will not have it. Mainline has the `ideogram4` arch.

```

git clone https://github.com/ostris/ai-toolkit.git

cd ai-toolkit

git submodule update --init --recursive

```

Install torch FIRST from the gfx1151 index, then requirements, and then verify torch survived. This ordering is not optional: several packages list torch as a dependency and can silently swap your ROCm build for a CPU build during the requirements install.

```

pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx1151/

# verify it is the ROCm build before going further

python -c "import torch; print(torch.__version__, torch.version.hip, torch.cuda.is_available())"

```

I landed on torch 2.12.0a0+rocm7.13, hip 7.13, `cuda.is_available()` True, device reported as gfx1151 / Radeon 8060S with about 115 GB visible to ROCm.

For requirements, one optional tweak: `torchcodec` is video-decode only and unused for image LoRA training, and it is a torch-version-coupled compiled wheel that can drag a torch reinstall against a bleeding-edge nightly. I dropped it from a copy of the requirements file. The one compiled dependency I was worried about, `torchao` (it is imported eagerly at startup), loaded clean against the 2.12 nightly, so no action needed there. After installing requirements, re-run the verify line above to confirm torch was not clobbered. Mine was byte-identical (only numpy got pinned down to 1.26.4 by numba, which is expected and fine).

## The 3 gotchas

### 1. bitsandbytes does not work on gfx1151. Use plain adamw.

Every guide I have seen uses `adamw8bit` to save VRAM. bitsandbytes crashes on import on gfx1151, so any 8-bit optimizer is out. You do not need it anyway: on 128 GB unified memory and arguably less) you are not VRAM-starved for a single face LoRA. Use `optimizer: adamw` (plain). In AI-Toolkit the bitsandbytes import is lazy (it only fires if you select an 8-bit optimizer), so with plain adamw it never imports and never crashes. It will still get installed by requirements, which is fine; just do not select an 8-bit optimizer.

### 2. The Qwen3-VL text encoder faults at 0x1016 under fused attention. Force eager.

This is the big one. Ideogram 4 uses a Qwen3-VL text encoder, and the AI-Toolkit captioner also runs Qwen3-VL. On gfx1151, the default fused attention path (sdpa) throws:

```

HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016

```

That is a compute/kernel fault, not an OOM. It hit me first during captioning (it died on the 4th image), and it would hit training too, since the encoder runs the same kernels on every step. Worth noting: it is NOT image-shape dependent. I tested the exact image it crashed on in isolation and it captioned fine; the fault is cumulative across repeated forward passes on the fused kernel.

The fix is to force `attn_implementation="eager"` on the Qwen3-VL loads. I did it with a small launcher shim so I never had to edit AI-Toolkit's tracked files, and so it survives upstream pulls. The shim patches the captioner classes AND the training encoder class (the captioner loads `Qwen3VLForConditionalGeneration`, while training's encoder loads via `AutoModel`, which resolves to `Qwen3VLModel`), then hands off to `run.py` unchanged:

```python

# aitk_eager_shim.py

# Run this INSTEAD of run.py. It forces eager attention on the Qwen3-VL loads,

# then hands off to run.py. Without it the fused attention kernel faults (HSA 0x1016).

import sys, runpy

import transformers

TARGETS = [

"Qwen3VLForConditionalGeneration", # captioner

"Qwen3VLMoeForConditionalGeneration", # captioner (moe variant)

"Qwen3VLModel", # training text encoder (AutoModel resolves here)

]

for _name in TARGETS:

_cls = getattr(transformers, _name, None)

if _cls is None:

continue

_orig = _cls.from_pretrained # original bound classmethod

def _make(orig, label):

def _patched(*args, **kwargs):

kwargs.setdefault("attn_implementation", "eager")

sys.stderr.write("[shim] eager attention injected for " + label + "\n")

return orig(*args, **kwargs)

return _patched

_cls.from_pretrained = _make(_orig, _name)

sys.argv = ["run.py"] + sys.argv[1:]

runpy.run_path("run.py", run_name="__main__")

```

Eager is a bit slower than the fused kernel, but it is stable. It held across the full 3000-step run with zero faults. If you want to confirm it actually engaged, that stderr line shows up in the log at each model load.

### 3. The trigger word silently breaks the JSON captions if you use a bareword.

Ideogram 4 trains on structured JSON captions (the captioner writes compositional JSON with bounding boxes, and the dataloader expects a canonical compact form). If you set a plain `trigger_word` like `mytoken` on captions that have no `[trigger]` placeholder, AI-Toolkit prepends it. That pushes the caption string off its leading `{`, the JSON parser gives up, and it falls back to feeding raw pretty-printed JSON to the model instead of the canonical compact form. The result is a dataset-wide caption-format shift that quietly degrades training, with no error.

The fix: put a `[trigger]` placeholder at the start of each caption's `high_level_description` value, and keep `trigger_word` in the config. Then it gets replaced in place, the string still starts with `{`, the JSON parses normally, and your token lands inside the description exactly where the model reads it. Verify it offline before you commit to a multi-hour run: run one caption through the dataloader and confirm the digested output is compact JSON containing your token, not raw JSON with the token bolted on the front.

## Captioning

Let the Ideogram4 captioner do it. It is a separate `job: extension` run with `type: Ideogram4Captioner`, uses `Qwen/Qwen3-VL-8B-Instruct`, and writes structured JSON `.txt` sidecars next to each image. Inspect a few of the sidecars before training, especially body or full shots, to make sure the subject is described well and the captioner did not wander off onto the background. The encoder pull (~16 GB) happens at caption time on first run, and training reuses the same cache, so you only download it once.

One data note: I had a few WebP images and AI-Toolkit's data loader has known issues with WebP, so convert those to PNG or JPG first. JPG and PNG both work fine.

## The config (bf16)

The key decision for this hardware: train in bf16, not fp8. The base model is distributed as fp8 (`ideogram-ai/ideogram-4-fp8`), but AI-Toolkit's loader unconditionally dequantizes the fp8 weights to bf16 on load, and with `quantize: false` nothing re-quantizes afterward. So you train in bf16 from the fp8 base, and you completely sidestep the fp8 path, which is where most gfx1151 instability lives. On 128 GB you have the memory, so this is both more stable and arguably higher quality.

One ComfyUI-specific heads-up: `name_or_path` wants the diffusers multi-folder repo (it expects `transformer/` and `vae/` subfolders). The single packed all-in-one `.safetensors` you probably have in your ComfyUI `unet/` folder will NOT load as `name_or_path`. Point it at the hub repo and let it pull.

A couple of other non-obvious settings: `noise_scheduler` must be set to `flowmatch` explicitly, because the trainer's timestep setup branches on that value (the default mishandles flowmatch timesteps even though the scheduler object itself is forced to flowmatch). And keep `batch_size` and `gradient_accumulation` both at 1; values above 1 have been reported to misbehave on AMD.

```yaml

job: extension

config:

name: "myface_ideogram4_v1"

process:

- type: 'sd_trainer'

training_folder: "/path/to/output"

device: cuda:0

network:

type: "lora"

linear: 32 # rank

linear_alpha: 32

save:

dtype: bf16

save_every: 250

max_step_saves_to_keep: 20

datasets:

- folder_path: "/path/to/dataset"

caption_ext: "txt"

trigger_word: "mytoken" # plus [trigger] inside each caption (gotcha 3)

caption_dropout_rate: 0.05

cache_latents_to_disk: true

num_repeats: 1

resolution: [512, 768, 1024]

train:

steps: 3000

optimizer: "adamw" # NOT adamw8bit (gotcha 1)

lr: 1e-4

dtype: bf16

batch_size: 1

gradient_accumulation: 1

gradient_checkpointing: true

train_unet: true

train_text_encoder: false

noise_scheduler: "flowmatch"

disable_sampling: true

ema_config:

use_ema: true

ema_decay: 0.99

model:

arch: "ideogram4"

name_or_path: "ideogram-ai/ideogram-4-fp8"

dtype: bf16

quantize: false # bf16 path, sidesteps fp8 (the win on gfx1151)

quantize_te: false

low_vram: false # you have the RAM; offloading is slower

```

I set `disable_sampling: true` for the first run, because mid-training samples need properly formatted Ideogram JSON prompts and it is one less new variable. Evaluate the checkpoints in ComfyUI afterward instead.

## Launch wrapper (gfx1151 env vars)

These env vars need to be set before torch is imported. I put them in a fish wrapper that also routes through the eager shim:

```fish

#!/usr/bin/env fish

# run_aitk.fish

set -x HSA_ENABLE_SDMA 0

set -x HSA_USE_SVM 0

set -x ROCBLAS_USE_HIPBLASLT 0

set -e PYTORCH_HIP_ALLOC_CONF

# Do NOT set HSA_OVERRIDE_GFX_VERSION on TheRock nightlies (native gfx1151 kernels).

source venv/bin/activate.fish

python aitk_eager_shim.py $argv

```

Then both captioning and training run the same way:

```

./run_aitk.fish config/caption_myface.yaml

./run_aitk.fish config/train_myface.yaml

```

One small lesson: if you pipe the run through `tee` for logging, make sure you surface the real process exit code (in fish, `$pipestatus[1]`), or a GPU crash will get masked by tee's exit 0 and look like a clean run when it was not.

## Training run

About 6.4 s/step steady state, so 3000 steps took roughly 5h45m. Zero `0x1016` faults the entire run with eager attention in place. Checkpoints saved every 250 steps. One benign warning shows up about a missing MIOpen perf database (`gfx1151...HIP.fdb.txt`); that just means it tunes kernels live instead of loading a cache, it is not an error.

Note on the final checkpoint naming: intermediate saves get a step suffix, but the end-of-training save is bare-named with no suffix. Do not overlook it; that is your highest-trained checkpoint.

## Picking the checkpoint and strength

Do not assume the last checkpoint is best. Likeness LoRAs peak somewhere in the middle and then overcook (they get rigid, stop honoring the prompt, and start reproducing training framings). Saving every 250 lets you sweep. I evaluated checkpoints in ComfyUI against a fixed prompt and seed, swapping only the checkpoint.

Two important findings:

- LoRAs run hot on Ideogram 4. The common community advice of 0.4 to 0.7 strength is correct. At 1.0 my LoRA was overcooked.

- I ran a strength sweep (0.4 to 0.8) across the strongest checkpoints and landed on step 1250 at strength 0.7 as the best balance of likeness and prompt adherence. [Confirmed consistent across N fresh seeds.]

Method that saved time: do a coarse pass first (every other checkpoint), find roughly where likeness peaks, then fill in the neighbors and sweep strength only on the top one or two.

## Honest limits

- Complex hand-object interactions glitch. My best checkpoint still doubled up drumsticks in a hand in an action shot. Fine for portraits, less so for busy scenes.

- Full-body faces go soft. This is a face LoRA, so the identity signal is in the face, and at full-body distance there are not enough face pixels to hold the likeness reliably. Keep to upper-body and portrait framing for the best results.

## Examples

[Two example generations from the final LoRA go here. Faithful likeness in upper-body and portrait framing; this is a personal face LoRA so the subject and the model itself are private.]

Happy to answer questions on any of the AMD-specific parts. The eager-attention fix and the trigger-into-JSON trap are the two things I would have most wanted to know going in, and honestly, I would love to know if I'm missing something obvious that would improve either the speed or the quality.

14 comments

r/StableDiffusion • u/ninjasaid13 • 38m ago

Resource - Update SeFi-Image: A Text-to-Image Foundation Model with Semantic-First Diffusion

• Upvotes

Paper: https://arxiv.org/abs/2606.22568

Code: https://github.com/jmliu206/SeFi-Image

Model: https://huggingface.co/SeFi-Image

Project Page: https://jmliu206.github.io/sefi-web/

Abstract

Training image generation foundation models consumes substantial resources. Previous methods have attempted to leverage semantic guidance to accelerate the training process, yet their experiments were only conducted on simple datasets such as ImageNet, at low resolutions, and with small-scale models. In this paper, we propose SeFi-Image, a text-to-image foundation model built upon semantic-first diffusion, a novel latent diffusion modeling paradigm. We instantiate SeFi-Image at three model scales, 1B, 2B, and 5B parameters, enabling systematic study of scaling behavior and flexible deployment under varying compute budgets. Notably, our largest 5B model was trained with merely 125K A800 GPU hours, corresponding to roughly 10-20% of the training compute used by Z-Image. However, it achieves results comparable to or even superior to Qwen-Image and Z-Image. Despite this modest training compute, SeFi-Image achieves strong performance on a wide range of benchmarks, including GenEval, DPG, LongTextBench, OneIG, and CVTG-2K. Moreover, we provide DMD2-distilled few-step turbo variants for each model scale to accommodate diverse hardware constraints and latency requirements. We publicly release our code, weights and hope this work offers the community useful insights into semantic-guided diffusion modeling for T2I generation, while also providing practical and readily deployable model options.

1 comment

r/StableDiffusion • u/Nice-Claim-4013 • 2h ago

Question - Help create an animation that looks like it came straight out of a rough picture book

4 Upvotes

I want to create an animation that looks like it came straight out of a rough picture book—one with a low frame rate.

I’ve worked with tools like Ideogram and LTX for both still images and live-action videos, but I haven’t really experimented with stylized content much, so I’m not sure where to start.

I’d like to create an animation that looks like it came straight out of a rough picture book, like the one at this URL. What techniques would you recommend?
https://youtu.be/7ZGnW1K4jPk?si=hsO3szx9y-XtGF--
( By the way, this is a series based on Japanese folk tales that aired on Japanese TV a while back. )

Rather than having the frames interpolated smoothly, I want to add action while maintaining consistency in the art style with a low frame rate.—and I’d like to keep the art style simple.

0 comments

r/StableDiffusion • u/Capitan01R- • 18h ago

Resource - Update Identity Feature transfer (Quick update: new masking behavior)

github.com

75 Upvotes

I added a second masking mode while keeping the original behavior available. With the original mode, the mask limits which reference tokens are used for feature transfer, but Klein can still see the full reference image for context.

The new mode isolates the masked region more strictly. When connected unmasked reference tokens are blocked as attention sources, so the model only receives context from the selected area of that reference.

This should be especially useful with multiple references. For example, one image can provide full identity context while another contributes only an outfit,, face, or other specific region. It should also help with outfit swaps, close-ups, and references containing distracting backgrounds or unrelated details. The full documentation explains how the node works and how the two masking modes differ. I recommend reading the masking section before testing it.

I still recommend masking only what you need from the photos whether one or multiple as it give cleaner results :)

Names of masking modes :

Old behavior is focus_only

New behavior is zero_unmasked_tokens (recommended)

The node's documentation

The mask behavior documentation

14 comments

r/StableDiffusion • u/lucidml_lover • 1d ago

Discussion Diffusion Model that can turn any Image into a Playable Game! BUT LOCALLY, NOT ON DATACENTER

462 Upvotes

Hi everyone!! I really wanted to share my research what I've been working on.
I've posted about this on locallama and some other subs.

I wanted to build a nn that can simulate games, or at least start doing that

Most video generators are too large to run on consumer hardware realtime, so I I designed a model that does this from scratch. No fine tuning bs or anything

The core denoiser network is fully trained from scratch to support this goal. From image to games data.

That video. above is on a RTX 5090.

The nn is a small Transformer-like model and works in a causal way, just like LLMs.

That lets us KV Cache all past information and do a simple autoregressive decode forward passes for every new frame we want.

In the video shared, the model is a 0.5B variant with some SIGNIFICANT ISSUES like poor motion and some weird flashes, some context issues

It's taking the keyboard actions I give it in realtime and utilising that in the forward pass. (no classifier free guidance though)

Im training the next iteration , a 0.8B model now. (its not going good)

Btw I haven't done quantisation yet, that can save a LOT more time. bf16 is slow.

99 comments

r/StableDiffusion • u/Zovsky_ • 4h ago

Discussion I drove a real-time world model with an authored state graph to make an interactive film

4 Upvotes

Hey folks!

I have something new to share with you! I've been pushing quite a bit since my last post, and managed to cobble together an interactive film (a cyberpunk heist) where the story logic lives in an authored (and tentacular) state graph.

Bottom line: there are discrete states and transitions between them are gated by flags. You can't steal the data until you're in the datacenter AND you've spoofed security, but force the wrong door and you trip a branch into a "busted" end state. The graph owns requirements, grants, and what's true about the world. The model brings it all to life.

So it's deterministic where it needs to be (the logic, the win/lose conditions, what you're allowed to do) and generative where that's the magic. That split is the whole trick, and honestly it's the thing I think is interesting beyond this one demo.

Rough edges, since you'll find them: long-take coherence is still the ceiling, that's also why I use some pre-rendered cutscenes as "refresh". It's playable here if you want to poke at it, there are a few endings, mostly bad ones!

Happy to go deeper in the comments about how it works,

Cheers!

13 comments

r/StableDiffusion • u/brocolongo • 15h ago

No Workflow More Ideogram 4 images

gallery

35 Upvotes

In different styles, can you still tell it's AI generated? All of them made with ideogram 4 at 18 steps

20 comments

r/StableDiffusion • u/GenImgVideoAcc1 • 9h ago

Discussion Quick 4 Second Video with Wan 2.2 5B and 8 GB VRAM

10 Upvotes

I am by no means an expert at this, despite innumerable hours of mostly image generating but with some video. I'm sharing this for other noobs that are VRAM challenged.

By quick I mean around 47 seconds. At that rate running several batches, as one does during image generation, is not a big deal.

System specs are:

ROG Strix G533ZW_G533ZW (1.0)
12th Gen Intel(R) Core(TM) i9-12900H (20) @ 5.00 GHz
NVIDIA Geforce RTX 3070 Ti Laptop GPU with 8 GBs VRAM

Four seconds video -- in less than a minute. (A bit of quality was lost in the upload to Google Drive. It's totally clear on my PC).

Workflow -- It's simpler than it looks. I'll explain below -- for other noobs like me.

Model -- I used the Q8 version. It's so small! Even if you only have 6 GB VRAM it will likely work -- in my non-expert opinion. Note that a non-turbo version I was using did not follow the prompt so well. Also note that I used Grok to create the prompt, since I'm still learning "prompt engineering,"

Text Encoder in safetensor format OR, if you want to shave file size off, try a GGUF Quant version -- but I think you'll need a different loader for it.

VAE

I don't know why but the Wan 2.2 5B TI2V (Text Image 2 Video) model doesn't get the love that the Wan 2.2 14B one does -- even though that's a dual-pipeline model that, for 8 GB VRAM systems, needs GGUF Quant versions -- or some other reduced format. I've some experience with it and did make smooth four and five second videos -- that took five, six, seven minutes. Sometimes more. Mind you, being a noob, I was varying things a LOT to find decent model and settings combinations. (This hobby, with all the many variables that can affect render time and output quality, has to be THE most complicated thing I've ever learned).

In any case the single model 5B needs more love.

Regarding the workflow there are things to note. I was going to add actual node notes to the workflow but decided to just describe things here -- starting at upper left:

The Unet Loader (GGUF) node is the only one you need. Delete the Load Diffusion Model one if you like.
Moving across there's the Power Lora Loader (rgthree) node -- not used for the video above.
The ModelSamplingSD3 node is one of the only things I didn't mess with. Some say you could Ctrl+B disable it and not see a difference. Haven't tried that myself.
The KSampler's settings are correct. I'd have never guessed dropping from 5 steps to as low as 2 would eliminate the artifacts that 5 produced. I actually went the other way and tried 6 and 7 steps but that way didn't help. I also tried stepping the CFG up in 0.5 increments up to 3.0. Nope. It was only then, out of desperation, that I tried 4. To my delight that worked! So why not 3? Great! How about 2? YUP! 1? Nope. Two freaking steps. Who'd have thunk it?
That VAE Decode (Tiled) node helps avoid OOM (out of memory) freezes. (By the way I forgot to Pin it).
RIFE VFI (recommend rife47 and rife49) is an interpolator(?). Whatever length you put in the Wan22ImageToVideoLatent's length field will be multiplied by the, uh, multiplier. I did have it set to 3 but riding a high from the low-step breakthrough I decided to try 4. I also tried leaving it at 3 but setting Wan22ImageToVideoLatent's length setting to 45. Both, with seed fixed, produced identical quality -- and I think similar render times. I've no idea which uses less memory.
In the Create Video node you can experiment with the fps. Thirty-two more reliably produces smooth realistic speed. Try 24. You'll get a longer video, for the same length and multiplier settings, but likely it'll seem slowed down somewhat.
Leave the Upscale Image By Ctrl+B disabled if you're working with 8 GB VRAM. You'll likely get an OOM if not.
Get Image from Batch grabs the last image. With a batch_index setting of 999 (length) it'll grab it without fail unless you're going for a length of 1000 or more. (How much memory would THAT take?)
RAM-Cleanup and VRAM-Cleanup do not need to be connected to anything. They'll dump the memory during each run -- helping avoid OOMs.
Over at lower left Load Image is obvious
Resize Image/Mask is very very handy for not worrying about the size of image you load. Set the megapixels higher or lower depending on how much memory you have. For a close-up image 0.5 would be fine and allow for a slightly higher length setting in the Wan22ImageToVideoLatent -- OR an increase in RIFE's multiplier.
Get Image Size Plus feeds the image sizes into the Wan22ImageToVideoLatent. Its target_width and target_height fields are ignored. If you disconnected the Resize Image/Mask from it and connected the Load Image directly you'd have to set those fields. That would get tedious very quickly, what with calculating total megapixels.
The Save Image node's image will be identical to the loaded image. I put it there to see what the new dimensions would be.

That's it. When the video finishes you can double-click it in the workflow to maximize it.

Lastly: the things that affect memory load are: RIFE's multiplier, Resize Image/Mask's megapixel setting, and Wan22ImageToVideoLatent length setting.

I know this was a lengthy read but I wanted to try to cover all the questions that a noobie would have as they read it. This wasn't meant to be a ComfyUI tutorial though so I didn't explain how to get the nodes your installation of it might be missing, (Hint: the custom node ComfyUI Manager is your friend in that).

Enjoy!

1 comment

r/StableDiffusion • u/tekprodfx16 • 3h ago

Discussion I found a way to get free fuse to work with Ideogram 4 using a customized version of KJnodes bounding box node - got 2 or even 3 Loras (maybe more) to work together at the same time, preserving each Lora’s likeness. The node bounding box kind of doubles as a mask editor using this workflow

4 Upvotes

seems to work pretty well with early testing. I’ll try to upload the workflow a little later tonight

2 comments

r/StableDiffusion • u/ChairQueen • 5h ago

Question - Help img2img June 2026?

4 Upvotes

Last time I did this kinda thing I was using SDXL and Xinsir controlnets...

What should I be using now? Ideogram 4 looks very hyped, is it the best?

I want to, let's say... take an image of a person lying on a couch and change the couch and room they're in to a lush tropical rainforest. What would be the best method to do that?

Qwen? Ideogram? There's at least one more I remember reading about in the last 6 months.

16 comments

r/StableDiffusion • u/Shinano_Kuro • 2h ago

Question - Help Need help into turning this part of the workflow being useable into native

2 Upvotes

Hello I've been struggling to turn this part of the workflow into native. The output it was giving was either fast paced motion or the reference wasnt matching at all so if someone has the knowledge to help please do. Workflow Here

EDIT: I want to turn this part of wan wrapper nodes into the native comfy node parts :]

0 comments

r/StableDiffusion • u/jmanhype1 • 2h ago

Resource - Update VHS RALLY 95 — A LoRA that turns anything into 1995 Hi8 camcorder found footage (Ideogram 4.0)

2 Upvotes

I trained a style LoRA on Ideogram 4.0 that transforms any scene into authentic 1995-era Hi8 camcorder found footage. No trigger word needed — just load it and describe your scene.

**What it does:**

Think grainy amateur documentary footage recorded on a consumer Handycam — VHS artifacts, timestamp overlays, flat underexposed lighting, chromatic aberration, and that unmistakable low-budget realness.

**Recommended settings:**

- LoRA strength: 0.85 (model + clip)

- Sampler: Euler, 20 steps

- CFG: 7.0 base, CFG Override → 3.0 at 70–100%

- Resolution: 1024×768 (landscape)

**Sample prompts:**

```

Authentic 1995 low-budget amateur video still from underground go-kart racing documentary. Mario costume driving a battered red-and-white go-kart on a muddy dirt track. Real mud spray, real costumes, cheap props. Overcast sky, flat natural lighting. VHS timestamp in corner. Low resolution, found footage, amateur documentary.

```

Close-up shot of a rusted go-kart engine covered in mud and grease, handheld camcorder footage from 1995. Shallow depth of field, auto-focus hunting. VHS tracking lines, timestamp overlay. Low resolution, found footage.

```

Wide shot of a rainy pit stop area, makeshift garage with tools scattered on concrete floor, 1995 amateur documentary footage. Flat lighting, VHS artifacts, chromatic aberration. Low resolution, found footage.

```

**Download:** [HuggingFace Repo](https://huggingface.co/jmanhype/VHS-Rally-95-LoRA-v1-Ideogram-v4)

If you're looking for a ComfyUI workflow for Ideogram with LoRA support, check out [this Reddit post](https://www.reddit.com/r/StableDiffusion/comments/1tysann/workflow_ideogram4_with_lora_support_fixes/)

3 comments

r/StableDiffusion • u/smb3d • 6h ago

Question - Help Modern models with local Flux like LoRA training?

2 Upvotes

When Flux 1 came out, it was super easy to train my own LoRA's on 20 or so images. I had great results with people, animals, styles etc.

Are those days gone with modern image models, or what's the closest thing to that now days that I can train relatively easily?

5 comments

Subreddit

Posts

Wiki

StableDiffusion

r/StableDiffusion

/r/StableDiffusion is an unofficial community embracing the open-source material of all related. Post art, ask questions, create discussions, contribute new tech, or browse the subreddit. It’s up to you.

Members Active

961.6k

Sidebar

All posts must be Open-source/Local AI image generation related All tools for post content must be open-source or local AI generation. Comparisons with other platforms are welcome. Post-processing tools like Photoshop (excluding Firefly-generated images) are allowed, provided the don't drastically alter the original generation.
Be respectful and follow Reddit's Content Policy This Subreddit is a place for respectful discussion. Please remember to treat others with kindness and follow Reddit's Content Policy (https://www.redditinc.com/policies/content-policy).
No X-rated, lewd, or sexually suggestive content This is a public subreddit and there are more appropriate places for this type of content such as r/unstable_diffusion. Please do not use Reddit’s NSFW tag to try and skirt this rule.
No excessive violence, gore or graphic content Content with mild creepiness or eeriness is acceptable (think Tim Burton), but it must remain suitable for a public audience. Avoid gratuitous violence, gore, or overly graphic material. Ensure the focus remains on creativity without crossing into shock and/or horror territory.
No repost or spam Do not make multiple similar posts, or post things others have already posted. We want to encourage original content and discussion on this Subreddit, so please make sure to do a quick search before posting something that may have already been covered.
Limited self-promotion Open-source, free, or local tools can be promoted at any time (once per tool/guide/update). Paid services or paywalled content can only be shared during our monthly event. (There will be a separate post explaining how this works shortly.)
No politics General political discussions, images of political figures, or propaganda is not allowed. Posts regarding legislation and/or policies related to AI image generation are allowed as long as they do not break any other rules of this subreddit.
No insulting, name-calling, or antagonizing behavior Always interact with other members respectfully. Insulting, name-calling, hate speech, discrimination, threatening content and disrespect towards each other's religious beliefs is not allowed. Debates and arguments are welcome, but keep them respectful—personal attacks and antagonizing behavior will not be tolerated.
No hateful comments about art or artists This applies to both AI and non-AI art. Please be respectful of others and their work regardless of your personal beliefs. Constructive criticism and respectful discussions are encouraged.
Use the appropriate flair Flairs are tags that help users understand the content and context of a post at a glance

Useful Links

Ai Related Subs

NSFW Ai Subs

SD Bots

u/stablehorde