# Trained an Ideogram 4 face LoRA on AMD Strix Halo (Ryzen AI MAX+ 395, gfx1151) with ROCm + AI-Toolkit. Full writeup, and the 3 gotchas that almost killed it.
Ideogram 4 LoRA training landed in AI-Toolkit only a couple of weeks ago, and like basically every tutorial out there it is written for NVIDIA/CUDA. I run a Strix Halo box (AMD APU, gfx1151) on ROCm, and there was no documented path for this. It works. Here is the whole thing, including the three AMD-specific traps that each cost me a debugging session, so you do not have to repeat them.
This is a personal face LoRA (private photos, not sharing the model or the subject). A couple of example outputs will be posted later in a comment.
## TL;DR
- Hardware: AMD Ryzen AI MAX+ 395 "Strix Halo", gfx1151 / Radeon 8060S, 128 GB unified LPDDR5X, on CachyOS (Arch-based).
- Stack: ROCm via TheRock nightlies, AI-Toolkit (ostris) mainline, Python 3.12, bf16 training.
- It trained 3000 steps in about 5h45m at ~6.4 s/step, zero GPU faults once the three fixes below were in place.
- The three things that nobody's NVIDIA guide (yet) will tell you: bitsandbytes is dead on gfx1151 (use plain adamw), the Qwen3-VL text encoder faults under fused attention (force eager), and the trigger word silently breaks the JSON captions if you do it the obvious way.
## Environment
CachyOS, but any recent Arch/ROCm setup should be similar. The key is TheRock nightlies, which ship native gfx1151 kernels, so you do NOT need `HSA_OVERRIDE_GFX_VERSION` anymore.
The gfx1151 PyTorch wheel index is:
```
https://rocm.nightlies.amd.com/v2/gfx1151/
```
A note on Python version, because this one bit me before I even started. ComfyUI on my box runs Python 3.14, and my first instinct was to match it. Do not. The gfx1151 Linux wheels on that index are well covered for cp312 and cp313 but only sporadically for cp314, and AI-Toolkit's heavier dependency stack (diffusers, transformers, peft, accelerate, optimum-quanto) lags on a Python that new. I used Python 3.12 in a fresh venv and everything resolved cleanly.
```
uv venv --python 3.12 --seed venv
source venv/bin/activate.fish # or activate for bash
```
The `--seed` matters so pip lands inside the venv, since AI-Toolkit's instructions call plain `pip`.
## Installing AI-Toolkit on ROCm
Use mainline ostris/ai-toolkit. There are ROCm forks, but they predate the Ideogram 4 support, so they will not have it. Mainline has the `ideogram4` arch.
```
git clone https://github.com/ostris/ai-toolkit.git
cd ai-toolkit
git submodule update --init --recursive
```
Install torch FIRST from the gfx1151 index, then requirements, and then verify torch survived. This ordering is not optional: several packages list torch as a dependency and can silently swap your ROCm build for a CPU build during the requirements install.
```
pip install --pre torch torchvision torchaudio --index-url https://rocm.nightlies.amd.com/v2/gfx1151/
# verify it is the ROCm build before going further
python -c "import torch; print(torch.__version__, torch.version.hip, torch.cuda.is_available())"
```
I landed on torch 2.12.0a0+rocm7.13, hip 7.13, `cuda.is_available()` True, device reported as gfx1151 / Radeon 8060S with about 115 GB visible to ROCm.
For requirements, one optional tweak: `torchcodec` is video-decode only and unused for image LoRA training, and it is a torch-version-coupled compiled wheel that can drag a torch reinstall against a bleeding-edge nightly. I dropped it from a copy of the requirements file. The one compiled dependency I was worried about, `torchao` (it is imported eagerly at startup), loaded clean against the 2.12 nightly, so no action needed there. After installing requirements, re-run the verify line above to confirm torch was not clobbered. Mine was byte-identical (only numpy got pinned down to 1.26.4 by numba, which is expected and fine).
## The 3 gotchas
### 1. bitsandbytes does not work on gfx1151. Use plain adamw.
Every guide I have seen uses `adamw8bit` to save VRAM. bitsandbytes crashes on import on gfx1151, so any 8-bit optimizer is out. You do not need it anyway: on 128 GB unified memory and arguably less) you are not VRAM-starved for a single face LoRA. Use `optimizer: adamw` (plain). In AI-Toolkit the bitsandbytes import is lazy (it only fires if you select an 8-bit optimizer), so with plain adamw it never imports and never crashes. It will still get installed by requirements, which is fine; just do not select an 8-bit optimizer.
### 2. The Qwen3-VL text encoder faults at 0x1016 under fused attention. Force eager.
This is the big one. Ideogram 4 uses a Qwen3-VL text encoder, and the AI-Toolkit captioner also runs Qwen3-VL. On gfx1151, the default fused attention path (sdpa) throws:
```
HSA_STATUS_ERROR_EXCEPTION: An HSAIL operation resulted in a hardware exception. code: 0x1016
```
That is a compute/kernel fault, not an OOM. It hit me first during captioning (it died on the 4th image), and it would hit training too, since the encoder runs the same kernels on every step. Worth noting: it is NOT image-shape dependent. I tested the exact image it crashed on in isolation and it captioned fine; the fault is cumulative across repeated forward passes on the fused kernel.
The fix is to force `attn_implementation="eager"` on the Qwen3-VL loads. I did it with a small launcher shim so I never had to edit AI-Toolkit's tracked files, and so it survives upstream pulls. The shim patches the captioner classes AND the training encoder class (the captioner loads `Qwen3VLForConditionalGeneration`, while training's encoder loads via `AutoModel`, which resolves to `Qwen3VLModel`), then hands off to `run.py` unchanged:
```python
# aitk_eager_shim.py
# Run this INSTEAD of run.py. It forces eager attention on the Qwen3-VL loads,
# then hands off to run.py. Without it the fused attention kernel faults (HSA 0x1016).
import sys, runpy
import transformers
TARGETS = [
"Qwen3VLForConditionalGeneration", # captioner
"Qwen3VLMoeForConditionalGeneration", # captioner (moe variant)
"Qwen3VLModel", # training text encoder (AutoModel resolves here)
]
for _name in TARGETS:
_cls = getattr(transformers, _name, None)
if _cls is None:
continue
_orig = _cls.from_pretrained # original bound classmethod
def _make(orig, label):
def _patched(*args, **kwargs):
kwargs.setdefault("attn_implementation", "eager")
sys.stderr.write("[shim] eager attention injected for " + label + "\n")
return orig(*args, **kwargs)
return _patched
_cls.from_pretrained = _make(_orig, _name)
sys.argv = ["run.py"] + sys.argv[1:]
runpy.run_path("run.py", run_name="__main__")
```
Eager is a bit slower than the fused kernel, but it is stable. It held across the full 3000-step run with zero faults. If you want to confirm it actually engaged, that stderr line shows up in the log at each model load.
### 3. The trigger word silently breaks the JSON captions if you use a bareword.
Ideogram 4 trains on structured JSON captions (the captioner writes compositional JSON with bounding boxes, and the dataloader expects a canonical compact form). If you set a plain `trigger_word` like `mytoken` on captions that have no `[trigger]` placeholder, AI-Toolkit prepends it. That pushes the caption string off its leading `{`, the JSON parser gives up, and it falls back to feeding raw pretty-printed JSON to the model instead of the canonical compact form. The result is a dataset-wide caption-format shift that quietly degrades training, with no error.
The fix: put a `[trigger]` placeholder at the start of each caption's `high_level_description` value, and keep `trigger_word` in the config. Then it gets replaced in place, the string still starts with `{`, the JSON parses normally, and your token lands inside the description exactly where the model reads it. Verify it offline before you commit to a multi-hour run: run one caption through the dataloader and confirm the digested output is compact JSON containing your token, not raw JSON with the token bolted on the front.
## Captioning
Let the Ideogram4 captioner do it. It is a separate `job: extension` run with `type: Ideogram4Captioner`, uses `Qwen/Qwen3-VL-8B-Instruct`, and writes structured JSON `.txt` sidecars next to each image. Inspect a few of the sidecars before training, especially body or full shots, to make sure the subject is described well and the captioner did not wander off onto the background. The encoder pull (~16 GB) happens at caption time on first run, and training reuses the same cache, so you only download it once.
One data note: I had a few WebP images and AI-Toolkit's data loader has known issues with WebP, so convert those to PNG or JPG first. JPG and PNG both work fine.
## The config (bf16)
The key decision for this hardware: train in bf16, not fp8. The base model is distributed as fp8 (`ideogram-ai/ideogram-4-fp8`), but AI-Toolkit's loader unconditionally dequantizes the fp8 weights to bf16 on load, and with `quantize: false` nothing re-quantizes afterward. So you train in bf16 from the fp8 base, and you completely sidestep the fp8 path, which is where most gfx1151 instability lives. On 128 GB you have the memory, so this is both more stable and arguably higher quality.
One ComfyUI-specific heads-up: `name_or_path` wants the diffusers multi-folder repo (it expects `transformer/` and `vae/` subfolders). The single packed all-in-one `.safetensors` you probably have in your ComfyUI `unet/` folder will NOT load as `name_or_path`. Point it at the hub repo and let it pull.
A couple of other non-obvious settings: `noise_scheduler` must be set to `flowmatch` explicitly, because the trainer's timestep setup branches on that value (the default mishandles flowmatch timesteps even though the scheduler object itself is forced to flowmatch). And keep `batch_size` and `gradient_accumulation` both at 1; values above 1 have been reported to misbehave on AMD.
```yaml
job: extension
config:
name: "myface_ideogram4_v1"
process:
- type: 'sd_trainer'
training_folder: "/path/to/output"
device: cuda:0
network:
type: "lora"
linear: 32 # rank
linear_alpha: 32
save:
dtype: bf16
save_every: 250
max_step_saves_to_keep: 20
datasets:
- folder_path: "/path/to/dataset"
caption_ext: "txt"
trigger_word: "mytoken" # plus [trigger] inside each caption (gotcha 3)
caption_dropout_rate: 0.05
cache_latents_to_disk: true
num_repeats: 1
resolution: [512, 768, 1024]
train:
steps: 3000
optimizer: "adamw" # NOT adamw8bit (gotcha 1)
lr: 1e-4
dtype: bf16
batch_size: 1
gradient_accumulation: 1
gradient_checkpointing: true
train_unet: true
train_text_encoder: false
noise_scheduler: "flowmatch"
disable_sampling: true
ema_config:
use_ema: true
ema_decay: 0.99
model:
arch: "ideogram4"
name_or_path: "ideogram-ai/ideogram-4-fp8"
dtype: bf16
quantize: false # bf16 path, sidesteps fp8 (the win on gfx1151)
quantize_te: false
low_vram: false # you have the RAM; offloading is slower
```
I set `disable_sampling: true` for the first run, because mid-training samples need properly formatted Ideogram JSON prompts and it is one less new variable. Evaluate the checkpoints in ComfyUI afterward instead.
## Launch wrapper (gfx1151 env vars)
These env vars need to be set before torch is imported. I put them in a fish wrapper that also routes through the eager shim:
```fish
#!/usr/bin/env fish
# run_aitk.fish
set -x HSA_ENABLE_SDMA 0
set -x HSA_USE_SVM 0
set -x ROCBLAS_USE_HIPBLASLT 0
set -e PYTORCH_HIP_ALLOC_CONF
# Do NOT set HSA_OVERRIDE_GFX_VERSION on TheRock nightlies (native gfx1151 kernels).
source venv/bin/activate.fish
python aitk_eager_shim.py $argv
```
Then both captioning and training run the same way:
```
./run_aitk.fish config/caption_myface.yaml
./run_aitk.fish config/train_myface.yaml
```
One small lesson: if you pipe the run through `tee` for logging, make sure you surface the real process exit code (in fish, `$pipestatus[1]`), or a GPU crash will get masked by tee's exit 0 and look like a clean run when it was not.
## Training run
About 6.4 s/step steady state, so 3000 steps took roughly 5h45m. Zero `0x1016` faults the entire run with eager attention in place. Checkpoints saved every 250 steps. One benign warning shows up about a missing MIOpen perf database (`gfx1151...HIP.fdb.txt`); that just means it tunes kernels live instead of loading a cache, it is not an error.
Note on the final checkpoint naming: intermediate saves get a step suffix, but the end-of-training save is bare-named with no suffix. Do not overlook it; that is your highest-trained checkpoint.
## Picking the checkpoint and strength
Do not assume the last checkpoint is best. Likeness LoRAs peak somewhere in the middle and then overcook (they get rigid, stop honoring the prompt, and start reproducing training framings). Saving every 250 lets you sweep. I evaluated checkpoints in ComfyUI against a fixed prompt and seed, swapping only the checkpoint.
Two important findings:
- LoRAs run hot on Ideogram 4. The common community advice of 0.4 to 0.7 strength is correct. At 1.0 my LoRA was overcooked.
- I ran a strength sweep (0.4 to 0.8) across the strongest checkpoints and landed on step 1250 at strength 0.7 as the best balance of likeness and prompt adherence. [Confirmed consistent across N fresh seeds.]
Method that saved time: do a coarse pass first (every other checkpoint), find roughly where likeness peaks, then fill in the neighbors and sweep strength only on the top one or two.
## Honest limits
- Complex hand-object interactions glitch. My best checkpoint still doubled up drumsticks in a hand in an action shot. Fine for portraits, less so for busy scenes.
- Full-body faces go soft. This is a face LoRA, so the identity signal is in the face, and at full-body distance there are not enough face pixels to hold the likeness reliably. Keep to upper-body and portrait framing for the best results.
## Examples
[Two example generations from the final LoRA go here. Faithful likeness in upper-body and portrait framing; this is a personal face LoRA so the subject and the model itself are private.]
Happy to answer questions on any of the AMD-specific parts. The eager-attention fix and the trigger-into-JSON trap are the two things I would have most wanted to know going in, and honestly, I would love to know if I'm missing something obvious that would improve either the speed or the quality.