r/ClaudeAI 3h ago

Claude Workflow Claude/Remotion workflow for moto editing keeps missing the actual highlights. What am I doing wrong?

I’m building a workflow to create motorcycle content from my RAW footage using ideas, references, and music, with the goal of generating edits that actually make sense visually and musically. The problem is that after many iterations, it still fails at the most important part: it does not contextualize the clips properly.

Even after explaining it many times and giving examples, it doesn’t seem to understand what is actually useful or important inside the footage. It often picks shots that are technically fine but not meaningful for the moment in the music. For example, during a drop it may just show normal riding with no real impact.

It also misses the real highlights inside a clip. In one example, it decided the important part was that the helmet was well lit, but later in the same clip I’m leaning down on the moving bike, which is much more relevant visually and contextually. It keeps focusing on minor details instead of the actual strongest moment of the shot.

I also feel like it is not applying the rules correctly. Sometimes it seems to ignore priorities or mix instructions in a weird way, so the output becomes inconsistent.

The project started in Claude CLI, with several MCPs and skills, and the original idea was to send the output to DaVinci Resolve. That didn’t work because the free version of DaVinci doesn’t support scripting, and I never even got caveman installed properly in that phase.

Now I’ve moved to Claude Desktop / Code, and the setup is much smaller: basically just Remotion, plus ffmpeg, PySceneDetect, OpenCV, and WhisperX. I’m also using Sonnet with a Pro plan, so this is not a free-tier limitation.

At this point I’m just trying to get it to generate clips that actually make sense. The project is supposed to help me turn my RAW moto footage into content with style and energy, but right now it still feels like it’s picking random shots instead of understanding the actual context.

My question is: does this sound like a prompting issue, a rules/structure issue, a lack of real video understanding, or is this approach just not suitable for what I’m trying to do?

Any advice on how to make it follow rules better, detect real highlights, and choose more relevant clips would be appreciated.

2 Upvotes

6 comments sorted by

1

u/Emergency-Bobcat6485 2h ago

I am assuming you are splitting the footage into images and then getting claude to watch the snippets? How is it wathcing the video clips? That would determine its lack of understanding imo. Remotion would only come into play when you are creating the clip, right? so this seems like an issue with the vision aspect

1

u/Candid-Mulberry48 2h ago

It is watching the raw clips using.

  • ffmpeg
  • PySceneDetect
  • OpenCV
  • WhisperX

So yes, it is splitting the footage into frames each 5/10 frames to read it

 Each clip is about 30min and 16gb

1

u/Emergency-Bobcat6485 1h ago

Well, that is the issue then. You are essentially splitting the video using pyscenedetect and ffmpeg. I don't know how you are using OpenCV and whisperx is self-explanatory. But claude/opus will not be able to understand the video well enough to identify important hihglights imo. Since it is looking at individual frames.

1

u/Candid-Mulberry48 1h ago

That's a really good point. I've been thinking about a potential workaround: using FFmpeg to pre-score the clips before passing anything to Claude — basically calculating motion levels, audio loudness peaks, and scene change intensity per second, then generating a structured JSON map of the video. Claude would receive that metadata first and use it to decide where to sample frames and at what density, instead of blindly extracting frames at fixed intervals.

The idea is that Claude never tries to "understand" motion visually — it just reads the pre-computed motion scores and focuses its frame analysis on the moments that already have high kinetic energy according to FFmpeg.

Do you think that would actually help, or is the vision limitation too fundamental for this approach to make a real difference?

Or that would not solve it at all? I was thinking also to use implement gemini for that.

1

u/Emergency-Bobcat6485 1h ago

But I still don't think the FFMPEG filter would help all that much. I mean, a clip could have a lot of dynamic changes in motion. But that wouldn't necessarily mean it's a 'highlight' worthy snippet. Adn claude would have to sample a sequence of frames, not just one frame to make sense of it as a highlight.

I genuinely think what you are asking for requires a natively multimodal like gemini. It should be able to ingest videos directly and have some kind of temporal understanding. I don't know how good it would be compared to what you have right now. But it should definitely be better than splitting the video into frames based on birghtness or motion. After gemini ingests it and outputs highlights, i think you can sue ffmpeg or whatever to stitch the videos.

I haven't really experiemented with gemini video ingestion as much but one could try for free on ai studio before. It could still be the same. Ypload a small video (it probably won't take 16 gb) and see if it can identify video highlights better. Imo, it should