Most early-stage products need a product demo video to land on the homepage. You know the kind: 30 seconds, a few UI screenshots, a clean voiceover, some background music. Professional but not over-engineered.
The usual route is painful. You hire a videographer or use a screen recording tool. You write a script. You find a voiceover person on Fiverr or do it yourself. You source royalty-free music. Then you stitch it all together in After Effects or Premiere or Capcut, back and forth for a week.
I replaced all four of those roles with free-tier AI tools and wrote the entire thing as code. This is the case study.
Time spent: 3.5 hours of agentic coding sessions
Cost: $0 (free tiers only)
Result: A 35-second, 1920x1080 rendered MP4 with animated scenes, AI voiceover, and background music
Tools: Claude (Sonnet 4.6), Remotion, ElevenLabs, Node.js v24
The problem it solves
Before building anything, I needed to answer a question most developers skip: what roles does this actually replace?
For a landing page video, you normally need:
- A content writer to script the scenes
- A designer to spec out what each scene looks like
- A videographer or motion designer to actually animate it
- A voiceover artist to record the narration
- A sound designer to find and mix background music
- A content writer + developer to have professional demo data on your platform for the video screens
That is a week of calendar coordination at minimum, or several hundred dollars on a freelance platform.
Here is what replaced each of them.
The stack
| Role | Tool | Free tier |
|---|---|---|
| Motion designer | Remotion | Open source |
| Content writer + director | Claude (Sonnet 4.6) | Copilot integration |
| Voiceover artist | ElevenLabs TTS (eleven_multilingual_v2) |
10,000 credits/month |
| Sound designer | ElevenLabs Sound Effects (eleven_text_to_sound_v2) |
Separate quota |
| Screenshots / assets | Custom agent + Supabase seed data | Free |
Give your agent the right context first
Before writing any code, install the official agent skills for both tools. If you skip this, your AI agent will guess at API shapes, miss known gotchas, and produce code you have to correct manually. These two commands fix that.
Remotion skills (remotion.dev/docs/ai/skills):
npx skills add remotion-dev/skills
This installs a set of markdown files that teach your agent Remotion-specific patterns: how TransitionSeries handles frame overlap, how to structure compositions, how spring() and interpolate() work, and which common mistakes to avoid. Remotion also offers this setup when creating a new project via bun create video.
ElevenLabs skills (elevenlabs.io/blog/elevenlabs-agent-skills):
npx skills add elevenlabs/skills
This installs skills for text-to-speech, sound-effects, and music generation. Each skill includes the correct API call shapes, the parameters that matter (like previousText/nextText for natural TTS delivery), and the options most developers miss on the first pass.
Both commands detect your coding agent (Claude Code, Cursor, Codex) and copy the files into the right location automatically. For Claude Code it is .claude/skills/, for Cursor it is .cursor/skills/.
Once installed, your agent knows both libraries well enough that you spend your prompting budget on product decisions, not on explaining how an API works.
Phase 1: Ideation and scene design
The first question your LLM needed to answer was: what should a 30-second SaaS demo video show?
The answer is not "features." It is a problem gaining clarity over time. Each scene should move the viewer one step closer to understanding why the product exists. The classic structure is:
- Hook: establish what the product is about (emotional frame)
- Problem: name the pain the viewer already feels
- Feature 1: show the fix
- Feature 2: show the follow-through
- Feature 3: show the client experience
- CTA: close on the outcome, not the product
For FitComrade the product I built (a fitness coaching platform), that became six scenes:
- Intro: the brand promise
- Problem: the time wasted on admin
- Program builder: build once, assign everywhere
- Client dashboard: see who needs attention
- Client share: one link per client, any device
- CTA: less admin, more coaching
Breaking down the ideation into 3 layers
Most people try to think about a video as one thing. That is why they get stuck. The way to un-stick it is to realize a product demo video is three separate problems layered on top of each other. Solve them in order. (Similar to how you would solve a dev problem breaking into tiny composable components loosely coupled)
Layer 1: Screens
What does the viewer need to see? For each scene, define one screen and one thing it shows. Not two things. One. If you need to show two things, you need two scenes. This forces clarity. It also makes the script considerably easier to write.
For a 30-second video with 6 scenes, that means six screens. List them before writing a single line of copy:
Scene 1: No UI, brand statement only
Scene 2: No UI, problem statement only
Scene 3: Program builder screen
Scene 4: Client dashboard screen
Scene 5: Client share / mobile view screen
Scene 6: No UI, CTA only
Once this list exists, you know exactly what screenshots you need and you know which scenes need real assets versus which are text-only. The screenshot prompt in Phase 2 is written against this list.
Layer 2: Content per screen
With screens defined, write the voiceover for each one. This is just one to two sentences per scene. The goal is that each line could stand alone as a clear statement and still make sense as part of a connected narrative when heard in sequence.
The technique that works: write the first line and the last line of the full video first. Everything in between is a bridge from one to the other. You will end up with copy that flows instead of copy that sounds like six separate product screenshots narrated by a robot.
This was the hardest part had to iterate over 30-45 mins to generate this content in order for it to be impactful and still short.
export const SCENES = [
{
id: 'scene-1-intro',
durationSeconds: 5,
audioDelay: 0.5,
text: "Coaching your clients shouldn't feel this messy.",
},
{
id: 'scene-2-problem',
durationSeconds: 5,
audioDelay: 0.7,
text: "But most weeks, you're rebuilding plans and chasing updates.",
},
{
id: 'scene-3-program-builder',
durationSeconds: 7,
audioDelay: 0.7,
text: 'With FitComrade, you build a program once, then assign it in a few clicks.',
},
{
id: 'scene-4-client-dashboard',
durationSeconds: 6.5,
audioDelay: 0.7,
text: "You can see who's on track, and who needs a quick check-in.",
},
{
id: 'scene-5-client-share',
durationSeconds: 6,
audioDelay: 0.7,
text: 'And each client gets one simple link that works on any device.',
},
{
id: 'scene-6-cta',
durationSeconds: 5.5,
audioDelay: 0.5,
text: 'So you spend less time on admin, and more time coaching.',
},
] as const;
P.S: One tip that really helps is thing of each scene content as a connecting content that supports the previous and the next content (if any).
Layer 3: Animation and background music
This layer is decided last, not first. The mistake most developers make is opening a video tool and thinking about animation before they know what the screen shows or what the voiceover says. Animation should serve the content, not the other way around.
For each scene, one question: what is the thing the viewer's eye should land on? That element gets the animation. Everything else fades in passively. Background music is a single decision for the whole video, not per-scene. Choose a mood that fits the voiceover tone and stay consistent.
Claude wrote all six scene scripts in the same session it designed the architecture. The instruction that made the copy sound human was:
PROMPT:
**"write it like one person thinking out loud, not six separate headlines."** Using connectors like "But most weeks...", "With FitComrade...", and "So you..." made the voiceover feel like a continuous thought across scene cuts.
For each scene take the voice-over config file (shared above) and map the duration of each scene based on the attention and high conversion principle and map each scenes voice over according to the scenes durations and make sure we leave some sounds gaps before the scene ends for decent transition between one scene and the other
Phase 2: Building the video in code with Remotion
Remotion is a React-based video framework. You write your video as React components, then render to MP4. This is the right mental model for a developer: your scenes are components, your animations are hooks, your timeline is JSX.
The project structure is straightforward:
_video/
├── src/
│ ├── Root.tsx # Remotion entry point, composition config
│ ├── ProductVideo.tsx # Main timeline — assembles scenes
│ ├── voiceover-config.ts # Single source of truth for timing
│ ├── styles.ts # Brand colors, fonts, FPS
│ ├── animations.ts # Reusable spring/fade hooks
│ └── scenes/
│ ├── IntroScene.tsx
│ ├── ProblemScene.tsx
│ ├── ProgramBuilderScene.tsx
│ ├── ClientDashboardScene.tsx
│ ├── ClientShareScene.tsx
│ └── CtaScene.tsx
├── generate-voiceover.ts # ElevenLabs TTS script
└── generate-music.ts # ElevenLabs sound effects script
One configuration note before writing any scene code: _video/ needs its own package.json with "type": "module" and Node v22 or higher. Without it, native TypeScript execution fails with SyntaxError: Cannot use import statement. This is a two-line fix, but only if you know to look for it. Set it up first.
You can use npx create-video@latest to set up the remotion project setup for this.
The single source of truth pattern
The most important architecture decision was voiceover-config.ts. Every scene's duration, audio delay, and script text live in one place:
export const SCENES = [
{
id: "scene-1-intro",
durationSeconds: 5,
audioDelay: 0.5,
text: "Coaching your clients shouldn't feel this messy.",
},
// ...
] as const;
Both Root.tsx (total frame count) and ProductVideo.tsx (each TransitionSeries.Sequence) read directly from this config. Change a scene duration in one file, everything updates automatically. No hardcoded frame math scattered across multiple files.
Before this pattern, durations were hardcoded in three places. Changing a scene length made the render look correct while audio fell out of sync by frames and invisible during build, only caught on playback.
Whe building a video with llm's Config-first is not stylistic. It is the thing that makes frame-level debugging avoidable and actually makes the process more enjoyable since you have control over each layer, like how you would build a complication application, broken down to each level so it's composable.
Animations as reusable hooks
Rather than writing frame-by-frame animation logic in each scene, I established a shared animations.ts with composable spring-based hooks:
export const useFadeIn = (delay = 0, slideDistance = 40) => { ... }
export const useSlideIn = (delay, slideDistance, direction) => { ... }
export const useScaleIn = (delay = 0) => { ... }
export const useScreenshotZoom = (delay = 0) => { ... }
export const useFloat = (speed, amplitude) => { ... }
export const useCinematicReveal = (delay = 0) => { ... }
You can just prompt your AI to fill in these functions and it will do a decent job in writing what should be inside these methods. Works well with both Gemini 3.1, GPT 5.4 and Claude Sonnet 4.6
Each scene composes these hooks. ProgramBuilderScene, for example, uses useSlideIn for the text panel, useScreenshotZoom for the UI screenshot, and useFloat to give the screenshot a subtle hover animation in the background.
The visual design itself came from a combination of Claude's layout reasoning and a pass through Gemini for color suggestions. The result used the brand's green (#48bb78) with soft gradient backgrounds and coordinated accent colors per scene.
Screenshot assets
The scenes needed real product UI, not placeholder images. The approach: give your AI agent a demo account, point it at your running app, and tell it exactly which screens you need.
Here is the prompt structure used — copy it for any product:
PROMPT:
I need screenshots for a product demo video. Access [your app URL] with
demo credentials: [email] / [password].
For each scene below, navigate to the URL, make sure all visible data
looks production-realistic (no "Test User", "Sample Program", or obvious
placeholder text), adjust any content that looks fake, then take a
screenshot at 1440x900. Save as `[scene-id].png` in `_video/public/`.
Scenes:
1. [scene-id] → [path] — show: [what this scene demonstrates]
2. [scene-id] → [path] — show: [what this scene demonstrates]
That phrase matters more than it looks. Without it the agent seeds obvious defaults. With it, it adjusts names, numbers, and content to look like a real product in active use.
This is the hidden leverage of the whole pipeline: when your screenshots are generated from a running app, they stay in sync with your product as the UI evolves.
Transitions
@remotion/transitions provides composable scene transitions. The video uses three types:
fade()withlinearTiming: between Intro and Problem, soft and contemplativewipe({ direction: 'from-right' })withspringTiming: into the product section, more energeticslide({ direction: 'from-left' }): lateral movement between product feature scenesslide({ direction: 'from-bottom' }): upward reveal into the CTA
The TransitionSeries component handles the frame overlap automatically. The key insight: during transitions, two scenes render simultaneously. If your audio started at frame 0 of each scene, it would overlap with the tail of the previous scene's audio. You need an explicit audioDelay.
Phase 3: AI voiceover with ElevenLabs
ElevenLabs offers 10,000 credits of TTS per month on the free tier (At the time of writing this article). The entire voiceover — all six scenes — uses about 360 credits per generation. That is about 27 full regenerations per month before you hit the ceiling.
Voice selection
The API does not expose your available voices on the free tier. The workaround was sending a test generation request to each known voice ID. Out of 29 tested, 21 worked without a 402 error. Voice "Laura" available in Elevenlabs free tier was selected for a professional female delivery suited to a product demo.
Voice settings for natural delivery
The default ElevenLabs voice settings produce a slightly robotic result. Two parameters matter most:
- Stability (0.55): Lower values allow more natural emotional variation. The default 0.7 is too uniform for conversational copy.
- Style (0.2): Adds character and personality to the delivery without making it unstable.
voiceSettings: {
stability: 0.55,
similarityBoost: 0.6,
style: 0.2,
speed: 1.0,
useSpeakerBoost: true,
}
Speed was kept at 1.0 throughout. Rushing TTS makes it sound robotic. The extra scene time added in Phase 1 specifically created room for natural pacing.
The initial cut was 26 seconds, with scenes at 3.5-5 seconds each. Even with the right voice settings, the delivery felt mechanical — not because of the voice model but because there was no breathing room between sentences. Adding 9 seconds across all scenes changed the delivery completely. If your TTS sounds robotic, try longer scene durations before touching stability or style values.
The audio delay problem
TransitionSeries overlaps consecutive scenes during transitions (0.4-0.5 seconds). Without an explicit delay, the incoming scene's voiceover starts playing while the outgoing scene's audio is still active. The fix is wrapping each <Audio> in a <Sequence from={...}>. Here is the full structure for a single scene slot:
<TransitionSeries.Sequence
durationInFrames={Math.round(FPS * scene.durationSeconds)}
>
<Sequence from={Math.round(FPS * scene.audioDelay)}>
<Audio
src={staticFile(`voiceover/${scene.id}.mp3`)}
volume={fadeOutVolume(scene.durationSeconds, scene.audioDelay)}
/>
</Sequence>
<SceneComponent />
</TransitionSeries.Sequence>
Breaking this down:
TransitionSeries.Sequence durationInFramessets how long the scene slot is in total frames. Remotion uses this to calculate transition overlap.Sequence from={Math.round(FPS * scene.audioDelay)}delays when the audio starts playing within that slot. Frame 0 of the outer sequence is still the scene start, but the audio does not begin untilaudioDelayseconds in. This is what clears the transition overlap.fadeOutVolume(scene.durationSeconds, scene.audioDelay)is a helper that fades the audio volume to zero in the last 0.3 seconds of the available window. Without it, any voiceover line that runs close to the scene end will cut abruptly rather than fade gracefully.SceneComponentrenders at frame 0 of the slot, not delayed. Only the audio is delayed.
All three values (durationSeconds, audioDelay, the MP3 filename via scene.id) come from voiceover-config.ts. Nothing is hardcoded here.
Request stitching for continuity
VERY IMPORTANT: ElevenLabs supports previousText and nextText parameters on each TTS request. Passing these tells the model the conversational context around each line, which improves how the voice inflects at the ends and beginnings of sentences:
await client.textToSpeech.convert(VOICE_ID, {
text: scene.text,
previousText: scenes[i - 1]?.text,
nextText: scenes[i + 1]?.text,
// ...
});
Phase 4: Background music with ElevenLabs Sound Effects
Background music uses ElevenLabs' separate sound effects API (textToSoundEffects), which has its own quota independent of the TTS credit limit. A 22-second seamlessly looping ambient track was generated with one API call:
const audio = await client.textToSoundEffects.convert({
text: "Smooth modern tech startup background music, subtle electronic ambient with light beats, clean and minimal, forward-moving energy, suitable for a SaaS product demo video",
durationSeconds: 22,
promptInfluence: 0.5,
loop: true,
});
The loop: true parameter generates a seamlessly looping file. Remotion's <Audio loop /> prop handles repeating it for the full 35 seconds.
The volume is not a fixed number. It is a curve calculated per frame using Remotion's interpolate() helper:
volume={(f) =>
interpolate(
f,
[0, FPS, durationInFrames - Math.round(FPS * 1.5), durationInFrames],
[0, 0.12, 0.12, 0],
{ extrapolateLeft: 'clamp', extrapolateRight: 'clamp' },
)
}
interpolate takes four arguments: the current frame f, an array of keyframe positions, an array of output values at those positions, and extrapolation options.
Reading the keyframes left to right:
- Frame
0→ volume0. The track starts silent. - Frame
FPS(1 second in) → volume0.12. It fades up over the first second. - Frame
durationInFrames - Math.round(FPS * 1.5)(1.5 seconds before the end) → volume0.12. It holds steady through the whole video. - Frame
durationInFrames(last frame) → volume0. It fades out over the final 1.5 seconds.
The 0.12 level is deliberately low. The music should not compete with the voiceover. At 0.12 it sits under the voice during speech and becomes just noticeable in the half-second gaps between scenes, which is exactly where it earns its place.
What this actually replaces
Let me be specific about what was eliminated and what was produced:
Eliminated:
- Content writer: 6 voiceover lines, scene descriptions, CTA copy
- Motion designer: 6 animated scenes, 5 scene transitions, progress bar overlay
- Voiceover artist: 35 seconds of natural-sounding narration in a professional female voice
- Sound designer: 35 seconds of looping background music, mixed and faded
Produced in 3 hours:
- 1,600+ lines of TypeScript/TSX across 10 files
- 6 animated scenes using spring physics
- 6 MP3 voiceover files, total ~340KB
- 1 background music loop, 22 seconds, seamless
- 1 rendered MP4, 1920x1080, ready to upload
Lessons for other developers
1. Write the script first, then set your timing. The copy determines how long each scene needs to be, not the other way around. We initially had scenes at 3.5-5 seconds. The copy was too rushed. Adding 9 seconds total (26s → 35s) and running TTS at 1.0x speed eliminated the robotic feeling completely.
2. Single source of truth is not optional. Having voiceover-config.ts own all durations meant when we changed a scene length, nothing broke. Hardcoding the same numbers across three files breaks silently and is miserable to debug in a video context.
3. Audio delay is a first-class concern. This is the bug most developers will not anticipate. TransitionSeries overlaps scenes. If you do not add an explicit audioDelay, your voiceover clips stack. Calculate it, put it in your config, and wire it in one place.
4. Free tier ElevenLabs is genuinely enough for a launch video. 10,000 credits/month for TTS, unlimited sound effects. A full 35-second video with human-quality narration uses under 400 credits. You have room for 25+ complete revisions of your script before hitting any limit.
5. The AI is the director, not just the coder. The biggest value was not that Claude wrote animation code. It was that it reasoned about what makes a good product demo, iterated on the script until it sounded natural, caught the audio overlap bug before it was visible, and kept the architecture clean across 3 hours of changes. That is the director role.
The layered approach
Building a video with this stack is easier to reason about when you treat it like software development: break it into sequential layers where each layer should be satisfactory before you build the next one on top.
Layer 1: Content — script the scenes first
Write your voiceover script and derive scene durations from how long the script takes to say naturally. Do not set timing first and squeeze copy in later — that is what produces robotic-feeling videos. Your voiceover-config.ts holds all of this as the single source of truth.
Layer 2: Screens — capture assets for each scene With the script done, you know exactly what each scene needs to show. Use the screenshot prompt from Phase 2 to generate one image per scene. The agent handles getting demo data into production-realistic state.
Layer 3: Motion — animate each scene component
With assets ready, build each scene component. One primary spring animation per main element is enough — resist the urge to animate everything. Remotion's useCurrentFrame() with spring() or interpolate() handles 90% of what you need.
Layer 4: Voiceover — generate and review per scene
Run node generate-voiceover.ts. Review playback in Remotion Studio. Getting satisfaction at this layer before moving on is worth the time — regenerate with adjusted stability and style values until the delivery sounds right.
Layer 5: Music — one looping track across everything
Once voiceover is good, add background music. Run node generate-music.ts. Keep volume between 0.10-0.15. It should be present in the silence between scenes and inaudible behind the voiceover.
Render: one command, one output file
npm run render
# → out/hero-video.mp4, 1920x1080, h264
If any layer needs changing, fix that layer in isolation and re-render. There is no After Effects project to corrupt, no sync issues between tools, no export format mismatches. Every layer is a TypeScript file or an npm script. It works the same way a software build does.
FAQ
Can I use this for any product, not just fitness?
Yes. The scene structure (problem → solution → features → CTA) and the technical stack (Remotion + ElevenLabs) are product-agnostic. Swap the scene copy, screenshots, and brand colors in styles.ts and voiceover-config.ts.
What would this actually cost if I needed to do it manually?
A freelance motion designer charges $500-1,500 for a 30-second product video. Add $100-300 for voiceover. Add $50-100 for music licensing. You are looking at $500-$1,900 minimum for decent quality video, and at least a week of back-and-forth. This took 3.5 hours and $0.
Conclusion
Four roles, $0, 3 hours. That is the honest summary.
The individual tools are impressive on their own. But that is not where the value is. A coding agent can hold the whole picture (timeline architecture, copy, voice settings, audio sync) in one context and iterate on all of it together. That is what a video director does. It is just that the director now writes TypeScript.
If you are building a product and need a homepage video, this is a repeatable, version-controlled, and entirely free approach to get there. The finished video is live on the homepage. Watch it at YouTube.
If you are building a product and are looking for someone to wear multiple hats, book a call.