Modes

Video Generation

Generate videos from text, images, or interpolate between keyframes.

Video Generation creates short video clips using AI. You can generate from text alone, use an image as a starting frame, or interpolate between two keyframes.

For pricing details, see the Pricing page.

Sub-Modes

Sub-ModeDescriptionInput Required
Text to VideoGenerate from a text promptText only
Start FrameAnimate from a source image1 image + text
InterpolationTransition between two frames2 images (start + end)
ReferencesGenerate with reference guidanceReference images + text

Providers

ProviderKey Features
VEO (Google)Up to 4K, 4-8 seconds
WAN (Alibaba)Cost-effective, multiple sub-modes, optional audio
Kling (KlingAI)Motion control, premium quality (Pro only)
SeedAnce (ByteDance)Audio-inclusive generation
xAI (Grok)Budget-friendly
LTXUp to 4K, audio-driven mode
OmniHuman (ByteDance)Human-focused video

See Providers & Models for the full model list.

Settings

Resolution

  • 720p — Standard definition (fastest, all plans)
  • 1080p — Full HD (Basic+ plans)
  • 4K — Ultra HD (Pro plan only, VEO and LTX)

Duration

Duration varies by provider:

  • VEO — 4 or 8 seconds
  • WAN — 2 to 15 seconds (model-dependent)
  • Kling — 5 or 10 seconds

Aspect Ratio

  • 16:9 — Standard landscape video
  • 9:16 — Vertical video (stories, reels)
  • 1:1 — Square
  • Additional ratios available with some providers

Audio

Some providers support audio generation alongside video:

  • WAN — Optional audio track
  • SeedAnce — Audio-inclusive generation
  • LTX Pro Audio — Audio-driven video

Tips

  • Start Frame mode gives the best control — upload an image you like and describe the motion
  • VEO 3.1 Fast is a good default for quick iterations
  • WAN Flash models are cost-effective for high-volume work
  • Video generation is slower than image gen — expect 30 seconds to several minutes

On this page