Input Modes

Seedance 2.0 supports several input modes. Each mode changes how the prompt should be written.

Text-to-Video (T2V)

The baseline mode. Full 6-step formula required — this is where prompt quality matters most.

Structure: Subject + Action + Environment + Camera + Style + Constraints

Target prompt length: 60–100 words. Too short = vague output. Too long = conflicting instructions.

Image-to-Video (I2V)

When a first frame image is provided, the prompt should only describe what changes — not what is already visible. The model reads the image for composition and identity.

Key rule: focus the prompt on motion, not on describing the static image. Include preserve composition and colors to lock the visual style.

Starting Image Quality

The starting image is the single biggest quality lever in I2V. If the starting image looks like AI slop, the video will look like AI slop. Seedance inherits composition, lighting, color grade, and detail fidelity directly from the input frame.

Before generating the video, invest in the starting image:

Find a real reference photo that matches the mood and quality you want (Pinterest, brand lookbooks, film stills)
Use a high-quality image model to generate the starting frame

# Search for high-quality image generation models on Pica
pica model search "nano banana"
pica model search "flux"

# Generate the starting image with a detailed prompt
pica generate --model <image-model-id> --kind image_generation \
  --input '{ "prompt": "..." }'

These search terms target popular photorealistic image models. Actual availability depends on what is currently live — use pica model info to confirm.

Do not skip this step. A 30-second detour to get a strong starting frame saves multiple failed video generations.

First Frame

Prompt describes the animation:

She slowly turns her head toward camera, a subtle smile forming. Eyes blink naturally.
Hair moves gently in a light breeze. Camera holds fixed. Soft natural window light.
Preserve composition and colors.

First Frame + Last Frame

When both endpoints are provided, describe the motion arc between them:

Character leaps upward with explosive energy, arms raised, reaching maximum height
at the final frame. Natural motion arc. Preserve character appearance.

Landscape / Scene Animation

For non-character images, describe environmental motion:

Water surface ripples gently. Clouds drift slowly left across the sky. Birds fly across
the frame in the distance. Camera holds fixed. Preserve composition and colors, cinematic.

Reference Video (Motion Transfer)

The most powerful Seedance feature. Feed a reference video to transfer motion, camera language, and choreography to a new scene.

When using a reference video, prompt less. The clip handles motion — the text tells the model what to keep vs what to reinterpret.

Clip Preparation Rules

Length: 3–8 seconds is ideal. Under 2s = mushy. Over 10s = model loses confidence
Cuts: one continuous shot — no edits, no jump cuts
One idea wide: either the subject moves OR the camera moves — not both simultaneously
Compression: export clean H.264 at native frame rate
Clarity: plain backgrounds, steady lighting, strong silhouette, deliberate pacing

Prompting With Reference Video

Keep style adjectives to 2–3 max. More = motion drift.

Key phrases that work:

Respect motion from reference: reinterpret texture and color
Keep silhouette and tempo: permit style variation
Camera move from reference, new subject and environment

Transfer Camera Move to New Scene

A chef in a professional kitchen prepares a dish. Respect camera movement from reference.
Warm kitchen lighting, steam rising from pots. Cinematic, shallow depth of field.

Transfer Person's Motion to New Character

A woman in a blazer presents to camera with natural confident gestures.
Keep motion timing and rhythm from reference. Modern office background. Professional, clean, 4K.

Troubleshooting Reference Video

Problem	Cause	Fix
Jittery motion	Reference too long or has cuts	Trim to 3–5s, single shot only
Camera move ignored	Reference motion not prominent enough	Isolate strongest 2–3 seconds, mention explicitly in prompt
Style drift	Too many adjectives	Reduce to 2–3 max, add single style frame
Identity lost	Conflicting subject descriptions	Stable descriptors + `preserve character appearance`

Multimodal (Video + Image + Audio)

Seedance 2.0 supports up to 12 files across all modalities. Each file gets a role that tells the model how to use it.

Role Reference

Role	What it controls	Best file type
`reference_video`	Motion, camera language, choreography	3–8s clean clip
`reference_image`	Face, outfit, identity lock	Mid-body portrait, simple background
`first_frame`	Starting composition and character	High-res still
`last_frame`	Ending composition	High-res still
`reference_audio`	Mood, pacing, lip-sync, beat matching	Clean audio file

Person Consistency Across Scenes

The person in [Video1] who looks like [Image1] stands at a podium and delivers a keynote
presentation. Confident gestures, makes eye contact with camera. Modern auditorium,
dramatic stage lighting. Slow push-in. Cinematic.

Audio-Driven Generation

The person in [Image1] speaks directly to camera with energy matching the audio rhythm
in [Audio1]. Natural expressions, occasional smile. Clean white studio background.
Camera holds fixed.

Multi-Image Continuity

Character transitions from standing in [Image1] to reaching forward in [Image2]
to catching the object in [Image3]. Smooth continuous motion arc connecting all three poses.
Cinematic, 24fps look.

Video Extension (Chain Clips)

Seedance generates up to 15 seconds per clip. To build longer videos (30–60s), chain clips by feeding each output back as a reference for the next generation.

How It Works

Generate the first 15-second clip
Upload the output as a reference video
Prompt the model to extend, specifying what the next section should add
The model automatically maintains continuity — character, scene, environment, lighting

Extension Prompt Pattern

Extend this video. Keep the same character, voice, and environment from the reference.
[New action or dialogue for the next section]. Smooth transition from the previous scene.

Rules

Each extension inherits the original clip's character, scene, and style
Introduce one new element per extension (new product, new angle, new dialogue topic) — not a full scene change
3 extensions (4 clips total) is the practical ceiling before drift accumulates
If you need a new product image or prop, add it as a separate reference image in the extension prompt
Review each extension before chaining the next — errors compound

Practical Workflow

# 1. Generate the first clip
pica generate --model <seedance-id> --kind video_generation \
  --input '{ "prompt": "..." }'

# 2. Use the output as reference for extension
pica generate --model <seedance-id> --kind video_generation \
  --input '{
    "prompt": "Extend this video. Same character and environment. [Next section description].",
    "reference_video": "<first-clip-output>"
  }'

Prompt-as-Image Bypass

Seedance has a ~2000 character prompt limit. For complex multi-section videos that need more space, render the full prompt as text on an image and upload it as a reference.

The model can read text in reference images and follow the instructions.

When To Use

Detailed timestamp breakdowns that exceed the character limit
Complex choreography with many sections
Any prompt where you are cutting essential information to fit

How To Do It

Write the full detailed prompt (timestamps, dialogue, visuals) in any design tool or text editor
Export as a clean, high-res image with readable text
Upload as a reference image alongside a short summary prompt

Follow the detailed shot breakdown in the reference image. Use the timestamps,
dialogue, and visual directions exactly as described.

This is a workaround, not the default. If your prompt fits in 2000 characters, write it directly.

Voice And Audio Preparation

If the video includes dialogue or voiceover, prepare the audio asset before generating video. Seedance uses reference_audio to drive lip movement, expression timing, and pacing.

Voice Generation Tools

Ranked by ease of achieving realistic output:

MiniMax — easiest path to natural-sounding speech
ElevenLabs V2 — highest quality with the right voice setup
Resemble AI — best for voice cloning and enhancement
Qwen2-TTS — free open-source option

pica model search "minimax speech"
pica model search "text to speech"

Vocal Reference From Real Video

Instead of generating a voice from scratch, find a real video clip with the vocal quality you want. Upload it as reference_audio and write your own dialogue in the prompt — the model adopts the vocal character while generating new speech.

The prompt should make the role explicit:

Use this audio as the voice reference for character delivery. Do not reproduce
the words — speak the new dialogue with the same vocal quality and energy.

Rules

Generate audio before video — Seedance needs the audio file as input, not the other way around
Match the audio duration to the target video length — mismatched lengths cause pacing drift
Clean audio only — background music or noise in reference_audio bleeds into the output
For beat-driven videos (MV, dance), the audio is the timing backbone — write time slices that align with beats

Seedance-Based Talking Head

For avatar-style videos where a person speaks to camera, Seedance can produce natural talking heads using reference_video + reference_image for face lock. This is not precise lip sync (use kling-avatar for that) — it produces natural-looking conversational delivery.

The person in [Video1] who looks like [Image1] speaks conversationally to camera.
Natural micro-expressions — subtle head tilts, eye movements, occasional blink.
Slight smile. Clean modern office background. Camera holds fixed.
Warm professional lighting, shallow depth of field. Natural short pauses between sentences.

Keep background simple. Specify micro-expressions and natural pauses for realism.