Input Modes
Input Modes
Seedance 2.0 supports several input modes. Each mode changes how the prompt should be written.
Text-to-Video (T2V)
The baseline mode. Full 6-step formula required — this is where prompt quality matters most.
Structure: Subject + Action + Environment + Camera + Style + Constraints
Target prompt length: 60–100 words. Too short = vague output. Too long = conflicting instructions.
Image-to-Video (I2V)
When a first frame image is provided, the prompt should only describe what changes — not what is already visible. The model reads the image for composition and identity.
Key rule: focus the prompt on motion, not on describing the static image. Include preserve composition and colors to lock the visual style.
Starting Image Quality
The starting image is the single biggest quality lever in I2V. If the starting image looks like AI slop, the video will look like AI slop. Seedance inherits composition, lighting, color grade, and detail fidelity directly from the input frame.
Before generating the video, invest in the starting image:
- Find a real reference photo that matches the mood and quality you want (Pinterest, brand lookbooks, film stills)
- Use a high-quality image model to generate the starting frame
# Search for high-quality image generation models on Pica
pica model search "nano banana"
pica model search "flux"
# Generate the starting image with a detailed prompt
pica generate --model <image-model-id> --kind image_generation \
--input '{ "prompt": "..." }'
These search terms target popular photorealistic image models. Actual availability depends on what is currently live — use pica model info to confirm.
Do not skip this step. A 30-second detour to get a strong starting frame saves multiple failed video generations.
First Frame
Prompt describes the animation:
She slowly turns her head toward camera, a subtle smile forming. Eyes blink naturally.
Hair moves gently in a light breeze. Camera holds fixed. Soft natural window light.
Preserve composition and colors.
First Frame + Last Frame
When both endpoints are provided, describe the motion arc between them:
Character leaps upward with explosive energy, arms raised, reaching maximum height
at the final frame. Natural motion arc. Preserve character appearance.
Landscape / Scene Animation
For non-character images, describe environmental motion:
Water surface ripples gently. Clouds drift slowly left across the sky. Birds fly across
the frame in the distance. Camera holds fixed. Preserve composition and colors, cinematic.
Reference Video (Motion Transfer)
The most powerful Seedance feature. Feed a reference video to transfer motion, camera language, and choreography to a new scene.
When using a reference video, prompt less. The clip handles motion — the text tells the model what to keep vs what to reinterpret.
Clip Preparation Rules
- Length: 3–8 seconds is ideal. Under 2s = mushy. Over 10s = model loses confidence
- Cuts: one continuous shot — no edits, no jump cuts
- One idea wide: either the subject moves OR the camera moves — not both simultaneously
- Compression: export clean H.264 at native frame rate
- Clarity: plain backgrounds, steady lighting, strong silhouette, deliberate pacing
Prompting With Reference Video
Keep style adjectives to 2–3 max. More = motion drift.
Key phrases that work:
Respect motion from reference: reinterpret texture and colorKeep silhouette and tempo: permit style variationCamera move from reference, new subject and environment
Transfer Camera Move to New Scene
A chef in a professional kitchen prepares a dish. Respect camera movement from reference.
Warm kitchen lighting, steam rising from pots. Cinematic, shallow depth of field.
Transfer Person's Motion to New Character
A woman in a blazer presents to camera with natural confident gestures.
Keep motion timing and rhythm from reference. Modern office background. Professional, clean, 4K.
Troubleshooting Reference Video
| Problem | Cause | Fix |
|---|---|---|
| Jittery motion | Reference too long or has cuts | Trim to 3–5s, single shot only |
| Camera move ignored | Reference motion not prominent enough | Isolate strongest 2–3 seconds, mention explicitly in prompt |
| Style drift | Too many adjectives | Reduce to 2–3 max, add single style frame |
| Identity lost | Conflicting subject descriptions | Stable descriptors + preserve character appearance |
Multimodal (Video + Image + Audio)
Seedance 2.0 supports up to 12 files across all modalities. Each file gets a role that tells the model how to use it.
Role Reference
| Role | What it controls | Best file type |
|---|---|---|
reference_video | Motion, camera language, choreography | 3–8s clean clip |
reference_image | Face, outfit, identity lock | Mid-body portrait, simple background |
first_frame | Starting composition and character | High-res still |
last_frame | Ending composition | High-res still |
reference_audio | Mood, pacing, lip-sync, beat matching | Clean audio file |
Person Consistency Across Scenes
The person in [Video1] who looks like [Image1] stands at a podium and delivers a keynote
presentation. Confident gestures, makes eye contact with camera. Modern auditorium,
dramatic stage lighting. Slow push-in. Cinematic.
Audio-Driven Generation
The person in [Image1] speaks directly to camera with energy matching the audio rhythm
in [Audio1]. Natural expressions, occasional smile. Clean white studio background.
Camera holds fixed.
Multi-Image Continuity
Character transitions from standing in [Image1] to reaching forward in [Image2]
to catching the object in [Image3]. Smooth continuous motion arc connecting all three poses.
Cinematic, 24fps look.
Video Extension (Chain Clips)
Seedance generates up to 15 seconds per clip. To build longer videos (30–60s), chain clips by feeding each output back as a reference for the next generation.
How It Works
- Generate the first 15-second clip
- Upload the output as a reference video
- Prompt the model to extend, specifying what the next section should add
- The model automatically maintains continuity — character, scene, environment, lighting
Extension Prompt Pattern
Extend this video. Keep the same character, voice, and environment from the reference.
[New action or dialogue for the next section]. Smooth transition from the previous scene.
Rules
- Each extension inherits the original clip's character, scene, and style
- Introduce one new element per extension (new product, new angle, new dialogue topic) — not a full scene change
- 3 extensions (4 clips total) is the practical ceiling before drift accumulates
- If you need a new product image or prop, add it as a separate reference image in the extension prompt
- Review each extension before chaining the next — errors compound
Practical Workflow
# 1. Generate the first clip
pica generate --model <seedance-id> --kind video_generation \
--input '{ "prompt": "..." }'
# 2. Use the output as reference for extension
pica generate --model <seedance-id> --kind video_generation \
--input '{
"prompt": "Extend this video. Same character and environment. [Next section description].",
"reference_video": "<first-clip-output>"
}'
Prompt-as-Image Bypass
Seedance has a ~2000 character prompt limit. For complex multi-section videos that need more space, render the full prompt as text on an image and upload it as a reference.
The model can read text in reference images and follow the instructions.
When To Use
- Detailed timestamp breakdowns that exceed the character limit
- Complex choreography with many sections
- Any prompt where you are cutting essential information to fit
How To Do It
- Write the full detailed prompt (timestamps, dialogue, visuals) in any design tool or text editor
- Export as a clean, high-res image with readable text
- Upload as a reference image alongside a short summary prompt
Follow the detailed shot breakdown in the reference image. Use the timestamps,
dialogue, and visual directions exactly as described.
This is a workaround, not the default. If your prompt fits in 2000 characters, write it directly.
Voice And Audio Preparation
If the video includes dialogue or voiceover, prepare the audio asset before generating video. Seedance uses reference_audio to drive lip movement, expression timing, and pacing.
Voice Generation Tools
Ranked by ease of achieving realistic output:
- MiniMax — easiest path to natural-sounding speech
- ElevenLabs V2 — highest quality with the right voice setup
- Resemble AI — best for voice cloning and enhancement
- Qwen2-TTS — free open-source option
pica model search "minimax speech"
pica model search "text to speech"
Vocal Reference From Real Video
Instead of generating a voice from scratch, find a real video clip with the vocal quality you want. Upload it as reference_audio and write your own dialogue in the prompt — the model adopts the vocal character while generating new speech.
The prompt should make the role explicit:
Use this audio as the voice reference for character delivery. Do not reproduce
the words — speak the new dialogue with the same vocal quality and energy.
Rules
- Generate audio before video — Seedance needs the audio file as input, not the other way around
- Match the audio duration to the target video length — mismatched lengths cause pacing drift
- Clean audio only — background music or noise in
reference_audiobleeds into the output - For beat-driven videos (MV, dance), the audio is the timing backbone — write time slices that align with beats
Seedance-Based Talking Head
For avatar-style videos where a person speaks to camera, Seedance can produce natural talking heads using reference_video + reference_image for face lock. This is not precise lip sync (use kling-avatar for that) — it produces natural-looking conversational delivery.
The person in [Video1] who looks like [Image1] speaks conversationally to camera.
Natural micro-expressions — subtle head tilts, eye movements, occasional blink.
Slight smile. Clean modern office background. Camera holds fixed.
Warm professional lighting, shallow depth of field. Natural short pauses between sentences.
Keep background simple. Specify micro-expressions and natural pauses for realism.