Skip to main content

Avatar & Voice FAQ: Troubleshooting, Best Practices, and Credits

Written by Avi

Use this guide for common questions about avatar quality, Avatar III / IV / V, digital twin setup, Avatar Shots, Generate Looks, voice selection, speech pacing, and avoiding unnecessary credit usage.


Before you render: how to avoid unnecessary credit usage in AI Studio

Making changes to A-Roll scenes requires the scene to be re-rendered, which consumes credits.

To minimize unnecessary credit usage, we recommend the following workflow:

  1. Preview the speech in AI Studio to ensure the text-to-speech output sounds natural and accurate.

  2. Regenerate the audio or update the script as needed until you're satisfied with the voice output.

  3. Preview each scene individually to confirm it looks and sounds as expected.

  4. Review the avatar's expression, motion, timing, and pacing.

  5. Make any necessary adjustments to individual scenes before generating the full video.

  6. Once everything looks and sounds correct, render the final video.

Following this workflow can help reduce re-renders and avoid unnecessary credit consumption.


How do I fix a “too wide” mouth on an already-rendered avatar video?

If the avatar’s mouth looks too wide, the best fix depends on the avatar look type, avatar model, and expression settings. Changing the avatar result requires re-rendering, so preview first whenever possible.

For a photo look

Try these steps:

  1. Use a photo with a subtle smile, not a wide “cheese” smile.

  2. Turn More Expressive motion style off in Advanced Settings.

  3. Switch between Avatar III, IV & V to compare output.

For a video look

Try this order:

  1. Use Avatar V with Less Expressive.

  2. If that does not help, try Avatar IV.

  3. As a last resort, re-record the source footage.


Avatar III vs. Avatar IV vs. Avatar V: which one should I use?

HeyGen offers three avatar models. Each one is useful for different needs.

Avatar V

Use when you want the most consistent, high-quality avatar output. Best for real human avatars, digital twins, consistent facial detail, more natural avatar behavior, and high-quality final videos.

Avatar IV

Use when you need strong quality with more customization. Best for custom prompting, one-off creative needs, virtual characters, non-human or stylized avatars, and more control over the result.

Avatar III

Use when you need a simpler, lower-cost option. Best for basic avatar videos, lower-cost generation, and use cases where advanced motion and consistency are less important.


Which avatar model is used by default?

The default model depends on the avatar type:

  • Avatar V: Default for real-human avatars and Digital Twins

  • Avatar IV: Default for virtual characters and Photo Avatars

In general, Avatar IV is better for custom prompting, while Avatar V is better for realistic and consistent real-human avatar output.


Which avatars do not support Avatar V?

Most avatars support Avatar V, but there are a few exceptions. Avatar V is not supported for:

  • Public Expressive avatars

  • Some older legacy custom studio avatars

  • Photo-only avatar groups that do not include a motion or video look

For the best Avatar V results, use a video-based avatar or add a motion/video look when possible for digital twins.


Should I use Avatar IV or Avatar V for a virtual or non-human character?

Use Avatar V if consistency is the top priority. For example, Avatar V may better preserve unique identity details across generations, including non-human character traits.

Use Avatar IV if you need custom prompting or more creative control.

Simple rule:

  • Choose Avatar V for consistency.

  • Choose Avatar IV for customization.


Avatar V looks too static or robotic. How can I improve it?

If your Avatar V output feels too still, stiff, or robotic, try the following:

  • Keep More Expressive Motion enabled.

  • Avoid using a custom motion prompt unless you need a specific movement or style. (Note: this is specific to Avatar V — see below. On Photo/IV looks, a custom prompt such as "talks excitedly" can help add motion).

  • Use a more expressive motion-reference video.

  • Use voice audio with natural energy, emotion, and variation.

  • Avoid flat or monotone speech.

Avatar V performs best when the source material is expressive. If the input video or audio lacks emotion, or energy, the resulting avatar may appear more static.


Why doesn’t my prompt change the avatar’s gesture?

Avatar V prioritizes inputs in this order:

  1. Audio

  2. Input image expression

  3. Prompt

Because prompts have the lowest priority on Avatar V, they may have limited impact on gestures and facial expressions. If the audio is flat or lacks emotion, the avatar may not make significant gesture changes, even when instructed by the prompt.

👉 This is why a custom prompt helps on Photo/IV looks but is discouraged for fixing a static Avatar V: on Avatar V, audio and the input image outrank the prompt, so expressive audio and reference footage do the heavy lifting.


How do I control where the avatar looks?

Avatar V keeps the gaze direction from the start frame.

This means:

  • If the avatar starts by looking at the camera, the avatar should continue looking toward the camera.

  • If the avatar starts with an angled gaze, that gaze direction is usually maintained.

There are also gaze presets available:

  • Looking directly at camera – use for front-facing videos.

  • Looking straight ahead – use for side-angle setups.

For finished video looks where the avatar's gaze is off, you can also apply Eye Contact Correction, or re-record looking at the camera.


What makes a good motion-reference video?

A strong motion-reference video should be more expressive than a normal recording. Best practices:

  • Be more energetic than feels natural.

  • Use clear facial expressions.

  • Vary your tone and delivery.

  • Use natural hand movement.

  • Keep your face visible.

  • Avoid under-acting.

Choose your most expressive video look as the motion reference. For side-angle output, use a side-angle reference when possible. Under-acting often produces robotic output; a more expressive input usually creates a more natural avatar result.


How do I pick a good base look or reference image?

Use a clear image that represents the identity you want the avatar to keep. Best practices:

  • Use a close-up or half-body image.

  • Make sure the face is clearly visible.

  • Use a subtle expression.

  • Avoid large accessories that cover the face.

  • Avoid blurry, dark, or low-quality images.

  • Choose an image that looks like the identity you want to preserve.

The base look acts as the identity anchor for future generated looks, so a clean reference image can improve consistency.


How much footage do I need? Does recording more help?

Avatar V can create an avatar from a short recording, but longer high-quality footage can improve the final result. For the best results:

  • Record at least 1080p.

  • Use one continuous recording.

  • Keep your face visible throughout.

  • Use expressive facial delivery.

  • Avoid cuts, scene changes, or unstable footage.

More footage can help with facial expression and emotional range. It does not automatically improve hand gestures or body movement unless those gestures are clearly present in the recording.


What should I wear when recording an avatar?

Wear whatever feels natural and comfortable. For best results:

  • Avoid clothing with large logos or text.

  • Avoid very distracting patterns.

  • Keep your face clearly visible.

  • Make sure accessories do not block your face.

Glasses, jewelry, watches, facial hair, and makeup are allowed, but they may sometimes create small visual artifacts.


Tips for creating an avatar on mobile

Mobile is a great option for avatar creation because phone cameras are usually high quality and camera-roll upload is easy. For best results:

  • Keep the camera stable.

  • Avoid walking while recording.

  • Avoid background motion.

  • Record in good lighting.

  • Keep your face visible.

  • Use one continuous shot.

  • Avoid cuts or scene changes.

The iOS app may also detect some common recording issues automatically.


How do I fix artifacts or glitches in a finished render?

Upload checks catch many problems before training (see the rejection sections below), but some issues only appear in the finished render. Common artifacts and their fixes:

  • Hallucinated detail or distortion –– lower Expressiveness and simplify the prompt.

  • Background-removal halos or edge artifacts — increase the contrast between your outfit and the background. For a photo look, you can edit the source image in Nano Banana to improve separation.

  • Occlusions (hands or objects crossing the face) — re-record so the face stays clear.

  • Bad intro/outro frames — trim them, or re-record a continuous take of one minute or more.

  • General glitchiness or instability — keep a stable position and face the camera throughout the recording.

Because each fix changes the avatar result, preview before re-rendering whenever possible to avoid unnecessary credit usage.


What if the avatar doesn't look like me?

If a generated look doesn't preserve your likeness, the right fix depends on the look type:

Photo looks:

  • Edit the source in Nano Banana, starting from a real photo of yourself.

  • Upload more photos so the identity has more reference material.

  • Retrain Flux.

  • Use a close-up frame with an IV Digital Twin.

  • Re-record with a clear face and good lighting.

  • If the issue is motion rather than likeness, record a Video Look or switch to an IV Digital Twin.

Video looks:

  • If your face appears too small, use a close-up with an IV Digital Twin, or re-record with your face filling more of the frame.

  • If the footage is low resolution, re-record with higher-res, well-lit video.


Why was my recording rejected at upload?

A recording may be rejected if the system cannot use it to create a stable avatar. Common reasons include:

  • Too much face or head movement

  • Face not visible throughout the recording

  • Multiple scenes or cuts detected

  • Corrupt file

  • Blurry or low-quality footage

  • Poor lighting

  • Unclear audio

To fix this, record one continuous shot with your face clearly visible, stable, and well lit.


Why did the consent step fail?

Consent may fail if the recording does not clearly confirm the real person’s identity and permission. Common reasons include:

  • The consent code was not read correctly.

  • The person in the consent video does not match the training footage.

  • The consent was recorded using an AI avatar instead of the real person.

  • The video was blurry, dark, or unclear.

  • The audio was not clear enough.

  • The consent video was a screen recording instead of a real recording.

To fix this, record a new consent video with the real person, clear audio, good lighting, and the correct consent code.


Why was my photo upload rejected?

A photo upload may be rejected if the image cannot be processed. Common reasons include:

  • Unsupported image format

  • Corrupt image file

  • No detectable face

  • Face is too small

  • Face is blocked

  • Image is blurry or too dark

Try uploading a clean JPG or PNG with a clearly visible face.


Is Photo Avatar “Unlimited” mode still available?

There is no separate Photo Avatar Unlimited mode anymore. HeyGen now offers avatar generation through Avatar III, Avatar IV, and Avatar V. These models can support different avatar types.

Creating photo avatars and using avatars to generate videos are separate actions. Paid users can create unlimited photo avatars, but generating videos with those avatars uses credits based on the selected motion engine.


What is Generate Looks?

Generate Looks helps you create new looks for an existing avatar identity.

You can use it to create:

  • Persona-themed looks

  • Multi-angle looks

  • New outfits

  • New backgrounds

  • New poses

  • Edited versions of existing looks

Generate Looks uses AI image generation and enhancement designed for avatar video creation.


Are Avatar Shots editable?

No. Avatar Shots are not editable after generation in the same way as videos created AI Studio. If you need more control over scenes, scripts, timing, or other video elements, we recommend creating your project in AI Studio instead.


How does voice selection work in Avatar Shots or cinematic videos?

Each avatar identity has an associated voice. When you select an avatar, HeyGen automatically uses the voice assigned to that avatar. If you'd like to use a different voice, update the avatar's assigned voice where supported, or use a workflow that allows manual voice selection.


Should I use a cloned voice, designed voice, or public voice?

Use this simple rule:

  • Cloned voice — best for digital twins, real people, personal avatars, and matching an avatar to a specific person.

  • Public voice — best for realistic virtual characters, general business videos, and presenters that do not need a custom voice.

  • Designed voice — best for non-realistic characters, stylized virtual characters, fictional characters, and creative use cases where you want a new voice.

For digital twins, the cloned voice is usually the best match.


How do I fix pauses or pacing in speech?

If the voice is speaking too quickly, too slowly, or without the right pauses, try editing the script and voice settings.

For pauses:

  • Add punctuation; commas for short pauses, periods for clearer sentence breaks, ellipses for longer pauses.

  • Use Add pause controls when needed.

For pacing:

  • Break long sentences into shorter ones.

  • Adjust punctuation.

  • Rewrite the script to sound more natural when spoken.

  • Adjust Speed in Voice settings.

Small script changes can make a big difference in how natural the voice sounds.


Do I need to create an avatar when I sign up?

No. You do not need to create an avatar immediately when signing up. You can create a digital twin whenever you are ready. For the best digital twin experience, create both:

  1. A high-quality avatar recording

  2. A matching voice clone

This helps the avatar look and sound more natural in generated videos.

Did this answer your question?