AI Video Glossary

Plain-English definitions for every term you'll encounter in AI video generation.

A C D F I L M N P R S T U V

Aspect Ratio: The width-to-height ratio of a video frame. Common ratios: 16:9 (landscape/YouTube), 9:16 (vertical/TikTok), 1:1 (square/Instagram). Specified in your prompt or in Playground settings.
Character Consistency: A generation technique where the same character (face, build, style) is maintained across multiple separate clip generations, typically by conditioning each generation on a reference-image anchor.
CFG Scale: Classifier-Free Guidance scale controls how closely the model follows the prompt vs. producing free-form results. Higher values = closer to prompt, lower = more creative. Usually 7–12 for video.
Credits: The unit of compute consumption on this platform. Each generation costs credits based on model type, resolution, and duration. 1 credit ≈ 1 unit of generation compute.
Denoising Steps: The number of iterations a diffusion model runs to remove noise from a latent representation. More steps = higher quality but slower. Typical range: 20–50 steps for video.
Diffusion Model: A class of generative AI model that learns to reverse a noise-adding process. Training teaches the model to denoise progressively, starting from pure noise and producing structured data (image or video) that matches the training distribution.
FPS (Frames Per Second): The number of video frames displayed per second. Standard video is 24fps. Higher FPS (60+) is used for sports or gaming content. AI video models typically generate at 24fps.
Image-to-Video: A generation mode where a static image is used as the first frame, and the model generates plausible subsequent frames to create motion. Also called "I2V" in the research literature.
Latent Space: A compressed, lower-dimensional mathematical representation of data (image or video frames). Diffusion models operate in latent space rather than pixel space for computational efficiency. The decoder then converts latent representations back to pixels.
Lip Sync: Matching a character's mouth movements to an audio track at the phoneme level. AI lip-sync also generates secondary natural movements: head nods, blinks, and micro-expressions.
LoRA: Low-Rank Adaptation a fine-tuning technique that adapts a large model to a specific style or subject by training a small set of additional weights. Used to create style-specific or character-specific video generation.
Motion Score: A model parameter that controls the amount of motion in a generated video. Higher motion score = more movement; lower = more static, panning, or slow-motion style output.
Negative Prompt: Terms or descriptions you want the model to avoid. Example: "blurry, low quality, watermark, extra fingers, distorted face". Helps steer generation away from common failure modes.
Phoneme: The smallest unit of sound in a language. AI lip-sync works at the phoneme level, mapping each sound to specific mouth shape (viseme) to produce accurate lip movements.
Prompt: The text description you provide to guide AI generation. A good video prompt includes: subject, action, setting, style, camera movement, and lighting. See the Prompting Guide for best practices.
Resolution: The pixel dimensions of the output video. Common resolutions: 720p (1280×720), 1080p (1920×1080), 4K (3840×2160). Higher resolution = sharper output, more credits used.
Seed: A number that initializes the random state of a generation. The same seed + same prompt = highly similar output, allowing reproducible results. Different seeds produce variation.
T2V (Text-to-Video): Generating a video from a text description alone, with no image or video input. The model creates all visual content from the semantic content of the text prompt.
Temporal Consistency: The quality of an AI video where objects, characters, and backgrounds remain visually stable across frames not flickering, morphing, or disappearing between frames. Achieved through cross-frame attention in modern video diffusion models.
Upscaling: Increasing the resolution of a video after generation using a separate AI model. Useful for taking a 1080p draft to 4K for final delivery.
V2V (Video-to-Video): A generation mode that takes an existing video as input and restyled it while preserving the original motion. Also called "video restyle" or "video transfer".
VAE (Variational Autoencoder): The encoder/decoder pair used in latent diffusion models to convert between pixel space and latent space. The encoder compresses frames; the decoder reconstructs them. Video quality is heavily influenced by VAE quality.
Viseme: A visual phoneme the mouth shape that corresponds to a specific speech sound. Lip-sync AI generates the correct viseme for each phoneme in the audio track.

What is Gemini Omni? → How AI video works → Prompting Guide →