I want the generated video to be completely free of subtitles. How should I best phrase my positive prompt and negative prompt to achieve this?

#42
by yykjy2004 - opened

from ltx_pipelines.ti2vid_two_stages import TI2VidTwoStagesPipeline
from ltx_core.loader import LoraPathStrengthAndSDOps
from ltx_core.loader.sd_ops import LTXV_LORA_COMFY_RENAMING_MAP
from ltx_core.model.video_vae import TilingConfig, get_video_chunks_number
from ltx_pipelines.utils.constants import DEFAULT_NEGATIVE_PROMPT, AUDIO_SAMPLE_RATE
from ltx_pipelines.utils.media_io import encode_video
import torch
import time

"""
hf download google/gemma-3-12b-it-qat-q4_0-unquantized --local-dir /mnt/localdisk/google/gemma-3-12b-it-qat-q4_0-unquantized
hf download Lightricks/LTX-2 --local-dir /mnt/localdisk/Lightricks/LTX-2
"""

pipeline = TI2VidTwoStagesPipeline(
checkpoint_path = "/data3/LTX-2/ltx-2-19b-dev-fp8.safetensors",
spatial_upsampler_path = "/data3/LTX-2/ltx-2-spatial-upscaler-x2-1.0.safetensors",
gemma_root = "/data3/LTX-2/gemma",
distilled_lora = [
LoraPathStrengthAndSDOps(
path = "/data3/LTX-2/ltx-2-19b-distilled-lora-384.safetensors",
strength = 0.8,
sd_ops = LTXV_LORA_COMFY_RENAMING_MAP)
],
loras = [],
fp8transformer = False
)

tiling_config = TilingConfig.default()

duration = 20
height, width = 1280,704
frame_rate = 25
num_frames = duration * frame_rate + 1
num_inference_steps =20
cfg_guidance_scale = 3.5
images = [("/data3/LTX-2/inputs/7.png", 0, 0.8)]
output_path = "/data3/LTX-2/output111/101688889999_test.mp4"
prompt = """
A dimly lit long hospital corridor at night, with fluorescent lights flickering on and off, creating a oppressive and terrifying atmosphere.
Cool blue-green tones, deep shadows, and subtle volumetric fog fill the space.In the frame, on the left is female doctor Alen (holding a medical chart clipboard in his hand), on the right is female nurse Emma (wearing a nurse cap and blue scrubs). The two stand face to face, maintaining intense eye contact throughout the entire dialogue. The camera alternates between tight close-ups on the speaking character's face (focusing on expressions and lip movements) and medium shots showing both figures.
Emma (the nurse on the right) slightly tilts her head upward, her eyes flickering nervously, looking tensely at the other person, speaking in a trembling voice with perfect lip-sync, mouth opening and closing naturally in sync with the words: "You finally came…"
Alen (the doctor on the left) responds immediately with perfect lip-sync, mouth moving naturally: "Why do you say that?"
The corridor lights continue to flicker subtly.
Emma (the nurse on the right) lowers her head to organize the medical records in her hands, her fingers trembling slightly, speaking in a soft but uneasy voice with clear realistic lip movements and accurate lip-sync: "Tonight's night shift will be very long."
Alen (the doctor on the left) asks with perfect lip-sync, natural mouth articulation: "What happened in the hospital before?"
Emma (the nurse on the right) suddenly trembles, glancing nervously around, speaking in a low and urgent voice with visible mouth articulation and accurate lip-sync: "Some things shouldn't be asked."
Alen (the doctor on the left) responds with perfect lip-sync, mouth opening and closing naturally: "Okay, then let's go to work."
Emma (the nurse on the right) shows a relieved expression, speaking in a slightly relaxed but still fearful tone with lips visibly moving in sync: "Follow me, first check the ward."
A slow cart rolling sound comes from the end of the corridor.
Emma (the nurse on the right) quickly walks toward the end of the corridor, turning while walking and speaking nervously: "There's an abnormality in Room 304."
Throughout the scene: low subtle horror background music, distant echoing footsteps and clinking of medical instruments, characters' expressions filled with fear and unease, eyes frequently glancing toward dark corners. Cinematic horror style, realistic lighting and shadows, delicate facial expressions, high-quality details, ultra-realistic skin textures and movements.No text of any kind, no subtitles, no captions, no overlaid words, letters, signs, logos, or any form of on-screen text whatsoever.
"""
negative_prompt="subtitle, caption, text, watermark, logo, timestamp, OSD, UI, score, channel icon,three hands, extra hand, third hand, 3 hands, multiple hands,identity inconsistency,frame-by-frame facial drift,static image, blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise,grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones,deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands,wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth offield, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistentlighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncannyvalley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction,mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise,off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkwardpauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting,inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."
video_chunks_number = get_video_chunks_number(num_frames, tiling_config)

with torch.inference_mode():
st = time.time()
video, audio = pipeline(
prompt=prompt,
negative_prompt=DEFAULT_NEGATIVE_PROMPT,
seed=521,
height=height,
width=width,
num_frames=num_frames,
frame_rate=frame_rate,
num_inference_steps=num_inference_steps,
cfg_guidance_scale=cfg_guidance_scale,
images=images,
tiling_config=tiling_config,
)

encode_video(
    video=video,
    fps=frame_rate,
    audio=audio,
    audio_sample_rate=AUDIO_SAMPLE_RATE,
    output_path=output_path,
    video_chunks_number=video_chunks_number,
)
print(f"time cost: {time.time() - st}")

promopt:
A dimly lit long hospital corridor at night, with fluorescent lights flickering on and off, creating a oppressive and terrifying atmosphere.
Cool blue-green tones, deep shadows, and subtle volumetric fog fill the space.In the frame, on the left is female doctor Alen (holding a medical chart clipboard in his hand), on the right is female nurse Emma (wearing a nurse cap and blue scrubs). The two stand face to face, maintaining intense eye contact throughout the entire dialogue. The camera alternates between tight close-ups on the speaking character's face (focusing on expressions and lip movements) and medium shots showing both figures.
Emma (the nurse on the right) slightly tilts her head upward, her eyes flickering nervously, looking tensely at the other person, speaking in a trembling voice with perfect lip-sync, mouth opening and closing naturally in sync with the words: "You finally came…"
Alen (the doctor on the left) responds immediately with perfect lip-sync, mouth moving naturally: "Why do you say that?"
The corridor lights continue to flicker subtly.
Emma (the nurse on the right) lowers her head to organize the medical records in her hands, her fingers trembling slightly, speaking in a soft but uneasy voice with clear realistic lip movements and accurate lip-sync: "Tonight's night shift will be very long."
Alen (the doctor on the left) asks with perfect lip-sync, natural mouth articulation: "What happened in the hospital before?"
Emma (the nurse on the right) suddenly trembles, glancing nervously around, speaking in a low and urgent voice with visible mouth articulation and accurate lip-sync: "Some things shouldn't be asked."
Alen (the doctor on the left) responds with perfect lip-sync, mouth opening and closing naturally: "Okay, then let's go to work."
Emma (the nurse on the right) shows a relieved expression, speaking in a slightly relaxed but still fearful tone with lips visibly moving in sync: "Follow me, first check the ward."
A slow cart rolling sound comes from the end of the corridor.
Emma (the nurse on the right) quickly walks toward the end of the corridor, turning while walking and speaking nervously: "There's an abnormality in Room 304."
Throughout the scene: low subtle horror background music, distant echoing footsteps and clinking of medical instruments, characters' expressions filled with fear and unease, eyes frequently glancing toward dark corners. Cinematic horror style, realistic lighting and shadows, delicate facial expressions, high-quality details, ultra-realistic skin textures and movements.No text of any kind, no subtitles, no captions, no overlaid words, letters, signs, logos, or any form of on-screen text whatsoever.

negative_prompt="subtitle, caption, text, watermark, logo, timestamp, OSD, UI, score, channel icon,three hands, extra hand, third hand, 3 hands, multiple hands,identity inconsistency,frame-by-frame facial drift,static image, blurry, out of focus, overexposed, underexposed, low contrast, washed out colors, excessive noise,grainy texture, poor lighting, flickering, motion blur, distorted proportions, unnatural skin tones,deformed facial features, asymmetrical face, missing facial features, extra limbs, disfigured hands,wrong hand count, artifacts around text, inconsistent perspective, camera shake, incorrect depth offield, background too sharp, background clutter, distracting reflections, harsh shadows, inconsistentlighting direction, color banding, cartoonish rendering, 3D CGI look, unrealistic materials, uncannyvalley effect, incorrect ethnicity, wrong gender, exaggerated expressions, wrong gaze direction,mismatched lip sync, silent or muted audio, distorted voice, robotic voice, echo, background noise,off-sync audio, incorrect dialogue, added dialogue, repetitive speech, jittery movement, awkwardpauses, incorrect timing, unnatural transitions, inconsistent framing, tilted camera, flat lighting,inconsistent tone, cinematic oversaturation, stylized filters, or AI artifacts."

image:
7
vedio:

Sign up or log in to comment