feat(P0): Enhance image prompt generation with core imagery extraction for better text-image alignment

2026-01-06 23:04:20 +08:00
parent b62fdb6958
commit 2978622f7f
1 changed files with 28 additions and 21 deletions
--- a/pixelle_video/prompts/image_generation.py
+++ b/pixelle_video/prompts/image_generation.py
@@ -58,31 +58,37 @@ Based on the existing video script, create corresponding **English** image promp
 # Input Content
 {narrations_json}

+# ⭐ Core Imagery Extraction (Critical for Relevance)
+For EACH narration, you MUST:
+1. **Extract 2-3 core visual imagery/metaphors** that best represent the narration's meaning
+2. **Identify the emotional tone** (hopeful, melancholic, inspiring, etc.)
+3. **Determine concrete visual elements** that embody these abstract concepts
+
+Example thought process:
+- Narration: "给自己一个不设限的探索时间"
+- Core Imagery: exploration, freedom, open paths
+- Emotional Tone: hopeful, adventurous
+- Visual Elements: open road, person looking at horizon, map with unmarked routes
+
 # Output Requirements

 ## Image Prompt Specifications
 - Language: **Must use English** (for AI image generation models)
- Description structure: scene + character action + emotion + symbolic elements
- Description length: Ensure clear, complete, and creative descriptions (recommended 50-100 English words)
+- **REQUIRED Structure**: [Core imagery] + [Scene description] + [Character action] + [Emotional atmosphere]
+- Description length: 50-100 English words
+- **The image prompt MUST directly reflect the extracted core imagery from the narration**

 ## Visual Creative Requirements
 - Each image must accurately reflect the specific content and emotion of the corresponding narration
- Use symbolic techniques to visualize abstract concepts (e.g., use paths to represent life choices, chains to represent constraints, etc.)
+- **Prioritize core visual metaphors** - the main visual elements must embody the narration's key message
+- Use symbolic techniques to visualize abstract concepts (e.g., paths=choices, chains=constraints, open doors=opportunities)
 - Scenes should express rich emotions and actions to enhance visual impact
- Highlight themes through composition and element arrangement, avoid overly literal representations

-## Key English Vocabulary Reference
- Symbolic elements: symbolic elements
- Expression: expression / facial expression
- Action: action / gesture / movement
- Scene: scene / setting
- Atmosphere: atmosphere / mood
-
-## Visual and Copy Coordination Principles
- Images should serve the copy, becoming a visual extension of the copy content
- Avoid visual elements unrelated to or contradicting the copy content
- Choose visual presentation methods that best enhance the persuasiveness of the copy
- Ensure the audience can quickly understand the core viewpoint of the copy through images
+## Visual and Narration Coordination Principles (Most Important)
+- **Direct semantic connection**: The main visual elements MUST represent the narration's core meaning
+- **Avoid decorative scenes**: Don't add unrelated beautiful scenery that doesn't support the message
+- **Ask yourself**: If someone saw only the image, could they guess what the narration is about?
+- **Test question**: What is the ONE THING this narration is about? Make sure that thing is visible in the image.

 ## Creative Guidance
 1. **Phenomenon Description Copy**: Use intuitive scenes to represent social phenomena
@@ -97,8 +103,8 @@ Strictly output in the following JSON format, **image prompts must be in English
 ```json
 {{
  "image_prompts": [
-    "[detailed English image prompt following the style requirements]",
-    "[detailed English image prompt following the style requirements]"
+    "[Core imagery visible] + [Scene with semantic connection to narration] + [Character/action reflecting the message] + [Emotional atmosphere]",
+    "[Next image prompt following the same structure]"
  ]
 }}
 ```
@@ -109,14 +115,15 @@ Strictly output in the following JSON format, **image prompts must be in English
 3. Input is {{"narrations": [narration array]}} format, output is {{"image_prompts": [image prompt array]}} format
 4. **The output image_prompts array must contain exactly {narrations_count} elements, corresponding one-to-one with the input narrations array**
 5. **Image prompts must use English** (for AI image generation models)
-6. Image prompts must accurately reflect the specific content and emotion of the corresponding narration
-7. Each image must be creative and visually impactful, avoid being monotonous
-8. Ensure visual scenes can enhance the persuasiveness of the copy and audience understanding
+6. **⭐ Most Critical: Each image prompt must have DIRECT semantic relevance to its narration**
+7. Before writing each prompt, mentally extract the core visual metaphor from the narration
+8. Verify: Could someone understand the narration's message from the image alone?

 Now, please create {narrations_count} corresponding **English** image prompts for the above {narrations_count} narrations. Only output JSON, no other content.
 """


+
 def build_image_prompt_prompt(
    narrations: List[str],
    min_words: int,