Wan2GP_you_must_clone_this_space_to_use_it

Build error

App Files Files Community

DeepBeepMeep commited on May 23

Commit

30d0c66

1 Parent(s): 03085c8

Vace improvements

Browse files

Files changed (8) hide show

README.md +35 -11
ltx_video/pipelines/pipeline_ltx_video.py +1 -1
preprocessing/matanyone/app.py +424 -209
preprocessing/matanyone/tutorial_multi_targets.mp4 +3 -0
preprocessing/matanyone/tutorial_single_target.mp4 +3 -0
wan/text2video.py +35 -25
wan/utils/utils.py +6 -5
wgp.py +129 -91

README.md CHANGED Viewed

@@ -21,6 +21,7 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models
 ## 🔥 Latest News!!
 * May 20 2025: 👋 Wan 2.1GP v5.2 : Added support for Wan CausVid which is a distilled Wan model that can generate nice looking videos in only 4 to 12 steps.
  The great thing is that Kijai (Kudos to him !) has created a CausVid Lora that can be combined with any existing Wan t2v model 14B like Wan Vace 14B.
  See instructions below on how to use CausVid.\
@@ -307,17 +308,20 @@ You can define multiple lines of macros. If there is only one macro line, the ap
 ### VACE ControlNet introduction
-Vace is a ControlNet 1.3B text2video model that allows you to do Video to Video and Reference to Video (inject your own images into the output video). So with Vace you can inject in the scene people or objects of your choice, animate a person, perform inpainting or outpainting, continue a video, ...
-First you need to select the Vace 1.3B model in the Drop Down box at the top. Please note that Vace works well for the moment only with videos up to 5s (81 frames).
 Beside the usual Text Prompt, three new types of visual hints can be provided (and combined !):
-- a Control Video: Based on your choice, you can decide to transfer the motion, the depth in a new Video. You can tell WanGP to use only the first n frames of Control Video and to extrapolate the rest. You can also do inpainting ). If the video contains area of grey color 127, they will be considered as masks and will be filled based on the Text prompt of the reference Images.
-- reference Images: Use this to inject people or objects of your choice in the video. You can select multiple reference Images. The integration of the image is more efficient if the background is replaced by the full white color. You can do that with your preferred background remover or use the built in background remover by checking the box *Remove background*
-- a Video Mask
-This offers a stronger mechanism to tell Vace which parts should be kept (black) or replaced (white). You can do as well inpainting / outpainting, fill the missing part of a video more efficientlty with just the video hint. If a video mask is white, it will be generated so with black frames at the beginning and at the end and the rest white, you could generate the missing frames in between.
 Examples:
@@ -336,13 +340,29 @@ There is also a guide that describes the various combination of hints (https://g
 It seems you will get better results with Vace if you turn on "Skip Layer Guidance" with its default configuration.
 Other recommended setttings for Vace:
-- Use a long prompt description especially for the people / objects that are in the background and not in reference images. This will ensure consistency between the windows.
 - Set a medium size overlap window: long enough to give the model a sense of the motion but short enough so any overlapped blurred frames do no turn the rest of the video into a blurred video
 - Truncate at least the last 4 frames of the each generated window as Vace last frames tends to be blurry
-### VACE and Sky Reels v2 Diffusion Forcing Slidig Window
-With this mode (that works for the moment only with Vace and Sky Reels v2) you can merge mutiple Videos to form a very long video (up to 1 min).
 When combined with Vace this feature can use the same control video to generate the full Video that results from concatenining the different windows. For instance the first 0-4s of the control video will be used to generate the first window then the next 4-8s of the control video will be used to generate the second window, and so on. So if your control video contains a person walking, your generate video could contain up to one minute of this person walking.
@@ -352,12 +372,16 @@ Sliding Windows are turned on by default and are triggered as soon as you try to
 Although the window duration is set by the *Sliding Window Size* form field, the actual number of frames generated by each iteration will be less, because of the *overlap frames* and *discard last frames*:
 - *overlap frames* : the first frames of a new window are filled with last frames of the previous window in order to ensure continuity between the two windows
-- *discard last frames* : quite often (Vace model Only) the last frames of a window have a worse quality. You can decide here how many ending frames of a new window should be dropped.
-s
 Number of Generated Frames = [Number of Windows - 1] * ([Window Size] - [Overlap Frames] - [Discard Last Frames]) +  [Window Size]
 Experimental: if your prompt is broken into multiple lines (each line separated by a carriage return), then each line of the prompt will be used for a new window. If there are more windows to generate than prompt lines, the last prompt line will be repeated.
 ### Command line parameters for Gradio Server
 --i2v : launch the image to video generator\
 --t2v : launch the text to video generator (default defined in the configuration)\

 ## 🔥 Latest News!!
+* May 23 2025: 👋 Wan 2.1GP v5.21 : Improvements for Vace: better transitions between Sliding Windows,Support for Image masks in Matanyone, new Extend Video for Vace, different types of automated background removal
 * May 20 2025: 👋 Wan 2.1GP v5.2 : Added support for Wan CausVid which is a distilled Wan model that can generate nice looking videos in only 4 to 12 steps.
  The great thing is that Kijai (Kudos to him !) has created a CausVid Lora that can be combined with any existing Wan t2v model 14B like Wan Vace 14B.
  See instructions below on how to use CausVid.\
 ### VACE ControlNet introduction
+Vace is a ControlNet that allows you to do Video to Video and Reference to Video (inject your own images into the output video). It is probably one of the most powerful Wan models and you will be able to do amazing things when you master it: inject in the scene people or objects of your choice, animate a person, perform inpainting or outpainting, continue a video, ...
+First you need to select the Vace 1.3B model or the Vace 13B model in the Drop Down box at the top. Please note that Vace works well for the moment only with videos up to 7s with the Riflex option turned on.
 Beside the usual Text Prompt, three new types of visual hints can be provided (and combined !):
+- *a Control Video*\
+Based on your choice, you can decide to transfer the motion, the depth in a new Video. You can tell WanGP to use only the first n frames of Control Video and to extrapolate the rest. You can also do inpainting. If the video contains area of grey color 127, they will be considered as masks and will be filled based on the Text prompt of the reference Images.
+- *Reference Images*\
+ A reference Image can be as well a background that you want to use as a setting for the video or people or objects of your choice that you want to inject in the video. You can select multiple reference Images. The integration of object / person image is more efficient if the background is replaced by the full white color. For complex background removal you can use the Image version of the Matanyone tool that is embedded with WanGP or use you can use the fast on the fly background remover by selecting an option in the drop down box *Remove background*. Becareful not to remove the background of the reference image that is a landscape or setting  (always the first reference image) that you want to use as a start image / background for the video. It helps greatly to reference and describe explictly the injected objects / people of the Reference Images in the text prompt.
+- *a Video Mask*\
+This offers a stronger mechanism to tell Vace which parts should be kept (black) or replaced (white). You can do as well inpainting / outpainting, fill the missing part of a video more efficientlty with just the video hint. For instance, if a video mask is white except at the beginning and at the end where it is black, the first and last frames will be kept and everything in between will be generated.
 Examples:
 It seems you will get better results with Vace if you turn on "Skip Layer Guidance" with its default configuration.
 Other recommended setttings for Vace:
+- Use a long prompt description especially for the people / objects that are in the background and not in reference images.  This will ensure consistency between the windows.
 - Set a medium size overlap window: long enough to give the model a sense of the motion but short enough so any overlapped blurred frames do no turn the rest of the video into a blurred video
 - Truncate at least the last 4 frames of the each generated window as Vace last frames tends to be blurry
+**WanGP integrates the Matanyone tool which is tuned to work with Vace**.
+This can be very useful to create at the same time a control video and a mask video that go together.\
+For example, if you want to replace a face of a person in a video:
+- load the video in the Matanyone tool
+- click the face on the first frame and create a mask for it (if you have some trouble to select only the face look at the tips below)
+- generate both the control video and the mask video by clicking *Generate Video Matting*
+- Click *Export to current Video Input and Video Mask*
+- In the *Reference Image* field of the Vace screen, load a picture of the replacement face
+Please notes that sometime it may be useful to create *Background Masks* if want for instance to replace everything but a character that is in the video. You can do that by selecting *Background Mask* in the *Matanyone settings*
+If you have some trouble creating the perfect mask, be aware of these tips:
+- Using the Matanyone Settings you can also define Negative Point Prompts to remove parts of the current selection.
+- Sometime it is very hard to fit everything you want in a single mask, it may be much easier to combine multiple independent sub Masks before producing the Matting : each sub Mask is created by selecting an  area of an image and by clicking the Add Mask button. Sub masks can then be enabled / disabled in the Matanyone settings.
+### VACE, Sky Reels v2 Diffusion Forcing Slidig Window and LTX Video
+With this mode (that works for the moment only with Vace, Sky Reels v2 and LTX Video) you can merge mutiple Videos to form a very long video (up to 1 min).
 When combined with Vace this feature can use the same control video to generate the full Video that results from concatenining the different windows. For instance the first 0-4s of the control video will be used to generate the first window then the next 4-8s of the control video will be used to generate the second window, and so on. So if your control video contains a person walking, your generate video could contain up to one minute of this person walking.
 Although the window duration is set by the *Sliding Window Size* form field, the actual number of frames generated by each iteration will be less, because of the *overlap frames* and *discard last frames*:
 - *overlap frames* : the first frames of a new window are filled with last frames of the previous window in order to ensure continuity between the two windows
+- *discard last frames* : sometime (Vace 1.3B model Only) the last frames of a window have a worse quality. You can decide here how many ending frames of a new window should be dropped.
+There is some inevitable quality degradation over time to due to accumulated errors in calculation. One trick to reduce it / hide it is to add some noise (usually not noticable) on the overlapped frames using the *add overlapped noise* option.
 Number of Generated Frames = [Number of Windows - 1] * ([Window Size] - [Overlap Frames] - [Discard Last Frames]) +  [Window Size]
 Experimental: if your prompt is broken into multiple lines (each line separated by a carriage return), then each line of the prompt will be used for a new window. If there are more windows to generate than prompt lines, the last prompt line will be repeated.
 ### Command line parameters for Gradio Server
 --i2v : launch the image to video generator\
 --t2v : launch the text to video generator (default defined in the configuration)\

ltx_video/pipelines/pipeline_ltx_video.py CHANGED Viewed

@@ -1502,7 +1502,7 @@ class LTXVideoPipeline(DiffusionPipeline):
                         extra_conditioning_mask.append(conditioning_mask)
         # Patchify the updated latents and calculate their pixel coordinates
-        init_latents, init_latent_coords = self.patchifier.patchify(
             latents=init_latents
         )
         init_pixel_coords = latent_to_pixel_coords(

                         extra_conditioning_mask.append(conditioning_mask)
         # Patchify the updated latents and calculate their pixel coordinates
+            init_latents, init_latent_coords = self.patchifier.patchify(
             latents=init_latents
         )
         init_pixel_coords = latent_to_pixel_coords(

preprocessing/matanyone/app.py CHANGED Viewed

@@ -85,7 +85,7 @@ def get_frames_from_image(image_input, image_state):
     model.samcontroler.sam_controler.reset_image()
     model.samcontroler.sam_controler.set_image(image_state["origin_images"][0])
     return image_state, image_info, image_state["origin_images"][0], \
-                        gr.update(visible=True, maximum=10, value=10), gr.update(visible=True, maximum=len(frames), value=len(frames)),  gr.update(visible=False, maximum=len(frames), value=len(frames)), \
                         gr.update(visible=True), gr.update(visible=True), \
                         gr.update(visible=True), gr.update(visible=True),\
                         gr.update(visible=True), gr.update(visible=True), \
@@ -273,6 +273,57 @@ def save_video(frames, output_path, fps):
     return output_path
 # video matting
 def video_matting(video_state, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
     matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
@@ -397,7 +448,7 @@ def restart():
             "inference_times": 0,
             "negative_click_times" : 0,
             "positive_click_times": 0,
-            "mask_save": arg_mask_save,
             "multi_mask": {
                 "mask_names": [],
                 "masks": []
@@ -457,6 +508,15 @@ def export_to_vace_video_input(foreground_video_output):
     gr.Info("Masked Video Input transferred to Vace For Inpainting")
     return "V#" + str(time.time()), foreground_video_output
 def export_to_current_video_engine(foreground_video_output, alpha_video_output):
     gr.Info("Masked Video Input and Full Mask transferred to Current Video Engine For Inpainting")
     # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
@@ -471,15 +531,18 @@ def teleport_to_vace_1_3B():
 def teleport_to_vace_14B():
     return gr.Tabs(selected="video_gen"), gr.Dropdown(value="vace_14B")
-def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_video_guide_trigger):
     # my_tab.select(fn=load_unload_models, inputs=[], outputs=[])
     media_url = "https://github.com/pq-yang/MatAnyone/releases/download/media/"
     # download assets
-    gr.Markdown("Mast Edition is provided by MatAnyone")
     with gr.Column( visible=True):
         with gr.Row():
             with gr.Accordion("Video Tutorial (click to expand)", open=False, elem_classes="custom-bg"):
@@ -493,216 +556,368 @@ def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_
                         gr.Video(value="preprocessing/matanyone/tutorial_multi_targets.mp4", elem_classes="video")
-        click_state = gr.State([[],[]])
-        interactive_state = gr.State({
-            "inference_times": 0,
-            "negative_click_times" : 0,
-            "positive_click_times": 0,
-            "mask_save": arg_mask_save,
-            "multi_mask": {
-                "mask_names": [],
-                "masks": []
-            },
-            "track_end_number": None,
-            }
-        )
-        video_state = gr.State(
-            {
-            "user_name": "",
-            "video_name": "",
-            "origin_images": None,
-            "painted_images": None,
-            "masks": None,
-            "inpaint_masks": None,
-            "logits": None,
-            "select_frame_number": 0,
-            "fps": 16,
-            "audio": "",
-            }
-        )
-        with gr.Column( visible=True):
-            with gr.Row():
-                with gr.Accordion('MatAnyone Settings (click to expand)', open=False):
-                    with gr.Row():
-                        erode_kernel_size = gr.Slider(label='Erode Kernel Size',
-                                                minimum=0,
-                                                maximum=30,
-                                                step=1,
-                                                value=10,
-                                                info="Erosion on the added mask",
-                                                interactive=True)
-                        dilate_kernel_size = gr.Slider(label='Dilate Kernel Size',
-                                                minimum=0,
-                                                maximum=30,
-                                                step=1,
-                                                value=10,
-                                                info="Dilation on the added mask",
-                                                interactive=True)
-                    with gr.Row():
-                        image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Start Frame", info="Choose the start frame for target assignment and video matting", visible=False)
-                        end_selection_slider = gr.Slider(minimum=1, maximum=300, step=1, value=81, label="Last Frame to Process", info="Last Frame to Process", visible=False)
-                        track_pause_number_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="End frame", visible=False)
                     with gr.Row():
-                        point_prompt = gr.Radio(
-                            choices=["Positive", "Negative"],
-                            value="Positive",
-                            label="Point Prompt",
-                            info="Click to add positive or negative point for target mask",
-                            interactive=True,
-                            visible=False,
-                            min_width=100,
-                            scale=1)
-                        matting_type = gr.Radio(
-                            choices=["Foreground", "Background"],
-                            value="Foreground",
-                            label="Matting Type",
-                            info="Type of Video Matting to Generate",
-                            interactive=True,
-                            visible=False,
-                            min_width=100,
-                            scale=1)
-                        mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False, scale=2)
-        gr.Markdown("---")
-        with gr.Column():
-            # input video
-            with gr.Row(equal_height=True):
-                with gr.Column(scale=2):
-                    gr.Markdown("## Step1: Upload video")
-                with gr.Column(scale=2):
-                    step2_title = gr.Markdown("## Step2: Add masks <small>(Several clicks then **`Add Mask`** <u>one by one</u>)</small>", visible=False)
-            with gr.Row(equal_height=True):
-                with gr.Column(scale=2):
-                    video_input = gr.Video(label="Input Video", elem_classes="video")
-                    extract_frames_button = gr.Button(value="Load Video", interactive=True, elem_classes="new_button")
-                with gr.Column(scale=2):
-                    video_info = gr.Textbox(label="Video Info", visible=False)
-                    template_frame = gr.Image(label="Start Frame", type="pil",interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
                     with gr.Row():
-                        clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False,  min_width=100)
-                        add_mask_button = gr.Button(value="Set Mask", interactive=True, visible=False, min_width=100)
-                        remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False,  min_width=100) # no use
-                        matting_button = gr.Button(value="Generate Video Matting", interactive=True, visible=False,  min_width=100)
                     with gr.Row():
-                        gr.Markdown("")
-            # output video
-            with gr.Column() as output_row: #equal_height=True
-                with gr.Row():
-                    with gr.Column(scale=2):
-                        foreground_video_output = gr.Video(label="Masked Video Output", visible=False, elem_classes="video")
-                        foreground_output_button = gr.Button(value="Black & White Video Output", visible=False, elem_classes="new_button")
-                    with gr.Column(scale=2):
-                        alpha_video_output = gr.Video(label="B & W Mask Video Output", visible=False, elem_classes="video")
                         alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
-                with gr.Row():
-                    with gr.Row(visible= False):
-                        export_to_vace_video_14B_btn = gr.Button("Export to current Video Input Video For Inpainting", visible= False)
-                    with gr.Row(visible= True):
-                        export_to_current_video_engine_btn = gr.Button("Export to current Video Input and Video Mask", visible= False)
-        export_to_vace_video_14B_btn.click( fn=teleport_to_vace_14B, inputs=[], outputs=[tabs, model_choice]).then(
-            fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [video_prompt_video_guide_trigger, vace_video_input, vace_video_mask])
-        export_to_current_video_engine_btn.click(  fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [vace_video_input, vace_video_mask]).then( #video_prompt_video_guide_trigger,
-             fn=teleport_to_video_tab, inputs= [], outputs= [tabs])
-        # first step: get the video information
-        extract_frames_button.click(
-            fn=get_frames_from_video,
-            inputs=[
-                video_input, video_state
-            ],
-            outputs=[video_state, video_info, template_frame,
-                    image_selection_slider, end_selection_slider,  track_pause_number_slider, point_prompt, matting_type, clear_button_click, add_mask_button, matting_button, template_frame,
-                    foreground_video_output, alpha_video_output, foreground_output_button, alpha_output_button, mask_dropdown, step2_title]
-        )
-        # second step: select images from slider
-        image_selection_slider.release(fn=select_video_template,
-                                    inputs=[image_selection_slider, video_state, interactive_state],
-                                    outputs=[template_frame, video_state, interactive_state], api_name="select_image")
-        track_pause_number_slider.release(fn=get_end_number,
-                                    inputs=[track_pause_number_slider, video_state, interactive_state],
-                                    outputs=[template_frame, interactive_state], api_name="end_image")
-        # click select image to get mask using sam
-        template_frame.select(
-            fn=sam_refine,
-            inputs=[video_state, point_prompt, click_state, interactive_state],
-            outputs=[template_frame, video_state, interactive_state]
-        )
-        # add different mask
-        add_mask_button.click(
-            fn=add_multi_mask,
-            inputs=[video_state, interactive_state, mask_dropdown],
-            outputs=[interactive_state, mask_dropdown, template_frame, click_state]
-        )
-        remove_mask_button.click(
-            fn=remove_multi_mask,
-            inputs=[interactive_state, mask_dropdown],
-            outputs=[interactive_state, mask_dropdown]
-        )
-        # video matting
-        matting_button.click(
-            fn=show_outputs,
-            inputs=[],
-            outputs=[foreground_video_output, alpha_video_output]).then(
-            fn=video_matting,
-            inputs=[video_state, end_selection_slider,  matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size],
-            outputs=[foreground_video_output, alpha_video_output,foreground_video_output, alpha_video_output, export_to_vace_video_14B_btn, export_to_current_video_engine_btn]
-        )
-        # click to get mask
-        mask_dropdown.change(
-            fn=show_mask,
-            inputs=[video_state, interactive_state, mask_dropdown],
-            outputs=[template_frame]
-        )
-        # clear input
-        video_input.change(
-            fn=restart,
-            inputs=[],
-            outputs=[
-                video_state,
-                interactive_state,
-                click_state,
-                foreground_video_output, alpha_video_output,
-                template_frame,
-                image_selection_slider, end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
-                add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
-            ],
-            queue=False,
-            show_progress=False)
-        video_input.clear(
-            fn=restart,
-            inputs=[],
-            outputs=[
-                video_state,
-                interactive_state,
-                click_state,
-                foreground_video_output, alpha_video_output,
-                template_frame,
-                image_selection_slider , end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
-                add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
-            ],
-            queue=False,
-            show_progress=False)
-        # points clear
-        clear_button_click.click(
-            fn = clear_click,
-            inputs = [video_state, click_state,],
-            outputs = [template_frame,click_state],
-        )

     model.samcontroler.sam_controler.reset_image()
     model.samcontroler.sam_controler.set_image(image_state["origin_images"][0])
     return image_state, image_info, image_state["origin_images"][0], \
+                        gr.update(visible=True, maximum=10, value=10), gr.update(visible=False, maximum=len(frames), value=len(frames)), \
                         gr.update(visible=True), gr.update(visible=True), \
                         gr.update(visible=True), gr.update(visible=True),\
                         gr.update(visible=True), gr.update(visible=True), \
     return output_path
+# image matting
+def image_matting(video_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, refine_iter):
+    matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
+    if interactive_state["track_end_number"]:
+        following_frames = video_state["origin_images"][video_state["select_frame_number"]:interactive_state["track_end_number"]]
+    else:
+        following_frames = video_state["origin_images"][video_state["select_frame_number"]:]
+    if interactive_state["multi_mask"]["masks"]:
+        if len(mask_dropdown) == 0:
+            mask_dropdown = ["mask_001"]
+        mask_dropdown.sort()
+        template_mask = interactive_state["multi_mask"]["masks"][int(mask_dropdown[0].split("_")[1]) - 1] * (int(mask_dropdown[0].split("_")[1]))
+        for i in range(1,len(mask_dropdown)):
+            mask_number = int(mask_dropdown[i].split("_")[1]) - 1
+            template_mask = np.clip(template_mask+interactive_state["multi_mask"]["masks"][mask_number]*(mask_number+1), 0, mask_number+1)
+        video_state["masks"][video_state["select_frame_number"]]= template_mask
+    else:
+        template_mask = video_state["masks"][video_state["select_frame_number"]]
+    # operation error
+    if len(np.unique(template_mask))==1:
+        template_mask[0][0]=1
+    foreground, alpha = matanyone(matanyone_processor, following_frames, template_mask*255, r_erode=erode_kernel_size, r_dilate=dilate_kernel_size, n_warmup=refine_iter)
+    foreground_mat = False
+    output_frames = []
+    for frame_origin, frame_alpha in zip(following_frames, alpha):
+        if foreground_mat:
+            frame_alpha[frame_alpha > 127] = 255
+            frame_alpha[frame_alpha <= 127] = 0
+        else:
+            frame_temp = frame_alpha.copy()
+            frame_alpha[frame_temp > 127] = 0
+            frame_alpha[frame_temp <= 127] = 255
+        output_frame = np.bitwise_and(frame_origin, 255-frame_alpha)
+        frame_grey = frame_alpha.copy()
+        frame_grey[frame_alpha == 255] = 255
+        output_frame += frame_grey
+        output_frames.append(output_frame)
+    foreground = output_frames
+    foreground_output = Image.fromarray(foreground[-1])
+    alpha_output = Image.fromarray(alpha[-1][:,:,0])
+    return foreground_output, gr.update(visible=True)
 # video matting
 def video_matting(video_state, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
     matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
             "inference_times": 0,
             "negative_click_times" : 0,
             "positive_click_times": 0,
+            "mask_save": False,
             "multi_mask": {
                 "mask_names": [],
                 "masks": []
     gr.Info("Masked Video Input transferred to Vace For Inpainting")
     return "V#" + str(time.time()), foreground_video_output
+def export_image(image_refs, image_output):
+    gr.Info("Masked Image transferred to Current Video")
+    # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
+    if image_refs == None:
+        image_refs =[]
+    image_refs.append( image_output)
+    return image_refs
 def export_to_current_video_engine(foreground_video_output, alpha_video_output):
     gr.Info("Masked Video Input and Full Mask transferred to Current Video Engine For Inpainting")
     # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
 def teleport_to_vace_14B():
     return gr.Tabs(selected="video_gen"), gr.Dropdown(value="vace_14B")
+def display(tabs, model_choice, vace_video_input, vace_video_mask, vace_image_refs, video_prompt_video_guide_trigger):
     # my_tab.select(fn=load_unload_models, inputs=[], outputs=[])
     media_url = "https://github.com/pq-yang/MatAnyone/releases/download/media/"
     # download assets
+    gr.Markdown("<B>Mast Edition is provided by MatAnyone</B>")
+    gr.Markdown("If you have some trouble creating the perfect mask, be aware of these tips:")
+    gr.Markdown("- Using the Matanyone Settings you can also define Negative Point Prompts to remove parts of the current selection.")
+    gr.Markdown("- Sometime it is very hard to fit everything you want in a single mask, it may be much easier to combine multiple independent sub Masks before producing the Matting : each sub Mask is created by selecting an  area of an image and by clicking the Add Mask button. Sub masks can then be enabled / disabled in the Matanyone settings.")
     with gr.Column( visible=True):
         with gr.Row():
             with gr.Accordion("Video Tutorial (click to expand)", open=False, elem_classes="custom-bg"):
                         gr.Video(value="preprocessing/matanyone/tutorial_multi_targets.mp4", elem_classes="video")
+        with gr.Tabs():
+            with gr.TabItem("Video"):
+                click_state = gr.State([[],[]])
+                interactive_state = gr.State({
+                    "inference_times": 0,
+                    "negative_click_times" : 0,
+                    "positive_click_times": 0,
+                    "mask_save": arg_mask_save,
+                    "multi_mask": {
+                        "mask_names": [],
+                        "masks": []
+                    },
+                    "track_end_number": None,
+                    }
+                )
+                video_state = gr.State(
+                    {
+                    "user_name": "",
+                    "video_name": "",
+                    "origin_images": None,
+                    "painted_images": None,
+                    "masks": None,
+                    "inpaint_masks": None,
+                    "logits": None,
+                    "select_frame_number": 0,
+                    "fps": 16,
+                    "audio": "",
+                    }
+                )
+                with gr.Column( visible=True):
                     with gr.Row():
+                        with gr.Accordion('MatAnyone Settings (click to expand)', open=False):
+                            with gr.Row():
+                                erode_kernel_size = gr.Slider(label='Erode Kernel Size',
+                                                        minimum=0,
+                                                        maximum=30,
+                                                        step=1,
+                                                        value=10,
+                                                        info="Erosion on the added mask",
+                                                        interactive=True)
+                                dilate_kernel_size = gr.Slider(label='Dilate Kernel Size',
+                                                        minimum=0,
+                                                        maximum=30,
+                                                        step=1,
+                                                        value=10,
+                                                        info="Dilation on the added mask",
+                                                        interactive=True)
+                            with gr.Row():
+                                image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Start Frame", info="Choose the start frame for target assignment and video matting", visible=False)
+                                end_selection_slider = gr.Slider(minimum=1, maximum=300, step=1, value=81, label="Last Frame to Process", info="Last Frame to Process", visible=False)
+                                track_pause_number_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="End frame", visible=False)
+                            with gr.Row():
+                                point_prompt = gr.Radio(
+                                    choices=["Positive", "Negative"],
+                                    value="Positive",
+                                    label="Point Prompt",
+                                    info="Click to add positive or negative point for target mask",
+                                    interactive=True,
+                                    visible=False,
+                                    min_width=100,
+                                    scale=1)
+                                matting_type = gr.Radio(
+                                    choices=["Foreground", "Background"],
+                                    value="Foreground",
+                                    label="Matting Type",
+                                    info="Type of Video Matting to Generate",
+                                    interactive=True,
+                                    visible=False,
+                                    min_width=100,
+                                    scale=1)
+                                mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False, scale=2)
+                    # input video
+                    with gr.Row(equal_height=True):
+                        with gr.Column(scale=2):
+                            gr.Markdown("## Step1: Upload video")
+                        with gr.Column(scale=2):
+                            step2_title = gr.Markdown("## Step2: Add masks <small>(Several clicks then **`Add Mask`** <u>one by one</u>)</small>", visible=False)
+                    with gr.Row(equal_height=True):
+                        with gr.Column(scale=2):
+                            video_input = gr.Video(label="Input Video", elem_classes="video")
+                            extract_frames_button = gr.Button(value="Load Video", interactive=True, elem_classes="new_button")
+                        with gr.Column(scale=2):
+                            video_info = gr.Textbox(label="Video Info", visible=False)
+                            template_frame = gr.Image(label="Start Frame", type="pil",interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
+                            with gr.Row():
+                                clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False,  min_width=100)
+                                add_mask_button = gr.Button(value="Set Mask", interactive=True, visible=False, min_width=100)
+                                remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False,  min_width=100) # no use
+                                matting_button = gr.Button(value="Generate Video Matting", interactive=True, visible=False,  min_width=100)
+                            with gr.Row():
+                                gr.Markdown("")
+                    # output video
+                    with gr.Column() as output_row: #equal_height=True
+                        with gr.Row():
+                            with gr.Column(scale=2):
+                                foreground_video_output = gr.Video(label="Masked Video Output", visible=False, elem_classes="video")
+                                foreground_output_button = gr.Button(value="Black & White Video Output", visible=False, elem_classes="new_button")
+                            with gr.Column(scale=2):
+                                alpha_video_output = gr.Video(label="B & W Mask Video Output", visible=False, elem_classes="video")
+                                alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
+                        with gr.Row():
+                            with gr.Row(visible= False):
+                                export_to_vace_video_14B_btn = gr.Button("Export to current Video Input Video For Inpainting", visible= False)
+                            with gr.Row(visible= True):
+                                export_to_current_video_engine_btn = gr.Button("Export to current Video Input and Video Mask", visible= False)
+                export_to_vace_video_14B_btn.click( fn=teleport_to_vace_14B, inputs=[], outputs=[tabs, model_choice]).then(
+                    fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [video_prompt_video_guide_trigger, vace_video_input, vace_video_mask])
+                export_to_current_video_engine_btn.click(  fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [vace_video_input, vace_video_mask]).then( #video_prompt_video_guide_trigger,
+                    fn=teleport_to_video_tab, inputs= [], outputs= [tabs])
+                # first step: get the video information
+                extract_frames_button.click(
+                    fn=get_frames_from_video,
+                    inputs=[
+                        video_input, video_state
+                    ],
+                    outputs=[video_state, video_info, template_frame,
+                            image_selection_slider, end_selection_slider,  track_pause_number_slider, point_prompt, matting_type, clear_button_click, add_mask_button, matting_button, template_frame,
+                            foreground_video_output, alpha_video_output, foreground_output_button, alpha_output_button, mask_dropdown, step2_title]
+                )
+                # second step: select images from slider
+                image_selection_slider.release(fn=select_video_template,
+                                            inputs=[image_selection_slider, video_state, interactive_state],
+                                            outputs=[template_frame, video_state, interactive_state], api_name="select_image")
+                track_pause_number_slider.release(fn=get_end_number,
+                                            inputs=[track_pause_number_slider, video_state, interactive_state],
+                                            outputs=[template_frame, interactive_state], api_name="end_image")
+                # click select image to get mask using sam
+                template_frame.select(
+                    fn=sam_refine,
+                    inputs=[video_state, point_prompt, click_state, interactive_state],
+                    outputs=[template_frame, video_state, interactive_state]
+                )
+                # add different mask
+                add_mask_button.click(
+                    fn=add_multi_mask,
+                    inputs=[video_state, interactive_state, mask_dropdown],
+                    outputs=[interactive_state, mask_dropdown, template_frame, click_state]
+                )
+                remove_mask_button.click(
+                    fn=remove_multi_mask,
+                    inputs=[interactive_state, mask_dropdown],
+                    outputs=[interactive_state, mask_dropdown]
+                )
+                # video matting
+                matting_button.click(
+                    fn=show_outputs,
+                    inputs=[],
+                    outputs=[foreground_video_output, alpha_video_output]).then(
+                    fn=video_matting,
+                    inputs=[video_state, end_selection_slider,  matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size],
+                    outputs=[foreground_video_output, alpha_video_output,foreground_video_output, alpha_video_output, export_to_vace_video_14B_btn, export_to_current_video_engine_btn]
+                )
+                # click to get mask
+                mask_dropdown.change(
+                    fn=show_mask,
+                    inputs=[video_state, interactive_state, mask_dropdown],
+                    outputs=[template_frame]
+                )
+                # clear input
+                video_input.change(
+                    fn=restart,
+                    inputs=[],
+                    outputs=[
+                        video_state,
+                        interactive_state,
+                        click_state,
+                        foreground_video_output, alpha_video_output,
+                        template_frame,
+                        image_selection_slider, end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
+                        add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
+                    ],
+                    queue=False,
+                    show_progress=False)
+                video_input.clear(
+                    fn=restart,
+                    inputs=[],
+                    outputs=[
+                        video_state,
+                        interactive_state,
+                        click_state,
+                        foreground_video_output, alpha_video_output,
+                        template_frame,
+                        image_selection_slider , end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
+                        add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
+                    ],
+                    queue=False,
+                    show_progress=False)
+                # points clear
+                clear_button_click.click(
+                    fn = clear_click,
+                    inputs = [video_state, click_state,],
+                    outputs = [template_frame,click_state],
+                )
+            with gr.TabItem("Image"):
+                click_state = gr.State([[],[]])
+                interactive_state = gr.State({
+                    "inference_times": 0,
+                    "negative_click_times" : 0,
+                    "positive_click_times": 0,
+                    "mask_save": False,
+                    "multi_mask": {
+                        "mask_names": [],
+                        "masks": []
+                    },
+                    "track_end_number": None,
+                    }
+                )
+                image_state = gr.State(
+                    {
+                    "user_name": "",
+                    "image_name": "",
+                    "origin_images": None,
+                    "painted_images": None,
+                    "masks": None,
+                    "inpaint_masks": None,
+                    "logits": None,
+                    "select_frame_number": 0,
+                    "fps": 30
+                    }
+                )
+                with gr.Group(elem_classes="gr-monochrome-group", visible=True):
                     with gr.Row():
+                        with gr.Accordion('MatAnyone Settings (click to expand)', open=False):
+                            with gr.Row():
+                                erode_kernel_size = gr.Slider(label='Erode Kernel Size',
+                                                        minimum=0,
+                                                        maximum=30,
+                                                        step=1,
+                                                        value=10,
+                                                        info="Erosion on the added mask",
+                                                        interactive=True)
+                                dilate_kernel_size = gr.Slider(label='Dilate Kernel Size',
+                                                        minimum=0,
+                                                        maximum=30,
+                                                        step=1,
+                                                        value=10,
+                                                        info="Dilation on the added mask",
+                                                        interactive=True)
+                            with gr.Row():
+                                image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Num of Refinement Iterations", info="More iterations → More details & More time", visible=False)
+                                track_pause_number_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Track end frame", visible=False)
+                            with gr.Row():
+                                point_prompt = gr.Radio(
+                                    choices=["Positive", "Negative"],
+                                    value="Positive",
+                                    label="Point Prompt",
+                                    info="Click to add positive or negative point for target mask",
+                                    interactive=True,
+                                    visible=False,
+                                    min_width=100,
+                                    scale=1)
+                                mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False)
+                with gr.Column():
+                    # input image
+                    with gr.Row(equal_height=True):
+                        with gr.Column(scale=2):
+                            gr.Markdown("## Step1: Upload image")
+                        with gr.Column(scale=2):
+                            step2_title = gr.Markdown("## Step2: Add masks <small>(Several clicks then **`Add Mask`** <u>one by one</u>)</small>", visible=False)
+                    with gr.Row(equal_height=True):
+                        with gr.Column(scale=2):
+                            image_input = gr.Image(label="Input Image", elem_classes="image")
+                            extract_frames_button = gr.Button(value="Load Image", interactive=True, elem_classes="new_button")
+                        with gr.Column(scale=2):
+                            image_info = gr.Textbox(label="Image Info", visible=False)
+                            template_frame = gr.Image(type="pil", label="Start Frame", interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
+                            with gr.Row(equal_height=True, elem_classes="mask_button_group"):
+                                clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False, elem_classes="new_button", min_width=100)
+                                add_mask_button = gr.Button(value="Add Mask", interactive=True, visible=False, elem_classes="new_button", min_width=100)
+                                remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False, elem_classes="new_button", min_width=100)
+                                matting_button = gr.Button(value="Image Matting", interactive=True, visible=False, elem_classes="green_button", min_width=100)
+                    # output image
+                    with gr.Row(equal_height=True):
+                        foreground_image_output = gr.Image(type="pil", label="Foreground Output", visible=False, elem_classes="image")
                     with gr.Row():
+                        with gr.Row():
+                            export_image_btn = gr.Button(value="Add to current Reference Images", visible=False, elem_classes="new_button")
+                    with gr.Column(scale=2, visible= False):
+                        alpha_image_output = gr.Image(type="pil", label="Alpha Output", visible=False, elem_classes="image")
                         alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
+                export_image_btn.click(  fn=export_image, inputs= [vace_image_refs, foreground_image_output], outputs= [vace_image_refs]).then( #video_prompt_video_guide_trigger,
+                    fn=teleport_to_video_tab, inputs= [], outputs= [tabs])
+                # first step: get the image information
+                extract_frames_button.click(
+                    fn=get_frames_from_image,
+                    inputs=[
+                        image_input, image_state
+                    ],
+                    outputs=[image_state, image_info, template_frame,
+                            image_selection_slider, track_pause_number_slider,point_prompt, clear_button_click, add_mask_button, matting_button, template_frame,
+                            foreground_image_output, alpha_image_output, export_image_btn, alpha_output_button, mask_dropdown, step2_title]
+                )
+                # second step: select images from slider
+                image_selection_slider.release(fn=select_image_template,
+                                            inputs=[image_selection_slider, image_state, interactive_state],
+                                            outputs=[template_frame, image_state, interactive_state], api_name="select_image")
+                track_pause_number_slider.release(fn=get_end_number,
+                                            inputs=[track_pause_number_slider, image_state, interactive_state],
+                                            outputs=[template_frame, interactive_state], api_name="end_image")
+                # click select image to get mask using sam
+                template_frame.select(
+                    fn=sam_refine,
+                    inputs=[image_state, point_prompt, click_state, interactive_state],
+                    outputs=[template_frame, image_state, interactive_state]
+                )
+                # add different mask
+                add_mask_button.click(
+                    fn=add_multi_mask,
+                    inputs=[image_state, interactive_state, mask_dropdown],
+                    outputs=[interactive_state, mask_dropdown, template_frame, click_state]
+                )
+                remove_mask_button.click(
+                    fn=remove_multi_mask,
+                    inputs=[interactive_state, mask_dropdown],
+                    outputs=[interactive_state, mask_dropdown]
+                )
+                # image matting
+                matting_button.click(
+                    fn=image_matting,
+                    inputs=[image_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, image_selection_slider],
+                    outputs=[foreground_image_output, export_image_btn]
+                )

preprocessing/matanyone/tutorial_multi_targets.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:39eaa5740d67e7fc97138c7d74cbcbaffd1f798b30d206c50eb19ba6f33adfb8
+size 621144

preprocessing/matanyone/tutorial_single_target.mp4 ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:397719759b1c3c10c1a15c8603ca8a4ee7889fd8f4e9896703575387e8118826
+size 211460

wan/text2video.py CHANGED Viewed

@@ -111,7 +111,7 @@ class WanT2V:
             self.adapt_vace_model()
-    def vace_encode_frames(self, frames, ref_images, masks=None, tile_size = 0, overlapped_latents = 0, overlap_noise = 0):
         if ref_images is None:
             ref_images = [None] * len(frames)
         else:
@@ -123,10 +123,10 @@ class WanT2V:
             inactive = [i * (1 - m) + 0 * m for i, m in zip(frames, masks)]
             reactive = [i * m + 0 * (1 - m) for i, m in zip(frames, masks)]
             inactive = self.vae.encode(inactive, tile_size = tile_size)
-            # inactive = [ t  * (1.0 - noise_factor) + torch.randn_like(t ) * noise_factor for t in inactive]
-            # if overlapped_latents > 0:
-            #     for t in inactive:
-            #         t[:, :overlapped_latents ]   = t[:, :overlapped_latents ]  * (1.0 - noise_factor) + torch.randn_like(t[:, :overlapped_latents ] ) * noise_factor
             reactive = self.vae.encode(reactive, tile_size = tile_size)
             latents = [torch.cat((u, c), dim=0) for u, c in zip(inactive, reactive)]
@@ -190,13 +190,13 @@ class WanT2V:
             num_frames = total_frames - prepend_count
             if sub_src_mask is not None and sub_src_video is not None:
                 src_video[i], src_mask[i], _, _, _ = self.vid_proc.load_video_pair(sub_src_video, sub_src_mask, max_frames= num_frames, trim_video = trim_video - prepend_count, start_frame = start_frame, canvas_height = canvas_height, canvas_width = canvas_width, fit_into_canvas = fit_into_canvas)
-                # src_video is [-1, 1], 0 = inpainting area (in fact 127  in [0, 255])
-                # src_mask is [-1, 1], 0 = preserve original video (in fact 127  in [0, 255]) and 1 = Inpainting (in fact 255  in [0, 255])
                 src_video[i] = src_video[i].to(device)
                 src_mask[i] = src_mask[i].to(device)
                 if prepend_count > 0:
                     src_video[i] =  torch.cat( [sub_pre_src_video, src_video[i]], dim=1)
-                    src_mask[i] =  torch.cat( [torch.zeros_like(sub_pre_src_video), src_mask[i]] ,1)
                 src_video_shape = src_video[i].shape
                 if src_video_shape[1] != total_frames:
                     src_video[i] =  torch.cat( [src_video[i], src_video[i].new_zeros(src_video_shape[0], total_frames -src_video_shape[1], *src_video_shape[-2:])], dim=1)
@@ -300,7 +300,8 @@ class WanT2V:
                 slg_end = 1.0,
                 cfg_star_switch = True,
                 cfg_zero_step = 5,
-                overlapped_latents  = 0,
                 overlap_noise = 0,
                 model_filename = None,
                 **bbargs
@@ -373,8 +374,10 @@ class WanT2V:
             input_frames = [u.to(self.device) for u in input_frames]
             input_ref_images = [ None if u == None else [v.to(self.device) for v in u]  for u in input_ref_images]
             input_masks = [u.to(self.device) for u in input_masks]
-            z0 = self.vace_encode_frames(input_frames, input_ref_images, masks=input_masks, tile_size = VAE_tile_size, overlapped_latents = overlapped_latents, overlap_noise = overlap_noise )
             m0 = self.vace_encode_masks(input_masks, input_ref_images)
             z = self.vace_latent(z0, m0)
@@ -442,8 +445,9 @@ class WanT2V:
         if vace:
             ref_images_count = len(input_ref_images[0]) if input_ref_images != None and input_ref_images[0] != None else 0
             kwargs.update({'vace_context' : z, 'vace_context_scale' : context_scale})
-            if overlapped_latents > 0:
-                z_reactive = [  zz[0:16, ref_images_count:overlapped_latents + ref_images_count].clone() for zz in z]
         if self.model.enable_teacache:
@@ -453,13 +457,14 @@ class WanT2V:
         if callback != None:
             callback(-1, None, True)
         for i, t in enumerate(tqdm(timesteps)):
-            if vace and overlapped_latents > 0 :
-                # noise_factor = overlap_noise *(i/(len(timesteps)-1)) / 1000
-                noise_factor = overlap_noise / 1000 # * (999-t) / 999
-                # noise_factor = overlap_noise / 1000 # * t / 999
-                for zz, zz_r in zip(z, z_reactive):
-                    zz[0:16, ref_images_count:overlapped_latents + ref_images_count]   = zz_r  * (1.0 - noise_factor) + torch.randn_like(zz_r ) * noise_factor
             if target_camera != None:
                 latent_model_input = torch.cat([latents, source_latents], dim=1)
             else:
@@ -552,6 +557,13 @@ class WanT2V:
         x0 = [latents]
         if input_frames == None:
             if phantom:
                 # phantom post processing
@@ -560,11 +572,9 @@ class WanT2V:
         else:
             # vace post processing
             videos = self.decode_latent(x0, input_ref_images, VAE_tile_size)
-        del latents
-        del sample_scheduler
-        return videos[0] if self.rank == 0 else None
     def adapt_vace_model(self):
         model = self.model

             self.adapt_vace_model()
+    def vace_encode_frames(self, frames, ref_images, masks=None, tile_size = 0, overlapped_latents = None):
         if ref_images is None:
             ref_images = [None] * len(frames)
         else:
             inactive = [i * (1 - m) + 0 * m for i, m in zip(frames, masks)]
             reactive = [i * m + 0 * (1 - m) for i, m in zip(frames, masks)]
             inactive = self.vae.encode(inactive, tile_size = tile_size)
+            self.toto = inactive[0].clone()
+            if overlapped_latents  != None :
+                # inactive[0][:, 0:1] = self.vae.encode([frames[0][:, 0:1]], tile_size = tile_size)[0] # redundant
+                inactive[0][:, 1:overlapped_latents.shape[1] + 1] = overlapped_latents
             reactive = self.vae.encode(reactive, tile_size = tile_size)
             latents = [torch.cat((u, c), dim=0) for u, c in zip(inactive, reactive)]
             num_frames = total_frames - prepend_count
             if sub_src_mask is not None and sub_src_video is not None:
                 src_video[i], src_mask[i], _, _, _ = self.vid_proc.load_video_pair(sub_src_video, sub_src_mask, max_frames= num_frames, trim_video = trim_video - prepend_count, start_frame = start_frame, canvas_height = canvas_height, canvas_width = canvas_width, fit_into_canvas = fit_into_canvas)
+                # src_video is [-1, 1] (at this function output), 0 = inpainting area (in fact 127  in [0, 255])
+                # src_mask is [-1, 1] (at this function output), 0 = preserve original video (in fact 127  in [0, 255]) and 1 = Inpainting (in fact 255  in [0, 255])
                 src_video[i] = src_video[i].to(device)
                 src_mask[i] = src_mask[i].to(device)
                 if prepend_count > 0:
                     src_video[i] =  torch.cat( [sub_pre_src_video, src_video[i]], dim=1)
+                    src_mask[i] =  torch.cat( [torch.full_like(sub_pre_src_video, -1.0), src_mask[i]] ,1)
                 src_video_shape = src_video[i].shape
                 if src_video_shape[1] != total_frames:
                     src_video[i] =  torch.cat( [src_video[i], src_video[i].new_zeros(src_video_shape[0], total_frames -src_video_shape[1], *src_video_shape[-2:])], dim=1)
                 slg_end = 1.0,
                 cfg_star_switch = True,
                 cfg_zero_step = 5,
+                overlapped_latents  = None,
+                return_latent_slice = None,
                 overlap_noise = 0,
                 model_filename = None,
                 **bbargs
             input_frames = [u.to(self.device) for u in input_frames]
             input_ref_images = [ None if u == None else [v.to(self.device) for v in u]  for u in input_ref_images]
             input_masks = [u.to(self.device) for u in input_masks]
+            previous_latents = None
+            # if overlapped_latents != None:
+                # input_ref_images = [u[-1:] for u in input_ref_images]
+            z0 = self.vace_encode_frames(input_frames, input_ref_images, masks=input_masks, tile_size = VAE_tile_size, overlapped_latents = overlapped_latents )
             m0 = self.vace_encode_masks(input_masks, input_ref_images)
             z = self.vace_latent(z0, m0)
         if vace:
             ref_images_count = len(input_ref_images[0]) if input_ref_images != None and input_ref_images[0] != None else 0
             kwargs.update({'vace_context' : z, 'vace_context_scale' : context_scale})
+            if overlapped_latents != None:
+                overlapped_latents_size = overlapped_latents.shape[1] + 1
+                z_reactive = [  zz[0:16, 0:overlapped_latents_size + ref_images_count].clone() for zz in z]
         if self.model.enable_teacache:
         if callback != None:
             callback(-1, None, True)
         for i, t in enumerate(tqdm(timesteps)):
+            if overlapped_latents != None:
+                # overlap_noise_factor = overlap_noise *(i/(len(timesteps)-1)) / 1000
+                overlap_noise_factor = overlap_noise / 1000
+                latent_noise_factor = t / 1000
+                for zz, zz_r, ll in zip(z, z_reactive, [latents]):
+                    pass
+                    zz[0:16, ref_images_count:overlapped_latents_size + ref_images_count]   = zz_r[:, ref_images_count:]  * (1.0 - overlap_noise_factor) + torch.randn_like(zz_r[:, ref_images_count:] ) * overlap_noise_factor
+                    ll[:, 0:overlapped_latents_size + ref_images_count]   = zz_r  * (1.0 - latent_noise_factor) + torch.randn_like(zz_r ) * latent_noise_factor
             if target_camera != None:
                 latent_model_input = torch.cat([latents, source_latents], dim=1)
             else:
         x0 = [latents]
+        if return_latent_slice != None:
+            if overlapped_latents != None:
+                # latents [:, 1:] = self.toto
+                for zz, zz_r, ll  in zip(z, z_reactive, [latents]):
+                    ll[:, 0:overlapped_latents_size + ref_images_count]   = zz_r
+            latent_slice = latents[:, return_latent_slice].clone()
         if input_frames == None:
             if phantom:
                 # phantom post processing
         else:
             # vace post processing
             videos = self.decode_latent(x0, input_ref_images, VAE_tile_size)
+        if return_latent_slice != None:
+            return { "x" : videos[0], "latent_slice" : latent_slice }
+        return videos[0]
     def adapt_vace_model(self):
         model = self.model

wan/utils/utils.py CHANGED Viewed

@@ -91,11 +91,11 @@ def calculate_new_dimensions(canvas_height, canvas_width, height, width, fit_int
     return new_height, new_width
 def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, fit_into_canvas = False ):
-    if rm_background:
         session = new_session()
     output_list =[]
-    for img in img_list:
         width, height =  img.size
         if fit_into_canvas:
@@ -113,9 +113,10 @@ def resize_and_remove_background(img_list, budget_width, budget_height, rm_backg
             new_height = int( round(height * scale / 16) * 16)
             new_width = int( round(width * scale / 16) * 16)
             resized_image= img.resize((new_width,new_height), resample=Image.Resampling.LANCZOS)
-        if rm_background:
-            resized_image = remove(resized_image, session=session, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
-        output_list.append(resized_image)
     return output_list

     return new_height, new_width
 def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, fit_into_canvas = False ):
+    if rm_background  > 0:
         session = new_session()
     output_list =[]
+    for i, img in enumerate(img_list):
         width, height =  img.size
         if fit_into_canvas:
             new_height = int( round(height * scale / 16) * 16)
             new_width = int( round(width * scale / 16) * 16)
             resized_image= img.resize((new_width,new_height), resample=Image.Resampling.LANCZOS)
+        if rm_background == 1 or rm_background == 2 and i > 0 :
+            # resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1,alpha_matting_background_threshold = 70, alpha_foreground_background_threshold = 100, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
+            resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
+        output_list.append(resized_image) #alpha_matting_background_threshold = 30, alpha_foreground_background_threshold = 200,
     return output_list

wgp.py CHANGED Viewed

@@ -204,9 +204,6 @@ def process_prompt_and_add_tasks(state, model_choice):
         if isinstance(image_refs, list):
             image_refs = [ convert_image(tup[0]) for tup in image_refs ]
-            # os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
-            # from wan.utils.utils import resize_and_remove_background
-            # image_refs = resize_and_remove_background(image_refs, width, height, inputs["remove_background_image_ref"] ==1, fit_into_canvas= True)
         if len(prompts) > 0:
@@ -333,8 +330,10 @@ def process_prompt_and_add_tasks(state, model_choice):
         if "O" in video_prompt_type :
             keep_frames_video_guide= inputs["keep_frames_video_guide"]
             video_length = inputs["video_length"]
-            if len(keep_frames_video_guide) ==0:
-                gr.Info(f"Warning : you have asked to reuse all the frames of the control Video in the Alternate Video Ending it. Please make sure the number of frames of the control Video is lower than the total number of frames to generate otherwise it won't make a difference.")
             # elif keep_frames >= video_length:
             #     gr.Info(f"The number of frames in the control Video to reuse ({keep_frames_video_guide}) in Alternate Video Ending can not be bigger than the total number of frames ({video_length}) to generate.")
             #     return
@@ -347,12 +346,7 @@ def process_prompt_and_add_tasks(state, model_choice):
                 return
         if isinstance(image_refs, list):
-            image_refs = [ convert_image(tup[0]) for tup in image_refs ]
-            # os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
-            # from wan.utils.utils import resize_and_remove_background
-            # image_refs = resize_and_remove_background(image_refs, width, height, inputs["remove_background_image_ref"] ==1)
         if len(prompts) > 0:
             prompts = ["\n".join(prompts)]
@@ -1464,7 +1458,6 @@ lock_ui_attention = False
 lock_ui_transformer = False
 lock_ui_compile = False
-preload =int(args.preload)
 force_profile_no = int(args.profile)
 verbose_level = int(args.verbose)
 quantizeTransformer = args.quantize_transformer
@@ -1482,15 +1475,19 @@ if os.path.isfile("t2v_settings.json"):
 if not os.path.isfile(server_config_filename) and os.path.isfile("gradio_config.json"):
     shutil.move("gradio_config.json", server_config_filename)
 src_move = [ "ckpts/models_clip_open-clip-xlm-roberta-large-vit-huge-14-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-quanto_int8.safetensors" ]
 tgt_move = [ "ckpts/xlm-roberta-large/", "ckpts/umt5-xxl/", "ckpts/umt5-xxl/"]
 for src,tgt in zip(src_move,tgt_move):
     if os.path.isfile(src):
         try:
-            shutil.move(src, tgt)
         except:
             pass
 if not Path(server_config_filename).is_file():
@@ -1755,7 +1752,10 @@ def get_default_settings(filename):
                 "flow_shift": 13,
                 "resolution": "1280x720"
             })
         with open(defaults_filename, "w", encoding="utf-8") as f:
@@ -2136,6 +2136,9 @@ def load_models(model_filename):
     global transformer_filename, transformer_loras_filenames
     model_family = get_model_family(model_filename)
     perc_reserved_mem_max = args.perc_reserved_mem_max
     new_transformer_loras_filenames = None
     dependent_models = get_dependent_models(model_filename, quantization= transformer_quantization, dtype_policy =  transformer_dtype_policy)
     new_transformer_loras_filenames = [model_filename]  if "_lora" in model_filename else None
@@ -2259,7 +2262,8 @@ def apply_changes(  state,
                     preload_model_policy_choice = 1,
                     UI_theme_choice = "default",
                     enhancer_enabled_choice = 0,
-                    fit_canvas_choice = 0
 ):
     if args.lock_config:
         return
@@ -2284,6 +2288,7 @@ def apply_changes(  state,
                      "UI_theme" : UI_theme_choice,
                      "fit_canvas": fit_canvas_choice,
                      "enhancer_enabled" : enhancer_enabled_choice,
                        }
     if Path(server_config_filename).is_file():
@@ -2456,26 +2461,20 @@ def refresh_gallery(state): #, msg
             prompt = "<BR><DIV style='height:8px'></DIV>".join(prompts)
         if enhanced:
             prompt = "<U><B>Enhanced:</B></U><BR>" + prompt
         start_img_uri = task.get('start_image_data_base64')
-        start_img_uri = start_img_uri[0] if start_img_uri !=None else None
         end_img_uri = task.get('end_image_data_base64')
-        end_img_uri = end_img_uri[0] if end_img_uri !=None else None
         thumbnail_size = "100px"
-        if start_img_uri:
-            start_img_md = f'<img src="{start_img_uri}" alt="Start" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" />'
-        if end_img_uri:
-            end_img_md = f'<img src="{end_img_uri}" alt="End" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" />'
-        label = f"Prompt of Video being Generated"
-        html = "<STYLE> #PINFO, #PINFO  th, #PINFO td {border: 1px solid #CCCCCC;background-color:#FFFFFF;}</STYLE><TABLE WIDTH=100% ID=PINFO ><TR><TD width=100%>" + prompt + "</TD>"
-        if start_img_md != "":
-            html += "<TD>" + start_img_md +  "</TD>"
-        if end_img_md != "":
-            html += "<TD>" + end_img_md +  "</TD>"
-        html += "</TR></TABLE>"
         html_output = gr.HTML(html, visible= True)
         return gr.Gallery(selected_index=choice, value = file_list), html_output, gr.Button(visible=False), gr.Button(visible=True), gr.Row(visible=True), update_queue_data(queue), gr.Button(interactive=  abort_interactive), gr.Button(visible= onemorewindow_visible)
@@ -2680,7 +2679,7 @@ def generate_video(
     sliding_window_overlap,
     sliding_window_overlap_noise,
     sliding_window_discard_last_frames,
-    remove_background_image_ref,
     temporal_upsampling,
     spatial_upsampling,
     RIFLEx_setting,
@@ -2816,13 +2815,14 @@ def generate_video(
         fps = 30
     else:
         fps = 16
     original_image_refs = image_refs
     if image_refs != None and len(image_refs) > 0 and (hunyuan_custom or phantom or vace):
         send_cmd("progress", [0, get_latest_status(state, "Removing Images References Background")])
         os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
         from wan.utils.utils import resize_and_remove_background
-        image_refs = resize_and_remove_background(image_refs, width, height, remove_background_image_ref ==1, fit_into_canvas= not vace)
         update_task_thumbnails(task, locals())
         send_cmd("output")
@@ -2879,7 +2879,6 @@ def generate_video(
     repeat_no = 0
     extra_generation = 0
     initial_total_windows = 0
-    max_frames_to_generate = video_length
     if diffusion_forcing or vace or ltxv:
         reuse_frames = min(sliding_window_size - 4, sliding_window_overlap)
     else:
@@ -2888,8 +2887,9 @@ def generate_video(
         video_length +=  sliding_window_overlap
     sliding_window = (vace or diffusion_forcing or ltxv) and video_length > sliding_window_size
     if sliding_window:
-        discard_last_frames = sliding_window_discard_last_frames
         left_after_first_window = video_length - sliding_window_size + discard_last_frames
         initial_total_windows= 1 + math.ceil(left_after_first_window / (sliding_window_size - discard_last_frames - reuse_frames))
         video_length = sliding_window_size
@@ -2913,6 +2913,7 @@ def generate_video(
         prefix_video_frames_count = 0
         frames_already_processed = None
         pre_video_guide = None
         window_no = 0
         extra_windows = 0
         guide_start_frame = 0
@@ -2920,6 +2921,8 @@ def generate_video(
         gen["extra_windows"] = 0
         gen["total_windows"] = 1
         gen["window_no"] = 1
         start_time = time.time()
         if prompt_enhancer_image_caption_model != None and prompt_enhancer !=None and len(prompt_enhancer)>0:
             text_encoder_max_tokens = 256
@@ -2955,38 +2958,50 @@ def generate_video(
         while not abort:
             if sliding_window:
                 prompt =  prompts[window_no] if window_no < len(prompts) else prompts[-1]
-            extra_windows += gen.get("extra_windows",0)
-            if extra_windows > 0:
-                video_length = sliding_window_size
             gen["extra_windows"] = 0
             total_windows = initial_total_windows + extra_windows
             gen["total_windows"] = total_windows
             if window_no >= total_windows:
                 break
             window_no += 1
             gen["window_no"] = window_no
             if hunyuan_custom:
                 src_ref_images  = image_refs
             elif phantom:
                 src_ref_images = image_refs.copy() if image_refs != None else None
-            elif diffusion_forcing or ltxv:
                 if video_source != None and len(video_source) > 0 and window_no == 1:
                     keep_frames_video_source= 1000 if len(keep_frames_video_source) ==0 else int(keep_frames_video_source)
                     prefix_video  = preprocess_video(None, width=width, height=height,video_in=video_source, max_frames= keep_frames_video_source , start_frame = 0, fit_canvas= fit_canvas, target_fps = fps, block_size = 32 if ltxv else 16)
                     prefix_video  = prefix_video .permute(3, 0, 1, 2)
                     prefix_video  = prefix_video .float().div_(127.5).sub_(1.) # c, f, h, w
-                    prefix_video_frames_count = prefix_video.shape[1]
                     pre_video_guide =  prefix_video[:, -reuse_frames:]
-            elif vace:
-                # video_prompt_type =  video_prompt_type +"G"
                 image_refs_copy = image_refs.copy() if image_refs != None else None # required since prepare_source do inplace modifications
                 video_guide_copy = video_guide
                 video_mask_copy = video_mask
                 if any(process in video_prompt_type for process in ("P", "D", "G")) :
-                    prompts_max = gen["prompts_max"]
                     preprocess_type = None
                     if "P" in video_prompt_type :
                         progress_args = [0, get_latest_status(state,"Extracting Open Pose Information")]
@@ -3005,8 +3020,11 @@ def generate_video(
                 if len(error) > 0:
                     raise gr.Error(f"invalid keep frames {keep_frames_video_guide}")
                 keep_frames_parsed = keep_frames_parsed[guide_start_frame: guide_start_frame + video_length]
                 if window_no == 1:
-                    image_size = (height, width) # VACE_SIZE_CONFIGS[resolution_reformated] # default frame dimensions until it is set by video_src (if there is any)
                 src_video, src_mask, src_ref_images = wan_model.prepare_source([video_guide_copy],
                                                                         [video_mask_copy ],
                                                                         [image_refs_copy],
@@ -3017,29 +3035,24 @@ def generate_video(
                                                                         pre_src_video = [pre_video_guide],
                                                                         fit_into_canvas = fit_canvas
                                                                         )
-                # if window_no == 1 and src_video != None and len(src_video) > 0:
-                #     image_size = src_video[0].shape[-2:]
-            prompts_max = gen["prompts_max"]
             status = get_latest_status(state)
             gen["progress_status"] = status
             gen["progress_phase"] = ("Encoding Prompt", -1 )
             callback = build_callback(state, trans, send_cmd, status, num_inference_steps)
             progress_args = [0, merge_status_context(status, "Encoding Prompt")]
             send_cmd("progress", progress_args)
             # samples = torch.empty( (1,2)) #for testing
             # if False:
             try:
-                if trans.enable_teacache:
-                    trans.teacache_counter = 0
-                    trans.num_steps = num_inference_steps
-                    trans.teacache_skipped_steps = 0
-                    trans.previous_residual = None
-                    trans.previous_modulated_input = None
                 samples = wan_model.generate(
                     input_prompt = prompt,
                     image_start = image_start,
@@ -3049,7 +3062,7 @@ def generate_video(
                     input_masks = src_mask,
                     input_video= pre_video_guide  if diffusion_forcing or ltxv else source_video,
                     target_camera= target_camera,
-                    frame_num=(video_length // 4)* 4 + 1,
                     height =  height,
                     width = width,
                     fit_into_canvas = fit_canvas == 1,
@@ -3076,7 +3089,8 @@ def generate_video(
                     causal_block_size = 5,
                     causal_attention = True,
                     fps = fps,
-                    overlapped_latents = 0 if reuse_frames == 0 or window_no == 1 else ((reuse_frames - 1) // 4 + 1),
                     overlap_noise = sliding_window_overlap_noise,
                     model_filename = model_filename,
                 )
@@ -3109,6 +3123,7 @@ def generate_video(
                 tb = traceback.format_exc().split('\n')[:-1]
                 print('\n'.join(tb))
                 send_cmd("error", new_error)
                 return
             finally:
                 trans.previous_residual = None
@@ -3118,33 +3133,42 @@ def generate_video(
                 print(f"Teacache Skipped Steps:{trans.teacache_skipped_steps}/{trans.num_steps}" )
             if samples != None:
                 samples = samples.to("cpu")
             offload.last_offload_obj.unload_all()
             gc.collect()
             torch.cuda.empty_cache()
             if samples == None:
                 abort = True
                 state["prompt"] = ""
                 send_cmd("output")
             else:
                 sample = samples.cpu()
-                if True: # for testing
-                    torch.save(sample, "output.pt")
-                else:
-                    sample =torch.load("output.pt")
                 if sliding_window :
                     guide_start_frame += video_length
                     if discard_last_frames > 0:
                         sample = sample[: , :-discard_last_frames]
                         guide_start_frame -= discard_last_frames
                     if reuse_frames == 0:
-                        pre_video_guide =  sample[:,9999 :]
                     else:
-                        # noise_factor = 200/ 1000
-                        # pre_video_guide =  sample[:, -reuse_frames:] * (1.0 - noise_factor) + torch.randn_like(sample[:, -reuse_frames:] ) * noise_factor
-                        pre_video_guide =  sample[:, -reuse_frames:]
                 if prefix_video != None:
@@ -3158,7 +3182,6 @@ def generate_video(
                         sample = sample[: , :]
                     else:
                         sample = sample[: , reuse_frames:]
                     guide_start_frame -= reuse_frames
                 exp = 0
@@ -3252,15 +3275,9 @@ def generate_video(
                 print(f"New video saved to Path: "+video_path)
                 file_list.append(video_path)
                 send_cmd("output")
-                if sliding_window :
-                    if max_frames_to_generate > 0 and extra_windows == 0:
-                        current_length = sample.shape[1]
-                        if (current_length - prefix_video_frames_count)>= max_frames_to_generate:
-                            break
-                        video_length = min(sliding_window_size, ((max_frames_to_generate - (current_length - prefix_video_frames_count) + reuse_frames + discard_last_frames) // 4) * 4 + 1 )
         seed += 1
     if temp_filename!= None and  os.path.isfile(temp_filename):
         os.remove(temp_filename)
     offload.unload_loras_from_model(trans)
@@ -3630,7 +3647,16 @@ def merge_status_context(status="", context=""):
         return status
     else:
         return status + " - " + context
 def get_latest_status(state, context=""):
     gen = get_gen_info(state)
     prompt_no = gen["prompt_no"]
@@ -3999,7 +4025,7 @@ def prepare_inputs_dict(target, inputs ):
         inputs.pop("model_mode")
     if not "Vace" in model_filename or not "phantom" in model_filename or not "hunyuan_video_custom" in model_filename:
-        unsaved_params = ["keep_frames_video_guide", "video_prompt_type",  "remove_background_image_ref"]
         for k in unsaved_params:
             inputs.pop(k)
@@ -4066,7 +4092,7 @@ def save_inputs(
             sliding_window_overlap,
             sliding_window_overlap_noise,
             sliding_window_discard_last_frames,
-            remove_background_image_ref,
             temporal_upsampling,
             spatial_upsampling,
             RIFLEx_setting,
@@ -4458,7 +4484,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                                 ("Transfer Human Motion from the Control Video", "PV"),
                                 ("Transfer Depth from the Control Video", "DV"),
                                 ("Recolorize the Control Video", "CV"),
-                                # ("Alternate Video Ending", "OV"),
                                 ("Video contains Open Pose, Depth, Black & White, Inpainting ", "V"),
                                 ("Control Video and Mask video for Inpainting ", "MV"),
                             ],
@@ -4489,7 +4515,17 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                  )
                 # with gr.Row():
-                remove_background_image_ref = gr.Checkbox(value=ui_defaults.get("remove_background_image_ref",1), label= "Remove Background of Images References", visible= "I" in video_prompt_type_value, scale =1 )
                 video_mask = gr.Video(label= "Video Mask (for Inpainting or Outpaing, white pixels = Mask)", visible= "M" in video_prompt_type_value, value= ui_defaults.get("video_mask", None))
@@ -4730,7 +4766,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
                         else:
                             sliding_window_size = gr.Slider(5, 137, value=ui_defaults.get("sliding_window_size", 81), step=4, label="Sliding Window Size")
                             sliding_window_overlap = gr.Slider(1, 97, value=ui_defaults.get("sliding_window_overlap",5), step=4, label="Windows Frames Overlap (needed to maintain continuity between windows, a higher value will require more windows)")
-                            sliding_window_overlap_noise = gr.Slider(0, 100, value=ui_defaults.get("sliding_window_overlap_noise",20), step=1, label="Noise to be added to overlapped frames to reduce blur effect")
                             sliding_window_discard_last_frames = gr.Slider(0, 20, value=ui_defaults.get("sliding_window_discard_last_frames", 8), step=4, label="Discard Last Frames of a Window (that may have bad quality)", visible = True)
@@ -4811,7 +4847,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
             image_prompt_type.change(fn=refresh_image_prompt_type, inputs=[state, image_prompt_type], outputs=[image_start, image_end, video_source, keep_frames_video_source] )
             video_prompt_video_guide_trigger.change(fn=refresh_video_prompt_video_guide_trigger, inputs=[video_prompt_type, video_prompt_video_guide_trigger], outputs=[video_prompt_type, video_prompt_type_video_guide, video_guide, video_mask, keep_frames_video_guide])
-            video_prompt_type_image_refs.input(fn=refresh_video_prompt_type_image_refs, inputs = [video_prompt_type, video_prompt_type_image_refs], outputs = [video_prompt_type, image_refs, remove_background_image_ref ])
             video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, keep_frames_video_guide, video_mask])
             show_advanced.change(fn=switch_advanced, inputs=[state, show_advanced, lset_name], outputs=[advanced_row, preset_buttons_rows, refresh_lora_btn, refresh2_row ,lset_name ]).then(
@@ -5036,7 +5072,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
             )
     return ( state, loras_choices, lset_name, state,
-             video_guide, video_mask, video_prompt_video_guide_trigger, prompt_enhancer
         )
@@ -5250,6 +5286,7 @@ def generate_configuration_tab(state, blocks, header, model_choice, prompt_enhan
                     value= profile,
                     label="Profile (for power users only, not needed to change it)"
                 )
@@ -5277,7 +5314,8 @@ def generate_configuration_tab(state, blocks, header, model_choice, prompt_enhan
                     preload_model_policy_choice,
                     UI_theme_choice,
                     enhancer_enabled_choice,
-                    fit_canvas_choice
                 ],
                 outputs= [msg , header, model_choice, prompt_enhancer_row]
         )
@@ -5661,7 +5699,7 @@ def create_demo():
         theme = gr.themes.Soft(font=["Verdana"], primary_hue="sky", neutral_hue="slate", text_size="md")
     with gr.Blocks(css=css, theme=theme, title= "WanGP") as main:
-        gr.Markdown("<div align=center><H1>Wan<SUP>GP</SUP> v5.2 <FONT SIZE=4>by <I>DeepBeepMeep</I></FONT> <FONT SIZE=3>") # (<A HREF='https://github.com/deepbeepmeep/Wan2GP'>Updates</A>)</FONT SIZE=3></H1></div>")
         global model_list
         tab_state = gr.State({ "tab_no":0 })
@@ -5680,7 +5718,7 @@ def create_demo():
                     header = gr.Markdown(generate_header(transformer_filename, compile, attention_mode), visible= True)
                 with gr.Row():
                     (   state, loras_choices, lset_name, state,
-                        video_guide, video_mask, video_prompt_type_video_trigger, prompt_enhancer_row
                     ) = generate_video_tab(model_choice=model_choice, header=header, main = main)
             with gr.Tab("Informations", id="info"):
                 generate_info_tab()
@@ -5688,7 +5726,7 @@ def create_demo():
                 from preprocessing.matanyone  import app as matanyone_app
                 vmc_event_handler = matanyone_app.get_vmc_event_handler()
-                matanyone_app.display(main_tabs, model_choice, video_guide, video_mask, video_prompt_type_video_trigger)
             if not args.lock_config:
                 with gr.Tab("Downloads", id="downloads") as downloads_tab:
                     generate_download_tab(lset_name, loras_choices, state)

         if isinstance(image_refs, list):
             image_refs = [ convert_image(tup[0]) for tup in image_refs ]
         if len(prompts) > 0:
         if "O" in video_prompt_type :
             keep_frames_video_guide= inputs["keep_frames_video_guide"]
             video_length = inputs["video_length"]
+            if len(keep_frames_video_guide) > 0:
+                gr.Info("Keeping Frames with Extending Video is not yet supported")
+                return
+                # gr.Info(f"Warning : you have asked to reuse all the frames of the control Video in the Alternate Video Ending it. Please make sure the number of frames of the control Video is lower than the total number of frames to generate otherwise it won't make a difference.")
             # elif keep_frames >= video_length:
             #     gr.Info(f"The number of frames in the control Video to reuse ({keep_frames_video_guide}) in Alternate Video Ending can not be bigger than the total number of frames ({video_length}) to generate.")
             #     return
                 return
         if isinstance(image_refs, list):
+            image_refs = [ convert_image(tup[0]) for tup in image_refs ]
         if len(prompts) > 0:
             prompts = ["\n".join(prompts)]
 lock_ui_transformer = False
 lock_ui_compile = False
 force_profile_no = int(args.profile)
 verbose_level = int(args.verbose)
 quantizeTransformer = args.quantize_transformer
 if not os.path.isfile(server_config_filename) and os.path.isfile("gradio_config.json"):
     shutil.move("gradio_config.json", server_config_filename)
+if not os.path.isdir("ckpts/umt5-xxl/"):
+    os.makedirs("ckpts/umt5-xxl/")
 src_move = [ "ckpts/models_clip_open-clip-xlm-roberta-large-vit-huge-14-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-quanto_int8.safetensors" ]
 tgt_move = [ "ckpts/xlm-roberta-large/", "ckpts/umt5-xxl/", "ckpts/umt5-xxl/"]
 for src,tgt in zip(src_move,tgt_move):
     if os.path.isfile(src):
         try:
+            if os.path.isfile(tgt):
+                shutil.remove(src)
+            else:
+                shutil.move(src, tgt)
         except:
             pass
 if not Path(server_config_filename).is_file():
                 "flow_shift": 13,
                 "resolution": "1280x720"
             })
+        elif get_model_type(filename) in ("vace_14B"):
+            ui_defaults.update({
+                "sliding_window_discard_last_frames": 0,
+            })
         with open(defaults_filename, "w", encoding="utf-8") as f:
     global transformer_filename, transformer_loras_filenames
     model_family = get_model_family(model_filename)
     perc_reserved_mem_max = args.perc_reserved_mem_max
+    preload =int(args.preload)
+    if preload == 0:
+        preload = server_config.get("preload_in_VRAM", 0)
     new_transformer_loras_filenames = None
     dependent_models = get_dependent_models(model_filename, quantization= transformer_quantization, dtype_policy =  transformer_dtype_policy)
     new_transformer_loras_filenames = [model_filename]  if "_lora" in model_filename else None
                     preload_model_policy_choice = 1,
                     UI_theme_choice = "default",
                     enhancer_enabled_choice = 0,
+                    fit_canvas_choice = 0,
+                    preload_in_VRAM_choice = 0
 ):
     if args.lock_config:
         return
                      "UI_theme" : UI_theme_choice,
                      "fit_canvas": fit_canvas_choice,
                      "enhancer_enabled" : enhancer_enabled_choice,
+                     "preload_in_VRAM" : preload_in_VRAM_choice
                        }
     if Path(server_config_filename).is_file():
             prompt = "<BR><DIV style='height:8px'></DIV>".join(prompts)
         if enhanced:
             prompt = "<U><B>Enhanced:</B></U><BR>" + prompt
+        list_uri = []
         start_img_uri = task.get('start_image_data_base64')
+        if start_img_uri != None:
+            list_uri += start_img_uri
         end_img_uri = task.get('end_image_data_base64')
+        if end_img_uri != None:
+            list_uri += end_img_uri
         thumbnail_size = "100px"
+        thumbnails = ""
+        for img_uri in list_uri:
+            thumbnails += f'<TD><img src="{img_uri}" alt="Start" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" /></TD>'
+        html = "<STYLE> #PINFO, #PINFO  th, #PINFO td {border: 1px solid #CCCCCC;background-color:#FFFFFF;}</STYLE><TABLE WIDTH=100% ID=PINFO ><TR><TD width=100%>" + prompt + "</TD>" + thumbnails + "</TR></TABLE>"
         html_output = gr.HTML(html, visible= True)
         return gr.Gallery(selected_index=choice, value = file_list), html_output, gr.Button(visible=False), gr.Button(visible=True), gr.Row(visible=True), update_queue_data(queue), gr.Button(interactive=  abort_interactive), gr.Button(visible= onemorewindow_visible)
     sliding_window_overlap,
     sliding_window_overlap_noise,
     sliding_window_discard_last_frames,
+    remove_background_images_ref,
     temporal_upsampling,
     spatial_upsampling,
     RIFLEx_setting,
         fps = 30
     else:
         fps = 16
+    latent_size = 8 if ltxv else 4
     original_image_refs = image_refs
     if image_refs != None and len(image_refs) > 0 and (hunyuan_custom or phantom or vace):
         send_cmd("progress", [0, get_latest_status(state, "Removing Images References Background")])
         os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
         from wan.utils.utils import resize_and_remove_background
+        image_refs = resize_and_remove_background(image_refs, width, height, remove_background_images_ref, fit_into_canvas= not vace)
         update_task_thumbnails(task, locals())
         send_cmd("output")
     repeat_no = 0
     extra_generation = 0
     initial_total_windows = 0
     if diffusion_forcing or vace or ltxv:
         reuse_frames = min(sliding_window_size - 4, sliding_window_overlap)
     else:
         video_length +=  sliding_window_overlap
     sliding_window = (vace or diffusion_forcing or ltxv) and video_length > sliding_window_size
+    discard_last_frames = sliding_window_discard_last_frames
+    default_max_frames_to_generate = video_length
     if sliding_window:
         left_after_first_window = video_length - sliding_window_size + discard_last_frames
         initial_total_windows= 1 + math.ceil(left_after_first_window / (sliding_window_size - discard_last_frames - reuse_frames))
         video_length = sliding_window_size
         prefix_video_frames_count = 0
         frames_already_processed = None
         pre_video_guide = None
+        overlapped_latents = None
         window_no = 0
         extra_windows = 0
         guide_start_frame = 0
         gen["extra_windows"] = 0
         gen["total_windows"] = 1
         gen["window_no"] = 1
+        num_frames_generated = 0
+        max_frames_to_generate = default_max_frames_to_generate
         start_time = time.time()
         if prompt_enhancer_image_caption_model != None and prompt_enhancer !=None and len(prompt_enhancer)>0:
             text_encoder_max_tokens = 256
         while not abort:
             if sliding_window:
                 prompt =  prompts[window_no] if window_no < len(prompts) else prompts[-1]
+            new_extra_windows = gen.get("extra_windows",0)
             gen["extra_windows"] = 0
+            extra_windows += new_extra_windows
+            max_frames_to_generate +=  new_extra_windows * (sliding_window_size - discard_last_frames - reuse_frames)
+            sliding_window = sliding_window  or extra_windows > 0
+            if sliding_window and window_no > 0:
+                num_frames_generated -= reuse_frames
+                if (max_frames_to_generate - prefix_video_frames_count - num_frames_generated) <  latent_size:
+                    break
+                video_length = min(sliding_window_size, ((max_frames_to_generate - num_frames_generated - prefix_video_frames_count + reuse_frames + discard_last_frames) // latent_size) * latent_size + 1 )
             total_windows = initial_total_windows + extra_windows
             gen["total_windows"] = total_windows
             if window_no >= total_windows:
                 break
             window_no += 1
             gen["window_no"] = window_no
+            return_latent_slice = None
+            if reuse_frames > 0:
+                return_latent_slice = slice(-(reuse_frames - 1 + discard_last_frames ) // latent_size, None if discard_last_frames == 0 else -(discard_last_frames // latent_size) )
             if hunyuan_custom:
                 src_ref_images  = image_refs
             elif phantom:
                 src_ref_images = image_refs.copy() if image_refs != None else None
+            elif diffusion_forcing or ltxv or vace and "O" in video_prompt_type:
+                if vace:
+                   video_source =  video_guide
+                   video_guide = None
                 if video_source != None and len(video_source) > 0 and window_no == 1:
                     keep_frames_video_source= 1000 if len(keep_frames_video_source) ==0 else int(keep_frames_video_source)
+                    keep_frames_video_source =  (keep_frames_video_source // latent_size  ) * latent_size + 1
                     prefix_video  = preprocess_video(None, width=width, height=height,video_in=video_source, max_frames= keep_frames_video_source , start_frame = 0, fit_canvas= fit_canvas, target_fps = fps, block_size = 32 if ltxv else 16)
                     prefix_video  = prefix_video .permute(3, 0, 1, 2)
                     prefix_video  = prefix_video .float().div_(127.5).sub_(1.) # c, f, h, w
                     pre_video_guide =  prefix_video[:, -reuse_frames:]
+                    prefix_video_frames_count = pre_video_guide.shape[1]
+                    if vace:
+                        height, width  = pre_video_guide.shape[-2:]
+            if vace:
                 image_refs_copy = image_refs.copy() if image_refs != None else None # required since prepare_source do inplace modifications
                 video_guide_copy = video_guide
                 video_mask_copy = video_mask
                 if any(process in video_prompt_type for process in ("P", "D", "G")) :
                     preprocess_type = None
                     if "P" in video_prompt_type :
                         progress_args = [0, get_latest_status(state,"Extracting Open Pose Information")]
                 if len(error) > 0:
                     raise gr.Error(f"invalid keep frames {keep_frames_video_guide}")
                 keep_frames_parsed = keep_frames_parsed[guide_start_frame: guide_start_frame + video_length]
                 if window_no == 1:
+                    image_size = (height, width) #  default frame dimensions until it is set by video_src (if there is any)
                 src_video, src_mask, src_ref_images = wan_model.prepare_source([video_guide_copy],
                                                                         [video_mask_copy ],
                                                                         [image_refs_copy],
                                                                         pre_src_video = [pre_video_guide],
                                                                         fit_into_canvas = fit_canvas
                                                                         )
             status = get_latest_status(state)
             gen["progress_status"] = status
             gen["progress_phase"] = ("Encoding Prompt", -1 )
             callback = build_callback(state, trans, send_cmd, status, num_inference_steps)
             progress_args = [0, merge_status_context(status, "Encoding Prompt")]
             send_cmd("progress", progress_args)
+            if trans.enable_teacache:
+                trans.teacache_counter = 0
+                trans.num_steps = num_inference_steps
+                trans.teacache_skipped_steps = 0
+                trans.previous_residual = None
+                trans.previous_modulated_input = None
             # samples = torch.empty( (1,2)) #for testing
             # if False:
             try:
                 samples = wan_model.generate(
                     input_prompt = prompt,
                     image_start = image_start,
                     input_masks = src_mask,
                     input_video= pre_video_guide  if diffusion_forcing or ltxv else source_video,
                     target_camera= target_camera,
+                    frame_num=(video_length // latent_size)* latent_size + 1,
                     height =  height,
                     width = width,
                     fit_into_canvas = fit_canvas == 1,
                     causal_block_size = 5,
                     causal_attention = True,
                     fps = fps,
+                    overlapped_latents = overlapped_latents,
+                    return_latent_slice= return_latent_slice,
                     overlap_noise = sliding_window_overlap_noise,
                     model_filename = model_filename,
                 )
                 tb = traceback.format_exc().split('\n')[:-1]
                 print('\n'.join(tb))
                 send_cmd("error", new_error)
+                clear_status(state)
                 return
             finally:
                 trans.previous_residual = None
                 print(f"Teacache Skipped Steps:{trans.teacache_skipped_steps}/{trans.num_steps}" )
             if samples != None:
+                if isinstance(samples, dict):
+                    overlapped_latents = samples.get("latent_slice", None)
+                    samples= samples["x"]
                 samples = samples.to("cpu")
             offload.last_offload_obj.unload_all()
             gc.collect()
             torch.cuda.empty_cache()
+            # time_flag = datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d-%Hh%Mm%Ss")
+            # save_prompt = "_in_" + original_prompts[0]
+            # file_name = f"{time_flag}_seed{seed}_{sanitize_file_name(save_prompt[:50]).strip()}.mp4"
+            # sample = samples.cpu()
+            # cache_video( tensor=sample[None].clone(), save_file=os.path.join(save_path, file_name), fps=16, nrow=1, normalize=True, value_range=(-1, 1))
             if samples == None:
                 abort = True
                 state["prompt"] = ""
                 send_cmd("output")
             else:
                 sample = samples.cpu()
+                # if True: # for testing
+                #     torch.save(sample, "output.pt")
+                # else:
+                #     sample =torch.load("output.pt")
+                if gen.get("extra_windows",0) > 0:
+                    sliding_window = True
                 if sliding_window :
                     guide_start_frame += video_length
                     if discard_last_frames > 0:
                         sample = sample[: , :-discard_last_frames]
                         guide_start_frame -= discard_last_frames
                     if reuse_frames == 0:
+                        pre_video_guide =  sample[:,9999 :].clone()
                     else:
+                        pre_video_guide =  sample[:, -reuse_frames:].clone()
+                num_frames_generated += sample.shape[1]
                 if prefix_video != None:
                         sample = sample[: , :]
                     else:
                         sample = sample[: , reuse_frames:]
                     guide_start_frame -= reuse_frames
                 exp = 0
                 print(f"New video saved to Path: "+video_path)
                 file_list.append(video_path)
                 send_cmd("output")
         seed += 1
+    clear_status(state)
     if temp_filename!= None and  os.path.isfile(temp_filename):
         os.remove(temp_filename)
     offload.unload_loras_from_model(trans)
         return status
     else:
         return status + " - " + context
+def clear_status(state):
+    gen = get_gen_info(state)
+    gen["extra_windows"] = 0
+    gen["total_windows"] = 1
+    gen["window_no"] = 1
+    gen["extra_orders"] = 0
+    gen["repeat_no"] = 0
+    gen["total_generation"] = 0
 def get_latest_status(state, context=""):
     gen = get_gen_info(state)
     prompt_no = gen["prompt_no"]
         inputs.pop("model_mode")
     if not "Vace" in model_filename or not "phantom" in model_filename or not "hunyuan_video_custom" in model_filename:
+        unsaved_params = ["keep_frames_video_guide", "video_prompt_type",  "remove_background_images_ref"]
         for k in unsaved_params:
             inputs.pop(k)
             sliding_window_overlap,
             sliding_window_overlap_noise,
             sliding_window_discard_last_frames,
+            remove_background_images_ref,
             temporal_upsampling,
             spatial_upsampling,
             RIFLEx_setting,
                                 ("Transfer Human Motion from the Control Video", "PV"),
                                 ("Transfer Depth from the Control Video", "DV"),
                                 ("Recolorize the Control Video", "CV"),
+                                ("Extend Video", "OV"),
                                 ("Video contains Open Pose, Depth, Black & White, Inpainting ", "V"),
                                 ("Control Video and Mask video for Inpainting ", "MV"),
                             ],
                  )
                 # with gr.Row():
+                remove_background_images_ref = gr.Dropdown(
+                    choices=[
+                        ("Keep Backgrounds of All Images (landscape)", 0),
+                        ("Remove Backgrounds of All Images (objects / faces)", 1),
+                        ("Keep it for first Image (landscape) and remove it for other Images (objects / faces)", 2),
+                    ],
+                    value=ui_defaults.get("remove_background_images_ref",1),
+                    label="Remove Background of Images References", scale = 3, visible= "I" in video_prompt_type_value
+                )
+                # remove_background_images_ref = gr.Checkbox(value=ui_defaults.get("remove_background_images_ref",1), label= "Remove Background of Images References", visible= "I" in video_prompt_type_value, scale =1 )
                 video_mask = gr.Video(label= "Video Mask (for Inpainting or Outpaing, white pixels = Mask)", visible= "M" in video_prompt_type_value, value= ui_defaults.get("video_mask", None))
                         else:
                             sliding_window_size = gr.Slider(5, 137, value=ui_defaults.get("sliding_window_size", 81), step=4, label="Sliding Window Size")
                             sliding_window_overlap = gr.Slider(1, 97, value=ui_defaults.get("sliding_window_overlap",5), step=4, label="Windows Frames Overlap (needed to maintain continuity between windows, a higher value will require more windows)")
+                            sliding_window_overlap_noise = gr.Slider(0, 150, value=ui_defaults.get("sliding_window_overlap_noise",20), step=1, label="Noise to be added to overlapped frames to reduce blur effect")
                             sliding_window_discard_last_frames = gr.Slider(0, 20, value=ui_defaults.get("sliding_window_discard_last_frames", 8), step=4, label="Discard Last Frames of a Window (that may have bad quality)", visible = True)
             image_prompt_type.change(fn=refresh_image_prompt_type, inputs=[state, image_prompt_type], outputs=[image_start, image_end, video_source, keep_frames_video_source] )
             video_prompt_video_guide_trigger.change(fn=refresh_video_prompt_video_guide_trigger, inputs=[video_prompt_type, video_prompt_video_guide_trigger], outputs=[video_prompt_type, video_prompt_type_video_guide, video_guide, video_mask, keep_frames_video_guide])
+            video_prompt_type_image_refs.input(fn=refresh_video_prompt_type_image_refs, inputs = [video_prompt_type, video_prompt_type_image_refs], outputs = [video_prompt_type, image_refs, remove_background_images_ref ])
             video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, keep_frames_video_guide, video_mask])
             show_advanced.change(fn=switch_advanced, inputs=[state, show_advanced, lset_name], outputs=[advanced_row, preset_buttons_rows, refresh_lora_btn, refresh2_row ,lset_name ]).then(
             )
     return ( state, loras_choices, lset_name, state,
+             video_guide, video_mask, image_refs, video_prompt_video_guide_trigger, prompt_enhancer
         )
                     value= profile,
                     label="Profile (for power users only, not needed to change it)"
                 )
+                preload_in_VRAM_choice = gr.Slider(0, 40000, value=server_config.get("preload_in_VRAM", 0), step=100, label="Number of MB of Models that are Preloaded in VRAM (0 will use Profile default)")
                     preload_model_policy_choice,
                     UI_theme_choice,
                     enhancer_enabled_choice,
+                    fit_canvas_choice,
+                    preload_in_VRAM_choice
                 ],
                 outputs= [msg , header, model_choice, prompt_enhancer_row]
         )
         theme = gr.themes.Soft(font=["Verdana"], primary_hue="sky", neutral_hue="slate", text_size="md")
     with gr.Blocks(css=css, theme=theme, title= "WanGP") as main:
+        gr.Markdown("<div align=center><H1>Wan<SUP>GP</SUP> v5.21 <FONT SIZE=4>by <I>DeepBeepMeep</I></FONT> <FONT SIZE=3>") # (<A HREF='https://github.com/deepbeepmeep/Wan2GP'>Updates</A>)</FONT SIZE=3></H1></div>")
         global model_list
         tab_state = gr.State({ "tab_no":0 })
                     header = gr.Markdown(generate_header(transformer_filename, compile, attention_mode), visible= True)
                 with gr.Row():
                     (   state, loras_choices, lset_name, state,
+                        video_guide, video_mask, image_refs, video_prompt_type_video_trigger, prompt_enhancer_row
                     ) = generate_video_tab(model_choice=model_choice, header=header, main = main)
             with gr.Tab("Informations", id="info"):
                 generate_info_tab()
                 from preprocessing.matanyone  import app as matanyone_app
                 vmc_event_handler = matanyone_app.get_vmc_event_handler()
+                matanyone_app.display(main_tabs, model_choice, video_guide, video_mask, image_refs, video_prompt_type_video_trigger)
             if not args.lock_config:
                 with gr.Tab("Downloads", id="downloads") as downloads_tab:
                     generate_download_tab(lset_name, loras_choices, state)