DeepBeepMeep commited on
Commit
30d0c66
·
1 Parent(s): 03085c8

Vace improvements

Browse files
README.md CHANGED
@@ -21,6 +21,7 @@ WanGP supports the Wan (and derived models), Hunyuan Video and LTV Video models
21
 
22
 
23
  ## 🔥 Latest News!!
 
24
  * May 20 2025: 👋 Wan 2.1GP v5.2 : Added support for Wan CausVid which is a distilled Wan model that can generate nice looking videos in only 4 to 12 steps.
25
  The great thing is that Kijai (Kudos to him !) has created a CausVid Lora that can be combined with any existing Wan t2v model 14B like Wan Vace 14B.
26
  See instructions below on how to use CausVid.\
@@ -307,17 +308,20 @@ You can define multiple lines of macros. If there is only one macro line, the ap
307
 
308
  ### VACE ControlNet introduction
309
 
310
- Vace is a ControlNet 1.3B text2video model that allows you to do Video to Video and Reference to Video (inject your own images into the output video). So with Vace you can inject in the scene people or objects of your choice, animate a person, perform inpainting or outpainting, continue a video, ...
311
 
312
- First you need to select the Vace 1.3B model in the Drop Down box at the top. Please note that Vace works well for the moment only with videos up to 5s (81 frames).
313
 
314
  Beside the usual Text Prompt, three new types of visual hints can be provided (and combined !):
315
- - a Control Video: Based on your choice, you can decide to transfer the motion, the depth in a new Video. You can tell WanGP to use only the first n frames of Control Video and to extrapolate the rest. You can also do inpainting ). If the video contains area of grey color 127, they will be considered as masks and will be filled based on the Text prompt of the reference Images.
 
316
 
317
- - reference Images: Use this to inject people or objects of your choice in the video. You can select multiple reference Images. The integration of the image is more efficient if the background is replaced by the full white color. You can do that with your preferred background remover or use the built in background remover by checking the box *Remove background*
 
 
 
 
318
 
319
- - a Video Mask
320
- This offers a stronger mechanism to tell Vace which parts should be kept (black) or replaced (white). You can do as well inpainting / outpainting, fill the missing part of a video more efficientlty with just the video hint. If a video mask is white, it will be generated so with black frames at the beginning and at the end and the rest white, you could generate the missing frames in between.
321
 
322
 
323
  Examples:
@@ -336,13 +340,29 @@ There is also a guide that describes the various combination of hints (https://g
336
  It seems you will get better results with Vace if you turn on "Skip Layer Guidance" with its default configuration.
337
 
338
  Other recommended setttings for Vace:
339
- - Use a long prompt description especially for the people / objects that are in the background and not in reference images. This will ensure consistency between the windows.
340
  - Set a medium size overlap window: long enough to give the model a sense of the motion but short enough so any overlapped blurred frames do no turn the rest of the video into a blurred video
341
  - Truncate at least the last 4 frames of the each generated window as Vace last frames tends to be blurry
342
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
343
 
344
- ### VACE and Sky Reels v2 Diffusion Forcing Slidig Window
345
- With this mode (that works for the moment only with Vace and Sky Reels v2) you can merge mutiple Videos to form a very long video (up to 1 min).
346
 
347
  When combined with Vace this feature can use the same control video to generate the full Video that results from concatenining the different windows. For instance the first 0-4s of the control video will be used to generate the first window then the next 4-8s of the control video will be used to generate the second window, and so on. So if your control video contains a person walking, your generate video could contain up to one minute of this person walking.
348
 
@@ -352,12 +372,16 @@ Sliding Windows are turned on by default and are triggered as soon as you try to
352
 
353
  Although the window duration is set by the *Sliding Window Size* form field, the actual number of frames generated by each iteration will be less, because of the *overlap frames* and *discard last frames*:
354
  - *overlap frames* : the first frames of a new window are filled with last frames of the previous window in order to ensure continuity between the two windows
355
- - *discard last frames* : quite often (Vace model Only) the last frames of a window have a worse quality. You can decide here how many ending frames of a new window should be dropped.
356
- s
 
 
 
357
  Number of Generated Frames = [Number of Windows - 1] * ([Window Size] - [Overlap Frames] - [Discard Last Frames]) + [Window Size]
358
 
359
  Experimental: if your prompt is broken into multiple lines (each line separated by a carriage return), then each line of the prompt will be used for a new window. If there are more windows to generate than prompt lines, the last prompt line will be repeated.
360
 
 
361
  ### Command line parameters for Gradio Server
362
  --i2v : launch the image to video generator\
363
  --t2v : launch the text to video generator (default defined in the configuration)\
 
21
 
22
 
23
  ## 🔥 Latest News!!
24
+ * May 23 2025: 👋 Wan 2.1GP v5.21 : Improvements for Vace: better transitions between Sliding Windows,Support for Image masks in Matanyone, new Extend Video for Vace, different types of automated background removal
25
  * May 20 2025: 👋 Wan 2.1GP v5.2 : Added support for Wan CausVid which is a distilled Wan model that can generate nice looking videos in only 4 to 12 steps.
26
  The great thing is that Kijai (Kudos to him !) has created a CausVid Lora that can be combined with any existing Wan t2v model 14B like Wan Vace 14B.
27
  See instructions below on how to use CausVid.\
 
308
 
309
  ### VACE ControlNet introduction
310
 
311
+ Vace is a ControlNet that allows you to do Video to Video and Reference to Video (inject your own images into the output video). It is probably one of the most powerful Wan models and you will be able to do amazing things when you master it: inject in the scene people or objects of your choice, animate a person, perform inpainting or outpainting, continue a video, ...
312
 
313
+ First you need to select the Vace 1.3B model or the Vace 13B model in the Drop Down box at the top. Please note that Vace works well for the moment only with videos up to 7s with the Riflex option turned on.
314
 
315
  Beside the usual Text Prompt, three new types of visual hints can be provided (and combined !):
316
+ - *a Control Video*\
317
+ Based on your choice, you can decide to transfer the motion, the depth in a new Video. You can tell WanGP to use only the first n frames of Control Video and to extrapolate the rest. You can also do inpainting. If the video contains area of grey color 127, they will be considered as masks and will be filled based on the Text prompt of the reference Images.
318
 
319
+ - *Reference Images*\
320
+ A reference Image can be as well a background that you want to use as a setting for the video or people or objects of your choice that you want to inject in the video. You can select multiple reference Images. The integration of object / person image is more efficient if the background is replaced by the full white color. For complex background removal you can use the Image version of the Matanyone tool that is embedded with WanGP or use you can use the fast on the fly background remover by selecting an option in the drop down box *Remove background*. Becareful not to remove the background of the reference image that is a landscape or setting (always the first reference image) that you want to use as a start image / background for the video. It helps greatly to reference and describe explictly the injected objects / people of the Reference Images in the text prompt.
321
+
322
+ - *a Video Mask*\
323
+ This offers a stronger mechanism to tell Vace which parts should be kept (black) or replaced (white). You can do as well inpainting / outpainting, fill the missing part of a video more efficientlty with just the video hint. For instance, if a video mask is white except at the beginning and at the end where it is black, the first and last frames will be kept and everything in between will be generated.
324
 
 
 
325
 
326
 
327
  Examples:
 
340
  It seems you will get better results with Vace if you turn on "Skip Layer Guidance" with its default configuration.
341
 
342
  Other recommended setttings for Vace:
343
+ - Use a long prompt description especially for the people / objects that are in the background and not in reference images. This will ensure consistency between the windows.
344
  - Set a medium size overlap window: long enough to give the model a sense of the motion but short enough so any overlapped blurred frames do no turn the rest of the video into a blurred video
345
  - Truncate at least the last 4 frames of the each generated window as Vace last frames tends to be blurry
346
 
347
+ **WanGP integrates the Matanyone tool which is tuned to work with Vace**.
348
+
349
+ This can be very useful to create at the same time a control video and a mask video that go together.\
350
+ For example, if you want to replace a face of a person in a video:
351
+ - load the video in the Matanyone tool
352
+ - click the face on the first frame and create a mask for it (if you have some trouble to select only the face look at the tips below)
353
+ - generate both the control video and the mask video by clicking *Generate Video Matting*
354
+ - Click *Export to current Video Input and Video Mask*
355
+ - In the *Reference Image* field of the Vace screen, load a picture of the replacement face
356
+
357
+ Please notes that sometime it may be useful to create *Background Masks* if want for instance to replace everything but a character that is in the video. You can do that by selecting *Background Mask* in the *Matanyone settings*
358
+
359
+ If you have some trouble creating the perfect mask, be aware of these tips:
360
+ - Using the Matanyone Settings you can also define Negative Point Prompts to remove parts of the current selection.
361
+ - Sometime it is very hard to fit everything you want in a single mask, it may be much easier to combine multiple independent sub Masks before producing the Matting : each sub Mask is created by selecting an area of an image and by clicking the Add Mask button. Sub masks can then be enabled / disabled in the Matanyone settings.
362
+
363
 
364
+ ### VACE, Sky Reels v2 Diffusion Forcing Slidig Window and LTX Video
365
+ With this mode (that works for the moment only with Vace, Sky Reels v2 and LTX Video) you can merge mutiple Videos to form a very long video (up to 1 min).
366
 
367
  When combined with Vace this feature can use the same control video to generate the full Video that results from concatenining the different windows. For instance the first 0-4s of the control video will be used to generate the first window then the next 4-8s of the control video will be used to generate the second window, and so on. So if your control video contains a person walking, your generate video could contain up to one minute of this person walking.
368
 
 
372
 
373
  Although the window duration is set by the *Sliding Window Size* form field, the actual number of frames generated by each iteration will be less, because of the *overlap frames* and *discard last frames*:
374
  - *overlap frames* : the first frames of a new window are filled with last frames of the previous window in order to ensure continuity between the two windows
375
+ - *discard last frames* : sometime (Vace 1.3B model Only) the last frames of a window have a worse quality. You can decide here how many ending frames of a new window should be dropped.
376
+
377
+ There is some inevitable quality degradation over time to due to accumulated errors in calculation. One trick to reduce it / hide it is to add some noise (usually not noticable) on the overlapped frames using the *add overlapped noise* option.
378
+
379
+
380
  Number of Generated Frames = [Number of Windows - 1] * ([Window Size] - [Overlap Frames] - [Discard Last Frames]) + [Window Size]
381
 
382
  Experimental: if your prompt is broken into multiple lines (each line separated by a carriage return), then each line of the prompt will be used for a new window. If there are more windows to generate than prompt lines, the last prompt line will be repeated.
383
 
384
+
385
  ### Command line parameters for Gradio Server
386
  --i2v : launch the image to video generator\
387
  --t2v : launch the text to video generator (default defined in the configuration)\
ltx_video/pipelines/pipeline_ltx_video.py CHANGED
@@ -1502,7 +1502,7 @@ class LTXVideoPipeline(DiffusionPipeline):
1502
  extra_conditioning_mask.append(conditioning_mask)
1503
 
1504
  # Patchify the updated latents and calculate their pixel coordinates
1505
- init_latents, init_latent_coords = self.patchifier.patchify(
1506
  latents=init_latents
1507
  )
1508
  init_pixel_coords = latent_to_pixel_coords(
 
1502
  extra_conditioning_mask.append(conditioning_mask)
1503
 
1504
  # Patchify the updated latents and calculate their pixel coordinates
1505
+ init_latents, init_latent_coords = self.patchifier.patchify(
1506
  latents=init_latents
1507
  )
1508
  init_pixel_coords = latent_to_pixel_coords(
preprocessing/matanyone/app.py CHANGED
@@ -85,7 +85,7 @@ def get_frames_from_image(image_input, image_state):
85
  model.samcontroler.sam_controler.reset_image()
86
  model.samcontroler.sam_controler.set_image(image_state["origin_images"][0])
87
  return image_state, image_info, image_state["origin_images"][0], \
88
- gr.update(visible=True, maximum=10, value=10), gr.update(visible=True, maximum=len(frames), value=len(frames)), gr.update(visible=False, maximum=len(frames), value=len(frames)), \
89
  gr.update(visible=True), gr.update(visible=True), \
90
  gr.update(visible=True), gr.update(visible=True),\
91
  gr.update(visible=True), gr.update(visible=True), \
@@ -273,6 +273,57 @@ def save_video(frames, output_path, fps):
273
 
274
  return output_path
275
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
276
  # video matting
277
  def video_matting(video_state, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
278
  matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
@@ -397,7 +448,7 @@ def restart():
397
  "inference_times": 0,
398
  "negative_click_times" : 0,
399
  "positive_click_times": 0,
400
- "mask_save": arg_mask_save,
401
  "multi_mask": {
402
  "mask_names": [],
403
  "masks": []
@@ -457,6 +508,15 @@ def export_to_vace_video_input(foreground_video_output):
457
  gr.Info("Masked Video Input transferred to Vace For Inpainting")
458
  return "V#" + str(time.time()), foreground_video_output
459
 
 
 
 
 
 
 
 
 
 
460
  def export_to_current_video_engine(foreground_video_output, alpha_video_output):
461
  gr.Info("Masked Video Input and Full Mask transferred to Current Video Engine For Inpainting")
462
  # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
@@ -471,15 +531,18 @@ def teleport_to_vace_1_3B():
471
  def teleport_to_vace_14B():
472
  return gr.Tabs(selected="video_gen"), gr.Dropdown(value="vace_14B")
473
 
474
- def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_video_guide_trigger):
475
  # my_tab.select(fn=load_unload_models, inputs=[], outputs=[])
476
 
477
  media_url = "https://github.com/pq-yang/MatAnyone/releases/download/media/"
478
 
479
  # download assets
480
 
481
- gr.Markdown("Mast Edition is provided by MatAnyone")
482
-
 
 
 
483
  with gr.Column( visible=True):
484
  with gr.Row():
485
  with gr.Accordion("Video Tutorial (click to expand)", open=False, elem_classes="custom-bg"):
@@ -493,216 +556,368 @@ def display(tabs, model_choice, vace_video_input, vace_video_mask, video_prompt_
493
  gr.Video(value="preprocessing/matanyone/tutorial_multi_targets.mp4", elem_classes="video")
494
 
495
 
496
- click_state = gr.State([[],[]])
497
-
498
- interactive_state = gr.State({
499
- "inference_times": 0,
500
- "negative_click_times" : 0,
501
- "positive_click_times": 0,
502
- "mask_save": arg_mask_save,
503
- "multi_mask": {
504
- "mask_names": [],
505
- "masks": []
506
- },
507
- "track_end_number": None,
508
- }
509
- )
510
-
511
- video_state = gr.State(
512
- {
513
- "user_name": "",
514
- "video_name": "",
515
- "origin_images": None,
516
- "painted_images": None,
517
- "masks": None,
518
- "inpaint_masks": None,
519
- "logits": None,
520
- "select_frame_number": 0,
521
- "fps": 16,
522
- "audio": "",
523
- }
524
- )
525
-
526
- with gr.Column( visible=True):
527
- with gr.Row():
528
- with gr.Accordion('MatAnyone Settings (click to expand)', open=False):
529
- with gr.Row():
530
- erode_kernel_size = gr.Slider(label='Erode Kernel Size',
531
- minimum=0,
532
- maximum=30,
533
- step=1,
534
- value=10,
535
- info="Erosion on the added mask",
536
- interactive=True)
537
- dilate_kernel_size = gr.Slider(label='Dilate Kernel Size',
538
- minimum=0,
539
- maximum=30,
540
- step=1,
541
- value=10,
542
- info="Dilation on the added mask",
543
- interactive=True)
544
-
545
- with gr.Row():
546
- image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Start Frame", info="Choose the start frame for target assignment and video matting", visible=False)
547
- end_selection_slider = gr.Slider(minimum=1, maximum=300, step=1, value=81, label="Last Frame to Process", info="Last Frame to Process", visible=False)
548
 
549
- track_pause_number_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="End frame", visible=False)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
550
  with gr.Row():
551
- point_prompt = gr.Radio(
552
- choices=["Positive", "Negative"],
553
- value="Positive",
554
- label="Point Prompt",
555
- info="Click to add positive or negative point for target mask",
556
- interactive=True,
557
- visible=False,
558
- min_width=100,
559
- scale=1)
560
- matting_type = gr.Radio(
561
- choices=["Foreground", "Background"],
562
- value="Foreground",
563
- label="Matting Type",
564
- info="Type of Video Matting to Generate",
565
- interactive=True,
566
- visible=False,
567
- min_width=100,
568
- scale=1)
569
- mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False, scale=2)
570
-
571
- gr.Markdown("---")
572
-
573
- with gr.Column():
574
- # input video
575
- with gr.Row(equal_height=True):
576
- with gr.Column(scale=2):
577
- gr.Markdown("## Step1: Upload video")
578
- with gr.Column(scale=2):
579
- step2_title = gr.Markdown("## Step2: Add masks <small>(Several clicks then **`Add Mask`** <u>one by one</u>)</small>", visible=False)
580
- with gr.Row(equal_height=True):
581
- with gr.Column(scale=2):
582
- video_input = gr.Video(label="Input Video", elem_classes="video")
583
- extract_frames_button = gr.Button(value="Load Video", interactive=True, elem_classes="new_button")
584
- with gr.Column(scale=2):
585
- video_info = gr.Textbox(label="Video Info", visible=False)
586
- template_frame = gr.Image(label="Start Frame", type="pil",interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
587
  with gr.Row():
588
- clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False, min_width=100)
589
- add_mask_button = gr.Button(value="Set Mask", interactive=True, visible=False, min_width=100)
590
- remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False, min_width=100) # no use
591
- matting_button = gr.Button(value="Generate Video Matting", interactive=True, visible=False, min_width=100)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
592
  with gr.Row():
593
- gr.Markdown("")
594
-
595
- # output video
596
- with gr.Column() as output_row: #equal_height=True
597
- with gr.Row():
598
- with gr.Column(scale=2):
599
- foreground_video_output = gr.Video(label="Masked Video Output", visible=False, elem_classes="video")
600
- foreground_output_button = gr.Button(value="Black & White Video Output", visible=False, elem_classes="new_button")
601
- with gr.Column(scale=2):
602
- alpha_video_output = gr.Video(label="B & W Mask Video Output", visible=False, elem_classes="video")
603
  alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
604
- with gr.Row():
605
- with gr.Row(visible= False):
606
- export_to_vace_video_14B_btn = gr.Button("Export to current Video Input Video For Inpainting", visible= False)
607
- with gr.Row(visible= True):
608
- export_to_current_video_engine_btn = gr.Button("Export to current Video Input and Video Mask", visible= False)
609
-
610
- export_to_vace_video_14B_btn.click( fn=teleport_to_vace_14B, inputs=[], outputs=[tabs, model_choice]).then(
611
- fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [video_prompt_video_guide_trigger, vace_video_input, vace_video_mask])
612
-
613
- export_to_current_video_engine_btn.click( fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [vace_video_input, vace_video_mask]).then( #video_prompt_video_guide_trigger,
614
- fn=teleport_to_video_tab, inputs= [], outputs= [tabs])
615
-
616
- # first step: get the video information
617
- extract_frames_button.click(
618
- fn=get_frames_from_video,
619
- inputs=[
620
- video_input, video_state
621
- ],
622
- outputs=[video_state, video_info, template_frame,
623
- image_selection_slider, end_selection_slider, track_pause_number_slider, point_prompt, matting_type, clear_button_click, add_mask_button, matting_button, template_frame,
624
- foreground_video_output, alpha_video_output, foreground_output_button, alpha_output_button, mask_dropdown, step2_title]
625
- )
626
-
627
- # second step: select images from slider
628
- image_selection_slider.release(fn=select_video_template,
629
- inputs=[image_selection_slider, video_state, interactive_state],
630
- outputs=[template_frame, video_state, interactive_state], api_name="select_image")
631
- track_pause_number_slider.release(fn=get_end_number,
632
- inputs=[track_pause_number_slider, video_state, interactive_state],
633
- outputs=[template_frame, interactive_state], api_name="end_image")
634
-
635
- # click select image to get mask using sam
636
- template_frame.select(
637
- fn=sam_refine,
638
- inputs=[video_state, point_prompt, click_state, interactive_state],
639
- outputs=[template_frame, video_state, interactive_state]
640
- )
641
 
642
- # add different mask
643
- add_mask_button.click(
644
- fn=add_multi_mask,
645
- inputs=[video_state, interactive_state, mask_dropdown],
646
- outputs=[interactive_state, mask_dropdown, template_frame, click_state]
647
- )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
648
 
649
- remove_mask_button.click(
650
- fn=remove_multi_mask,
651
- inputs=[interactive_state, mask_dropdown],
652
- outputs=[interactive_state, mask_dropdown]
653
- )
654
 
655
- # video matting
656
- matting_button.click(
657
- fn=show_outputs,
658
- inputs=[],
659
- outputs=[foreground_video_output, alpha_video_output]).then(
660
- fn=video_matting,
661
- inputs=[video_state, end_selection_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size],
662
- outputs=[foreground_video_output, alpha_video_output,foreground_video_output, alpha_video_output, export_to_vace_video_14B_btn, export_to_current_video_engine_btn]
663
- )
664
 
665
- # click to get mask
666
- mask_dropdown.change(
667
- fn=show_mask,
668
- inputs=[video_state, interactive_state, mask_dropdown],
669
- outputs=[template_frame]
670
- )
671
-
672
- # clear input
673
- video_input.change(
674
- fn=restart,
675
- inputs=[],
676
- outputs=[
677
- video_state,
678
- interactive_state,
679
- click_state,
680
- foreground_video_output, alpha_video_output,
681
- template_frame,
682
- image_selection_slider, end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
683
- add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
684
- ],
685
- queue=False,
686
- show_progress=False)
687
-
688
- video_input.clear(
689
- fn=restart,
690
- inputs=[],
691
- outputs=[
692
- video_state,
693
- interactive_state,
694
- click_state,
695
- foreground_video_output, alpha_video_output,
696
- template_frame,
697
- image_selection_slider , end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
698
- add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
699
- ],
700
- queue=False,
701
- show_progress=False)
702
-
703
- # points clear
704
- clear_button_click.click(
705
- fn = clear_click,
706
- inputs = [video_state, click_state,],
707
- outputs = [template_frame,click_state],
708
- )
 
85
  model.samcontroler.sam_controler.reset_image()
86
  model.samcontroler.sam_controler.set_image(image_state["origin_images"][0])
87
  return image_state, image_info, image_state["origin_images"][0], \
88
+ gr.update(visible=True, maximum=10, value=10), gr.update(visible=False, maximum=len(frames), value=len(frames)), \
89
  gr.update(visible=True), gr.update(visible=True), \
90
  gr.update(visible=True), gr.update(visible=True),\
91
  gr.update(visible=True), gr.update(visible=True), \
 
273
 
274
  return output_path
275
 
276
+ # image matting
277
+ def image_matting(video_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, refine_iter):
278
+ matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
279
+ if interactive_state["track_end_number"]:
280
+ following_frames = video_state["origin_images"][video_state["select_frame_number"]:interactive_state["track_end_number"]]
281
+ else:
282
+ following_frames = video_state["origin_images"][video_state["select_frame_number"]:]
283
+
284
+ if interactive_state["multi_mask"]["masks"]:
285
+ if len(mask_dropdown) == 0:
286
+ mask_dropdown = ["mask_001"]
287
+ mask_dropdown.sort()
288
+ template_mask = interactive_state["multi_mask"]["masks"][int(mask_dropdown[0].split("_")[1]) - 1] * (int(mask_dropdown[0].split("_")[1]))
289
+ for i in range(1,len(mask_dropdown)):
290
+ mask_number = int(mask_dropdown[i].split("_")[1]) - 1
291
+ template_mask = np.clip(template_mask+interactive_state["multi_mask"]["masks"][mask_number]*(mask_number+1), 0, mask_number+1)
292
+ video_state["masks"][video_state["select_frame_number"]]= template_mask
293
+ else:
294
+ template_mask = video_state["masks"][video_state["select_frame_number"]]
295
+
296
+ # operation error
297
+ if len(np.unique(template_mask))==1:
298
+ template_mask[0][0]=1
299
+ foreground, alpha = matanyone(matanyone_processor, following_frames, template_mask*255, r_erode=erode_kernel_size, r_dilate=dilate_kernel_size, n_warmup=refine_iter)
300
+
301
+
302
+ foreground_mat = False
303
+
304
+ output_frames = []
305
+ for frame_origin, frame_alpha in zip(following_frames, alpha):
306
+ if foreground_mat:
307
+ frame_alpha[frame_alpha > 127] = 255
308
+ frame_alpha[frame_alpha <= 127] = 0
309
+ else:
310
+ frame_temp = frame_alpha.copy()
311
+ frame_alpha[frame_temp > 127] = 0
312
+ frame_alpha[frame_temp <= 127] = 255
313
+
314
+
315
+ output_frame = np.bitwise_and(frame_origin, 255-frame_alpha)
316
+ frame_grey = frame_alpha.copy()
317
+ frame_grey[frame_alpha == 255] = 255
318
+ output_frame += frame_grey
319
+ output_frames.append(output_frame)
320
+ foreground = output_frames
321
+
322
+ foreground_output = Image.fromarray(foreground[-1])
323
+ alpha_output = Image.fromarray(alpha[-1][:,:,0])
324
+
325
+ return foreground_output, gr.update(visible=True)
326
+
327
  # video matting
328
  def video_matting(video_state, end_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size):
329
  matanyone_processor = InferenceCore(matanyone_model, cfg=matanyone_model.cfg)
 
448
  "inference_times": 0,
449
  "negative_click_times" : 0,
450
  "positive_click_times": 0,
451
+ "mask_save": False,
452
  "multi_mask": {
453
  "mask_names": [],
454
  "masks": []
 
508
  gr.Info("Masked Video Input transferred to Vace For Inpainting")
509
  return "V#" + str(time.time()), foreground_video_output
510
 
511
+
512
+ def export_image(image_refs, image_output):
513
+ gr.Info("Masked Image transferred to Current Video")
514
+ # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
515
+ if image_refs == None:
516
+ image_refs =[]
517
+ image_refs.append( image_output)
518
+ return image_refs
519
+
520
  def export_to_current_video_engine(foreground_video_output, alpha_video_output):
521
  gr.Info("Masked Video Input and Full Mask transferred to Current Video Engine For Inpainting")
522
  # return "MV#" + str(time.time()), foreground_video_output, alpha_video_output
 
531
  def teleport_to_vace_14B():
532
  return gr.Tabs(selected="video_gen"), gr.Dropdown(value="vace_14B")
533
 
534
+ def display(tabs, model_choice, vace_video_input, vace_video_mask, vace_image_refs, video_prompt_video_guide_trigger):
535
  # my_tab.select(fn=load_unload_models, inputs=[], outputs=[])
536
 
537
  media_url = "https://github.com/pq-yang/MatAnyone/releases/download/media/"
538
 
539
  # download assets
540
 
541
+ gr.Markdown("<B>Mast Edition is provided by MatAnyone</B>")
542
+ gr.Markdown("If you have some trouble creating the perfect mask, be aware of these tips:")
543
+ gr.Markdown("- Using the Matanyone Settings you can also define Negative Point Prompts to remove parts of the current selection.")
544
+ gr.Markdown("- Sometime it is very hard to fit everything you want in a single mask, it may be much easier to combine multiple independent sub Masks before producing the Matting : each sub Mask is created by selecting an area of an image and by clicking the Add Mask button. Sub masks can then be enabled / disabled in the Matanyone settings.")
545
+
546
  with gr.Column( visible=True):
547
  with gr.Row():
548
  with gr.Accordion("Video Tutorial (click to expand)", open=False, elem_classes="custom-bg"):
 
556
  gr.Video(value="preprocessing/matanyone/tutorial_multi_targets.mp4", elem_classes="video")
557
 
558
 
559
+
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
560
 
561
+ with gr.Tabs():
562
+ with gr.TabItem("Video"):
563
+
564
+ click_state = gr.State([[],[]])
565
+
566
+ interactive_state = gr.State({
567
+ "inference_times": 0,
568
+ "negative_click_times" : 0,
569
+ "positive_click_times": 0,
570
+ "mask_save": arg_mask_save,
571
+ "multi_mask": {
572
+ "mask_names": [],
573
+ "masks": []
574
+ },
575
+ "track_end_number": None,
576
+ }
577
+ )
578
+
579
+ video_state = gr.State(
580
+ {
581
+ "user_name": "",
582
+ "video_name": "",
583
+ "origin_images": None,
584
+ "painted_images": None,
585
+ "masks": None,
586
+ "inpaint_masks": None,
587
+ "logits": None,
588
+ "select_frame_number": 0,
589
+ "fps": 16,
590
+ "audio": "",
591
+ }
592
+ )
593
+
594
+ with gr.Column( visible=True):
595
  with gr.Row():
596
+ with gr.Accordion('MatAnyone Settings (click to expand)', open=False):
597
+ with gr.Row():
598
+ erode_kernel_size = gr.Slider(label='Erode Kernel Size',
599
+ minimum=0,
600
+ maximum=30,
601
+ step=1,
602
+ value=10,
603
+ info="Erosion on the added mask",
604
+ interactive=True)
605
+ dilate_kernel_size = gr.Slider(label='Dilate Kernel Size',
606
+ minimum=0,
607
+ maximum=30,
608
+ step=1,
609
+ value=10,
610
+ info="Dilation on the added mask",
611
+ interactive=True)
612
+
613
+ with gr.Row():
614
+ image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Start Frame", info="Choose the start frame for target assignment and video matting", visible=False)
615
+ end_selection_slider = gr.Slider(minimum=1, maximum=300, step=1, value=81, label="Last Frame to Process", info="Last Frame to Process", visible=False)
616
+
617
+ track_pause_number_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="End frame", visible=False)
618
+ with gr.Row():
619
+ point_prompt = gr.Radio(
620
+ choices=["Positive", "Negative"],
621
+ value="Positive",
622
+ label="Point Prompt",
623
+ info="Click to add positive or negative point for target mask",
624
+ interactive=True,
625
+ visible=False,
626
+ min_width=100,
627
+ scale=1)
628
+ matting_type = gr.Radio(
629
+ choices=["Foreground", "Background"],
630
+ value="Foreground",
631
+ label="Matting Type",
632
+ info="Type of Video Matting to Generate",
633
+ interactive=True,
634
+ visible=False,
635
+ min_width=100,
636
+ scale=1)
637
+ mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False, scale=2)
638
+
639
+ # input video
640
+ with gr.Row(equal_height=True):
641
+ with gr.Column(scale=2):
642
+ gr.Markdown("## Step1: Upload video")
643
+ with gr.Column(scale=2):
644
+ step2_title = gr.Markdown("## Step2: Add masks <small>(Several clicks then **`Add Mask`** <u>one by one</u>)</small>", visible=False)
645
+ with gr.Row(equal_height=True):
646
+ with gr.Column(scale=2):
647
+ video_input = gr.Video(label="Input Video", elem_classes="video")
648
+ extract_frames_button = gr.Button(value="Load Video", interactive=True, elem_classes="new_button")
649
+ with gr.Column(scale=2):
650
+ video_info = gr.Textbox(label="Video Info", visible=False)
651
+ template_frame = gr.Image(label="Start Frame", type="pil",interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
652
+ with gr.Row():
653
+ clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False, min_width=100)
654
+ add_mask_button = gr.Button(value="Set Mask", interactive=True, visible=False, min_width=100)
655
+ remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False, min_width=100) # no use
656
+ matting_button = gr.Button(value="Generate Video Matting", interactive=True, visible=False, min_width=100)
657
+ with gr.Row():
658
+ gr.Markdown("")
659
+
660
+ # output video
661
+ with gr.Column() as output_row: #equal_height=True
662
+ with gr.Row():
663
+ with gr.Column(scale=2):
664
+ foreground_video_output = gr.Video(label="Masked Video Output", visible=False, elem_classes="video")
665
+ foreground_output_button = gr.Button(value="Black & White Video Output", visible=False, elem_classes="new_button")
666
+ with gr.Column(scale=2):
667
+ alpha_video_output = gr.Video(label="B & W Mask Video Output", visible=False, elem_classes="video")
668
+ alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
669
+ with gr.Row():
670
+ with gr.Row(visible= False):
671
+ export_to_vace_video_14B_btn = gr.Button("Export to current Video Input Video For Inpainting", visible= False)
672
+ with gr.Row(visible= True):
673
+ export_to_current_video_engine_btn = gr.Button("Export to current Video Input and Video Mask", visible= False)
674
+
675
+ export_to_vace_video_14B_btn.click( fn=teleport_to_vace_14B, inputs=[], outputs=[tabs, model_choice]).then(
676
+ fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [video_prompt_video_guide_trigger, vace_video_input, vace_video_mask])
677
+
678
+ export_to_current_video_engine_btn.click( fn=export_to_current_video_engine, inputs= [foreground_video_output, alpha_video_output], outputs= [vace_video_input, vace_video_mask]).then( #video_prompt_video_guide_trigger,
679
+ fn=teleport_to_video_tab, inputs= [], outputs= [tabs])
680
+
681
+
682
+ # first step: get the video information
683
+ extract_frames_button.click(
684
+ fn=get_frames_from_video,
685
+ inputs=[
686
+ video_input, video_state
687
+ ],
688
+ outputs=[video_state, video_info, template_frame,
689
+ image_selection_slider, end_selection_slider, track_pause_number_slider, point_prompt, matting_type, clear_button_click, add_mask_button, matting_button, template_frame,
690
+ foreground_video_output, alpha_video_output, foreground_output_button, alpha_output_button, mask_dropdown, step2_title]
691
+ )
692
+
693
+ # second step: select images from slider
694
+ image_selection_slider.release(fn=select_video_template,
695
+ inputs=[image_selection_slider, video_state, interactive_state],
696
+ outputs=[template_frame, video_state, interactive_state], api_name="select_image")
697
+ track_pause_number_slider.release(fn=get_end_number,
698
+ inputs=[track_pause_number_slider, video_state, interactive_state],
699
+ outputs=[template_frame, interactive_state], api_name="end_image")
700
+
701
+ # click select image to get mask using sam
702
+ template_frame.select(
703
+ fn=sam_refine,
704
+ inputs=[video_state, point_prompt, click_state, interactive_state],
705
+ outputs=[template_frame, video_state, interactive_state]
706
+ )
707
+
708
+ # add different mask
709
+ add_mask_button.click(
710
+ fn=add_multi_mask,
711
+ inputs=[video_state, interactive_state, mask_dropdown],
712
+ outputs=[interactive_state, mask_dropdown, template_frame, click_state]
713
+ )
714
+
715
+ remove_mask_button.click(
716
+ fn=remove_multi_mask,
717
+ inputs=[interactive_state, mask_dropdown],
718
+ outputs=[interactive_state, mask_dropdown]
719
+ )
720
+
721
+ # video matting
722
+ matting_button.click(
723
+ fn=show_outputs,
724
+ inputs=[],
725
+ outputs=[foreground_video_output, alpha_video_output]).then(
726
+ fn=video_matting,
727
+ inputs=[video_state, end_selection_slider, matting_type, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size],
728
+ outputs=[foreground_video_output, alpha_video_output,foreground_video_output, alpha_video_output, export_to_vace_video_14B_btn, export_to_current_video_engine_btn]
729
+ )
730
+
731
+ # click to get mask
732
+ mask_dropdown.change(
733
+ fn=show_mask,
734
+ inputs=[video_state, interactive_state, mask_dropdown],
735
+ outputs=[template_frame]
736
+ )
737
+
738
+ # clear input
739
+ video_input.change(
740
+ fn=restart,
741
+ inputs=[],
742
+ outputs=[
743
+ video_state,
744
+ interactive_state,
745
+ click_state,
746
+ foreground_video_output, alpha_video_output,
747
+ template_frame,
748
+ image_selection_slider, end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
749
+ add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
750
+ ],
751
+ queue=False,
752
+ show_progress=False)
753
+
754
+ video_input.clear(
755
+ fn=restart,
756
+ inputs=[],
757
+ outputs=[
758
+ video_state,
759
+ interactive_state,
760
+ click_state,
761
+ foreground_video_output, alpha_video_output,
762
+ template_frame,
763
+ image_selection_slider , end_selection_slider, track_pause_number_slider,point_prompt, export_to_vace_video_14B_btn, export_to_current_video_engine_btn, matting_type, clear_button_click,
764
+ add_mask_button, matting_button, template_frame, foreground_video_output, alpha_video_output, remove_mask_button, foreground_output_button, alpha_output_button, mask_dropdown, video_info, step2_title
765
+ ],
766
+ queue=False,
767
+ show_progress=False)
768
+
769
+ # points clear
770
+ clear_button_click.click(
771
+ fn = clear_click,
772
+ inputs = [video_state, click_state,],
773
+ outputs = [template_frame,click_state],
774
+ )
775
+
776
+
777
+
778
+ with gr.TabItem("Image"):
779
+ click_state = gr.State([[],[]])
780
+
781
+ interactive_state = gr.State({
782
+ "inference_times": 0,
783
+ "negative_click_times" : 0,
784
+ "positive_click_times": 0,
785
+ "mask_save": False,
786
+ "multi_mask": {
787
+ "mask_names": [],
788
+ "masks": []
789
+ },
790
+ "track_end_number": None,
791
+ }
792
+ )
793
+
794
+ image_state = gr.State(
795
+ {
796
+ "user_name": "",
797
+ "image_name": "",
798
+ "origin_images": None,
799
+ "painted_images": None,
800
+ "masks": None,
801
+ "inpaint_masks": None,
802
+ "logits": None,
803
+ "select_frame_number": 0,
804
+ "fps": 30
805
+ }
806
+ )
807
+
808
+ with gr.Group(elem_classes="gr-monochrome-group", visible=True):
809
  with gr.Row():
810
+ with gr.Accordion('MatAnyone Settings (click to expand)', open=False):
811
+ with gr.Row():
812
+ erode_kernel_size = gr.Slider(label='Erode Kernel Size',
813
+ minimum=0,
814
+ maximum=30,
815
+ step=1,
816
+ value=10,
817
+ info="Erosion on the added mask",
818
+ interactive=True)
819
+ dilate_kernel_size = gr.Slider(label='Dilate Kernel Size',
820
+ minimum=0,
821
+ maximum=30,
822
+ step=1,
823
+ value=10,
824
+ info="Dilation on the added mask",
825
+ interactive=True)
826
+
827
+ with gr.Row():
828
+ image_selection_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Num of Refinement Iterations", info="More iterations → More details & More time", visible=False)
829
+ track_pause_number_slider = gr.Slider(minimum=1, maximum=100, step=1, value=1, label="Track end frame", visible=False)
830
+ with gr.Row():
831
+ point_prompt = gr.Radio(
832
+ choices=["Positive", "Negative"],
833
+ value="Positive",
834
+ label="Point Prompt",
835
+ info="Click to add positive or negative point for target mask",
836
+ interactive=True,
837
+ visible=False,
838
+ min_width=100,
839
+ scale=1)
840
+ mask_dropdown = gr.Dropdown(multiselect=True, value=[], label="Mask Selection", info="Choose 1~all mask(s) added in Step 2", visible=False)
841
+
842
+
843
+ with gr.Column():
844
+ # input image
845
+ with gr.Row(equal_height=True):
846
+ with gr.Column(scale=2):
847
+ gr.Markdown("## Step1: Upload image")
848
+ with gr.Column(scale=2):
849
+ step2_title = gr.Markdown("## Step2: Add masks <small>(Several clicks then **`Add Mask`** <u>one by one</u>)</small>", visible=False)
850
+ with gr.Row(equal_height=True):
851
+ with gr.Column(scale=2):
852
+ image_input = gr.Image(label="Input Image", elem_classes="image")
853
+ extract_frames_button = gr.Button(value="Load Image", interactive=True, elem_classes="new_button")
854
+ with gr.Column(scale=2):
855
+ image_info = gr.Textbox(label="Image Info", visible=False)
856
+ template_frame = gr.Image(type="pil", label="Start Frame", interactive=True, elem_id="template_frame", visible=False, elem_classes="image")
857
+ with gr.Row(equal_height=True, elem_classes="mask_button_group"):
858
+ clear_button_click = gr.Button(value="Clear Clicks", interactive=True, visible=False, elem_classes="new_button", min_width=100)
859
+ add_mask_button = gr.Button(value="Add Mask", interactive=True, visible=False, elem_classes="new_button", min_width=100)
860
+ remove_mask_button = gr.Button(value="Remove Mask", interactive=True, visible=False, elem_classes="new_button", min_width=100)
861
+ matting_button = gr.Button(value="Image Matting", interactive=True, visible=False, elem_classes="green_button", min_width=100)
862
+
863
+ # output image
864
+ with gr.Row(equal_height=True):
865
+ foreground_image_output = gr.Image(type="pil", label="Foreground Output", visible=False, elem_classes="image")
866
  with gr.Row():
867
+ with gr.Row():
868
+ export_image_btn = gr.Button(value="Add to current Reference Images", visible=False, elem_classes="new_button")
869
+ with gr.Column(scale=2, visible= False):
870
+ alpha_image_output = gr.Image(type="pil", label="Alpha Output", visible=False, elem_classes="image")
 
 
 
 
 
 
871
  alpha_output_button = gr.Button(value="Alpha Mask Output", visible=False, elem_classes="new_button")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
872
 
873
+ export_image_btn.click( fn=export_image, inputs= [vace_image_refs, foreground_image_output], outputs= [vace_image_refs]).then( #video_prompt_video_guide_trigger,
874
+ fn=teleport_to_video_tab, inputs= [], outputs= [tabs])
875
+
876
+ # first step: get the image information
877
+ extract_frames_button.click(
878
+ fn=get_frames_from_image,
879
+ inputs=[
880
+ image_input, image_state
881
+ ],
882
+ outputs=[image_state, image_info, template_frame,
883
+ image_selection_slider, track_pause_number_slider,point_prompt, clear_button_click, add_mask_button, matting_button, template_frame,
884
+ foreground_image_output, alpha_image_output, export_image_btn, alpha_output_button, mask_dropdown, step2_title]
885
+ )
886
+
887
+ # second step: select images from slider
888
+ image_selection_slider.release(fn=select_image_template,
889
+ inputs=[image_selection_slider, image_state, interactive_state],
890
+ outputs=[template_frame, image_state, interactive_state], api_name="select_image")
891
+ track_pause_number_slider.release(fn=get_end_number,
892
+ inputs=[track_pause_number_slider, image_state, interactive_state],
893
+ outputs=[template_frame, interactive_state], api_name="end_image")
894
+
895
+ # click select image to get mask using sam
896
+ template_frame.select(
897
+ fn=sam_refine,
898
+ inputs=[image_state, point_prompt, click_state, interactive_state],
899
+ outputs=[template_frame, image_state, interactive_state]
900
+ )
901
+
902
+ # add different mask
903
+ add_mask_button.click(
904
+ fn=add_multi_mask,
905
+ inputs=[image_state, interactive_state, mask_dropdown],
906
+ outputs=[interactive_state, mask_dropdown, template_frame, click_state]
907
+ )
908
+
909
+ remove_mask_button.click(
910
+ fn=remove_multi_mask,
911
+ inputs=[interactive_state, mask_dropdown],
912
+ outputs=[interactive_state, mask_dropdown]
913
+ )
914
+
915
+ # image matting
916
+ matting_button.click(
917
+ fn=image_matting,
918
+ inputs=[image_state, interactive_state, mask_dropdown, erode_kernel_size, dilate_kernel_size, image_selection_slider],
919
+ outputs=[foreground_image_output, export_image_btn]
920
+ )
921
 
 
 
 
 
 
922
 
 
 
 
 
 
 
 
 
 
923
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
preprocessing/matanyone/tutorial_multi_targets.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:39eaa5740d67e7fc97138c7d74cbcbaffd1f798b30d206c50eb19ba6f33adfb8
3
+ size 621144
preprocessing/matanyone/tutorial_single_target.mp4 ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:397719759b1c3c10c1a15c8603ca8a4ee7889fd8f4e9896703575387e8118826
3
+ size 211460
wan/text2video.py CHANGED
@@ -111,7 +111,7 @@ class WanT2V:
111
 
112
  self.adapt_vace_model()
113
 
114
- def vace_encode_frames(self, frames, ref_images, masks=None, tile_size = 0, overlapped_latents = 0, overlap_noise = 0):
115
  if ref_images is None:
116
  ref_images = [None] * len(frames)
117
  else:
@@ -123,10 +123,10 @@ class WanT2V:
123
  inactive = [i * (1 - m) + 0 * m for i, m in zip(frames, masks)]
124
  reactive = [i * m + 0 * (1 - m) for i, m in zip(frames, masks)]
125
  inactive = self.vae.encode(inactive, tile_size = tile_size)
126
- # inactive = [ t * (1.0 - noise_factor) + torch.randn_like(t ) * noise_factor for t in inactive]
127
- # if overlapped_latents > 0:
128
- # for t in inactive:
129
- # t[:, :overlapped_latents ] = t[:, :overlapped_latents ] * (1.0 - noise_factor) + torch.randn_like(t[:, :overlapped_latents ] ) * noise_factor
130
 
131
  reactive = self.vae.encode(reactive, tile_size = tile_size)
132
  latents = [torch.cat((u, c), dim=0) for u, c in zip(inactive, reactive)]
@@ -190,13 +190,13 @@ class WanT2V:
190
  num_frames = total_frames - prepend_count
191
  if sub_src_mask is not None and sub_src_video is not None:
192
  src_video[i], src_mask[i], _, _, _ = self.vid_proc.load_video_pair(sub_src_video, sub_src_mask, max_frames= num_frames, trim_video = trim_video - prepend_count, start_frame = start_frame, canvas_height = canvas_height, canvas_width = canvas_width, fit_into_canvas = fit_into_canvas)
193
- # src_video is [-1, 1], 0 = inpainting area (in fact 127 in [0, 255])
194
- # src_mask is [-1, 1], 0 = preserve original video (in fact 127 in [0, 255]) and 1 = Inpainting (in fact 255 in [0, 255])
195
  src_video[i] = src_video[i].to(device)
196
  src_mask[i] = src_mask[i].to(device)
197
  if prepend_count > 0:
198
  src_video[i] = torch.cat( [sub_pre_src_video, src_video[i]], dim=1)
199
- src_mask[i] = torch.cat( [torch.zeros_like(sub_pre_src_video), src_mask[i]] ,1)
200
  src_video_shape = src_video[i].shape
201
  if src_video_shape[1] != total_frames:
202
  src_video[i] = torch.cat( [src_video[i], src_video[i].new_zeros(src_video_shape[0], total_frames -src_video_shape[1], *src_video_shape[-2:])], dim=1)
@@ -300,7 +300,8 @@ class WanT2V:
300
  slg_end = 1.0,
301
  cfg_star_switch = True,
302
  cfg_zero_step = 5,
303
- overlapped_latents = 0,
 
304
  overlap_noise = 0,
305
  model_filename = None,
306
  **bbargs
@@ -373,8 +374,10 @@ class WanT2V:
373
  input_frames = [u.to(self.device) for u in input_frames]
374
  input_ref_images = [ None if u == None else [v.to(self.device) for v in u] for u in input_ref_images]
375
  input_masks = [u.to(self.device) for u in input_masks]
376
-
377
- z0 = self.vace_encode_frames(input_frames, input_ref_images, masks=input_masks, tile_size = VAE_tile_size, overlapped_latents = overlapped_latents, overlap_noise = overlap_noise )
 
 
378
  m0 = self.vace_encode_masks(input_masks, input_ref_images)
379
  z = self.vace_latent(z0, m0)
380
 
@@ -442,8 +445,9 @@ class WanT2V:
442
  if vace:
443
  ref_images_count = len(input_ref_images[0]) if input_ref_images != None and input_ref_images[0] != None else 0
444
  kwargs.update({'vace_context' : z, 'vace_context_scale' : context_scale})
445
- if overlapped_latents > 0:
446
- z_reactive = [ zz[0:16, ref_images_count:overlapped_latents + ref_images_count].clone() for zz in z]
 
447
 
448
 
449
  if self.model.enable_teacache:
@@ -453,13 +457,14 @@ class WanT2V:
453
  if callback != None:
454
  callback(-1, None, True)
455
  for i, t in enumerate(tqdm(timesteps)):
456
- if vace and overlapped_latents > 0 :
457
- # noise_factor = overlap_noise *(i/(len(timesteps)-1)) / 1000
458
- noise_factor = overlap_noise / 1000 # * (999-t) / 999
459
- # noise_factor = overlap_noise / 1000 # * t / 999
460
- for zz, zz_r in zip(z, z_reactive):
461
- zz[0:16, ref_images_count:overlapped_latents + ref_images_count] = zz_r * (1.0 - noise_factor) + torch.randn_like(zz_r ) * noise_factor
462
-
 
463
  if target_camera != None:
464
  latent_model_input = torch.cat([latents, source_latents], dim=1)
465
  else:
@@ -552,6 +557,13 @@ class WanT2V:
552
 
553
  x0 = [latents]
554
 
 
 
 
 
 
 
 
555
  if input_frames == None:
556
  if phantom:
557
  # phantom post processing
@@ -560,11 +572,9 @@ class WanT2V:
560
  else:
561
  # vace post processing
562
  videos = self.decode_latent(x0, input_ref_images, VAE_tile_size)
563
-
564
- del latents
565
- del sample_scheduler
566
-
567
- return videos[0] if self.rank == 0 else None
568
 
569
  def adapt_vace_model(self):
570
  model = self.model
 
111
 
112
  self.adapt_vace_model()
113
 
114
+ def vace_encode_frames(self, frames, ref_images, masks=None, tile_size = 0, overlapped_latents = None):
115
  if ref_images is None:
116
  ref_images = [None] * len(frames)
117
  else:
 
123
  inactive = [i * (1 - m) + 0 * m for i, m in zip(frames, masks)]
124
  reactive = [i * m + 0 * (1 - m) for i, m in zip(frames, masks)]
125
  inactive = self.vae.encode(inactive, tile_size = tile_size)
126
+ self.toto = inactive[0].clone()
127
+ if overlapped_latents != None :
128
+ # inactive[0][:, 0:1] = self.vae.encode([frames[0][:, 0:1]], tile_size = tile_size)[0] # redundant
129
+ inactive[0][:, 1:overlapped_latents.shape[1] + 1] = overlapped_latents
130
 
131
  reactive = self.vae.encode(reactive, tile_size = tile_size)
132
  latents = [torch.cat((u, c), dim=0) for u, c in zip(inactive, reactive)]
 
190
  num_frames = total_frames - prepend_count
191
  if sub_src_mask is not None and sub_src_video is not None:
192
  src_video[i], src_mask[i], _, _, _ = self.vid_proc.load_video_pair(sub_src_video, sub_src_mask, max_frames= num_frames, trim_video = trim_video - prepend_count, start_frame = start_frame, canvas_height = canvas_height, canvas_width = canvas_width, fit_into_canvas = fit_into_canvas)
193
+ # src_video is [-1, 1] (at this function output), 0 = inpainting area (in fact 127 in [0, 255])
194
+ # src_mask is [-1, 1] (at this function output), 0 = preserve original video (in fact 127 in [0, 255]) and 1 = Inpainting (in fact 255 in [0, 255])
195
  src_video[i] = src_video[i].to(device)
196
  src_mask[i] = src_mask[i].to(device)
197
  if prepend_count > 0:
198
  src_video[i] = torch.cat( [sub_pre_src_video, src_video[i]], dim=1)
199
+ src_mask[i] = torch.cat( [torch.full_like(sub_pre_src_video, -1.0), src_mask[i]] ,1)
200
  src_video_shape = src_video[i].shape
201
  if src_video_shape[1] != total_frames:
202
  src_video[i] = torch.cat( [src_video[i], src_video[i].new_zeros(src_video_shape[0], total_frames -src_video_shape[1], *src_video_shape[-2:])], dim=1)
 
300
  slg_end = 1.0,
301
  cfg_star_switch = True,
302
  cfg_zero_step = 5,
303
+ overlapped_latents = None,
304
+ return_latent_slice = None,
305
  overlap_noise = 0,
306
  model_filename = None,
307
  **bbargs
 
374
  input_frames = [u.to(self.device) for u in input_frames]
375
  input_ref_images = [ None if u == None else [v.to(self.device) for v in u] for u in input_ref_images]
376
  input_masks = [u.to(self.device) for u in input_masks]
377
+ previous_latents = None
378
+ # if overlapped_latents != None:
379
+ # input_ref_images = [u[-1:] for u in input_ref_images]
380
+ z0 = self.vace_encode_frames(input_frames, input_ref_images, masks=input_masks, tile_size = VAE_tile_size, overlapped_latents = overlapped_latents )
381
  m0 = self.vace_encode_masks(input_masks, input_ref_images)
382
  z = self.vace_latent(z0, m0)
383
 
 
445
  if vace:
446
  ref_images_count = len(input_ref_images[0]) if input_ref_images != None and input_ref_images[0] != None else 0
447
  kwargs.update({'vace_context' : z, 'vace_context_scale' : context_scale})
448
+ if overlapped_latents != None:
449
+ overlapped_latents_size = overlapped_latents.shape[1] + 1
450
+ z_reactive = [ zz[0:16, 0:overlapped_latents_size + ref_images_count].clone() for zz in z]
451
 
452
 
453
  if self.model.enable_teacache:
 
457
  if callback != None:
458
  callback(-1, None, True)
459
  for i, t in enumerate(tqdm(timesteps)):
460
+ if overlapped_latents != None:
461
+ # overlap_noise_factor = overlap_noise *(i/(len(timesteps)-1)) / 1000
462
+ overlap_noise_factor = overlap_noise / 1000
463
+ latent_noise_factor = t / 1000
464
+ for zz, zz_r, ll in zip(z, z_reactive, [latents]):
465
+ pass
466
+ zz[0:16, ref_images_count:overlapped_latents_size + ref_images_count] = zz_r[:, ref_images_count:] * (1.0 - overlap_noise_factor) + torch.randn_like(zz_r[:, ref_images_count:] ) * overlap_noise_factor
467
+ ll[:, 0:overlapped_latents_size + ref_images_count] = zz_r * (1.0 - latent_noise_factor) + torch.randn_like(zz_r ) * latent_noise_factor
468
  if target_camera != None:
469
  latent_model_input = torch.cat([latents, source_latents], dim=1)
470
  else:
 
557
 
558
  x0 = [latents]
559
 
560
+ if return_latent_slice != None:
561
+ if overlapped_latents != None:
562
+ # latents [:, 1:] = self.toto
563
+ for zz, zz_r, ll in zip(z, z_reactive, [latents]):
564
+ ll[:, 0:overlapped_latents_size + ref_images_count] = zz_r
565
+
566
+ latent_slice = latents[:, return_latent_slice].clone()
567
  if input_frames == None:
568
  if phantom:
569
  # phantom post processing
 
572
  else:
573
  # vace post processing
574
  videos = self.decode_latent(x0, input_ref_images, VAE_tile_size)
575
+ if return_latent_slice != None:
576
+ return { "x" : videos[0], "latent_slice" : latent_slice }
577
+ return videos[0]
 
 
578
 
579
  def adapt_vace_model(self):
580
  model = self.model
wan/utils/utils.py CHANGED
@@ -91,11 +91,11 @@ def calculate_new_dimensions(canvas_height, canvas_width, height, width, fit_int
91
  return new_height, new_width
92
 
93
  def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, fit_into_canvas = False ):
94
- if rm_background:
95
  session = new_session()
96
 
97
  output_list =[]
98
- for img in img_list:
99
  width, height = img.size
100
 
101
  if fit_into_canvas:
@@ -113,9 +113,10 @@ def resize_and_remove_background(img_list, budget_width, budget_height, rm_backg
113
  new_height = int( round(height * scale / 16) * 16)
114
  new_width = int( round(width * scale / 16) * 16)
115
  resized_image= img.resize((new_width,new_height), resample=Image.Resampling.LANCZOS)
116
- if rm_background:
117
- resized_image = remove(resized_image, session=session, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
118
- output_list.append(resized_image)
 
119
  return output_list
120
 
121
 
 
91
  return new_height, new_width
92
 
93
  def resize_and_remove_background(img_list, budget_width, budget_height, rm_background, fit_into_canvas = False ):
94
+ if rm_background > 0:
95
  session = new_session()
96
 
97
  output_list =[]
98
+ for i, img in enumerate(img_list):
99
  width, height = img.size
100
 
101
  if fit_into_canvas:
 
113
  new_height = int( round(height * scale / 16) * 16)
114
  new_width = int( round(width * scale / 16) * 16)
115
  resized_image= img.resize((new_width,new_height), resample=Image.Resampling.LANCZOS)
116
+ if rm_background == 1 or rm_background == 2 and i > 0 :
117
+ # resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1,alpha_matting_background_threshold = 70, alpha_foreground_background_threshold = 100, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
118
+ resized_image = remove(resized_image, session=session, alpha_matting_erode_size = 1, alpha_matting = True, bgcolor=[255, 255, 255, 0]).convert('RGB')
119
+ output_list.append(resized_image) #alpha_matting_background_threshold = 30, alpha_foreground_background_threshold = 200,
120
  return output_list
121
 
122
 
wgp.py CHANGED
@@ -204,9 +204,6 @@ def process_prompt_and_add_tasks(state, model_choice):
204
 
205
  if isinstance(image_refs, list):
206
  image_refs = [ convert_image(tup[0]) for tup in image_refs ]
207
- # os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
208
- # from wan.utils.utils import resize_and_remove_background
209
- # image_refs = resize_and_remove_background(image_refs, width, height, inputs["remove_background_image_ref"] ==1, fit_into_canvas= True)
210
 
211
 
212
  if len(prompts) > 0:
@@ -333,8 +330,10 @@ def process_prompt_and_add_tasks(state, model_choice):
333
  if "O" in video_prompt_type :
334
  keep_frames_video_guide= inputs["keep_frames_video_guide"]
335
  video_length = inputs["video_length"]
336
- if len(keep_frames_video_guide) ==0:
337
- gr.Info(f"Warning : you have asked to reuse all the frames of the control Video in the Alternate Video Ending it. Please make sure the number of frames of the control Video is lower than the total number of frames to generate otherwise it won't make a difference.")
 
 
338
  # elif keep_frames >= video_length:
339
  # gr.Info(f"The number of frames in the control Video to reuse ({keep_frames_video_guide}) in Alternate Video Ending can not be bigger than the total number of frames ({video_length}) to generate.")
340
  # return
@@ -347,12 +346,7 @@ def process_prompt_and_add_tasks(state, model_choice):
347
  return
348
 
349
  if isinstance(image_refs, list):
350
- image_refs = [ convert_image(tup[0]) for tup in image_refs ]
351
-
352
- # os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
353
- # from wan.utils.utils import resize_and_remove_background
354
- # image_refs = resize_and_remove_background(image_refs, width, height, inputs["remove_background_image_ref"] ==1)
355
-
356
 
357
  if len(prompts) > 0:
358
  prompts = ["\n".join(prompts)]
@@ -1464,7 +1458,6 @@ lock_ui_attention = False
1464
  lock_ui_transformer = False
1465
  lock_ui_compile = False
1466
 
1467
- preload =int(args.preload)
1468
  force_profile_no = int(args.profile)
1469
  verbose_level = int(args.verbose)
1470
  quantizeTransformer = args.quantize_transformer
@@ -1482,15 +1475,19 @@ if os.path.isfile("t2v_settings.json"):
1482
  if not os.path.isfile(server_config_filename) and os.path.isfile("gradio_config.json"):
1483
  shutil.move("gradio_config.json", server_config_filename)
1484
 
 
 
1485
  src_move = [ "ckpts/models_clip_open-clip-xlm-roberta-large-vit-huge-14-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-quanto_int8.safetensors" ]
1486
  tgt_move = [ "ckpts/xlm-roberta-large/", "ckpts/umt5-xxl/", "ckpts/umt5-xxl/"]
1487
  for src,tgt in zip(src_move,tgt_move):
1488
  if os.path.isfile(src):
1489
  try:
1490
- shutil.move(src, tgt)
 
 
 
1491
  except:
1492
  pass
1493
-
1494
 
1495
 
1496
  if not Path(server_config_filename).is_file():
@@ -1755,7 +1752,10 @@ def get_default_settings(filename):
1755
  "flow_shift": 13,
1756
  "resolution": "1280x720"
1757
  })
1758
-
 
 
 
1759
 
1760
 
1761
  with open(defaults_filename, "w", encoding="utf-8") as f:
@@ -2136,6 +2136,9 @@ def load_models(model_filename):
2136
  global transformer_filename, transformer_loras_filenames
2137
  model_family = get_model_family(model_filename)
2138
  perc_reserved_mem_max = args.perc_reserved_mem_max
 
 
 
2139
  new_transformer_loras_filenames = None
2140
  dependent_models = get_dependent_models(model_filename, quantization= transformer_quantization, dtype_policy = transformer_dtype_policy)
2141
  new_transformer_loras_filenames = [model_filename] if "_lora" in model_filename else None
@@ -2259,7 +2262,8 @@ def apply_changes( state,
2259
  preload_model_policy_choice = 1,
2260
  UI_theme_choice = "default",
2261
  enhancer_enabled_choice = 0,
2262
- fit_canvas_choice = 0
 
2263
  ):
2264
  if args.lock_config:
2265
  return
@@ -2284,6 +2288,7 @@ def apply_changes( state,
2284
  "UI_theme" : UI_theme_choice,
2285
  "fit_canvas": fit_canvas_choice,
2286
  "enhancer_enabled" : enhancer_enabled_choice,
 
2287
  }
2288
 
2289
  if Path(server_config_filename).is_file():
@@ -2456,26 +2461,20 @@ def refresh_gallery(state): #, msg
2456
  prompt = "<BR><DIV style='height:8px'></DIV>".join(prompts)
2457
  if enhanced:
2458
  prompt = "<U><B>Enhanced:</B></U><BR>" + prompt
2459
-
2460
  start_img_uri = task.get('start_image_data_base64')
2461
- start_img_uri = start_img_uri[0] if start_img_uri !=None else None
 
2462
  end_img_uri = task.get('end_image_data_base64')
2463
- end_img_uri = end_img_uri[0] if end_img_uri !=None else None
 
 
2464
  thumbnail_size = "100px"
2465
- if start_img_uri:
2466
- start_img_md = f'<img src="{start_img_uri}" alt="Start" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" />'
2467
- if end_img_uri:
2468
- end_img_md = f'<img src="{end_img_uri}" alt="End" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" />'
2469
 
2470
- label = f"Prompt of Video being Generated"
2471
-
2472
- html = "<STYLE> #PINFO, #PINFO th, #PINFO td {border: 1px solid #CCCCCC;background-color:#FFFFFF;}</STYLE><TABLE WIDTH=100% ID=PINFO ><TR><TD width=100%>" + prompt + "</TD>"
2473
- if start_img_md != "":
2474
- html += "<TD>" + start_img_md + "</TD>"
2475
- if end_img_md != "":
2476
- html += "<TD>" + end_img_md + "</TD>"
2477
-
2478
- html += "</TR></TABLE>"
2479
  html_output = gr.HTML(html, visible= True)
2480
  return gr.Gallery(selected_index=choice, value = file_list), html_output, gr.Button(visible=False), gr.Button(visible=True), gr.Row(visible=True), update_queue_data(queue), gr.Button(interactive= abort_interactive), gr.Button(visible= onemorewindow_visible)
2481
 
@@ -2680,7 +2679,7 @@ def generate_video(
2680
  sliding_window_overlap,
2681
  sliding_window_overlap_noise,
2682
  sliding_window_discard_last_frames,
2683
- remove_background_image_ref,
2684
  temporal_upsampling,
2685
  spatial_upsampling,
2686
  RIFLEx_setting,
@@ -2816,13 +2815,14 @@ def generate_video(
2816
  fps = 30
2817
  else:
2818
  fps = 16
 
2819
 
2820
  original_image_refs = image_refs
2821
  if image_refs != None and len(image_refs) > 0 and (hunyuan_custom or phantom or vace):
2822
  send_cmd("progress", [0, get_latest_status(state, "Removing Images References Background")])
2823
  os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
2824
  from wan.utils.utils import resize_and_remove_background
2825
- image_refs = resize_and_remove_background(image_refs, width, height, remove_background_image_ref ==1, fit_into_canvas= not vace)
2826
  update_task_thumbnails(task, locals())
2827
  send_cmd("output")
2828
 
@@ -2879,7 +2879,6 @@ def generate_video(
2879
  repeat_no = 0
2880
  extra_generation = 0
2881
  initial_total_windows = 0
2882
- max_frames_to_generate = video_length
2883
  if diffusion_forcing or vace or ltxv:
2884
  reuse_frames = min(sliding_window_size - 4, sliding_window_overlap)
2885
  else:
@@ -2888,8 +2887,9 @@ def generate_video(
2888
  video_length += sliding_window_overlap
2889
  sliding_window = (vace or diffusion_forcing or ltxv) and video_length > sliding_window_size
2890
 
 
 
2891
  if sliding_window:
2892
- discard_last_frames = sliding_window_discard_last_frames
2893
  left_after_first_window = video_length - sliding_window_size + discard_last_frames
2894
  initial_total_windows= 1 + math.ceil(left_after_first_window / (sliding_window_size - discard_last_frames - reuse_frames))
2895
  video_length = sliding_window_size
@@ -2913,6 +2913,7 @@ def generate_video(
2913
  prefix_video_frames_count = 0
2914
  frames_already_processed = None
2915
  pre_video_guide = None
 
2916
  window_no = 0
2917
  extra_windows = 0
2918
  guide_start_frame = 0
@@ -2920,6 +2921,8 @@ def generate_video(
2920
  gen["extra_windows"] = 0
2921
  gen["total_windows"] = 1
2922
  gen["window_no"] = 1
 
 
2923
  start_time = time.time()
2924
  if prompt_enhancer_image_caption_model != None and prompt_enhancer !=None and len(prompt_enhancer)>0:
2925
  text_encoder_max_tokens = 256
@@ -2955,38 +2958,50 @@ def generate_video(
2955
  while not abort:
2956
  if sliding_window:
2957
  prompt = prompts[window_no] if window_no < len(prompts) else prompts[-1]
2958
- extra_windows += gen.get("extra_windows",0)
2959
- if extra_windows > 0:
2960
- video_length = sliding_window_size
2961
  gen["extra_windows"] = 0
 
 
 
 
 
 
 
 
 
2962
  total_windows = initial_total_windows + extra_windows
2963
  gen["total_windows"] = total_windows
2964
  if window_no >= total_windows:
2965
  break
2966
  window_no += 1
2967
  gen["window_no"] = window_no
2968
-
 
 
 
2969
  if hunyuan_custom:
2970
  src_ref_images = image_refs
2971
  elif phantom:
2972
  src_ref_images = image_refs.copy() if image_refs != None else None
2973
- elif diffusion_forcing or ltxv:
 
 
 
2974
  if video_source != None and len(video_source) > 0 and window_no == 1:
2975
  keep_frames_video_source= 1000 if len(keep_frames_video_source) ==0 else int(keep_frames_video_source)
 
2976
  prefix_video = preprocess_video(None, width=width, height=height,video_in=video_source, max_frames= keep_frames_video_source , start_frame = 0, fit_canvas= fit_canvas, target_fps = fps, block_size = 32 if ltxv else 16)
2977
  prefix_video = prefix_video .permute(3, 0, 1, 2)
2978
  prefix_video = prefix_video .float().div_(127.5).sub_(1.) # c, f, h, w
2979
- prefix_video_frames_count = prefix_video.shape[1]
2980
  pre_video_guide = prefix_video[:, -reuse_frames:]
2981
-
2982
- elif vace:
2983
- # video_prompt_type = video_prompt_type +"G"
 
2984
  image_refs_copy = image_refs.copy() if image_refs != None else None # required since prepare_source do inplace modifications
2985
  video_guide_copy = video_guide
2986
  video_mask_copy = video_mask
2987
  if any(process in video_prompt_type for process in ("P", "D", "G")) :
2988
- prompts_max = gen["prompts_max"]
2989
-
2990
  preprocess_type = None
2991
  if "P" in video_prompt_type :
2992
  progress_args = [0, get_latest_status(state,"Extracting Open Pose Information")]
@@ -3005,8 +3020,11 @@ def generate_video(
3005
  if len(error) > 0:
3006
  raise gr.Error(f"invalid keep frames {keep_frames_video_guide}")
3007
  keep_frames_parsed = keep_frames_parsed[guide_start_frame: guide_start_frame + video_length]
 
3008
  if window_no == 1:
3009
- image_size = (height, width) # VACE_SIZE_CONFIGS[resolution_reformated] # default frame dimensions until it is set by video_src (if there is any)
 
 
3010
  src_video, src_mask, src_ref_images = wan_model.prepare_source([video_guide_copy],
3011
  [video_mask_copy ],
3012
  [image_refs_copy],
@@ -3017,29 +3035,24 @@ def generate_video(
3017
  pre_src_video = [pre_video_guide],
3018
  fit_into_canvas = fit_canvas
3019
  )
3020
- # if window_no == 1 and src_video != None and len(src_video) > 0:
3021
- # image_size = src_video[0].shape[-2:]
3022
- prompts_max = gen["prompts_max"]
3023
  status = get_latest_status(state)
3024
-
3025
-
3026
  gen["progress_status"] = status
3027
  gen["progress_phase"] = ("Encoding Prompt", -1 )
3028
  callback = build_callback(state, trans, send_cmd, status, num_inference_steps)
3029
  progress_args = [0, merge_status_context(status, "Encoding Prompt")]
3030
  send_cmd("progress", progress_args)
3031
 
 
 
 
 
 
 
 
3032
  # samples = torch.empty( (1,2)) #for testing
3033
  # if False:
3034
 
3035
  try:
3036
- if trans.enable_teacache:
3037
- trans.teacache_counter = 0
3038
- trans.num_steps = num_inference_steps
3039
- trans.teacache_skipped_steps = 0
3040
- trans.previous_residual = None
3041
- trans.previous_modulated_input = None
3042
-
3043
  samples = wan_model.generate(
3044
  input_prompt = prompt,
3045
  image_start = image_start,
@@ -3049,7 +3062,7 @@ def generate_video(
3049
  input_masks = src_mask,
3050
  input_video= pre_video_guide if diffusion_forcing or ltxv else source_video,
3051
  target_camera= target_camera,
3052
- frame_num=(video_length // 4)* 4 + 1,
3053
  height = height,
3054
  width = width,
3055
  fit_into_canvas = fit_canvas == 1,
@@ -3076,7 +3089,8 @@ def generate_video(
3076
  causal_block_size = 5,
3077
  causal_attention = True,
3078
  fps = fps,
3079
- overlapped_latents = 0 if reuse_frames == 0 or window_no == 1 else ((reuse_frames - 1) // 4 + 1),
 
3080
  overlap_noise = sliding_window_overlap_noise,
3081
  model_filename = model_filename,
3082
  )
@@ -3109,6 +3123,7 @@ def generate_video(
3109
  tb = traceback.format_exc().split('\n')[:-1]
3110
  print('\n'.join(tb))
3111
  send_cmd("error", new_error)
 
3112
  return
3113
  finally:
3114
  trans.previous_residual = None
@@ -3118,33 +3133,42 @@ def generate_video(
3118
  print(f"Teacache Skipped Steps:{trans.teacache_skipped_steps}/{trans.num_steps}" )
3119
 
3120
  if samples != None:
 
 
 
3121
  samples = samples.to("cpu")
3122
  offload.last_offload_obj.unload_all()
3123
  gc.collect()
3124
  torch.cuda.empty_cache()
3125
 
 
 
 
 
 
 
3126
  if samples == None:
3127
  abort = True
3128
  state["prompt"] = ""
3129
  send_cmd("output")
3130
  else:
3131
  sample = samples.cpu()
3132
- if True: # for testing
3133
- torch.save(sample, "output.pt")
3134
- else:
3135
- sample =torch.load("output.pt")
3136
-
 
3137
  if sliding_window :
3138
  guide_start_frame += video_length
3139
  if discard_last_frames > 0:
3140
  sample = sample[: , :-discard_last_frames]
3141
  guide_start_frame -= discard_last_frames
3142
  if reuse_frames == 0:
3143
- pre_video_guide = sample[:,9999 :]
3144
  else:
3145
- # noise_factor = 200/ 1000
3146
- # pre_video_guide = sample[:, -reuse_frames:] * (1.0 - noise_factor) + torch.randn_like(sample[:, -reuse_frames:] ) * noise_factor
3147
- pre_video_guide = sample[:, -reuse_frames:]
3148
 
3149
 
3150
  if prefix_video != None:
@@ -3158,7 +3182,6 @@ def generate_video(
3158
  sample = sample[: , :]
3159
  else:
3160
  sample = sample[: , reuse_frames:]
3161
-
3162
  guide_start_frame -= reuse_frames
3163
 
3164
  exp = 0
@@ -3252,15 +3275,9 @@ def generate_video(
3252
  print(f"New video saved to Path: "+video_path)
3253
  file_list.append(video_path)
3254
  send_cmd("output")
3255
- if sliding_window :
3256
- if max_frames_to_generate > 0 and extra_windows == 0:
3257
- current_length = sample.shape[1]
3258
- if (current_length - prefix_video_frames_count)>= max_frames_to_generate:
3259
- break
3260
- video_length = min(sliding_window_size, ((max_frames_to_generate - (current_length - prefix_video_frames_count) + reuse_frames + discard_last_frames) // 4) * 4 + 1 )
3261
 
3262
  seed += 1
3263
-
3264
  if temp_filename!= None and os.path.isfile(temp_filename):
3265
  os.remove(temp_filename)
3266
  offload.unload_loras_from_model(trans)
@@ -3630,7 +3647,16 @@ def merge_status_context(status="", context=""):
3630
  return status
3631
  else:
3632
  return status + " - " + context
3633
-
 
 
 
 
 
 
 
 
 
3634
  def get_latest_status(state, context=""):
3635
  gen = get_gen_info(state)
3636
  prompt_no = gen["prompt_no"]
@@ -3999,7 +4025,7 @@ def prepare_inputs_dict(target, inputs ):
3999
  inputs.pop("model_mode")
4000
 
4001
  if not "Vace" in model_filename or not "phantom" in model_filename or not "hunyuan_video_custom" in model_filename:
4002
- unsaved_params = ["keep_frames_video_guide", "video_prompt_type", "remove_background_image_ref"]
4003
  for k in unsaved_params:
4004
  inputs.pop(k)
4005
 
@@ -4066,7 +4092,7 @@ def save_inputs(
4066
  sliding_window_overlap,
4067
  sliding_window_overlap_noise,
4068
  sliding_window_discard_last_frames,
4069
- remove_background_image_ref,
4070
  temporal_upsampling,
4071
  spatial_upsampling,
4072
  RIFLEx_setting,
@@ -4458,7 +4484,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
4458
  ("Transfer Human Motion from the Control Video", "PV"),
4459
  ("Transfer Depth from the Control Video", "DV"),
4460
  ("Recolorize the Control Video", "CV"),
4461
- # ("Alternate Video Ending", "OV"),
4462
  ("Video contains Open Pose, Depth, Black & White, Inpainting ", "V"),
4463
  ("Control Video and Mask video for Inpainting ", "MV"),
4464
  ],
@@ -4489,7 +4515,17 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
4489
  )
4490
 
4491
  # with gr.Row():
4492
- remove_background_image_ref = gr.Checkbox(value=ui_defaults.get("remove_background_image_ref",1), label= "Remove Background of Images References", visible= "I" in video_prompt_type_value, scale =1 )
 
 
 
 
 
 
 
 
 
 
4493
 
4494
 
4495
  video_mask = gr.Video(label= "Video Mask (for Inpainting or Outpaing, white pixels = Mask)", visible= "M" in video_prompt_type_value, value= ui_defaults.get("video_mask", None))
@@ -4730,7 +4766,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
4730
  else:
4731
  sliding_window_size = gr.Slider(5, 137, value=ui_defaults.get("sliding_window_size", 81), step=4, label="Sliding Window Size")
4732
  sliding_window_overlap = gr.Slider(1, 97, value=ui_defaults.get("sliding_window_overlap",5), step=4, label="Windows Frames Overlap (needed to maintain continuity between windows, a higher value will require more windows)")
4733
- sliding_window_overlap_noise = gr.Slider(0, 100, value=ui_defaults.get("sliding_window_overlap_noise",20), step=1, label="Noise to be added to overlapped frames to reduce blur effect")
4734
  sliding_window_discard_last_frames = gr.Slider(0, 20, value=ui_defaults.get("sliding_window_discard_last_frames", 8), step=4, label="Discard Last Frames of a Window (that may have bad quality)", visible = True)
4735
 
4736
 
@@ -4811,7 +4847,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
4811
 
4812
  image_prompt_type.change(fn=refresh_image_prompt_type, inputs=[state, image_prompt_type], outputs=[image_start, image_end, video_source, keep_frames_video_source] )
4813
  video_prompt_video_guide_trigger.change(fn=refresh_video_prompt_video_guide_trigger, inputs=[video_prompt_type, video_prompt_video_guide_trigger], outputs=[video_prompt_type, video_prompt_type_video_guide, video_guide, video_mask, keep_frames_video_guide])
4814
- video_prompt_type_image_refs.input(fn=refresh_video_prompt_type_image_refs, inputs = [video_prompt_type, video_prompt_type_image_refs], outputs = [video_prompt_type, image_refs, remove_background_image_ref ])
4815
  video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, keep_frames_video_guide, video_mask])
4816
 
4817
  show_advanced.change(fn=switch_advanced, inputs=[state, show_advanced, lset_name], outputs=[advanced_row, preset_buttons_rows, refresh_lora_btn, refresh2_row ,lset_name ]).then(
@@ -5036,7 +5072,7 @@ def generate_video_tab(update_form = False, state_dict = None, ui_defaults = Non
5036
  )
5037
 
5038
  return ( state, loras_choices, lset_name, state,
5039
- video_guide, video_mask, video_prompt_video_guide_trigger, prompt_enhancer
5040
  )
5041
 
5042
 
@@ -5250,6 +5286,7 @@ def generate_configuration_tab(state, blocks, header, model_choice, prompt_enhan
5250
  value= profile,
5251
  label="Profile (for power users only, not needed to change it)"
5252
  )
 
5253
 
5254
 
5255
 
@@ -5277,7 +5314,8 @@ def generate_configuration_tab(state, blocks, header, model_choice, prompt_enhan
5277
  preload_model_policy_choice,
5278
  UI_theme_choice,
5279
  enhancer_enabled_choice,
5280
- fit_canvas_choice
 
5281
  ],
5282
  outputs= [msg , header, model_choice, prompt_enhancer_row]
5283
  )
@@ -5661,7 +5699,7 @@ def create_demo():
5661
  theme = gr.themes.Soft(font=["Verdana"], primary_hue="sky", neutral_hue="slate", text_size="md")
5662
 
5663
  with gr.Blocks(css=css, theme=theme, title= "WanGP") as main:
5664
- gr.Markdown("<div align=center><H1>Wan<SUP>GP</SUP> v5.2 <FONT SIZE=4>by <I>DeepBeepMeep</I></FONT> <FONT SIZE=3>") # (<A HREF='https://github.com/deepbeepmeep/Wan2GP'>Updates</A>)</FONT SIZE=3></H1></div>")
5665
  global model_list
5666
 
5667
  tab_state = gr.State({ "tab_no":0 })
@@ -5680,7 +5718,7 @@ def create_demo():
5680
  header = gr.Markdown(generate_header(transformer_filename, compile, attention_mode), visible= True)
5681
  with gr.Row():
5682
  ( state, loras_choices, lset_name, state,
5683
- video_guide, video_mask, video_prompt_type_video_trigger, prompt_enhancer_row
5684
  ) = generate_video_tab(model_choice=model_choice, header=header, main = main)
5685
  with gr.Tab("Informations", id="info"):
5686
  generate_info_tab()
@@ -5688,7 +5726,7 @@ def create_demo():
5688
  from preprocessing.matanyone import app as matanyone_app
5689
  vmc_event_handler = matanyone_app.get_vmc_event_handler()
5690
 
5691
- matanyone_app.display(main_tabs, model_choice, video_guide, video_mask, video_prompt_type_video_trigger)
5692
  if not args.lock_config:
5693
  with gr.Tab("Downloads", id="downloads") as downloads_tab:
5694
  generate_download_tab(lset_name, loras_choices, state)
 
204
 
205
  if isinstance(image_refs, list):
206
  image_refs = [ convert_image(tup[0]) for tup in image_refs ]
 
 
 
207
 
208
 
209
  if len(prompts) > 0:
 
330
  if "O" in video_prompt_type :
331
  keep_frames_video_guide= inputs["keep_frames_video_guide"]
332
  video_length = inputs["video_length"]
333
+ if len(keep_frames_video_guide) > 0:
334
+ gr.Info("Keeping Frames with Extending Video is not yet supported")
335
+ return
336
+ # gr.Info(f"Warning : you have asked to reuse all the frames of the control Video in the Alternate Video Ending it. Please make sure the number of frames of the control Video is lower than the total number of frames to generate otherwise it won't make a difference.")
337
  # elif keep_frames >= video_length:
338
  # gr.Info(f"The number of frames in the control Video to reuse ({keep_frames_video_guide}) in Alternate Video Ending can not be bigger than the total number of frames ({video_length}) to generate.")
339
  # return
 
346
  return
347
 
348
  if isinstance(image_refs, list):
349
+ image_refs = [ convert_image(tup[0]) for tup in image_refs ]
 
 
 
 
 
350
 
351
  if len(prompts) > 0:
352
  prompts = ["\n".join(prompts)]
 
1458
  lock_ui_transformer = False
1459
  lock_ui_compile = False
1460
 
 
1461
  force_profile_no = int(args.profile)
1462
  verbose_level = int(args.verbose)
1463
  quantizeTransformer = args.quantize_transformer
 
1475
  if not os.path.isfile(server_config_filename) and os.path.isfile("gradio_config.json"):
1476
  shutil.move("gradio_config.json", server_config_filename)
1477
 
1478
+ if not os.path.isdir("ckpts/umt5-xxl/"):
1479
+ os.makedirs("ckpts/umt5-xxl/")
1480
  src_move = [ "ckpts/models_clip_open-clip-xlm-roberta-large-vit-huge-14-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-bf16.safetensors", "ckpts/models_t5_umt5-xxl-enc-quanto_int8.safetensors" ]
1481
  tgt_move = [ "ckpts/xlm-roberta-large/", "ckpts/umt5-xxl/", "ckpts/umt5-xxl/"]
1482
  for src,tgt in zip(src_move,tgt_move):
1483
  if os.path.isfile(src):
1484
  try:
1485
+ if os.path.isfile(tgt):
1486
+ shutil.remove(src)
1487
+ else:
1488
+ shutil.move(src, tgt)
1489
  except:
1490
  pass
 
1491
 
1492
 
1493
  if not Path(server_config_filename).is_file():
 
1752
  "flow_shift": 13,
1753
  "resolution": "1280x720"
1754
  })
1755
+ elif get_model_type(filename) in ("vace_14B"):
1756
+ ui_defaults.update({
1757
+ "sliding_window_discard_last_frames": 0,
1758
+ })
1759
 
1760
 
1761
  with open(defaults_filename, "w", encoding="utf-8") as f:
 
2136
  global transformer_filename, transformer_loras_filenames
2137
  model_family = get_model_family(model_filename)
2138
  perc_reserved_mem_max = args.perc_reserved_mem_max
2139
+ preload =int(args.preload)
2140
+ if preload == 0:
2141
+ preload = server_config.get("preload_in_VRAM", 0)
2142
  new_transformer_loras_filenames = None
2143
  dependent_models = get_dependent_models(model_filename, quantization= transformer_quantization, dtype_policy = transformer_dtype_policy)
2144
  new_transformer_loras_filenames = [model_filename] if "_lora" in model_filename else None
 
2262
  preload_model_policy_choice = 1,
2263
  UI_theme_choice = "default",
2264
  enhancer_enabled_choice = 0,
2265
+ fit_canvas_choice = 0,
2266
+ preload_in_VRAM_choice = 0
2267
  ):
2268
  if args.lock_config:
2269
  return
 
2288
  "UI_theme" : UI_theme_choice,
2289
  "fit_canvas": fit_canvas_choice,
2290
  "enhancer_enabled" : enhancer_enabled_choice,
2291
+ "preload_in_VRAM" : preload_in_VRAM_choice
2292
  }
2293
 
2294
  if Path(server_config_filename).is_file():
 
2461
  prompt = "<BR><DIV style='height:8px'></DIV>".join(prompts)
2462
  if enhanced:
2463
  prompt = "<U><B>Enhanced:</B></U><BR>" + prompt
2464
+ list_uri = []
2465
  start_img_uri = task.get('start_image_data_base64')
2466
+ if start_img_uri != None:
2467
+ list_uri += start_img_uri
2468
  end_img_uri = task.get('end_image_data_base64')
2469
+ if end_img_uri != None:
2470
+ list_uri += end_img_uri
2471
+
2472
  thumbnail_size = "100px"
2473
+ thumbnails = ""
2474
+ for img_uri in list_uri:
2475
+ thumbnails += f'<TD><img src="{img_uri}" alt="Start" style="max-width:{thumbnail_size}; max-height:{thumbnail_size}; display: block; margin: auto; object-fit: contain;" /></TD>'
 
2476
 
2477
+ html = "<STYLE> #PINFO, #PINFO th, #PINFO td {border: 1px solid #CCCCCC;background-color:#FFFFFF;}</STYLE><TABLE WIDTH=100% ID=PINFO ><TR><TD width=100%>" + prompt + "</TD>" + thumbnails + "</TR></TABLE>"
 
 
 
 
 
 
 
 
2478
  html_output = gr.HTML(html, visible= True)
2479
  return gr.Gallery(selected_index=choice, value = file_list), html_output, gr.Button(visible=False), gr.Button(visible=True), gr.Row(visible=True), update_queue_data(queue), gr.Button(interactive= abort_interactive), gr.Button(visible= onemorewindow_visible)
2480
 
 
2679
  sliding_window_overlap,
2680
  sliding_window_overlap_noise,
2681
  sliding_window_discard_last_frames,
2682
+ remove_background_images_ref,
2683
  temporal_upsampling,
2684
  spatial_upsampling,
2685
  RIFLEx_setting,
 
2815
  fps = 30
2816
  else:
2817
  fps = 16
2818
+ latent_size = 8 if ltxv else 4
2819
 
2820
  original_image_refs = image_refs
2821
  if image_refs != None and len(image_refs) > 0 and (hunyuan_custom or phantom or vace):
2822
  send_cmd("progress", [0, get_latest_status(state, "Removing Images References Background")])
2823
  os.environ["U2NET_HOME"] = os.path.join(os.getcwd(), "ckpts", "rembg")
2824
  from wan.utils.utils import resize_and_remove_background
2825
+ image_refs = resize_and_remove_background(image_refs, width, height, remove_background_images_ref, fit_into_canvas= not vace)
2826
  update_task_thumbnails(task, locals())
2827
  send_cmd("output")
2828
 
 
2879
  repeat_no = 0
2880
  extra_generation = 0
2881
  initial_total_windows = 0
 
2882
  if diffusion_forcing or vace or ltxv:
2883
  reuse_frames = min(sliding_window_size - 4, sliding_window_overlap)
2884
  else:
 
2887
  video_length += sliding_window_overlap
2888
  sliding_window = (vace or diffusion_forcing or ltxv) and video_length > sliding_window_size
2889
 
2890
+ discard_last_frames = sliding_window_discard_last_frames
2891
+ default_max_frames_to_generate = video_length
2892
  if sliding_window:
 
2893
  left_after_first_window = video_length - sliding_window_size + discard_last_frames
2894
  initial_total_windows= 1 + math.ceil(left_after_first_window / (sliding_window_size - discard_last_frames - reuse_frames))
2895
  video_length = sliding_window_size
 
2913
  prefix_video_frames_count = 0
2914
  frames_already_processed = None
2915
  pre_video_guide = None
2916
+ overlapped_latents = None
2917
  window_no = 0
2918
  extra_windows = 0
2919
  guide_start_frame = 0
 
2921
  gen["extra_windows"] = 0
2922
  gen["total_windows"] = 1
2923
  gen["window_no"] = 1
2924
+ num_frames_generated = 0
2925
+ max_frames_to_generate = default_max_frames_to_generate
2926
  start_time = time.time()
2927
  if prompt_enhancer_image_caption_model != None and prompt_enhancer !=None and len(prompt_enhancer)>0:
2928
  text_encoder_max_tokens = 256
 
2958
  while not abort:
2959
  if sliding_window:
2960
  prompt = prompts[window_no] if window_no < len(prompts) else prompts[-1]
2961
+ new_extra_windows = gen.get("extra_windows",0)
 
 
2962
  gen["extra_windows"] = 0
2963
+ extra_windows += new_extra_windows
2964
+ max_frames_to_generate += new_extra_windows * (sliding_window_size - discard_last_frames - reuse_frames)
2965
+ sliding_window = sliding_window or extra_windows > 0
2966
+ if sliding_window and window_no > 0:
2967
+ num_frames_generated -= reuse_frames
2968
+ if (max_frames_to_generate - prefix_video_frames_count - num_frames_generated) < latent_size:
2969
+ break
2970
+ video_length = min(sliding_window_size, ((max_frames_to_generate - num_frames_generated - prefix_video_frames_count + reuse_frames + discard_last_frames) // latent_size) * latent_size + 1 )
2971
+
2972
  total_windows = initial_total_windows + extra_windows
2973
  gen["total_windows"] = total_windows
2974
  if window_no >= total_windows:
2975
  break
2976
  window_no += 1
2977
  gen["window_no"] = window_no
2978
+ return_latent_slice = None
2979
+ if reuse_frames > 0:
2980
+ return_latent_slice = slice(-(reuse_frames - 1 + discard_last_frames ) // latent_size, None if discard_last_frames == 0 else -(discard_last_frames // latent_size) )
2981
+
2982
  if hunyuan_custom:
2983
  src_ref_images = image_refs
2984
  elif phantom:
2985
  src_ref_images = image_refs.copy() if image_refs != None else None
2986
+ elif diffusion_forcing or ltxv or vace and "O" in video_prompt_type:
2987
+ if vace:
2988
+ video_source = video_guide
2989
+ video_guide = None
2990
  if video_source != None and len(video_source) > 0 and window_no == 1:
2991
  keep_frames_video_source= 1000 if len(keep_frames_video_source) ==0 else int(keep_frames_video_source)
2992
+ keep_frames_video_source = (keep_frames_video_source // latent_size ) * latent_size + 1
2993
  prefix_video = preprocess_video(None, width=width, height=height,video_in=video_source, max_frames= keep_frames_video_source , start_frame = 0, fit_canvas= fit_canvas, target_fps = fps, block_size = 32 if ltxv else 16)
2994
  prefix_video = prefix_video .permute(3, 0, 1, 2)
2995
  prefix_video = prefix_video .float().div_(127.5).sub_(1.) # c, f, h, w
 
2996
  pre_video_guide = prefix_video[:, -reuse_frames:]
2997
+ prefix_video_frames_count = pre_video_guide.shape[1]
2998
+ if vace:
2999
+ height, width = pre_video_guide.shape[-2:]
3000
+ if vace:
3001
  image_refs_copy = image_refs.copy() if image_refs != None else None # required since prepare_source do inplace modifications
3002
  video_guide_copy = video_guide
3003
  video_mask_copy = video_mask
3004
  if any(process in video_prompt_type for process in ("P", "D", "G")) :
 
 
3005
  preprocess_type = None
3006
  if "P" in video_prompt_type :
3007
  progress_args = [0, get_latest_status(state,"Extracting Open Pose Information")]
 
3020
  if len(error) > 0:
3021
  raise gr.Error(f"invalid keep frames {keep_frames_video_guide}")
3022
  keep_frames_parsed = keep_frames_parsed[guide_start_frame: guide_start_frame + video_length]
3023
+
3024
  if window_no == 1:
3025
+ image_size = (height, width) # default frame dimensions until it is set by video_src (if there is any)
3026
+
3027
+
3028
  src_video, src_mask, src_ref_images = wan_model.prepare_source([video_guide_copy],
3029
  [video_mask_copy ],
3030
  [image_refs_copy],
 
3035
  pre_src_video = [pre_video_guide],
3036
  fit_into_canvas = fit_canvas
3037
  )
 
 
 
3038
  status = get_latest_status(state)
 
 
3039
  gen["progress_status"] = status
3040
  gen["progress_phase"] = ("Encoding Prompt", -1 )
3041
  callback = build_callback(state, trans, send_cmd, status, num_inference_steps)
3042
  progress_args = [0, merge_status_context(status, "Encoding Prompt")]
3043
  send_cmd("progress", progress_args)
3044
 
3045
+ if trans.enable_teacache:
3046
+ trans.teacache_counter = 0
3047
+ trans.num_steps = num_inference_steps
3048
+ trans.teacache_skipped_steps = 0
3049
+ trans.previous_residual = None
3050
+ trans.previous_modulated_input = None
3051
+
3052
  # samples = torch.empty( (1,2)) #for testing
3053
  # if False:
3054
 
3055
  try:
 
 
 
 
 
 
 
3056
  samples = wan_model.generate(
3057
  input_prompt = prompt,
3058
  image_start = image_start,
 
3062
  input_masks = src_mask,
3063
  input_video= pre_video_guide if diffusion_forcing or ltxv else source_video,
3064
  target_camera= target_camera,
3065
+ frame_num=(video_length // latent_size)* latent_size + 1,
3066
  height = height,
3067
  width = width,
3068
  fit_into_canvas = fit_canvas == 1,
 
3089
  causal_block_size = 5,
3090
  causal_attention = True,
3091
  fps = fps,
3092
+ overlapped_latents = overlapped_latents,
3093
+ return_latent_slice= return_latent_slice,
3094
  overlap_noise = sliding_window_overlap_noise,
3095
  model_filename = model_filename,
3096
  )
 
3123
  tb = traceback.format_exc().split('\n')[:-1]
3124
  print('\n'.join(tb))
3125
  send_cmd("error", new_error)
3126
+ clear_status(state)
3127
  return
3128
  finally:
3129
  trans.previous_residual = None
 
3133
  print(f"Teacache Skipped Steps:{trans.teacache_skipped_steps}/{trans.num_steps}" )
3134
 
3135
  if samples != None:
3136
+ if isinstance(samples, dict):
3137
+ overlapped_latents = samples.get("latent_slice", None)
3138
+ samples= samples["x"]
3139
  samples = samples.to("cpu")
3140
  offload.last_offload_obj.unload_all()
3141
  gc.collect()
3142
  torch.cuda.empty_cache()
3143
 
3144
+ # time_flag = datetime.fromtimestamp(time.time()).strftime("%Y-%m-%d-%Hh%Mm%Ss")
3145
+ # save_prompt = "_in_" + original_prompts[0]
3146
+ # file_name = f"{time_flag}_seed{seed}_{sanitize_file_name(save_prompt[:50]).strip()}.mp4"
3147
+ # sample = samples.cpu()
3148
+ # cache_video( tensor=sample[None].clone(), save_file=os.path.join(save_path, file_name), fps=16, nrow=1, normalize=True, value_range=(-1, 1))
3149
+
3150
  if samples == None:
3151
  abort = True
3152
  state["prompt"] = ""
3153
  send_cmd("output")
3154
  else:
3155
  sample = samples.cpu()
3156
+ # if True: # for testing
3157
+ # torch.save(sample, "output.pt")
3158
+ # else:
3159
+ # sample =torch.load("output.pt")
3160
+ if gen.get("extra_windows",0) > 0:
3161
+ sliding_window = True
3162
  if sliding_window :
3163
  guide_start_frame += video_length
3164
  if discard_last_frames > 0:
3165
  sample = sample[: , :-discard_last_frames]
3166
  guide_start_frame -= discard_last_frames
3167
  if reuse_frames == 0:
3168
+ pre_video_guide = sample[:,9999 :].clone()
3169
  else:
3170
+ pre_video_guide = sample[:, -reuse_frames:].clone()
3171
+ num_frames_generated += sample.shape[1]
 
3172
 
3173
 
3174
  if prefix_video != None:
 
3182
  sample = sample[: , :]
3183
  else:
3184
  sample = sample[: , reuse_frames:]
 
3185
  guide_start_frame -= reuse_frames
3186
 
3187
  exp = 0
 
3275
  print(f"New video saved to Path: "+video_path)
3276
  file_list.append(video_path)
3277
  send_cmd("output")
 
 
 
 
 
 
3278
 
3279
  seed += 1
3280
+ clear_status(state)
3281
  if temp_filename!= None and os.path.isfile(temp_filename):
3282
  os.remove(temp_filename)
3283
  offload.unload_loras_from_model(trans)
 
3647
  return status
3648
  else:
3649
  return status + " - " + context
3650
+
3651
+ def clear_status(state):
3652
+ gen = get_gen_info(state)
3653
+ gen["extra_windows"] = 0
3654
+ gen["total_windows"] = 1
3655
+ gen["window_no"] = 1
3656
+ gen["extra_orders"] = 0
3657
+ gen["repeat_no"] = 0
3658
+ gen["total_generation"] = 0
3659
+
3660
  def get_latest_status(state, context=""):
3661
  gen = get_gen_info(state)
3662
  prompt_no = gen["prompt_no"]
 
4025
  inputs.pop("model_mode")
4026
 
4027
  if not "Vace" in model_filename or not "phantom" in model_filename or not "hunyuan_video_custom" in model_filename:
4028
+ unsaved_params = ["keep_frames_video_guide", "video_prompt_type", "remove_background_images_ref"]
4029
  for k in unsaved_params:
4030
  inputs.pop(k)
4031
 
 
4092
  sliding_window_overlap,
4093
  sliding_window_overlap_noise,
4094
  sliding_window_discard_last_frames,
4095
+ remove_background_images_ref,
4096
  temporal_upsampling,
4097
  spatial_upsampling,
4098
  RIFLEx_setting,
 
4484
  ("Transfer Human Motion from the Control Video", "PV"),
4485
  ("Transfer Depth from the Control Video", "DV"),
4486
  ("Recolorize the Control Video", "CV"),
4487
+ ("Extend Video", "OV"),
4488
  ("Video contains Open Pose, Depth, Black & White, Inpainting ", "V"),
4489
  ("Control Video and Mask video for Inpainting ", "MV"),
4490
  ],
 
4515
  )
4516
 
4517
  # with gr.Row():
4518
+ remove_background_images_ref = gr.Dropdown(
4519
+ choices=[
4520
+ ("Keep Backgrounds of All Images (landscape)", 0),
4521
+ ("Remove Backgrounds of All Images (objects / faces)", 1),
4522
+ ("Keep it for first Image (landscape) and remove it for other Images (objects / faces)", 2),
4523
+ ],
4524
+ value=ui_defaults.get("remove_background_images_ref",1),
4525
+ label="Remove Background of Images References", scale = 3, visible= "I" in video_prompt_type_value
4526
+ )
4527
+
4528
+ # remove_background_images_ref = gr.Checkbox(value=ui_defaults.get("remove_background_images_ref",1), label= "Remove Background of Images References", visible= "I" in video_prompt_type_value, scale =1 )
4529
 
4530
 
4531
  video_mask = gr.Video(label= "Video Mask (for Inpainting or Outpaing, white pixels = Mask)", visible= "M" in video_prompt_type_value, value= ui_defaults.get("video_mask", None))
 
4766
  else:
4767
  sliding_window_size = gr.Slider(5, 137, value=ui_defaults.get("sliding_window_size", 81), step=4, label="Sliding Window Size")
4768
  sliding_window_overlap = gr.Slider(1, 97, value=ui_defaults.get("sliding_window_overlap",5), step=4, label="Windows Frames Overlap (needed to maintain continuity between windows, a higher value will require more windows)")
4769
+ sliding_window_overlap_noise = gr.Slider(0, 150, value=ui_defaults.get("sliding_window_overlap_noise",20), step=1, label="Noise to be added to overlapped frames to reduce blur effect")
4770
  sliding_window_discard_last_frames = gr.Slider(0, 20, value=ui_defaults.get("sliding_window_discard_last_frames", 8), step=4, label="Discard Last Frames of a Window (that may have bad quality)", visible = True)
4771
 
4772
 
 
4847
 
4848
  image_prompt_type.change(fn=refresh_image_prompt_type, inputs=[state, image_prompt_type], outputs=[image_start, image_end, video_source, keep_frames_video_source] )
4849
  video_prompt_video_guide_trigger.change(fn=refresh_video_prompt_video_guide_trigger, inputs=[video_prompt_type, video_prompt_video_guide_trigger], outputs=[video_prompt_type, video_prompt_type_video_guide, video_guide, video_mask, keep_frames_video_guide])
4850
+ video_prompt_type_image_refs.input(fn=refresh_video_prompt_type_image_refs, inputs = [video_prompt_type, video_prompt_type_image_refs], outputs = [video_prompt_type, image_refs, remove_background_images_ref ])
4851
  video_prompt_type_video_guide.input(fn=refresh_video_prompt_type_video_guide, inputs = [video_prompt_type, video_prompt_type_video_guide], outputs = [video_prompt_type, video_guide, keep_frames_video_guide, video_mask])
4852
 
4853
  show_advanced.change(fn=switch_advanced, inputs=[state, show_advanced, lset_name], outputs=[advanced_row, preset_buttons_rows, refresh_lora_btn, refresh2_row ,lset_name ]).then(
 
5072
  )
5073
 
5074
  return ( state, loras_choices, lset_name, state,
5075
+ video_guide, video_mask, image_refs, video_prompt_video_guide_trigger, prompt_enhancer
5076
  )
5077
 
5078
 
 
5286
  value= profile,
5287
  label="Profile (for power users only, not needed to change it)"
5288
  )
5289
+ preload_in_VRAM_choice = gr.Slider(0, 40000, value=server_config.get("preload_in_VRAM", 0), step=100, label="Number of MB of Models that are Preloaded in VRAM (0 will use Profile default)")
5290
 
5291
 
5292
 
 
5314
  preload_model_policy_choice,
5315
  UI_theme_choice,
5316
  enhancer_enabled_choice,
5317
+ fit_canvas_choice,
5318
+ preload_in_VRAM_choice
5319
  ],
5320
  outputs= [msg , header, model_choice, prompt_enhancer_row]
5321
  )
 
5699
  theme = gr.themes.Soft(font=["Verdana"], primary_hue="sky", neutral_hue="slate", text_size="md")
5700
 
5701
  with gr.Blocks(css=css, theme=theme, title= "WanGP") as main:
5702
+ gr.Markdown("<div align=center><H1>Wan<SUP>GP</SUP> v5.21 <FONT SIZE=4>by <I>DeepBeepMeep</I></FONT> <FONT SIZE=3>") # (<A HREF='https://github.com/deepbeepmeep/Wan2GP'>Updates</A>)</FONT SIZE=3></H1></div>")
5703
  global model_list
5704
 
5705
  tab_state = gr.State({ "tab_no":0 })
 
5718
  header = gr.Markdown(generate_header(transformer_filename, compile, attention_mode), visible= True)
5719
  with gr.Row():
5720
  ( state, loras_choices, lset_name, state,
5721
+ video_guide, video_mask, image_refs, video_prompt_type_video_trigger, prompt_enhancer_row
5722
  ) = generate_video_tab(model_choice=model_choice, header=header, main = main)
5723
  with gr.Tab("Informations", id="info"):
5724
  generate_info_tab()
 
5726
  from preprocessing.matanyone import app as matanyone_app
5727
  vmc_event_handler = matanyone_app.get_vmc_event_handler()
5728
 
5729
+ matanyone_app.display(main_tabs, model_choice, video_guide, video_mask, image_refs, video_prompt_type_video_trigger)
5730
  if not args.lock_config:
5731
  with gr.Tab("Downloads", id="downloads") as downloads_tab:
5732
  generate_download_tab(lset_name, loras_choices, state)