Inference
from diffusers import StableDiffusionPipeline
from gemma_encoder import Encoder
if __name__ == '__main__':
pipeline = StableDiffusionPipeline.from_single_file('rosaceae_inkRose.safetensors', vae=...)
pipeline.enable_model_cpu_offload()
encoder = Encoder(adapter_model, 'google/t5gemma-2b-2b-ul2-it', device='cpu')
load_model(adapter_model, 'adapter.safetensors')
image = pipeline(
None,
prompt_embeds=encoder.encode(pipeline, text).to('cpu'),
negative_prompt='bad quality, low quality, worst quality'
).images[0]
image.save('preview.png')
SD1.5 and Gemma
- Text condition with spatial positional encoding, it can include both image and text tokens, but trained only on text tokens (similar to OneDiffusion, Qwen Image, etc.).
- Supports long captions, the dataset emphasized mix of booru tags and natural language
- Unlike similar T5 models, users don't need to write a novel or use a second LLM. It just works with human written text
- Character looks and actions are prioritized over the ImageNet1K categories
Datasets
- alfredplpl/artbench-pd-256x256
- anime-art-multicaptions (multicharacter interactions)
- danbooru2023-florence2-caption (verb, action clauses)
- spatial-caption
- SPRIGHT-T2I/spright_coco
- colormix (synthetic color, fashion dataset)
- trojblue/danbooru2025-metadata
Model tree for nightknocker/rosaceae-t5gemma-adapter
Base model
minaiosu/RabbitYourMajesty