NEW BOT Телеграм, страница

Auto-generate caption files for LoRA training with local vision LLMs

Hey everyone!

I made a tool that automatically generates .txt caption files for your training datasets using local Ollama vision models (Qwen3-VL, LLaVA, Llama Vision).

Why this tool over other image annotators?

Modern models like Z-Image or Flux need long, precise, and well-structured denoscriptions to perform at their best — not just a string of tags separated by commas.

The advantage of multimodal vision LLMs is that you can give them instructions in natural language to define exactly the output format you want. The result: much richer denoscriptions, better organized, and truly adapted to what these models actually expect.

Built-in presets:

Z-Image / Flux: detailed, structured denoscriptions (composition, lighting, textures, atmosphere) — the prompt uses the [official Tongyi-MAI instructions](https://huggingface.co/spaces/Tongyi-MAI/Z-Image-Turbo/blob/main/pe.py), the team behind Z-Image
Stable Diffusion: classic format with weight syntax (element:1.2) and quality tags

You can also create your own presets very easily by editing the config file.

Check out the project on GitHub: https://github.com/hydropix/ollama-image-describer
Feel free to open issues or suggest improvements!

https://redd.it/1pfusb3
@rStableDiffusion

huggingface.co

pe.py · Tongyi-MAI/Z-Image-Turbo at main

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

7 views04:40

About

Blog

Apps

Platform