Z-Image Turbo ControlNet 2.0 Released Just 9 Days After 1.0!?

Z-Image Turbo ControlNet 2.0 Released Just 9 Days After 1.0!?

Author: Z-Image.me2 min read
Z-ImageControlNetAI Image GenerationInpaintingAlibabaComfyUIImage EditingDiffusion Models

Z-Image Turbo ControlNet 2.0 Released Just 9 Days After 1.0!?

Introduction

Recently, Alibaba has been making frequent moves in the image generation model field. Just after renaming z-image Base (Not Z-Image-Base, but Z-Image-Omni-Base), they hastily released Z-Image-Turbo-Fun-Controlnet-Union-2.0 on December 14th.

Notably, this comes just 9 days after the release of Z-Image-Turbo ControlNet Union 1.0, inevitably raising questions: what secrets lie behind such rapid iteration?

As outsiders, it's difficult to know the exact details, but we can gain insights from the update content. Without further ado, let's examine what's new:

Key Updates and Features

Version 2.0 emphasizes reliability and creativity. Here's what's inside:

  • Supported Control Modes: Handles standard inputs like Canny (edge detection for contours), HED (soft edges for artistic effects), Depth (3D structure from maps), Pose (human or object positioning), and MLSD (straight lines for architecture). These allow you to "condition" the AI—for example, provide a rough sketch, and the model generates a refined matching image.

  • Inpainting Mode: A major new addition! This allows you to mask and edit specific regions of an image (e.g., replace the background without changing the foreground). However, users note it sometimes blurs unmasked areas, so ComfyUI's masking tools help refine results.

  • Adjustable Parameters: Tune control_context_scale (recommended 0.65–0.90) to balance how strictly the AI follows controls. Higher values require more inference steps (e.g., 20–40) for clear output, avoiding over-control that distorts details.

  • Training Foundation: Trained from scratch for 70,000 steps using 1 million high-quality images (a mix of general scenes and human-centric content). Uses 1328 resolution, BFloat16 precision, batch size 64, and learning rate 2e-5. The "Fun" name hints at its playful, creative focus, with a text dropout ratio of 0.10 to encourage diverse prompts.

Comparison with Previous Version (1.0)

The previous version, Z-Image-Turbo-Fun-Controlnet-Union (commonly called 1.0), laid the foundation but had limitations. It was trained on a similar 1 million image dataset for only 10,000 steps, adding just 6 blocks, resulting in occasional retraining errors and slower loading times. Users often needed workarounds for effective control, and inpainting was unavailable.

In contrast, 2.0 feels like upgrading from a basic bicycle to a geared one: more layers (15 + 2 refinement) mean finer control, longer training improves quality, and inpainting opens new editing possibilities. It addresses all reported issues from 1.0, such as stability failures, while maintaining the same core controls. Extended training and refinement blocks improve detail preservation, especially in human poses or complex scenes, though custom training may require 24GB+ VRAM.

Dimension Version 1.0 Version 2.0 Why It Matters
Training Steps 10,000 70,000 Longer training leads to more refined, realistic output with fewer artifacts.
Dataset Focus 1M high-quality images (general) 1M high-quality images (general + human-centric) Better handling of people and poses, reducing common AI flaws like distorted hands.
Control Layers Added on 6 blocks Added on 15 layer blocks + 2 refinement blocks Deeper integration for smoother control fusion, improving overall image coherence.
Inpainting Support None Full support with masking Enables targeted editing, like fixing backgrounds—a game-changer for iterative design.
Resolution & Precision Basic (unspecified) 1328 resolution, BFloat16 precision Higher resolution supports detailed generation; BFloat16 optimizes speed on modern GPUs.
Batch Size & Learning Rate Not detailed Batch size 64, learning rate 2e-5 Efficient training on large datasets, translating to faster inference in practice.
Control Tuning Basic strength adjustment Adjustable control_context_scale (0.65–0.90); step recommendations More user control for balance, avoiding over- or under-adherence to inputs.
Issues & Performance Retraining errors, slow loading; requires skill to use All issues resolved; slight loading tradeoff but better stability Makes workflows like ComfyUI more reliable, with quick community fixes.
Hardware Notes Lower requirements but under-optimized Benefits from 8GB+ VRAM; not distilled (requires more steps) Suitable for mid-range setups, but professionals can further tune.

Summary

This upgrade brings improvements in quality and functionality, including support for Inpainting mode and longer training steps. It's an incremental update that addresses some issues from the previous version, such as training errors and slow loading, making the model more reliable for creative tasks. While performance is better, complex scenes (like hand poses) may still require manual optimization, and hardware requirements are relatively high.

It feels more like it should be called V1.1 or V1.5 rather than V2.0. My subjective speculation is that the current active updates and upgrades may be aimed at faster rollout of Z-Image-Omni-Base, using a modular upgrade approach with distributed iterations to drive unified capability improvements.

Regardless, I hope Alibaba can maintain Z-Image's good momentum, continuously lowering AI barriers, allowing more people to enjoy the convenience of AI.

References