When I first delved into the Stable Diffusion paper, it felt like I had stumbled upon an enigmatic realm. This advanced technology, seemingly crafted by virtuosos with formidable hardware, held the uncanny power to conjure images. I couldn’t help but wonder if there were hints of quantum wizardry at play. Several months later, ControlNet emerged on the scene, revealing its internal intricacies, though it remained an arcane art to me. ControlNet, in all its mystique, allowed us to steer Stable Diffusion toward specific creative destinations – it was akin to wielding a form of digital sorcery. For a while, I held the belief that the original eight control nets were the handiwork of highly-versed AI experts, far beyond the grasp of a layperson like myself. However, the original paper provided a resounding clarity:
“The ControlNet learns task-specific conditions in an end-to-end way, and the learning is robust even when the training dataset is small (< 50k). Moreover, training a ControlNet is as fast as fine-tuning a diffusion model, and the model can be trained on personal devices.”
As a software developer, my AI knowledge barely scratched the surface, but recently, I embarked on a journey to train my own control net. How challenging could it be, I thought? To my pleasant surprise, the path became remarkably straightforward once certain setup pitfalls were sidestepped. I couldn’t contain my excitement as I witnessed the emergence of novel functionality for Stable Diffusion after a mere few hours of training, all on consumer-grade hardware, using a dataset of 50k images. With this setup guide, my aim is to inspire more individuals to take the plunge and explore this realm, perhaps even nurturing their own concepts for a control net.
In this tutorial, our mission is to train a control net using white-gray-black images, a creative endeavor aimed at directing Stable Diffusion to craft images with variations in light and darkness. Think of it as guiding the AI to produce intriguing squint illusion images or stylized QR codes. To achieve this, we’ll set up various components, download an image dataset, convert these images to the white-gray-black format, and then entrust our GPU to undertake the arduous task of training on 50k images, a journey spanning over 6 hours. We’ll be using SD 1.5 with fp16 for the sake of expediency, although the fundamental setup principles should remain applicable to SD 2.x and SDXL. It’s worth noting that I am relatively new to this endeavor, and while my approach indeed yields results, I might not possess answers to exceedingly intricate questions about control nets.
Before embarking on my maiden voyage in control net training, I harbored several pressing questions:
- Are there readily available tutorials on control net training? => Not many, but see the References section below. We’ll be following the HuggingFace ControlNet Training method.
- What’s the time investment for control net training? => Approximately 6 hours when employing an RTX 3060 for a functional outcome.
- Where can you acquire a suitable training dataset? => The dataset is thoughtfully provided and will be fetched automatically.
- How many images are necessary for a control net to function effectively? => A minimum of 25k for a proof of concept, over 50k for practical use, and 10 to 100 times that for remarkable outcomes (for context, Canny edge detection was trained on 3M images).
- Can you verify that your setup and training are effective at the earliest opportunity? => Yes, you can accomplish this within approximately 15 minutes with the first checkpoint.
- In case the process becomes too time-consuming, can you pause and resume it? => Absolutely, refer to the QA section below for guidance.
- Are there strategies to optimize performance and expedite training? => Indeed, this aspect is crucial and will be thoroughly addressed.
- Will it be compatible with ComfyUI or Automatic1111, or how can you test it? => ComfyUI, yes; however, Automatic1111 necessitates an additional conversion step, although we’ll provide a straightforward inference script.
Furthermore, it’s vital to recognize that the machine learning community predominantly relies on the Python ecosystem, and most tutorials assume a degree of familiarity with Python scripting using the provided libraries. This is noteworthy if you are transitioning from a user-oriented background accustomed to Automatic1111 or ComfyUI. Nonetheless, with fundamental command-line and programming skills, you should be well-equipped to set up control net training. While we await the evolution of a more user-friendly, one-click control net UI, rest assured that with determination and a modicum of guidance, you can traverse this path successfully – you can, too!
For those eager to explore, my very first control net model can be found here: ControlNet Mysee – Light and Dark!
For this tutorial, you’ll need the following:
- Fundamental command-line and programming skills.
- Windows 10 operating system.
- Nvidia GPU with a minimum of 12GB VRAM (8GB is currently viable only on Linux with DeepSpeed).
- Approximately 100GB of available disk space.
- Around 30GB of internet download capacity.
- Approximately 5 hours of your time.
- Approximately 5 to 15 hours of compute time (utilizing an RTX 3060).
- A positive and upbeat attitude.
Prerequisites Before we dive into the main setup, there are several items we need to download and install. Later in the process, we’ll acquire a special sub-models-separated version of Stable Diffusion 1.5 (~5GB) and the laion2B-en aesthetics>=6.5 dataset (>20GB). If you have a slow internet connection, it’s advisable to initiate these steps as early as possible, as some of them involve substantial downloads (indicated as “huge download”).
- Download LAION Parquet File (120MB): This file contains a lengthy list of URLs, captions, and other metadata for images, but not the actual images themselves. You can find it manually via LAION-Aesthetics V2 -> “6.5 or higher” huggingface -> Datasets -> Files and versions -> data/.
- Install CUDA (3GB): CUDA is a specialized driver for graphics hardware that provides additional performance optimizations for model training. You can download it for Windows 10. Note that this is a “huge download” and may take some time.
- Install Miniconda (80MB): Miniconda is a minimalistic package manager for Python that allows us to install additional packages in virtual environments.
- Install Git (60MB): Git is a widely-used software versioning tool. In this tutorial, we’ll primarily use it to download code repositories.
- Install Imagemagick CLI (40MB): ImageMagick is a popular command-line tool used for converting images and applying various filters.
- Optional – Install ParquetViewer (10MB): ParquetViewer is a viewer for parquet files, which are commonly used for image datasets in the machine learning community. You can also install the .NET Runtime (60MB) to run ParquetViewer.
Setup Now that we have our prerequisites in place, let’s proceed with the setup steps for Diffusers train_controlnet.py. Follow these instructions to configure your environment:
- Launch the Anaconda shell by searching for “miniconda” in the Windows start menu.
- Create a Conda environment for your Python packages (you only need to do this once). Use the following commands:
conda update -n base -c defaults conda
conda create -n controlnet
conda install pip
- Activate the environment you just created (you’ll need to do this each time you restart the Anaconda shell) using the command:
conda activate controlnet
These steps lay the foundation for setting up your environment to train a control net and explore the exciting world of AI and image generation.
Activating the ControlNet Environment
Now that we’ve set up the Conda environment, we can proceed with the steps to set up diffusers and configure accelerate for performance optimization. These steps need to be done just once. Here’s how you can do it:
- Clone the Diffusers Repository:
git clone https://github.com/huggingface/diffusers
- Navigate to the Diffusers Directory:
- Install Diffusers Locally:
pip install -e .
- Go to the ControlNet Example Directory:
- Install ControlNet Requirements:
pip install -r requirements.txt
- Configure Accelerate:
Answer all the questions during the configuration process. The default settings should suffice.
- Note: If you ever restart the Anaconda shell, you’ll need to navigate to the control net directory and activate the environment again using the following commands:
conda activate controlnet
Next, let’s perform a test launch by copying and pasting the following command. This command is primarily for downloading and testing purposes, and it involves a “huge download” that requires Miniconda and the diffusers repository:
(Note: The specific command is not provided in the text, but you should proceed with the test launch as indicated in your original instructions.)
These steps will verify that your environment is properly set up and ready for control net training.
Launching ControlNet Training
It appears that you’re ready to launch the control net training. Here’s how you can do it:
- Copy and paste the following command into your Anaconda shell:
accelerate launch train_controlnet.py ^
- Press [Enter] to execute the command.
You’ll notice some activity, indicating that the process has started. If you want to stop it at any point, press Control+C.
Here’s what just happened:
- Stable Diffusion 1.5 with sub-models separated was downloaded to
fusing/fill50kexample training set was downloaded to
- A control net directory was created in
%USERPROFILE%\diffusers\examples\controlnet\control-ini, but it’s currently empty.
- You might have experienced a surge of endorphins due to this exciting endeavor!
However, you may have noticed a few things:
- It shows “Device: cpu” even though we set it to use the GPU in accelerate.
- The estimated training time is extremely long (566 hours or approximately 24 days).
The reason for this is that the PyTorch package isn’t compiled with CUDA, so the script defaults to CPU. Additionally, we haven’t specified additional optimizations in the arguments yet. As mentioned earlier, it took only around 15 hours on my RTX 3060.
To address this, you should install PyTorch with CUDA support by following these steps:
- Visit the PyTorch website.
- Choose “Stable” and then select “Windows.”
- Under “Conda,” choose “Python” and “CUDA (latest).”
- Copy and paste the command-line argument provided into your Anaconda shell.
This step ensures that PyTorch is properly configured with CUDA.
First Launch with Optimizations
Now that PyTorch with CUDA is installed, let’s launch the control net training with all recommended optimizations for a GPU with 12GB VRAM. If you have a different setup, you can consult the HuggingFace ControlNet Training documentation for additional options.
First, install a few more dependencies (you only need to do this once):
- Install chardet and cchardet with Conda:
conda install chardet cchardet
- Install xformers and bitsandbytes-windows with pip:
pip install xformers bitsandbytes-windows
Then, initiate the optimized control net training with the following command:
accelerate launch train_controlnet.py ^
Execute this command, and your training process should be optimized for better performance.
- Validation for Setup: Before diving into validating your ControlNet idea, you should ensure that your setup is functioning correctly. To do this, you only need to run the first checkpoint. However, note that this validation does not check if your own ControlNet concept is working; it merely confirms that your setup is correct.
- Validation for ControlNet Ideas: To validate your own ControlNet idea, you will need to take additional steps. Unfortunately, the original ControlNet code requires either TensorBoard or Weights & Biases to store images, which introduces extra complexity. Therefore, in this tutorial, you’ll run inference manually to validate your ideas.
If you don’t have a Stable Diffusion UI, you can use the attached “inference.py” script. Open the script in your text editor, adjust the paths, and execute it with the command
For ComfyUI users, you can copy the “diffusion_pytorch_model.safetensors” file from “control-ini-fp16\checkpoint-500\controlnet” to your ControlNet directory and test it.
For Automatic1111 users, you need to convert “.safetensor” files to “.pth” before testing. Refer to the QA section for more details on this.
Color Circles Example Control Net
You can assess the effectiveness of your ControlNet with the “color circles” example, using the following details:
- Prompt (https://www.daprompts.com/search/circles): “red circle with blue background”
- Steps: 20
- Sampler: UniPC
- CFG scale: 7.5
- Seed: 0
- Model: v1-5-pruned-emaonly.safetensors
- Batch size: 9 (to match “inference.py”)
The example demonstrates how ControlNet evolves in different steps. Key steps include 500, 4500, 6000, 8500, and 12500. Note that these numbers are only rough estimates and can vary based on your ControlNet concept.
FP32 vs. FP16 Comparison
For comparison, you can also check the final checkpoint with FP32 precision. In some cases, the FP16 result may appear better than FP32, which might be due to the seed or other factors. However, higher precision typically leads to better results with more training steps.
Uncanny Faces Control Net
Another example from the “Uncanny faces” tutorial is provided. In this case, it shows that after 10k steps, the ControlNet already appears to work well. With just one epoch, it converges to following poses without overfitting.
The tutorial also mentions attempts to use bf16 (bfloat16) precision, which didn’t work as expected. It resulted in different outcomes even with the same seed. Troubleshooting and optimizing training parameters can help overcome such issues.
To train your ControlNet, you’ll need an image dataset. In this tutorial, you’ll work with the “laion2B-en-aesthetics65” dataset, which has images with English captions and aesthetics scores of 6.5 or higher. This dataset is a subset of the larger “laion2B-en” dataset.
Remember that the full “laion2B-en” dataset is enormous, with parquet files totaling over 6.2TB. The subset you are using, “laion2B-en-aesthetics65,” contains 625K entries, with a parquet file size of around 120MB. This dataset will be sufficient for your ControlNet training.
To view the contents of a .parquet file, you can use the ParquetViewer, a tool designed for working with such files.
Using img2dataset for Downloading Images
To efficiently download images from the “laion2B-en-aesthetics65” dataset, you’ll utilize the img2dataset tool. This command-line tool can read parquet files and download a large number of image files in parallel while allowing you to apply various filters and transformations. Here are the steps to set up and use img2dataset:
- Install img2dataset: Open the Anaconda shell and run the following command to install img2dataset:
pip install img2dataset
This will make img2dataset available for use in your environment.
- Consider Download Sizes: Before proceeding, you should consider the download sizes involved. Stable Diffusion 1.5 was trained on 512×512 images, so ideally, you should use images with this exact size. However, finding a large number of images with this exact size can be challenging. You’ll need to decide whether to upscale smaller images, downscale larger ones, or automatically crop them, each of which may introduce slight errors.
- If you want to download all 625k images without filtering, expect an average of ~165kB per image, with around 15% skips (due to robots.txt, dead links, etc.). This will result in approximately 87GB of downloads and disk space usage.
- If you filter out some entries (e.g., small images, those with weird aspect ratios), you can reduce the dataset to around 110k images, requiring about 21GB of space.
- If you also apply image transformations immediately (e.g., resize, crop, reencode), you can save more disk space. However, keep in mind that you’ll lose the source images if you want to use different sizes or croppings later.
- Consider Image Formats: When downloading and processing the images, consider image formats and potential artifacts. For example, if you crop every image to 512×512, you might encounter artifacts when upscaling smaller images. It’s crucial to choose the right filters and transformations to minimize such issues.
- Downloading Images with img2dataset: To download images efficiently, use the img2dataset tool with the appropriate arguments. Here are some key arguments to consider:
--min_image_size 512: Filter for images with a minimum size of 512×512 pixels.
--max_aspect_ratio 2: Filter out images with aspect ratios greater than 2. This helps avoid cropping out too much of the image.
--resize_mode="center_crop": Apply a center crop to images, maintaining the aspect ratio.
These arguments will help you obtain a workable dataset that aligns with the size and cropping preferences used in training Stable Diffusion 1.5. Make sure to explore and understand all available img2dataset arguments for further customization.
By following these steps and carefully choosing your filters and transformations, you can efficiently download and preprocess the image dataset for your ControlNet training.
Using img2dataset to Download Images Efficiently
To efficiently download images from the “laion2B-en-aesthetics65” dataset while considering the requirements of Stable Diffusion 1.5, you can use the img2dataset tool. This command-line tool allows you to download and preprocess images with specific criteria. Here’s how to set it up and use it:
- Install img2dataset: Ensure you have img2dataset installed by running the following command in your Anaconda shell:
pip install img2dataset
- Determine Criteria: Decide on the criteria you want to use for downloading and preprocessing images:
--min_image_size 512: Specify a minimum image size of 512×512 pixels to avoid upscaling for SD 1.5.
--max_aspect_ratio 2: Set a maximum aspect ratio of 2 to accommodate both landscape and portrait images.
--max_image_area: If needed, you can further reduce downloads by specifying a maximum image area.
--image_size 512: In case larger images need to be downscaled to 512×512, use this option.
--processes_count: Set the number of processes to match your computer’s number of CPU cores or lower if you want to use your computer during the download.
- Move the .parquet File: Place the downloaded .parquet file from the laion2B-en-aesthetics65 dataset into the “diffusers\examples\controlnet” directory. Rename the file to “laion2b-en-aesthetics65.parquet.”
- Start the Download: Run the following command in your Anaconda shell to start the download (ensure you have img2dataset and the parquet file):
--url_list "./laion2b-en-aesthetics65.parquet" ^
--input_format "parquet" ^
--url_col "URL" ^
--caption_col "TEXT" ^
--output_format files ^
--output_folder laion2B-en-aesthetics65 ^
--processes_count 8 ^
--thread_count 24 ^
--min_image_size 512 ^
--max_aspect_ratio 2 ^
--image_size 512 ^
The tool will start downloading the images based on your specified criteria.
- Monitoring Progress: img2dataset doesn’t provide real-time progress updates. You should see image files populating the designated folder structure immediately. If you don’t see any images appearing, double-check your command-line arguments to ensure they match your criteria and file locations.
The downloaded images will be organized into a structured folder hierarchy based on the criteria you specified. This setup ensures that you have a suitable dataset for training your ControlNet while meeting the requirements of Stable Diffusion 1.5.
After downloading the “laion2b-en-aesthetics65” dataset and filtering it based on your specified criteria, here’s an overview of the downloaded entries and their details:
- Total Entries: There are a total of 635,561 entries in the dataset.
- Entries with min_image_size 512: After filtering for a minimum image size of 512×512 pixels, there are 211,857 entries that meet this criteria.
- Entries with max_aspect_ratio 2: Among the images with a minimum size of 512×512, 208,140 of them have a maximum aspect ratio of 2, making them suitable for Stable Diffusion 1.5.
- Entries with min_image_size 768: If you need larger images with a minimum size of 768×768 pixels, there are 83,224 entries in this category.
- Entries with min_image_size 1024: For even larger images with a minimum size of 1024×1024 pixels, there are 37,310 entries available.
- Largest Image Size: The largest image in the dataset has dimensions of 12,833×5,500 pixels.
- Text Captions: Every image in the dataset comes with a text caption, and there are no entries without captions. In cases where the .txt file is empty, it might indicate an issue with the entry, such as an encoding error.
- Hash Column: The hash column in the dataset is not intended for file verification purposes.
This analysis provides an overview of the dataset’s contents and how it aligns with the criteria you set for Stable Diffusion training. You now have a well-prepared dataset that meets the requirements of your training goals.
Generating the Training Dataset for Control Net
To generate the training dataset for the Control Net, you’ll need to follow these steps:
Step 1: Create a 3-Color Palette
Generate a 3-color palette using ImageMagick. Run the following command:
magick convert -size 3x1 xc:black xc:fractal xc:white +append palette.png
This command creates a palette.png file containing the black, fractal, and white colors.
Step 2: Convert Images to White-Gray-Black
Convert your images to white-gray-black using the generated palette. You can use ImageMagick for this purpose. Run this command for each image:
magick convert lenna.png -dither None -remap palette.png lenna_mysee.png
This command remaps the colors in lenna.png to the nearest colors in the palette.png, resulting in lenna_mysee.png.
You can either do this manually for all images or use the provided batch script (img2dataset_3_convert.bat).
Step 3: Prepare the Directory Structure
Set up the directory structure for your dataset. Create folders for images, conditioning images, and a train.jsonl file. Here’s an example structure:
Step 4: Create a train.jsonl File
You need to prepare a train.jsonl file that specifies where to find the images and their associated captions. Run the provided Python script (img2dataset_2_makejsonl.py) to generate this file. Use the following command:
This script will go through all the images in the specified directory, fetch their corresponding text from the .parquet file, and create entries in train.jsonl. The captions may be sanitized to remove strange characters.
Step 5: Create a Python Dataset Object
The HuggingFace train_controlnet.py script expects a Python dataset object to be created manually. You can use the provided mydataset.py script, which is adapted from fill50k.py. This script defines a class and some functions that train_controlnet.py will use to generate examples and splits.
With these steps completed, you should have a well-prepared training dataset for your Control Net. This dataset includes white-gray-black images along with captions, and it’s ready for use in training.
Reflections on Control Net Training:
- Color Associations: Initially, the goal was to train the control net to link white with light areas and black with dark areas when guiding Stable Diffusion. While the Control Net seems effective, questions arise regarding whether this assumption holds true.
- Color Remapping: Using a simple color remapping strategy might warrant reconsideration. Exploring a broader color spectrum, including semi-light and semi-dark areas, may offer advantages. Moreover, distinctions exist between remapping RGB to grayscale and converting RGB to luminance, posing interesting psychological considerations.
- Optimal Training Steps: The optimal number of training steps remains uncertain. Evaluating the potential under-fitting or over-fitting of the model calls for generating test images at various checkpoints. Additionally, exploring tools like “wandb” could shed light on their relevance in the training process.
- Selective Area Manipulation: Experimenting with the selective removal of white and black areas could instruct Stable Diffusion to generate light and dark regions more flexibly. This approach aligns with methodologies employed in training inpainting models. Alternatively, focusing on “interesting” areas for training warrants clarification.
- Dataset Diversity: Diversifying the image dataset is a consideration. While the current dataset focuses on aesthetics, expanding it to include illustrations, screenshots, text, and more could enrich the Control Net’s versatility.
- Prompt Randomization: The practice of randomly substituting 50% of text prompts with empty strings during training appears valuable. This technique enhances the Control Net’s ability to interpret semantic content from condition maps, such as Canny edge maps or human scribbles. It acknowledges that the absence of a prompt encourages the encoder to glean semantics from input control maps.
In summary, these reflections underscore the importance of iterative experimentation and evaluation in refining the Control Net’s capabilities and its role in guiding Stable Diffusion effectively.
- HuggingFace ControlNet Training Documentation: The most up-to-date tutorial by HuggingFace, featuring essential optimizations for training Control Nets.
- HuggingFace Uncanny Faces: This tutorial explores training your own Control Net on facial images, highlighting the importance of data quality in achieving good results and underscoring the need for caution when working with synthetic datasets.
- HuggingFace Ultra Fast ControlNet with Diffusers: A blog post providing insights into Control Net training with diffusers, offering practical guidance and tips.
- Original ControlNet Tutorial: An unoptimized and outdated tutorial, available on GitHub, that serves as the foundation for later developments in Control Net training.
- “Adding Conditional Control to Text-to-Image Diffusion Models” (Original ControlNet Paper): This paper introduces the ControlNet concept, explaining how conditional control is integrated into text-to-image diffusion models.
- “High-Resolution Image Synthesis with Latent Diffusion Models” (Original Stable Diffusion Paper): The original Stable Diffusion paper lays the groundwork for latent diffusion models in high-resolution image synthesis.
- “SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis” (Stable Diffusion XL Paper): An extension of the Stable Diffusion framework, this paper focuses on improving latent diffusion models for high-resolution image synthesis.
- “Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer” (Original Midas Paper): This paper explores monocular depth estimation and the use of diverse training datasets, including 3D films.
- ControlNet 1.0 Repository: The repository for ControlNet 1.0, which contains resources and code for the earlier version of Control Net.
- ControlNet 1.1 Repository: The repository for ControlNet 1.1, an updated version of the Control Net with improvements and optimizations.
- ControlNet for SDXL: A specialized Control Net repository tailored for Stable Diffusion XL.
- Stable Diffusion 1.5 Model Card: Information and details about the Stable Diffusion 1.5 model, providing insights into its specifications and usage.
- laion-aesthetics: An image database used for various experiments and training involving aesthetics.
- Exploring 12 Million of the 2.3 Billion Images Used to Train Stable Diffusion’s Image Generator: A resource for understanding the massive image dataset employed in training Stable Diffusion’s image generator.
- CLIP Retrieval: References to resources related to CLIP retrieval techniques.
- https://www.daprompts.com/model-stable-diffusion: Resources and discussions related to identifying whether a model has been trained on specific data.
- ControlNet GitHub Issues: A GitHub repository for reporting and discussing issues related to ControlNet.
- ControlNet GitHub Discussions: A space on GitHub for engaging in discussions about ControlNet.
- ControlNet v1.1 Nightly GitHub Issues: GitHub repository for tracking and discussing issues specifically related to ControlNet version 1.1 nightly builds.
- ControlNet v1.1 Nightly GitHub Discussions: Discussions and collaboration related to ControlNet version 1.1 nightly builds on GitHub.
- Diffusers GitHub Issues: GitHub repository for addressing issues and inquiries related to the use of diffusers in training.
These references serve as valuable sources for tutorials, papers, repositories, image databases, and additional support to aid in the exploration and understanding of ControlNet and related concepts.