Technical
AI

Fashion Segmentation Datasets and Their Common Problems

In this technical blog post, we will explore the most common datasets used for clothing segmentation and human parsing. We will review their key characteristics and highlight issues to watch out for when working with them.

Written by Dan Bochman | April 10, 2024

blog-image

Introduction

Many fashion-related technologies and applications start with an important fundamental step: Clothes Segmentation. Whether the goal is to classify a garment, compose a look, or recommend a matching item, it is first necessary for an algorithm to have a good distinction between the clothing item itself and its surroundings.

At FASHN, clothes segmentation is also an important first step for performing virtual try-ons.

ex1

Fashion model with masked top, segmented top to try-on, and try-on result. Original model and top images from boohoo.com.

The state-of-the-art approaches for clothes segmentation today are based on neural networks, which require labeled training data to learn this task. In this blog post, we will discuss the datasets commonly used to train these neural networks and what to watch out for when working with them.

Human Parsing

Before we dive into the datasets, it will be helpful to introduce a term you may not be familiar with: Human Parsing.

Although clothes segmentation exists as a standalone task, it is more commonly performed within the broader context of human parsing. This approach not only segments clothing but also labels other body parts such as the face, hair, arms, and legs.

ex2

Human parsing example from the Clothing Co-Parsing dataset.

Common Datasets

There are many publicly available datasets for segmentation. When looking at some of the most popular clothing segmentation or human parsing models currently in use (e.g. SCHP, U2NET, and SegFormer), the most common datasets used are:

  1. ATR (Atrous Textile Recognition):

    Developed by the Sun Yat-sen University, specifically from the Human Cyber Physical Intelligence Integration Lab, the ATR dataset includes ~17k fashion-focused images with human parsing annotations (i.e. both clothing and body part annotations).

  2. LIP (Look Into Person):

    Originating from the same lab at Sun Yat-sen University, the LIP dataset is another human parsing dataset with ~50k images. Unlike ATR, the collected images are not necessarily fashion-focused but rather label people and their body parts in different scenarios.

  3. iMaterialist:

    Introduced at the FGVC4 workshop during CVPR 2017, the iMaterialist dataset was a collaborative effort by organizations including Google and Kaggle. It contains ~45k fashion-focused images with rich clothing segmentation annotations, which also include cloth attributes such as sleeves, zippers, pockets, etc. However, it does not contain any human body part labels.

ATR

The ATR dataset overall has high-quality fashion-related images; however, the same cannot be said about the quality of its annotations.

ex3

Samples from the ATR dataset. Lots of love for K-Pop idols ❤️

When working with the ATR dataset, there are 2 main issues to consider:

  1. Annotation “holes”

  2. Annotation labels “spilling” into each other

Annotation “holes”

The ATR dataset has a consistent problem of “holes”, i.e. background-labeled pixels (value = 0) are interleaved at the edges and also in the middle of other label masks.

ex4

An image from the ATR dataset, side-by-side with its corresponding segmentation annotations.

Zooming in on the mask image:

ex5

Enlarged mask with holes circled in red.

These false labels can significantly hurt the training of a deep neural network. Our primary goal here is to highlight the issue rather than suggest a solution. However, as an example, these holes can be addressed using image processing techniques such as morphological closing or an iterative nearest neighbors analysis. The choice of method is up to you.

Labels “spilling”

Another recurring problem in the ATR dataset is labels that deviate beyond the bounds of the original objects in the image, sometimes spilling into either the background or other labels.

ex6

Let us have a closer look:

ex7

Enlarged mask with annotated instances of label spillage.

As you can see, the previous issue of "holes" is also dominant in this mask example. However, in this particular example, there is also a severe lack of positional accuracy for the mask labels.

LIP

The LIP dataset is as random as it gets, appearing to be a web crawl for images of people from the early days of the internet. Coming from the same team that curated the ATR dataset, it unfortunately shares the same problems, along with a few additional ones.

ex8

Samples from the LIP dataset.

When working with the LIP dataset, there are 4 main issues to consider:

  1. Same issues as the ATR dataset

  2. Wrong and inconsistent labels

  3. Problems caused by the team’s cropping mechanism

  4. Inappropriate images

“Holes” and “Spillage”

The same issues of mask integrity and precision that plague the ATR dataset are also present in the LIP dataset.

ex9

An image from the LIP dataset, side-by-side with its corresponding segmentation annotations.

Wrong and Inconsistent Labels

As with the ATR dataset, the LIP dataset includes labels such as Left Arm, Right Arm, Left Leg and Right Leg. However, in LIP, there are common instances where labels that should be Upper-clothes or Pants are incorrectly labeled as body parts.

ex10

LIP sample where Pants and Upper-clothes labels are mislabeled as body parts. A recurring pattern.

Aggressive Crops

The LIP dataset attempts to feature only one person per image. It seems that in the process of creating the dataset, the team took images with multiple people and cropped the bounding boxes around individual subjects. This often results in instances where:

  1. Images are only slightly cropped, becoming borderline duplicates of other images.

  2. The are still multiple (unlabeled) people visible in the image.

  3. The crops are too aggressive, leading to loss of important details.

ex11

Pairs of images cropped from the same source image in the LIP dataset.

Questionable Ethics

The ATR dataset consists of images that were clearly meant for public distribution, such as celebrity images, magazine covers, and fashion photoshoots. In contrast, the LIP dataset is composed of images that, while found online, appear to be intended for personal use, such as those shared among friends and family.

An even more glaring issue, immediately apparent upon browsing the dataset, is that a significant portion of it includes minors - from teens down to toddlers. For these reasons, FASHN has opted out of using this dataset for any purpose.

iMaterialist

The iMaterialist dataset represents a significant improvement in terms of both image and annotation quality. The images were collected in highly relevant fashion contexts, such as studio photoshoots, runway shows, and street fashion photography.

ex12

When working with iMaterialist dataset, one should be aware of the following issues:

  1. Multi-person images with only 1 person labeled

  2. Lack of body part labels (for human parsing)

Multi-Person Images

A notable drawback of this high-quality dataset is the frequent instances where multiple people appear in an image, but only one of them is annotated.

ex13

An example from the iMaterialist dataset of 2 people equally visible in the image, but only the left woman is annotated.

ex14

An example from the iMaterialist dataset of an annotated woman posing in a crowd of unlabeled people.

Given the frequency of these occurrences in the dataset, it's essential to establish a mechanism to handle these images before training a segmentation model on these image-annotation pairs:

  • Instances where similarly sized people appear in the image account for approximately 6% of the dataset.

  • Instances where there are visible people in the background of the annotated person amount for more than 10% of the dataset.

No Body Part Labels

This issue cannot be categorized as a problem per se, as it is by design of the dataset's creators. However, it is unfortunate for those who wish to enhance their human parsing models with this dataset. For the human parsing use case, the human labels must be complemented, perhaps with the help of another segmentation model.

Closing Words

Clothing segmentation and human parsing models trained on the previously mentioned datasets are often used out-of-the-box in application pipelines or as a pre-processing step to train other models. Given the prevalence of these models, we at FASHN deemed it necessary to take a good look under the hood and were quite surprised by what we discovered.

At FASHN, we take great care to perfect the quality of our datasets. Addressing these issues has significantly improved the performance of our internal segmentation models.