RECOGNIZING FASHION STYLE USING COMPONENT-DEPENDENT CONVOLUTIONAL NEURAL NETWORKS

RECOGNIZING FASHION STYLE USING COMPONENT-DEPENDENT CONVOLUTIONAL NEURAL NETWORKS

The fashion style recognition is important in online marketing applications. Several algorithms have been proposed, but their accuracy is still unsatisfactory. In this paper, we share our proposed method for creating an improved fashion style recognition algorithm, component-dependent convolutional neural networks (CD-CNNs). Given that a lot of fashion styles largely depend on the features of specific body parts or human body postures, first, we obtain images of the body parts and postures by using semantic segmentation and pose estimation algorithms; then, we pre-train CD-CNNs. We perform the classification by the concatenated outputs of CDCNNs and a support vector machine (SVM). Experimental results using the HipsterWars and FashionStyle14 datasets prove that our method is effective and can improve classification accuracy, namely 85.3% for HipsterWars and 77.7% for FashionStyle14, while those of existing methods were 80.9% for HipsterWars and 72.0% for FashionStyle14.

In pre-training, six CD-CNNs – that correspond to the components – are pre-trained. In prediction, feature vectors are extracted from CD-CNNs and concatenated, then used further for fashion style classification. We use a SVM as a classifier in all the experiments.

Creating component images from an input image

In this subsection, we describe how to create six component images from an input image. Human body parts are extracted from an input image by using DeepLabv3+ [19] and joint human parsing and pose estimation network (JPPNet) [20] after the input image is resized while maintaining the aspect ratio. JPPNet detects 19 elements of human body: Hat, Hair, Glove, Sunglasses, UpperClothes, Dress, Coat, Socks, Pants, Jumpsuits, Scarf, Skirt, Face, LeftArm, RightArm, LeftLeg, RightLeg, LeftShoe, and RightShoe. Since JPPNet cannot detect Neck, we additionally use DeepLabv3+. As a result, we obtain 20 human body elements and create four component images by merging the elements, namely, Whole body: all 20 elements. Clothes: UpperClothes, Dress, Coat, Jumpsuits, Skirt, Pants. Head: Hair, Hat, Sunglasses, Face. Limbs: Glove, Socks, Scarf, LeftArm, RightArm, LeftLeg, RightLeg, LeftShoe, RightShoe, Neck. To make a posture image, we use Newell et al.’s method [21], which indicates good performance in pose estimations. We also use a grayscale image as shape information. To make the grayscale image, all RGB components are replaced with a G component.

Pre-training CD-CNNs, feature extraction and concatenation, classifier

We use ResNet50 as a feature extractor based on two reasons. The first reason is that VGG feature extractor performed better than the handcrafted feature extractor [12], according to the evaluation [13]. The second reason is that ResNet50 feature extractor performed better than other deep learning architectures, according to the experiment [7]. The default input image size of ResNet50 is 224×224, but we use 214×474 based on the average human area size of the HipsterWars and FashionStyle14 datasets after resizing 500 height while maintaining the aspect ratio and cropping the human area. We obtain the feature of 4,096 dimensions from this input image size through ResNet50. In pre-training CD-CNNs, we start from the weights of ResNet50 trained on ImageNet [22], then pre-train by using the component images in each fashion style dataset. Appropriate feature vectors are obtained from each component image by using these CD-CNNs. Each extracted feature vector is concatenated.

We conducted evaluation experiments by using the HipsterWars and FashionStyle14 datasets to examine the effectiveness of our method. In the HipsterWars dataset experiment, we used images within the top 50% rating score and evaluated them 100 times by splitting the images into 9 : 1 (train : test) at random as done in related studies [12, 13, 14].

In the FashionStyle14 dataset experiment, we split all the images into 6.5 : 3.5 (train : test) as done in the related study [7]. This related study did not conduct data shuffling, but we considered 10 times data shuffling at random to be reasonable and necessary because there are 10 times the number of images than with the HipsterWars dataset. In both dataset experiments, training data was used for pre-training CD-CNNs and training the SVM. As in related studies [12, 13, 14], we also carried out 5-fold cross validation to find the best SVM parameters by using [23]. Test data was used to evaluate classification accuracy. We compared average classification accuracy in all the experiments. In Table 1, we summarize the best feature combination results in average classification accuracy for used component number (UCN) and datasets. We call methods using the best feature combination for each component number (except five components) “1-component”, “2-component”, “3-component”, “4-component”, and “6-component”, respectively. We call the best feature combination of five components for the HipsterWars “5EL”. We also call the five components for the FashionStyle14 “5EP”.

HipsterWars dataset

The 1-component method is the same as that of Nakajima et al. (except preprocessing and feature dimension), but the accuracy improved by approximately 2%. The differences of preprocessing and feature dimension contributed to this improvement. Our method (4-component) achieved 85.3% accuracy, a 4.4% improvement compared with that of Nakajima et al. Our method of 5EL, 5EP, and 6-component also outperformed other existing methods in classification accuracy.

FashionStyle14 dataset

Our method (6-component) achieved 77.7% accuracy, a 5.7% improvement compared with that of Takagi et al. Unlike the results of HipsterWars dataset, all six components (whole body, clothes, head, limbs, posture, and grayscale) are important for this dataset. We consider the reason for differences in important components of both datasets that deal with different fashion styles. For example, girlish style is similar to gal style, but girlish style tends to incorporate sporty shoes as shown in Fig. 1. Thus, in discriminating these two styles, the limbs component (including shoes) helped to improve accuracy.

The post RECOGNIZING FASHION STYLE USING COMPONENT-DEPENDENT CONVOLUTIONAL NEURAL NETWORKS appeared first on Aashee Info Tech.



from WordPress https://ift.tt/36A4bJ0

Comments

Popular posts from this blog

Learning and forecasting fashion style

integral to Matriculation school Subjects education

Fashion Learning adapts to Attributes’ Data