TWC #6

SOTA updates between 5 Sept– 11 Sept 2022
- Conditional image generation
- Synthetic to real translation (Synthetic-to-real translation is the task of domain adaptation from synthetic (or virtual) data to real data)
- Few shot image classification - Few-shot image classification is the task of doing image classification with only a few examples for each category (typically < 6 examples).
- Unsupervised Object Segmentation
- Lip reading
- video object segmentation
This post is a consolidation of daily twitter posts tracking SOTA changes.
Official code release (with pre-trained models in most cases) also available for these tasks.
#1 SOTA in conditional image generation on Imagenet 128x128

Paper: Entropy-driven Sampling and Training Scheme for Conditional Diffusion Generation
Submitted on 23 June 2022 (v1), last revised 23 Aug 2022 (v4) . Code updated 6 Sept 2022
Github code released by Zheng Guang Cong ( author in paper) Model link: Pretrained models in Github page
Notes: This model proposes a solution to the vanishing gradient when a classifier is used to guide conditional image generation. It proposes a scaling method to adaptively recover conditional semantic guidance. On the training side, it offers a solution to overconfident prediction for noisy data.
Model Name: ADM-G + EDS
Score (↓) : 2.63 (Prev: 2.68)
Δ: .05 (Metric: FID)
Model links.
License: MIT license
Demo page link? None to date
#1 SOTA in Synthetic-to-Real Translation on GTAV-to-Cityscapes Labels dataset

Paper: CLUDA : Contrastive Learning in Unsupervised Domain Adaptation for Semantic Segmentation
Github code released by user0407 Model link: models not released yet
Submitted on 27 Aug 2022 (v1). Code updated 13 Sept 2022
Notes: The models performs unsupervised domain adaptation (UDA) for semantic segmentation by incorporating contrastive losses into a student-teacher learning paradigm, that makes use of pseudo-labels generated from the target domain by the teacher network.
Model Name: HRDA + CLUDA
Score (↑) : 74.4(Prev: 73.8)
Δ: .6 (Metric: mIoU)
Model links. Trained models not released yet.
License: Not specified
Demo page link? None to date
#1 in Few shot image classification on 8 datasets

Paper: Class-Specific Channel Attention for Few-Shot Learning
Submitted on 3 Sept 2022 (v1). Code updated 7 Sept 2022
Github code released by Ying-Yu Chen ( author in paper) Model link: Trained models in Github repository.
Notes: This model attempts to address the challenge of few shot learning - the training and testing categories (the base vs. novel sets) can be largely diversified. It extends the solution of transfer-based methods by incorporating the concept of metric-learning and channel attention. The approach learns to highlight the discriminative channels in each class. Unlike general attention modules designed to learn global-class features, the model aims to learn local and class-specific features with very effective computation.
Model Name: CSCA
Model links.
License: Not specified
Demo page link? None to date
#1 SOTA in Unsupervised Object Segmentation on ClevrTex

Paper: Unsupervised multi-object segmentation using attention and soft-argmax
Submitted on 26 May 2022 (v1) Last updated (v2) 31 Aug 2022. Code updated 8 Sept 2022
Github code released by Bruno Sauvalle ( first author in paper) Model link: Trained models not released yet
Notes: The model performs unsupervised object-centric representation learning and multi-object detection and segmentation, by using a translation-equivariant attention mechanism to predict the coordinates of the objects present in the scene and to associate a feature vector to each object. A transformer encoder handles occlusions and redundant detections, and a convolutional autoencoder is in charge of background reconstruction.
Model Name: AST-Seg-B3-CT
Score (↑) : 79.58 (Prev: 66.62)
Δ: 12.96 (Metric: mIoU)
License: MIT license
Demo page link? None to date
#1 in Lipreading on "Lip Reading in the Wild" dataset

Paper: Training Strategies for Improved Lip-reading
Submitted on 3 Sept 2022 (v1) Code updated 9 Sept 2022
Github code released by Pingchuan Ma ( first author in paper) Model link: Trained models in Github page
Notes: This paper examines several augmentation strategies for lip reading such as temporal models and other training strategies, like self-distillation and using word boundary indicators. They fine Time Masking (TM) is the most important augmentation followed by mixup and Densely-Connected Temporal Convolutional Networks (DC-TCN) are the best temporal model for lip-reading of isolated words. Using self-distillation and word boundary indicators is also beneficial but to a lesser extent. A combination of all the above methods results in improvement in classification accuracy.
Model Name: 3D Conv + ResNet-18 + DC-TCN + KD (Ensemble)
Score (↑) : 94.1 (Prev:88.5)
Δ: 5.6 (Metric: Top-1 accuracy)
License: Non-commercial use license
#1 SOTA in video object segmentation on 4 datasets

Paper: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory
Submitted on 14 July 2022 (v1); updated (v2) 18 July 2022 Code updated 9 Sept 2022
Github code released by Ho Kei Cheng ( first author in paper) Model link: Trained models not released yet
Notes: This paper introduces a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Also the paper introduces a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction.
Model Name: XMem (BL30K, MS)
Datasets: Youtube VOS (2918 & 2019), Davis (2016 & 2017)
Model links. Trained models not released yet.
License: GPL-3.0 license