TWC #6

TWC #6
State-of-the-art papers with Github avatars of researchers who released code, models (in most cases) and demo apps (in few cases) along with their paper. Image created from papers described below

SOTA updates between 5 Sept– 11 Sept 2022

  • Conditional image generation
  • Synthetic to real translation (Synthetic-to-real translation is the task of domain adaptation from synthetic (or virtual) data to real data)
  • Few shot image classification  - Few-shot image classification is the task of doing image classification with only a few examples for each category (typically < 6 examples).
  • Unsupervised Object Segmentation
  • Lip reading
  • video object segmentation

This post is a consolidation of daily twitter posts tracking SOTA changes.

Official code release (with pre-trained models in most cases) also available for these tasks.


#1 SOTA in conditional image generation on Imagenet 128x128

Paper:  Entropy-driven Sampling and Training Scheme for Conditional Diffusion Generation

Submitted on 23 June 2022 (v1), last revised 23 Aug 2022 (v4) . Code updated 6 Sept 2022

Github code released by Zheng Guang Cong ( author in paper) Model link: Pretrained models in Github page

Notes:  This model proposes a solution to the vanishing gradient when a classifier is used to guide conditional image generation. It proposes a scaling method to adaptively recover conditional semantic guidance. On the training side, it  offers a solution to overconfident prediction for noisy data.

Model Name: ADM-G + EDS

Score () : 2.63 (Prev: 2.68)

Δ:  .05  (Metric:  FID)

Model links.  

License: MIT license

Demo page link? None to date


#1 SOTA in Synthetic-to-Real Translation on GTAV-to-Cityscapes Labels dataset

Paper:  CLUDA : Contrastive Learning in Unsupervised Domain Adaptation for Semantic Segmentation

Github code released by user0407 Model link: models not released yet

Submitted on 27 Aug 2022 (v1). Code updated 13 Sept 2022

Notes:  The models  performs unsupervised domain adaptation (UDA) for semantic segmentation by incorporating contrastive losses into a student-teacher learning paradigm, that makes use of pseudo-labels generated from the target domain by the teacher network.

Model Name:  HRDA + CLUDA

Score () : 74.4(Prev: 73.8)

Δ:   .6  (Metric:  mIoU)

Model links.  Trained models not released yet.

License: Not specified

Demo page link? None to date


#1 in Few shot image classification on 8 datasets

Paper:  Class-Specific Channel Attention for Few-Shot Learning

Submitted on 3 Sept 2022 (v1). Code updated 7 Sept 2022

Github code released by Ying-Yu Chen ( author in paper) Model link: Trained models in Github repository.

Notes:  This model attempts to address the challenge of few shot learning - the training and testing categories (the base vs. novel sets) can be largely diversified.  It  extends the solution of transfer-based methods by incorporating the concept of metric-learning and channel attention. The approach  learns to highlight the discriminative channels in each class. Unlike general attention modules designed to learn global-class features, the model aims to learn local and class-specific features with very effective computation.

Model Name:  CSCA

Model links.  

License: Not specified

Demo page link? None to date


#1 SOTA in Unsupervised Object Segmentation on ClevrTex

Paper:  Unsupervised multi-object segmentation using attention and soft-argmax

Submitted on 26 May 2022 (v1)  Last updated  (v2) 31 Aug 2022. Code updated 8 Sept 2022

Github code released by Bruno Sauvalle ( first author in paper) Model link: Trained models not released yet

Notes:  The model performs unsupervised object-centric representation learning and multi-object detection and segmentation, by  using a translation-equivariant attention mechanism to predict the coordinates of the objects present in the scene and to associate a feature vector to each object. A transformer encoder handles occlusions and redundant detections, and a convolutional autoencoder is in charge of background reconstruction.

Model Name:  AST-Seg-B3-CT

Score () : 79.58  (Prev:  66.62)

Δ:   12.96  (Metric:  mIoU)

License: MIT license

Demo page link? None to date


#1 in Lipreading on "Lip Reading in the Wild" dataset

Paper:  Training Strategies for Improved Lip-reading

Submitted on 3 Sept 2022 (v1) Code updated 9 Sept 2022

Github code released by Pingchuan Ma ( first author in paper) Model link:  Trained models in Github page

Notes:  This paper examines several augmentation strategies for lip reading such as temporal models and other training strategies, like self-distillation and using word boundary indicators. They fine Time Masking (TM) is the most important augmentation followed by mixup and Densely-Connected Temporal Convolutional Networks (DC-TCN) are the best temporal model for lip-reading of isolated words. Using self-distillation and word boundary indicators is also beneficial but to a lesser extent. A combination of all the above methods results in improvement in  classification accuracy.

Model Name:  3D Conv + ResNet-18 + DC-TCN + KD (Ensemble)

Score () : 94.1 (Prev:88.5)

Δ:   5.6  (Metric:  Top-1 accuracy)

License: Non-commercial use license


#1 SOTA in video object segmentation on 4 datasets

Paper:  Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory

Submitted on 14 July 2022 (v1); updated (v2) 18 July 2022 Code updated 9 Sept 2022

Github code released by Ho Kei Cheng ( first author in paper) Model link:  Trained models not released yet

Notes:  This paper introduces a video object segmentation architecture for long videos with unified feature memory stores inspired by the Atkinson-Shiffrin memory model. Prior work on video object segmentation typically only uses one type of feature memory. For videos longer than a minute, a single feature memory model tightly links memory consumption and accuracy. In contrast, following the Atkinson-Shiffrin model incorporates multiple independent yet deeply-connected feature memory stores: a rapidly updated sensory memory, a high-resolution working memory, and a compact thus sustained long-term memory. Also the paper  introduces a memory potentiation algorithm that routinely consolidates actively used working memory elements into the long-term memory, which avoids memory explosion and minimizes performance decay for long-term prediction.

Model Name:  XMem (BL30K, MS)

Datasets: Youtube VOS (2918 & 2019), Davis (2016 & 2017)

Model links.  Trained models not released yet.

License: GPL-3.0  license