TWC #19

TWC Team

Dec 14, 2022 • 7 min read

State-of-the-art papers with Github avatars of researchers who released code, models (in most cases) and demo apps (in few cases) along with their paper. Image created from papers described below

State-of-the-art (SOTA) updates for 5 – 11 Dec 2022.

This weekly newsletter highlights the work of researchers who produced state-of-the-art work breaking existing records on benchmarks. They also

authored their paper
released their code
released models in most cases
released notebooks/apps in few cases

New records were set on the following tasks (in order of papers)

Image Generation
Unsupervised Domain adaptation
SMAC tasks (Sequential Model-based Algorithm Configuration)
Dense Captioning
Motion Synthesis

To date, 27.8% (93,701) of total papers (337,412) published have code released along with the papers (source).

SOTA details below are snapshots of SOTA models at the time of publishing this newsletter. SOTA details in the link provided below the snapshots would most likely be different from the snapshot over time as new SOTA models emerge.

#1 in Image Generation on CelebA-HQ 512x512 dataset

Paper: **Wavelet Diffusion Models are fast and scalable Image Generators**

Github code with pretrained models **released by** Hao Phung (first author in paper)

Model Name: WaveDiff

Notes: Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models' running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion structure. They extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, they propose to use a reconstruction term, which effectively boosts the model training convergence.

Demo page: No demo page yet.

License: Apache-2.0 license

#1 in Unsupervised Domain Adaptation on multiple datasets

Paper: **MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation**

Github code with pretrained models **released by** Lukas Hoyer (first author in paper)

Model Name: MIC

Notes: In unsupervised domain adaptation (UDA), a model trained on source data (e.g. synthetic) is adapted to target data (e.g. real-world) without access to target annotation. Most previous UDA methods struggle with classes that have a similar visual appearance on the target domain as no ground truth is available to learn the slight appearance differences. To address this problem, this proposes a Masked Image Consistency (MIC) module to enhance UDA by learning spatial context relations of the target domain as additional clues for robust visual recognition. MIC enforces the consistency between predictions of masked target images, where random patches are withheld, and pseudo-labels that are generated based on the complete image by an exponential moving average teacher. To minimize the consistency loss, the network has to learn to infer the predictions of the masked regions from their context. Due to its simple and universal concept, MIC can be integrated into various UDA methods across different visual recognition tasks such as image classification, semantic segmentation, and object detection. MIC significantly improves the state-of-the-art performance across the different recognition tasks for synthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA.

Demo page: No demo page yet.

License: None specified to date

#1 in SMAC tasks

Paper: **ACE: Cooperative Multi-agent Q-learning with Bidirectional Action-Dependency**

Github code **released by** Jie Liu (second author in paper). Models not released yet

Model Name: ACE

Notes: Multi-agent reinforcement learning (MARL) suffers from the non-stationarity problem, which is the ever-changing targets at every iteration when multiple agents update their policies at the same time. This paper proposes a solution to solve the non-stationarity problem with bidirectional action-dependent Q-learning (ACE). Central to the development of ACE is the sequential decision-making process wherein only one agent is allowed to take action at one time. Within this process, each agent maximizes its value function given the actions taken by the preceding agents at the inference stage. In the learning phase, each agent minimizes the TD error that is dependent on how the subsequent agents have reacted to their chosen action. Given the design of bidirectional dependency, ACE effectively turns a multiagent MDP into a single-agent MDP.

Demo page: Select demo examples in this page

License: Apache-2.0 license

#1 in Dense Captioning on Visual Genome dataset

Paper:GRiT: **A Generative Region-to-text Transformer for Object Understanding**

Github code with model **released by** Jialian Wu (first author in paper).

Model Name: GRiT

Notes: This paper presents a Generative RegIon-to-Text transformer, GRiT, for object understanding. The spirit of GRiT is to formulate object understanding as <region, text> pairs, where region locates objects and text describes objects. For example, the text in object detection denotes class names while that in dense captioning refers to descriptive sentences. Specifically, GRiT consists of a visual encoder to extract image features, a foreground object extractor to localize objects, and a text decoder to generate open-set object descriptions. With the same model architecture, GRiT can understand objects via not only simple nouns, but also rich descriptive sentences including object attributes or actions. GRiT is applied to object detection and dense captioning tasks.

Demo page: We forked repo and added a Google Colab link for evaluating model

Model output for dense captioning on Pexels.com images

Evaluation notes Captioning for detected objects captures some aspect of the object features beyond just the object class name. This is a consequence of the captioning component getting object features in the form of image patches. For instance in some pictures, the captioning describes the "sky" (detected object) as clear vs dark etc.
This particular aspect is perhaps a distinguishing factor of this approach compared to dense captioning using a model like Detic which feeds the object class name with coordinates and dimensions to ChatGPT. While ChatGPT's output is superior to the dense captioning of GriT particularly the ability of the model to offer rich summary of a scene beyond just individual objects, it is at a disadvantage in its description since it is blind to the characteristics of the object, since it only gets as input object name, position, and size.
Also Detic's object detection capability appears to be better than GrIT in images tested (images were selected from a royalty free site Pexels).

The number of objects detected by GRiT may appears to be fewer than that detected by Detic, only because of the difference in training datasets (LVIS vs. COCO - thank you Jialian for pointing this out). So, if we combine the dense captioning capability of GriT along with the object detection capability of Detic and feed the bounding box information along with the dense captioning ChatGPT outputs rich descriptions of the scene even describing relative spatial positions of objects in the image. Here are a couple of our twitter post threads analyzing this. Thread 1, Thread 2

Combining Detic and GrIT to creating object bbox with descriptions as input to ChatGPT to get rich descriptions of scene

License: MIT license

#1 in Motion Synthesis on 2 datasets

Paper: **Executing your Commands via Motion Diffusion in Latent Space**

Placeholder code **released by** Xin Chen (first author in paper). Models not released yet

Model Name: MLD

Notes: This papers offers a solution for conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, this paper uses a Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, they perform a diffusion process on the motion latent space. The proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences.

Demo page: No demo page yet.

License: MIT license