TWC #20

TWC Team

Dec 20, 2022 • 7 min read

State-of-the-art papers with Github avatars of researchers who released code, models (in most cases) and demo apps (in few cases) along with their paper. Image created from papers described below

State-of-the-art (SOTA) updates for 12 – 18 Dec 2022.

This weekly newsletter highlights the work of researchers who produced state-of-the-art work breaking existing records on benchmarks. They also

authored their paper
released their code
released models in most cases
released notebooks/apps in few cases

New records were set on the following tasks (in order of papers)

Image Generation
3D Point Cloud Classification
Classifier Calibration
Seeing Beyond the Visible

The paper released with code from OpenAI did not set a new record, but is reported for its potential (two orders faster to sample from compared to other approaches) to generate 3D objects from a text description

3D Point cloud generation from complex prompts

To date, 27.8 % (94,134) of total papers (338,394) published have code released along with the papers (source).

SOTA details below are snapshots of SOTA models at the time of publishing this newsletter. SOTA details in the link provided below the snapshots would most likely be different from the snapshot over time as new SOTA models emerge.

Our contributions last week

We submitted a pull request to add the use case of ChatGPT as a blind image captioner which we reviewed in last week's publication.

Our work from last week was featured and developed further in the medium post ChatGPT - an epochal event
A fork to replicate OpenAI's model Point-E released yesterday. See results below

#1 in Image-to-Image Translation

Paper: **PiPa: Pixel- and Patch-wise Self-supervised Learning for Domain Adaptative Semantic Segmentation**

Github code with trained models **released by** Mu Chen (first author in paper)

Model Name: HRDA+PiPa

Notes: Unsupervised Domain Adaptation (UDA) aims to enhance the generalization of the learned model to other domains. The domain-invariant knowledge is transferred from the model trained on labeled source domain, e.g., video game, to unlabeled target domains, e.g., real-world scenarios, saving annotation expenses. Existing UDA methods for semantic segmentation usually focus on minimizing the inter-domain discrepancy of various levels, e.g., pixels, features, and predictions, for extracting domain-invariant knowledge. However, the primary intra-domain knowledge, such as context correlation inside an image, remains under-explored. In an attempt to fill this gap, this paper proposes a unified pixel- and patch-wise self-supervised learning framework, called PiPa, for domain adaptive semantic segmentation that facilitates intra-image pixel-wise correlations and patch-wise semantic consistency against different contexts. The proposed framework exploits the inherent structures of intra-domain images, which: (1) explicitly encourages learning the discriminative pixel-wise features with intra-class compactness and inter-class separability, and (2) motivates the robust feature learning of the identical patch against different contexts or fluctuations.

Demo page: No demo page yet.

License: None to date

#1 in 3D Point Cloud Classification

Paper: **Learning 3D Representations from 2D Pre-trained Models via Image-to-Point Masked Autoencoders**

Placeholder repo **released** by Renrui Zhang (first author in paper)

Model Name: I2P-MAE

Notes: Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data acquisition and annotation, a paucity of large-scale 3D datasets severely hinders the learning for high-quality 3D features. This paper proposes an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. The leverage self-supervised pre-training to guide 3D masked autoencoding, which reconstructs the masked point tokens with an encoder-decoder architecture. Specifically, they first utilize off-the-shelf 2D models to extract the multi-view visual features of the input point cloud, and then conduct two types of image-to-point learning schemes on top. They introduce a 2D-guided masking strategy that maintains semantically important point tokens to be visible for the encoder. Compared to random masking, the network can better concentrate on significant 3D structures and recover the masked tokens from key spatial cues. They also enforce these visible tokens to reconstruct the corresponding multi-view 2D features after the decoder. This enables the network to effectively inherit high-level 2D semantics learned from rich image data for discriminative 3D modeling.

Demo page: No demo page yet.

License: None to date

#1 in Classifier Calibration

Paper: **Expeditious Saliency-guided Mix-up through Random Gradient Thresholding**

Github code with trained models **released by** Minh-Long Luu (first author in paper)

Model Name: R-Mix

Notes: Mix-up training approaches have proven to be effective in improving the generalization ability of Deep Neural Networks. Over the years, the research community expands mix-up methods into two directions, with extensive efforts to improve saliency-guided procedures but minimal focus on the arbitrary path, leaving the randomization domain unexplored. This paper introduces a novel method that lies at the junction of the two routes. By combining the best elements of randomness and saliency utilization, the proposed method balances speed, simplicity, and accuracy. The method is named R-Mix following the concept of "Random Mix-up". The paper demonstrates its effectiveness in generalization, weakly supervised object localization, calibration, and robustness to adversarial attacks. Also, in order to address the question of whether there exists a better decision protocol, they train a Reinforcement Learning agent that decides the mix-up policies based on the classifier's performance, reducing dependency on human-designed objectives and hyperparameter tuning. Extensive experiments further show that the agent is capable of performing at the cutting-edge level, laying the foundation for a fully automatic mix-up.

Demo page: No demo page yet.

License: MIT license

#1 in Seeing Beyond the Visible on KITTI360-EX dataset

Paper:**FlowLens: Seeing Beyond the FoV via Flow-guided Clip-Recurrent Transformer**

Placeholder repo with examples **released** by Hao Shi (first author in paper)

Model Name: FlowLens

Notes: Limited by hardware cost and system size, camera's Field-of-View (FoV) is not always satisfactory. However, from a spatio-temporal perspective, information beyond the camera's physical FoV is off-the-shelf and can actually be obtained "for free" from the past. This paper proposes a novel task termed Beyond-FoV Estimation, aiming to exploit past visual cues and bidirectional break through the physical FoV of a camera. This paper puts forward a FlowLens architecture to expand the FoV by achieving feature propagation explicitly by optical flow and implicitly by a novel clip-recurrent transformer, which has two appealing features: 1) FlowLens comprises a newly proposed Clip-Recurrent Hub with 3D-Decoupled Cross Attention (DDCA) to progressively process global information accumulated in the temporal dimension. 2) A multi-branch Mix Fusion Feed Forward Network (MixF3N) is integrated to enhance the spatially-precise flow of local features.

Demo page: No demo page yet.

License: MIT license.

3D Point cloud generation from complex prompts

Paper:**Point-E: A System for Generating 3D Point Clouds from Complex Prompts**

This model is not a SOTA model but performs one to two orders of magnitude faster than SOTA models

Github code with trained models **Alex Nichol** (first author in paper)

Model Name: Point-E

Notes: While recent work on text-conditional 3D object generation has shown promising results, the state-of-the-art methods typically require multiple GPU-hours to produce a single sample. This is in stark contrast to state-of-the-art generative image models, which produce samples in a number of seconds or minutes. In this paper explores an alternative method for 3D object generation which produces 3D models in 1-2 minutes on a single GPU. The method first generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model which conditions on the generated image. The paper claims that even though this approach falls short of the state-of-the-art in terms of sample quality, it is one to two orders of magnitude faster to sample from, offering a practical trade-off for some use cases.

Demo page: Three notebooks have been released to try the model. Link to our fork where we used to output the following image

The generation time on average on a V100 is 19 secs well below the numbers reported in the paper. There seems to a lot of variance in the generation for the same input prompt on reruns

Example generations for "a black tesla"

License: Not specified to date