TWC #17

TWC #17
State-of-the-art papers with Github avatars of researchers who released code, models (in most cases) and demo apps (in few cases) along with their paper. Image created from papers described below

State-of-the-art  (SOTA)  updates  for 21 – 27 Nov 2022.

This weekly newsletter highlights the work of  researchers who produced state-of-the-art work breaking existing records on benchmarks. They also

  • authored their paper
  • released their code
  • released models in most cases
  • released notebooks/apps in few cases

Nearly half of released source code licenses allow commercial use with just attribution. Most if not all, ML powered companies at least in part owe their very existence to the work of these researchers. Please consider supporting open  research by starring/sponsoring them on Github

New records were set on the following tasks

  • Object Detection
  • Image Generation
  • Image Harmonization (adjust the fore-ground to make it compatible with the back-ground)
  • Cross-modal retrieval
  • Salient Object detection (update)
  • Few shot semantic segmentation
  • Video Generation
  • Unsupervised Video Object Segmentation

This weekly is  a consolidation of daily twitter posts tracking SOTA researchers. Daily SOTA updates are also done on - "a twitter alternative by and for the AI community"

To date,  27.6% (92,202) of total papers (334,005) published have code released along with the papers (source).

SOTA details below are snapshots of  SOTA models at the time of  publishing this newsletter. SOTA details in the  link provided below the snapshots would most likely be different from the snapshot over  time as new SOTA models emerge.

#1 in Object Detection on LVIS dataset

Paper: Towards All-in-one Pre-training via Maximizing Multi-modal Mutual Information
SOTA details
Github code placeholder with models released by Weijie Su (first author in paper).

Model Name:  InternImage-H (M3I Pre-training)

Notes:  This  paper proposes a general multi-modal mutual information formula as a unified optimization target and demonstrate that all existing pretraining approaches are special cases of our framework. They  pre-train a billion-level parameter image backbone and achieve state-of-the-art performance on various benchmarks.

Demo page: None to date

License:  None to date

#1 in Image Generation on Places50 dataset

Paper:SinDiffusion: Learning a Diffusion Model from a Single Natural Image
SOTA details
Github code released by Weilun Wang (first author in paper). Models not released yet

Model Name:  SinDiffusion

Notes:  This paper proposes  SinDiffusion, a model  leveraging denoising diffusion models to capture internal distribution of patches from a single natural image. SinDiffusion  improves the quality and diversity of generated samples compared with existing GAN-based approaches. It is based on two  design approaches. First, SinDiffusion is trained with a single model at a single scale instead of multiple models with progressive growing of scales which serves as the default setting in prior work. This avoids the accumulation of errors, which cause characteristic artifacts in generated results. Second, we identify that a patch-level receptive field of the diffusion network is crucial and effective for capturing the image’s patch statistics, therefore we redesign the network structure of the diffusion model. Coupling these two designs enables us to generate photorealistic and diverse images from a single image. Furthermore, SinDiffusion can be applied to various applications, i.e., text-guided image generation, and image outpainting, due to the inherent capability of diffusion models.

Demo page: None to date

License:  Apache-2.0 license

#1 in Image Harmonization on iHarmony4 dataset

Paper:Hierarchical Dynamic Image Harmonization

SOTA details

Github Placeholder code released by Haoxing Chen (first author in paper). Models not released yet

Model Name:  HDNet

Notes:  Current Image harmonization  models ignore local consistency and their model size limit their harmonization ability on edge devices. This paper proposes a hierarchical dynamic network (HDNet) for efficient image harmonization to adapt the model parameters and features from local to global view for better feature transformation. Specifically, local dynamics (LD) and mask-aware global dynamics (MGD) are applied. LD enables features of different channels and positions to change adaptively and improve the representation ability of geometric transformation through structural information learning. MGD learns the representations of fore- and back-ground regions and correlations to global harmonization.

Demo page: None to date

License:  MIT license

#1 in Cross-modal retrieval on multiple datasets

Paper:X2-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
SOTA- 1 of 3
SOTA- 2 of 3
SOTA - 3 of 3

Github Placeholder code released by Yan Zeng (first author in paper). Models not released yet

Model Name:  X2-VLM

Notes:  This paper proposes multi-grained vision language pre-training, a unified approach which can learn vision language alignments in multiple granularity. This paper advances the proposed method by unifying image and video encoding in one model and scaling up the model with large-scale data.

Demo page: None to date

License:  None to date

#1 in Salient Object Detection on 8 datasets (update)

Paper: Revisiting Image Pyramid Structure for High Resolution Salient Object Detection
SOTA details - 1 of 4
SOTA details - 2 of 4
SOTA details - 3 of 4
SOTA details - 4 of 4
Github code released by: Taehun Kim. Model link: On Github page. A fork was created to replicate the results.

Model Name: InSPyReNet (update) This model was reported in TWC #9. This is an update to report an additional repo made by author showcasing an additional use case.

Notes:  This paper propose a model for Salient object detection (SOD)  for high-resolution (HR) image prediction with out any HR dataset. Their model is designed as a  image pyramid structure of saliency map, which enables to ensemble multiple results with pyramid-based image blending. For HR prediction, they design a pyramid blending method which synthesizes two different image pyramids from a pair of LR and HR scale from the same image to overcome effective receptive field (ERF) discrepancy.

Demo page links.

The new repo showcases a utility of the same model reported earlier - to overlay a salient object on a second input image by making the salient object background transparent.

An additional use case of the same model reported earlier

A Notebook was created for the original model release to replicate inference on CPU using on one of their pretrained models. Samples shown below. Salient detection  time is ~3 secs  on CPU for images. For a  6 second high resolution video (1920x2080) it took 53 minutes on a CPU.

Inference test of one of their pre-trained models (Trained with LR+HR dataset - LR scale 384 X 384). Note the salient object in an image could be multiple objects as can be seen in the last test - two people in the foreground are considered salient by the model. In addition to the standard uses of salient object detection, perhaps one practical use could be to automate masked image creation for an inpainting model. All images above with the exception of the lady sitting on the bench is 384x384. The lady sitting on the bench is 512x512. The second row is model output for SOD with the rest masked out. The third row is model output for SOD with the rest blurred

0:00/0:051×This is a 6 second high resolution 1920 × 1080 video. The model performed salient object detection (shown below) on a CPU in 53 minutes. Video from Pexels

Gif image created from the video above. Dog with stick is detected as the salient object in the video frames. This high resolution 1920 × 1080, 6 second video took 53 minutes for SOD on a CPU. Code for the replicated results

We also released an app built on top of a SOTA model InSPyReNet we reviewed in TWC #9  The app is also hosted on HuggingFace. This app addresses the use case of removing background from a picture. User can upload any picture and have the background removed by a state-of-the art model. This could be a handy tool for removal of background in pictures we take on a phone.

License: MIT license

#1 in Few-Shot Semantic Segmentation on 2 datasets

Paper:Feature-Proxy Transformer for Few-Shot Segmentation
SOTA details
Github code with models released by: Jian-Wei Zhang (first author in paper)

Model Name:  FPTrans

Notes:  Few-shot segmentation (FSS) aims at performing semantic segmentation on novel classes given a few annotated support samples. Current FSS framework has deviated far from the supervised segmentation framework: Given the deep features, FSS methods typically use an intricate decoder to perform sophisticated pixel-wise matching, while the supervised segmentation methods use a simple linear classification head. Due to the intricacy of the decoder and its matching pipeline, it is not easy to follow such an FSS framework. This paper revives the straightforward framework of "feature extractor + linear classification head" and proposes a novel Feature-Proxy Transformer (FPTrans) method, in which the "proxy" is the vector representing a semantic class in the linear classification head. FPTrans has two keypoints for learning discriminative features and representative proxies: 1) To better utilize the limited support samples, the feature extractor makes the query interact with the support features from the bottom to top layers using a novel prompting strategy. 2) FPTrans uses multiple local background proxies (instead of a single one) because the background is not homogeneous and may contain some novel foreground regions. These two keypoints can be integrated into the vision transformer backbone with the prompting mechanism in the transformer. Given the learned features and proxies, FPTrans directly compares their cosine similarity for segmentation.

Demo page: None to date

License:  None to date

#1 in Video Generation on 2 datasets

Paper:Latent Video Diffusion Models for High-Fidelity Video Generation with Arbitrary Lengths
SOTA details
Github code released by: Yingqing He (first author in paper). Models not released yet

Model Name:  MoCoGAN-HD

Notes:  photo-realistic video synthesis remains a challenge despite all the attention AI generated content has recently garnered. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models (DMs) are another class of deep generative models and have recently achieved remarkable performance on various image synthesis tasks. However, training image diffusion models usually requires substantial computational resources to achieve a high performance, which makes expanding diffusion models to high-dimensional video synthesis tasks more computationally expensive. To ease this problem while leveraging its advantages, this paper introduces lightweight video diffusion models that synthesize high-fidelity and arbitrary-long videos from pure noise. Specifically, they propose to perform diffusion and denoising in a low-dimensional 3D latent space, which significantly outperforms previous methods on 3D pixel space when under a limited computational budget. In addition, though trained on tens of frames, our models can generate videos with arbitrary lengths, i.e., thousands of frames, in an autoregressive way. Finally, conditional latent perturbation is further introduced to reduce performance degradation during synthesizing long-duration videos.

Demo page: Project page has generated video  examples

License:  MIT license

#1 in Unsupervised Video Object Segmentation on 2 datasets

Paper:Domain Alignment and Temporal Aggregation for Unsupervised Video Object Segmentation
SOTA details
Github Placeholder released by  Minhyeok Lee (second author in paper). Models not released yet

Model Name:  DATA

Notes:  Unsupervised video object segmentation aims at detecting and segmenting the most salient object in videos. In recent times, two-stream approaches that collaboratively leverage appearance cues and motion cues have attracted extensive attention thanks to their powerful performance. However, there are two limitations faced by those methods: 1) the domain gap between appearance and motion information is not well considered; and 2) long-term temporal coherence within a video sequence is not exploited. To overcome these limitations, this paper proposes a domain alignment module (DAM) and a temporal aggregation module (TAM). DAM resolves the domain gap between two modalities by forcing the values to be in the same range using a cross-correlation mechanism. TAM captures long-term coherence by extracting and leveraging global cues of a video.

Demo page: None to date

License: None to date