TWC #10

TWC #10
State-of-the-art papers with Github avatars of researchers who released code, models (in most cases) and demo apps (in few cases) along with their paper. Image created from papers described below

State-of-the-art  (SOTA)  updates  for 3 Oct– 9 Oct 2022

TasksWithCode weekly newsletter highlights the work of researchers who publish their code (often with models) along with their SOTA paper.  This weekly is  a consolidation of daily twitter posts tracking SOTA changes.

To date,  27 % (87,114) of total papers (322,182) published have code released along with the papers (source)

The selected researchers in figure above produced state-of-the-art work breaking existing records on benchmarks. They also

  • authored their paper
  • released their code
  • released models in most cases
  • released notebooks/apps in few cases

They broke existing records on the following tasks

  • Link Prediction
  • Generalized few-shot image  Classification
  • Few shot image classification
  • Video Object Tracking
  • 3D Instance Segmentation
  • Temporal Action Proposal Generation

#1 in Link Prediction on YAGO3-10 dataset

Paper:MEIM: Multi-partition Embedding Interaction Beyond Block Term Format for Efficient and Expressive Link Prediction
SOTA details
Github code released by Hung-Nghiep Tran (first author in paper) Model link: Not released to date. Datasets are provided for training

Model Name: MEIM

Notes: This is a knowledge graph embedding model  to predict the missing relations between entities in knowledge graphs.  The model proposed in this paper addresses some of the drawbacks of prior models and claims to be more expressive and efficient. It outperform strong baselines and achieve state-of-the-art results on difficult link prediction benchmarks using fairly small embedding sizes

Demo page link None to date

License: None to date

#1 in Generalized few-shot image  Classification

Paper: A Continual Development Methodology for Large-scale Multitask Dynamic ML Systems
SOTA details
Github code released by Andrea Gesmundo (single author of paper) Model link: Model checkpoints can be obtained by contacting author at

Model Name: µ2Net+  & µ2Net

Notes: This paper proposes a  method for the generation of dynamic multitask ML models as a sequence of extensions and generalizations. They first analyze the capabilities of the proposed method by using the standard ML empirical evaluation methodology. They then propose a novel continuous development methodology that allows to dynamically extend a pre-existing multitask large-scale ML system while analyzing the properties of the proposed method extensions. This results in the generation of an ML model capable of jointly solving 124 image classification tasks achieving state of the art quality with improved size and compute cost.

Demo page link Multiple Colab notebook links available on Github page

License: Colab notebooks mention Apache 2.0 license

#1 in Few-Shot Image Classification on Mini and Tiered Imagenet datasets

Paper:Transductive Decoupled Variational Inference for Few-Shot Classification
SOTA details
Github code released by Anuj Singh (first author in paper) Model link: Model checkpoints in Google Drive

Model Name: TRIDENT

Notes: This paper proposes a novel variational inference network for few-shot classification to decouple the representation of an image into semantic and label latent variables, and simultaneously infer them in an intertwined fashion. To induce task-awareness, as part of the inference mechanics of the model, they exploit information across both query and support images of a few-shot task using a  built-in attention-based transductive feature extraction module

Demo page link None to date

License: MIT license

#1 in Video Object Tracking

Paper:Learning What and Where -- Unsupervised Disentangling Location and Identity Tracking
SOTA details
Github code released by Manuel Traub (first author in paper) Model checkpoints also released

Model Name: Loci

Notes: This paper introduces a self-supervised LOCation and Identity tracking system (Loci). Inspired by the dorsal-ventral pathways in the brain, Loci addresses the binding problem by processing separate, slot-wise encodings of 'what' and 'where'. Loci's predictive coding-like processing encourages active error minimization, such that individual slots tend to encode individual objects. Interactions between objects and object dynamics are processed in the disentangled latent space. Truncated backpropagation through time combined with forward eligibility accumulation significantly speeds up learning and improves memory efficiency. Loci effectively extracts objects from video streams and separates them into location and Gestalt components (an organized whole that is perceived as more than the sum of its parts). This separation offers an encoding that could facilitate effective planning and reasoning on conceptual levels.

Demo page link  An interface to explore learned latent representations also released

License: MIT license

#1 in 3D Instance Segmentation

Paper:Mask3D for 3D Semantic Instance Segmentation
SOTA details
Github code released by Jonas Schult (first author in paper) Model checkpoints also released

Model Name: Mask3D

Notes: This paper proposes the first Transformer-based approach for 3D semantic instance segmentation. They leverage generic transformer building blocks to directly predict instance masks from 3D point clouds. Each object instance is represented as an instance query. Using transformer decoders, the instance queries are learned by iteratively attending to point cloud features at multiple scales. Combined with point features, the instance queries directly yield all instance masks in parallel. The model, Mask3D is claimed to have  several advantages over current state-of-the-art approaches, since it neither relies on (1) voting schemes which require hand-selected geometric properties (such as centers) nor (2) geometric grouping mechanisms requiring manually-tuned hyper-parameters (e.g. radii) and (3) enables a loss that directly optimizes instance masks.

Demo page link The demo page allows users to upload a 3D scan using a hardware (iPad Pro & iPhone 12 Pro LIDAR Scanning)  capable  iPhone/iPad and see the segmentation.

Semantic instance segmentation of a room created using the demo page link above. The dark orange region is a couch. The Github page has much better visualization links than this one.

License: None specified to date

#1 in Temporal Action Proposal Generation

Paper:AOE-Net: Entities Interactions Modeling with Adaptive Attention Mechanism for Temporal Action Proposals Generation
SOTA details
Github code released by Khoa Vo (first author in paper) Model checkpoints not released to date

Model Name: AOE-Net

Notes: Temporal action proposal generation (TAPG)  requires localizing action intervals in an untrimmed video. Humans, perceive an action through the interactions between actors, relevant objects, and the surrounding environment. Despite the significant progress of TAPG, a vast majority of existing methods ignore the aforementioned principle of the human perceiving process by applying a backbone network into a given video as a black-box. This paper proposes to model these interactions with a multi-modal representation network, namely, Actors-Objects-Environment Interaction Network (AOE-Net). The  proposed model consists of two modules, i.e., perception-based multi-modal representation (PMR) and boundary-matching module (BMM). Additionally, they introduce adaptive attention mechanism (AAM) in PMR to focus only on main actors (or relevant objects) and model the relationships among them. PMR module represents each video snippet by a visual-linguistic feature, in which main actors and surrounding environment are represented by visual information, whereas relevant objects are depicted by linguistic features through an image-text model. BMM module processes the sequence of visual-linguistic features as its input and generates action proposals.

Demo page link None to date

License: None to date