TWC Issue #1

TWC Issue #1
State-of-the-art papers with Github avatars of researchers who released code, models (in most cases) and demo apps (in few cases) along with their paper. Image created from papers described below

State-of-art categories reviewed

(5–7 August 2022)

  • Pose estimation
  • Zero-shot action recognition in videos
  • Image Retrieval from Contextual Descriptions
  • Age estimation from image

Official code release also available for these tasks

Pose estimation

3-D Pose Human pose and shape  estimation

Image from CLIFF model Github page

Top-down approaches for human pose and shape estimation typically crop the image to focus on the core problem. However, by doing so, they loose important global information about the actual angle of the subject relative to the camera (shown below).

In the upper part, the cropped images look similar and thus get close predictions from a CNN model. However, two of the three predictions are wrong (marked in red). Actually, they have clearly different global rotations relative to the original camera, which can be seen from the bird’s eye view. In the lower part, the three global rotations are the same, but again two estimations are wrong. We only show the yaw angles of the global rotations here for simplicity. Figure from paper

CLIFF (Carrying Location Information in Full Frames), in contrast takes a holistic approach and feeds the model sufficient information that is adequate for estimating global rotation.

The model improved state-of-art by 7.6 units (40.4 to 32.8 PA-MPJPE) on 3DPW where evaluation metric PA-MPJPE measures the euclidean distance in millimeters between prediction and ground truth after performing shape alignment.

3DPW 3D Poses in the Wild dataset is the first dataset in the wild with accurate 3D poses for evaluation. While other datasets outdoors exist, they are all restricted to a small recording volume. 3DPW is the first one that includes video footage taken from a moving phone camera.

Official code release of the paper   is available on Github. Models are available for download after registering.

The dataset can be used  for non-commercial purposes and scientific research purposes. It can be downloaded directly after reading the license agreement.

Key Links:

  1. Paper
  2. Github Code  released by Zhihao Li (first author in paper) Model:  Model links in Github page
  3. Dataset
  4. Demo page? None to date
  5. Google Colab link? None to date
  6. Container image? None to date

6 Degrees of freedom (6-DOF) pose estimation  in RGB-D images

This model (RCVPose+ICP)  improves the state-of-art by .9 (OccludedLINEMOD dataset) and .3 (LineMOD dataset) points. The evaluation metric ADD is a measure of distance between points. A pose estimation  is considered correct if its ADD falls within 10% of the object radius.

Official code release of the paper   is available on Github. Pretrained models are also available for download.

The dataset links are directly available for download in the Github page  LineMOD, OccludedLINEMOD.

Key Links:

  1. Paper
  2. Github Code  released by aaronWool/Yangzheng Wu (Yangzheng Wu - first author in paper) Model:  Model links in Github page
  3. Datasets LineMOD, OccludedLINEMOD
  4. Demo page?   None to date
  5. Google Colab link? None to date
  6. Container image? None to date

Zero-shot action recognition in videos

X-CLIP expands  language-image pretrained models for video recognition instead of having to pretrain a model from scratch. A cross-frame attention mechanism exchanges information across frames

Image from Github page

This model improved state-of-art accuracy by 23.1 units (65.2 from 42.1 top-1 accuracy) on kinetics dataset. Kinetics is a collection of 650,000 large-scale high quality video clips covering 600 action classes. The dataset is directly available for download covered by a creative commons license.  

The girl in blue is to the left of the girl in the middle with the purple shoes. The girl in blue is not obscured in any way. Image from paper

Pretrained models are also available for download in the Github page for code.

Key Links:

  1. Paper
  2. Github code  released by Houwen Peng (second author in paper) Model:  Model links in Github page
  3. Dataset
  4. Demo page spaces link? None to date
  5. Google Colab link None to date
  6. Container image None to date

Image Retrieval from Contextual Descriptions

This paper introduces a new challenge dataset Image retrieval from contextual descriptions (ImageCoDe). This challenge illustrates models still lag far behind human performance on this task - 90.8 by humans in comparison to 20.8 on video and 59.4 on static images. This dataset is intended to foster progress in grounded language understanding

The dataset could be downloaded from HuggingFace , Zenodo or Google Drive. The models to perform experiments on can be downloaded from this link . ContextualCLIP model that achieved highest performance 29.9 accuracy can be downloaded from here.

Key Links:

  1. Paper
  2. Github code  released by Benno Krojer ( first author in paper) Model:  Links, ContextualCLIP
  3. Dataset
  4. Demo page spaces link? A HuggingFace space to examine the dataset
  5. Google Colab link None
  6. Container image None

Age estimation from image

Two models MWR and FP-Age improved state-of-art on age estimation from image.

Image from Paper. In the MWR algorithm for facial age estimation, an initial rank (age) estimate of the input image is refined iteratively by estimating the ρ-rank within a search window (orange bar).

MWR improved state-of-art on four datasets ChaLearn 2015, FG-NET, UTKFace, CACD  - 2.95  (Prev: 3.135 ), 2.23 (Prev:2.95), 4.37(Prev:4.55),4.41(Prev:4.60). The unit is mean absolute error - average absolute error between predicted and ground-truth ages. FP-Age created state-of-art on IMDB-Clean (4.68) and KANFace (6.81)

Key Links:

  1. Paper MWR
  2. Github Code. released by Nyeong-Ho Shin  ( first author in paper) Model: Link in Github page
  3. Dataset ChaLearn 2015, FG-NET, UTKFace, CACD; IMDB-Clean, KANFace
  4. Demo page spaces link? None to date
  5. Google colab link? None to date
  6. Container image? None to date