I am a MS by Research student at International Institute of Information Technology, Hyderabad (IIIT-H). I am advised by Professor C V Jawahar and Professor Vinay Namboodiri at the Center for Visual Information Technology (CVIT) Lab. I am also advised by Professor K. Madhav Krishna and Professor Srinath Sridhar at the Robotics Research Center (RRC).

My primary research interest is computer vision to aid robotics applications. Specifically, I am interested in Implicit Neural Representations, 3D Computer Vision, and Neural Planners for Autonomous Machines.

Before IIIT-H, I was a Data Scientist at Microsoft Research & Development, Hyderabad. I led the recommendation and suggestion team for the world’s biggest enterprise facing email client - Outlook. These features are used by more than 100 million users per month, my hunch is that you might have seen some of them!

I am also a musician, I sing and play guitar. I have toured and performed at several places with my previous band, Andrometa. I also tried my hands out travel vlogging and YouTubing! My brother is a (really awesome) piano player and has taken over the channel now - find them here!. I also love traveling.

In 2018, I travelled solo to 6 countries, 13 states and interviewed 128 independent musicians!

Recent Updates
  • [ October 2022 ] INR-V: A Continuous Representation Space for Video-based Generative Tasks accepted at TMLR 2022!
  • [ September 2022 ] I am grateful to Microsoft Research for awarding me a travel grant of $2000 for WACV 2023!
  • [ August 2022 ] 2 papers accepted at WACV 2023!
  • [ August 2022 ] Gave a tutorial on "Computer Vision challenges in Table-top rearrangement and Planning" at the 6th Summer School of AI, IIIT-H! (find the talk here)
  • [ June 2022 ] We are in news!
  • [ April 2022 ] We came 3rd in ICRA 2022 Open Cloud Robot Table Organization Challenge!
  • [ December 2021 ] Granted a Provisional Patent on "SYSTEM AND METHOD FOR TRAINING USERS TO LIP READ"!
  • [ November 2021 ] Joined Robotics Research Center as a Research Fellow!
  • [ August 2021 ] Joined MS by Research at IIIT-H!
  • [ June 2021 ] "Personalized One-Shot Lipreading for an ALS Patient" accepted at BMVC 2021!
  • [ March 2021 ] Joined Center for Visual Information Technology as a Research Fellow!
Selected Research
SCARP: 3D Shape Completion in ARbitrary Poses for Improved Grasping
Bipasha Sen*, Aditya Agarwal*, Gaurav Singh*, Brojeshwar Bhowmick, Srinath Sridhar, Madhava Krishna, Under Review @ ICRA 2023, [ Project Page ] [ video ]
Recovering full 3D shapes from partial observations is a challenging task that has been extensively addressed in the computer vision community. Many deep learning methods tackle this problem by training 3D shape generation networks to learn a prior over the full 3D shapes. In this training regime, the methods expect the inputs to be in a fixed canonical form, without which they fail to learn a valid prior over the 3D shapes. We propose SCARP, a model that performs Shape Completion in ARbitrary Poses. Given a partial pointcloud of an object, SCARP learns a disentangled feature representation of pose and shape by relying on rotationally equivariant pose features and geometric shape features trained using a multi-tasking objective. Unlike existing methods that depend on an external canonicalization, SCARP performs canonicalization, pose estimation, and shape completion in a single network, improving the performance by 45% over the existing baselines. In this work, we use SCARP for improving grasp proposals on tabletop objects. By completing partial tabletop objects directly in their observed poses, SCARP enables a SOTA grasp proposal network improve their proposals by 71.2% on partial shapes.
INR-V: A Continuous Representation Space for Video-based Generative Tasks
Bipasha Sen*, Aditya Agarwal*, Vinay Namboodiri, C V Jawahar, TMLR 2022, [ OpenReview ] [ Project Page ] [ video ]
Generating videos is a complex task that is accomplished by generating a set of temporally coherent images frame-by-frame. This limits the expressivity of videos to only image-based operations on the individual video frames needing network designs to obtain temporally coherent trajectories in the underlying image space. We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. The INR is predicted using a meta-network which is a hypernetwork trained on neural representations of multiple video instances. Later, the meta-network can be sampled to generate diverse novel videos enabling many downstream video-based generative tasks. Interestingly, we find that conditional regularization and progressive weight initialization play a crucial role in obtaining INR-V. The representation space learned by INR-V is more expressive than an image space showcasing many interesting properties not possible with the existing works. For instance, INR-V can smoothly interpolate intermediate videos between known video instances (such as intermediate identities, expressions, and poses in face videos). It can also in-paint missing portions in videos to recover temporally coherent full videos. In this work, we evaluate the space learned by INR-V on diverse generative tasks such as video interpolation, novel video generation, video inversion, and video inpainting against the existing baselines. INR-V significantly outperforms the baselines on several of these demonstrated tasks, clearly showing the potential of the proposed representation space.
FaceOff: A Video-to-Video Face Swapping System
Aditya Agarwal*, Bipasha Sen*, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar, WACV 2023, [ preprint ] [ video ]
Doubles play an indispensable role in the movie industry. They take the place of the actors in dangerous stunt scenes or in scenes where the same actor plays multiple characters. The double's face is later replaced with the actor's face and expressions manually using expensive CGI technology, costing millions of dollars and taking months to complete. An automated, inexpensive, and fast way can be to use face-swapping techniques that aim to swap an identity from a source face video (or an image) to a target face video. However, such methods can not preserve the source expressions of the actor important for the scene's context. To tackle this challenge, we introduce video-to-video (V2V) face-swapping, a novel task of face-swapping that can preserve (1) the identity and expressions of the source (actor) face video and (2) the background and pose of the target (double) video. We propose FaceOff, a V2V face-swapping system that operates by learning a robust blending operation to merge two face videos following the constraints above. It first reduces the videos to a quantized latent space and then blends them in the reduced space. FaceOff is trained in a self-supervised manner and robustly tackles the non-trivial challenges of V2V face-swapping. As shown in the experimental section, FaceOff significantly outperforms alternate approaches qualitatively and quantitatively.
Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale
Aditya Agarwal*, Bipasha Sen*, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar, WACV 2023, [ preprint ]
Many people with some form of hearing loss consider lipreading as their primary mode of day-to-day communication. However, finding resources to learn or improve one's lipreading skills can be challenging. This is further exacerbated in COVID$19$ pandemic due to restrictions on direct interactions with peers and speech therapists. Today, online MOOCs platforms like Coursera and Udemy have become the most effective form of training for many kinds of skill development. However, online lipreading resources are scarce as creating such resources is an extensive process needing months of manual effort to record hired actors. Because of the manual pipeline, such platforms are also limited in the vocabulary, supported languages, accents, and speakers, and have a high usage cost. In this work, we investigate the possibility of replacing real human talking videos with synthetically generated videos. Synthetic data can be used to easily incorporate larger vocabularies, variations in accent, and even local languages, and many speakers. We propose an end-to-end automated pipeline to develop such a platform using state-of-the-art talking heading video generator networks, text-to-speech models, and computer vision techniques. We then perform an extensive human evaluation using carefully thought out lipreading exercises to validate the quality of our designed platform against the existing lipreading platforms. Our studies concretely point towards the potential of our approach for the development of a large-scale lipreading MOOCs platform that can impact millions of people with hearing loss.
Approaches and Challenges in Robotic Perception for Table-top Rearrangement and Planning
Aditya Agarwal*, Bipasha Sen*, Shankara Narayanan V*, Vishal Reddy Mandadi*, Brojeshwar Bhowmick, K Madhava Krishna, Arxiv 2022, [ paper ] [ video ]
Table-top Rearrangement and Planning is a challenging problem that relies heavily on an excellent perception stack. The perception stack involves observing and registering the 3D scene on the table, detecting what objects are on the table, and how to manipulate them. Consequently, it greatly influences the system's task-planning and motion-planning stacks that follow. We present a comprehensive overview and discuss the different challenges associated with the perception module. This work is a result of our extensive involvement in the ICRA 2022 Open Cloud Robot Table Organization Challenge, in which we stood third in the final rankings.
Personalized One-Shot Lipreading for an ALS Patient
Bipasha Sen*, Aditya Agarwal*, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar, BMVC 2021, [ paper ] [ video ]
Lipreading or visually recognizing speech from the mouth movements of a speaker is a challenging and mentally taxing task. Unfortunately, multiple medical conditions force people to depend on this skill in their day-to-day lives for essential communication. Patients suffering from Amyotrophic Lateral Sclerosis (ALS) often lose muscle control, consequently their ability to generate speech and communicate via lip movements. Existing large datasets do not focus on medical patients or curate personalized vocabulary relevant to an individual. Collecting a large-scale dataset of a patient, needed to train mod-ern data-hungry deep learning models is, however, extremely challenging. In this work, we propose a personalized network to lipread an ALS patient using only one-shot examples. We depend on synthetically generated lip movements to augment the one-shot scenario. A Variational Encoder based domain adaptation technique is used to bridge the real-synthetic domain gap. Our approach significantly improves and achieves high top-5accuracy with 83.2% accuracy compared to 62.6% achieved by comparable methods for the patient. Apart from evaluating our approach on the ALS patient, we also extend it to people with hearing impairment relying extensively on lip movements to communicate.