I am a first-year Ph.D. at the CSAIL, MIT advised by Prof. Pulkit Agrawal. I am supported by the Ida Green Fellowship. I am grateful to have been advised by some of the most amazing researchers, including Prof. C V Jawahar, Vinay Namboodiri, K. Madhav Krishna, Srinath Sridhar, Liam Paull, and Florian Shkurti.
I was also a Data Scientist at Microsoft. I led the recommendation and suggestion team for the world’s biggest enterprise-facing email client - Outlook. These features are used by more than 100 million users per month!
Creative Outlet. I am a musician. I sing and play guitar. I have toured and performed at several places with my previous band, Andrometa. I also LOVE traveling and used to create travel vlogs and music covers on YouTube! My brother is an amazing pianist and has taken over the channel now: Insen: Outdoor Pianist.
Research Interest
My interest lies at the intersection of 3D computer vision and Robotics. Specifically, I am interested in designing improved representations of the 3D world to enable embodied agents acquire a holistic view of the world. This way, an agent can make better-informed control decisions for achieving a given downstream goal, for example, manipulation or autonomous navigation.
Today, most works rely on explicit representation forms like pointclouds or voxel-based representations. But they are limiting in many ways - they are high dimensional, discrete, and, most importantly, incomplete – they do not sense the underlying structure and only capture explicit values at specific locations. I am more interested in implicit representations of the world and how to design improved task-specific representations. Ultimately, I am excited to see embodied AI become a part of the real world and seamlessly integrate with humans!
Selected Research
*Equal Authors / Highlighted Papers
ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and Planning
Qiao Gu*, Ali Kuwajerwala*, Sacha Morin*, Krishna Murthy Jatavallabhula*, Bipasha Sen, Aditya Agarwal, Kirsty Ellis, Celso Miguel de Melo, Corban Rivera, William Paul, Rama Chellappa, Chuang Gan, Joshua B. Tenenbaum, Antonio Torralba, Florian Shkurti, Liam Paull, preprint
For robots to perform a wide variety of tasks, they require a 3D representation of the world that is semantically rich, yet compact and efficient for task-driven perception and planning. ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes that generalize to novel semantic classes, without the need to collect large 3D datasets or finetune models. EDPM: Ensemble-of-costs-guided Diffusion for Motion Planning
Kallol Saha*, Vishal Mandadi*, Jayaram Reddy*, Ajit Srikanth, Aditya Agarwal,
Bipasha Sen (in advising capacity), Arun Singh, Madhava Krishna,
preprintArXiv EDMP combines the strength of classical planning and deep learning by leveraging a diffusion policy to learn a prior over kinematically valid trajectories and guide it directly at the time of inference using scene-specific costs such as "collision-cost". Instead of using a single-cost, we propose using multiple-cost functions (ensemble-of-cost-guidance) to capture variations across scenes, thereby generalizing to diverse scenes. HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork
Bipasha Sen*, Gaurav Singh*, Aditya Agarwal*, Rohith Agaram, Madhava Krishna, Srinath Sridhar,
NeurIPS 2023ArXiv Learning generalizable NeRF priors over categories of scenes or objects has been challenging due to the high dimensionality of network weight space. To address the limitations of existing work on generalization, multi-view consistency and to improve quality, we propose HyP-NeRF, a latent conditioning method for learning generalizable category-level NeRF priors using hypernetworks. Disentangling Planning and Control for Non-prehensile Tabletop Manipulation
Vishal Mandadi, Kallol Saha, Dipanwita Guhathakurta, Mohammad Nomaan Qureshi, Aditya Agarwal, Bipasha Sen (in advising capacity), Dipanjan Das, Brojeshwar Bhowmick, Arun Kumar Singh, Madhava Krishna, CASE 2023
Our novel framework disentangles planning and control enabling us to operate in a context- free manner. Our method consists of an A* planner and a low-level RL controller. The low-level RL controller is agnostic of the scene context and A* is idependent of the low-level control and only takes scene context into account. Uncovering hidden biases against Indian independent artists in the perception of music quality by Indian Audience
Bipasha Sen*, Aditya Agarwal*, Vinoo Alluri, ICMPC 2023
Unlike well-established (W-E) music artists, Indian indie (In-In) music artists are small-scale artists unsigned by major music labels. Consequently, In-In music receives less publicity and resources during music production. Thus, In-In may be perceived to be lower in quality by the Indian audience compared to W-E. In this work, we aim to investigate if the Indian audience's perception of music quality is biased. SCARP: 3D Shape Completion in ARbitrary Poses for Improved Grasping
Bipasha Sen*, Aditya Agarwal*, Gaurav Singh*, Brojeshwar Bhowmick, Srinath Sridhar, Madhava Krishna,
ICRA 2023, RSS-W 2023project pagevideo We propose SCARP, a model that performs Shape Completion in ARbitrary Poses. Given a partial pointcloud of an object, SCARP learns a disentangled feature representation of pose and shape by relying on rotationally equivariant pose features and geometric shape features trained using a multi-tasking objective. INR-V: A Continuous Representation Space for Video-based Generative Tasks
We propose INR-V, a video representation network that learns a continuous space for video-based generative tasks. INR-V parameterizes videos using implicit neural representations (INRs), a multi-layered perceptron that predicts an RGB value for each input pixel location of the video. FaceOff: A Video-to-Video Face Swapping System
Aditya Agarwal*,
Bipasha Sen*, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar,
WACV 2023project pagepapervideo We introduce video-to-video (V2V) face-swapping, a novel task of face-swapping that can preserve (1) the identity and expressions of the source (actor) face video and (2) the background and pose of the target (double) video. We propose FaceOff, a V2V face-swapping system that operates by learning a robust blending operation to merge two face videos following the constraints above. Towards MOOCs for Lipreading: Using Synthetic Talking Heads to Train Humans in Lipreading at Scale
Aditya Agarwal*,
Bipasha Sen*, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar,
WACV 2023paper We propose an end-to-end automated pipeline to a lipreading training platform using state-of-the-art talking heading video generator networks, text-to-speech models, and computer vision techniques. We then perform an extensive human evaluation using carefully thought out lipreading exercises to validate the quality of our designed platform against the existing lipreading platforms. Approaches and Challenges in Robotic Perception for Table-top Rearrangement and Planning
Aditya Agarwal*,
Bipasha Sen*, Shankara Narayanan V*, Vishal Reddy Mandadi*, Brojeshwar Bhowmick, K Madhava Krishna,
Arxiv 2022papervideo Table-top Rearrangement and Planning is a challenging problem that relies heavily on an excellent perception stack. We present a comprehensive overview and discuss the different challenges associated with the perception module. This work is a result of our extensive involvement in the ICRA 2022 OCRTOC Challenge. Personalized One-Shot Lipreading for an ALS Patient
Bipasha Sen*, Aditya Agarwal*, Rudrabha Mukhopadhyay, Vinay Namboodiri, C V Jawahar,
BMVC 2021papervideoportal We propose a personalized network to lipread an ALS patient using only one-shot examples. Our approach significantly improves and achieves high top-5accuracy with 83.2% accuracy compared to 62.6% achieved by comparable methods for the patient. Apart from evaluating our approach on the ALS patient, we also extend it to people with hearing impairment relying extensively on lip movements to communicate.