I am working as a Research Fellow at the CVIT lab of IIIT-H. I am advised by Prof. CV Jawahar and Prof. Vinay Namboodiri. I also acted as a Visiting Researcher at the LTRC lab between 2019 and 2020. I was advised by Prof. Anil Kumar Vuppala.

I am primarily interested in developing deep learning models to accentuate a human’s visual, auditory, and interactive experiences. I think this can be achieved by training the models to use multimodal environmental cues (video, sound, interaction, etc.) to improve their own understanding of the environment and generate novel content inspired by the environment across different modalities. In this pursuit, I am exploring the areas of audio-visual generative modeling, complex scene understanding, and human intent forecasting. My primary motivation is aiding the space of AR/VR. I think these multimodal generative models can improve the content generation in AR/VR.

I used to work as a Data Scientist at Microsoft Research & Development - Hyderabad. I developed intelligent features on the world’s biggest enterprise facing email client - Outlook. These features increase the productivity of our users and reduce their time to task completion. These features are used by more than 100 million users per month, my hunch is that you might have seen some of them!

Suggested Attachments (Outlook)
Meeting Insights (Outlook)

My favorite language is Python, merely because it is so simple, yet elegant and powerful! You can find a sample of my code here. Below I have listed some of my major projects I’ve undertaken in the past few years.

PS. I am a musician, I sing and play guitar. I have toured and performed at several places with my previous band, Andrometa. I also tried my hands out travel vlogging and YouTubing! Find them here!

Major Projects

BReQS: Self-Supervised Meeting Summarization

Self-Supervision, Natural Language Processing, Deep Learning, Ongoing

*cited from `Adversarial Text Generation Without Reinforcement Learning’.

A self-supervised framework to generate summary of long meetings with multiple participants and speakers. The model is called BReQS, which stands for Brevity, Relevance, Quality and Span. I’m using the four metrics as a measure to train the model. Brevity keeps summary concise which is achieved by using an Autoencoder to compress the meeting transcript into a short latent space. The Autoencoder is trained using the reconstruction loss. Relevance is a measure of information loss. To obtain the training data for relevance, I exploited an interesting meeting property. In a meeting, a continuous utterance by a single speaker barring pauses and short interruptions by the participants can be considered as a single context. These short interruptions are predominantly queries to the context shared by the speaker. Thus, they underscore the most relevant points covered in the meeting. A pre-trained Question/Answering model determines if the generated summary for the meeting answers the queries (short interruptions). Quality is a measure of readability. This is achieved by using a Discriminator (a combination of autoencoder + generative adversarial network) that can discriminate between human generated and machine generated summaries. Lastly, a loss based on the length of the summary maintains the Span. This measure keeps the summary from getting too short. A combination of these losses are used to train the model.

Reed: An approach towards quickly bootstrapping multilingual acoustic models

Speech Recognition, Deep Learning [ paper ] [ presentation ] (accepted at SLT, 2021)

Abstract: Multilingual automatic speech recognition (ASR) system is a single entity capable of transcribing multiple languages sharing a common phone space. Performance of such a system is highly dependent on the compatibility of the languages. State of the art speech recognition systems are built using sequential architectures based on recurrent neural networks (RNN) limiting the computational parallelization in training. This poses a significant challenge in terms of time taken to bootstrap and validate the compatibility of multiple languages for building a robust multilingual system. Complex architectural choices based on self-attention networks are made to improve the parallelization thereby reducing the training time. In this work, we propose Reed, a simple system based on 1D convolutions which uses very short context to improve the training time. To improve the performance of our system, we use raw time-domain speech signals directly as input. This enables the convolutional layers to learn feature representations rather than relying on handcrafted features such as MFCC. We report improvement on training and inference times by atleast a factor of 4× and 7.4× respectively with comparable WERs against standard RNN based baseline systems on SpeechOcean’s multilingual low resource dataset.

An Approach Towards Action Recognition using Part Based Hierarchical Fusion

Computer Vision, Deep Learning [ paper ] [ presentation ] (accepted at ISVC, 2020)

Abstract: The human body can be represented as an articulation of rigid and hinged joints which can be combined to form the parts of the body. Human actions can be thought of as a collective action of these parts. Hence, learning an effective spatio-temporal representation of the collective motion of these parts is key to action recognition. In this work, we propose an end-to-end pipeline for the task of human action recognition on video sequences using 2D joint trajectories estimated from a pose estimation framework. We use a Hierarchical Bidirectional Long Short Term Memory Network (HBLSTM) to model the spatio-temporal dependencies of the motion by fusing the pose based joint trajectories in a part based hierarchical fashion. To denote the effectiveness of our proposed approach, we compare its performance with six comparative architectures based on our model.

Sentence Modelling for Contextual Meeting Segmentation

Natural Language Processing, Summarization [ short paper ]

Abstract: We propose a novel technique of contextual meeting segmentation for the task of meeting summarization. Unlike documents, meetings span over multiple topics spread throughout the course of the meeting. In order to capture the true summary of the meeting, it is important to capture the summary of each of the topics present in the meeting. The segmentation approaches existing today ignore the fact that sentences belonging to the same context can be continuous or non-continuous in nature. We solve the problem of contextual meeting segmentation using pointer mechanism to extract the related sentences from a meeting transcription without assuming that the sentences are consecutive in nature.

Knowledge Graph (AiGraph) Based Meeting Insights

Information Retrieval, Recommendation [ short paper ]

Abstract: In this paper we present AiGraph, an enterprise knowledge graph, representing details about how an employee communicates through emails, meetings, and documents. By representing all her communication in the form of a graph, we are able to extract complex insights which are computationally expensive in silo’ed applications. We consider a recommendation application – Meeting Insights – to show power of AiGraph. This application recommends related emails and documents for a given meeting. There are a number of ways in which AiGraph can improve the Meeting Insights – most signifcantly, it can improve the relevance of the system by providing better candidate emails; and features for a ranker to rank these candidates. In this paper we describe various ways to improve relevance of Meeting Insights using AiGraph.

Anterior Segment Imaging - MIT Media Lab’s REDX

Computer Vision, Anamoly Detection, Hardware [ poster ]

Rethinking Engineering Design Execution (REDX) is an interdisciplinary platform that enables collaboration between world-renowned medical professionals, engineers and computer scientists to build solutions for society’s most pressing healthcare challenges. I collaborated with Hyderabad’s leading Eye-Care Institute, L.V.Prasad Eye Institute to build a low-cost, wearable solid-state device as a replacement to existing Ophthalmic Slit Lamps, a device extensively used by Ophthalmologists for examining the eye. The device is bulky, expensive and consequently, extremely difficult to carry out of the hospitals, particularly to the remote locations in India which limits the eye-care facility in such locations. I, along with my team, worked on a simple portable device that could capture high definition stills of the cornea (anterior segment of the eye) through multiple viewpoints. The multiple viewpoints enabled me to reconstruct the stills into a 3D model of the cornea. I built an anomaly detector to identify any abnormalities in the reconstructed cornea which, along with the reconstructed 3D corneal model served as a preliminary report to the medical professional for further analysis.

Cloud Based Group Oriented File Sharing Network - TheBhaad

Cloud Computing, Security, User Behavioral Study [ video ]

An one-stop online-environment that complimented the real-environment in terms of the interactions between students and professors. Built a file-sharing network instead of just a portal. Back in 2013, there were limited means for sharing assignments and coursework online (such as Facebook groups) and weren’t user-friendly. Moreover, they were inconvinient for group interaction. In the dire need of an organized file-sharing group-based platform, I single-handedly developed TheBhaad over a course of 5 months. TheBhaad had an operating system like user-interface for easy operation with an advanced search features across groups (classrooms), contacts, personalized document realignment, discussion forum, request and push-notification features. This was extensively used by my undergraduate institution at a time having on an average of 5000 active users per month. I was awarded Best Enterpreneur by my institue for my work on TheBhaad

Virtual Shopping Assitant Bot

Reinforcment Learning, Conversational Bot [ Reference ] (Microsoft - Data Science Intern)

Developed a virtual shopping assistant bot that proactively engaged users and assisted them in placing an order. From a a set of curated questions, the agent learned the optimal order of questions to ask to maximize user engagement. I integrated multi-world testing (MWT), a machine learning toolbox based on reinforcement learning for principled and efficient experimentation, into the bot.

Other Projects

COVID-19 fact checker

Developed a pipeline to fact-check covid-19 news queried on Bing by validating the facts against curated authentic sources.


Developed an application that a projected and augmented a pinao on any table top and detected finger positions on the projection to play the corresponding piano notes.

Football Match - Emotion Segregation

Developed a model using word-2-vec and RNNs to detect emotion on Twitter feeds during football matches and segregate the tweets into buckets representing the supporters of competing teams.


As a part of the IEEE chapter of my undergraduate, I, along with my team, developed a hoverboard.