Recent developments in Deep Learning has paved the way to accomplish tasks involving multimodal learning. Visual Question Answering (VQA) is one such challenge which requires high-level scene interpretation from images combined with language modelling of relevant Q&A.

The learning architecture behind this demo is based on the model proposed in the VQA paper and is written in Keras. Check out the code, system design, training details and other information here.

Read more about the work done on this problem by various research teams. Also, here's a working demo by the MIT team.

Reach out for feedback or suggestions at anant718@gmail.com.