Abstract: Text-video retrieval is one of the basic tasks for multimodal research and has been widely harnessed in many real-world systems. Most existing approaches directly compare the global ...