TEMPLE: Incentivizing Temporal Understanding of Video Large Language Models via Progressive Pre-SFT Alignment

Publication
In Conference on Computer Vision and Pattern Recognition