We study the interaction of the participants in a pair program comprehension task across different time scales in a dual eye-tracking setup. We identify four layers of interaction episodes at different time scales. Each layer spans across the whole interaction. The present study concerns the relationship between different layers at different time scales. The first and third layers are based on the utterances of the participants while the second and fourth layers are based on participants' gaze.