The students' attention level to the explanation of a given lecture is a factor that might determine the capability of retention and subsequent application of a learned concept. For this reason, students that pay attention are generally more participatory in the learning/teaching process than those who don't, and consequently, they succeed in reaching the competencies proposed in the courses. Hence, it is important to design strategies and tools that help teachers to monitor in a non-invasive way the attention level of the students, allowing them to take actions to modify the dynamics of the lectures when needed. In this work, we introduce a fully automated system to monitor the students' attention based on computer vision algorithms. To this end, we feed a recurrent neural network with one-second sequences generated by facial landmarks. This spatiotemporal analysis of video recordings allows for identifying when a student is attending a given explanation in online educational environments. The system is tested in a database with more than 3000 sequences of students who pay or no attention to online video lectures. Obtained results show that the proposed system is suitable to monitor the students' attention to a particular explanation.