Gastric cancer is the fourth deadliest cancer worldwide. Esophagogastroduodenoscopy (EGD) is the preferred method to diagnose upper gastrointestinal lesions, particularly early gastric cancer. The procedure's success relies on the endoscopist's experience and a comprehensive examination by observing a set of anatomical landmarks. Most gastric neoplasias are undetected during early stages, despite being present during examinations, thus, it is essential to evaluate the quality and audit the examination of anatomical regions during the endoscopy procedure. This study assesses the performance of a recurrent neural network and transformer architecture in classifying anatomical and sub-anatomical regions within the gastrointestinal tract. By leveraging temporal information, the study aims to enhance the accuracy of detecting these critical regions. We collected and labeled video endoscopies from 32 patients, organizing them into four organ categories. Additionally, we utilized 565 labeled sequences from six sub-anatomical stomach regions for a separate classification task. The trained networks achieved a macro F1-score of 87.25% for organ classification and an 85.31% in identifying stomach regions. These findings provide substantial evidence supporting that temporal information improves the capabilities of accurately identify upper gastrointestinal regions.