This paper presents a novel multimodal recurrent model for time series forecasting leveraging LSTM architecture, with a focus on production forecasting in oil wells equipped with rod lift systems. The model is specifically designed to handle time series data with diverse types, incorporating both images and numerical data at each time step. This capability enables a comprehensive analysis over specified temporal windows. The architecture consists of distinct submodels tailored to process different data modalities. These submodels generate a unified concatenated feature vector, providing a holistic representation of the well's operational status. This representation is further refined through a dense layer to facilitate non-linear transformation and integration. Temporal analysis forms the core of the model's functionality, facilitated by a Long Short-Term Memory (LSTM) layer, which excels at capturing long-range dependencies in the data. Additionally, a fully connected layer with linear activation output enables one-shot multi-step forecasting, which is necessary because the input and output have different modalities. Experimental results show that the proposed multimodal model achieved the best performance in the studied cases, with a Mean Absolute Percentage Error (MAPE) of 8.2%, outperforming univariate and multivariate deep learning-based models, as well as ARIMA implementations, which yielded results with a MAPE greater than 9%.