This paper describes the joint participation of the TIA-LabTL (INAOE) and the MindLab research group (UNAL) at the ImageCLEF 2015 Scalable Concept Image Annotation challenge subtask 2: genera- tion of textual descriptions of images - noisy track. Our strategy relies on a multimodal representation that is built in an unsupervised way by using the associated text to images and the visual features that represent them. In the multimodal representation for every word extracted from the indexed web pages a visual prototype is formed, each prototype being a distribution over visual descriptors. Then, the process of generation of a textual description is formulated as a two-step IR problem. First, the image to be described is used as visual query and compared with all the visual prototypes in the multimodal representation; next the k nearest prototypes are used as a textual query to search for a phrase in a collec- tion of textual descriptions, the retrieved phrase is then used to describe the image.