Interpreting endomicroscopic images is still a significant challenge, especially since one single still image may not always contain enough information to make a robust diagnosis. To aid the physicians, we investigated some local feature-based retrieval methods that provide, given a query image, similar annotated images from a database of endomicroscopic images combined with high-level diagnosis represented as textual information. Local feature-based methods may be limited by the small field of view (FOV) of endomicroscopy and the fact that they do not take into account the spatial relationship between the local features, and the time relationship between successive images of the video sequences. To extract discriminative information over the entire image field, our proposed method collects local features in a dense manner instead of using a standard salient region detector. After the retrieval process, we introduce a verification step driven by the textual information in the database and in which spatial relationship between the local features is used. A spatial criterion is built from the co-occurence matrix of local features and used to remove outliers by thresholding on this criterion. To overcome the small-FOV problem and take advantage of the video sequence, we propose to combine image retrieval and mosaicing. Mosaicing essentially projects the temporal dimension onto a large field of view image. In this framework, videos, represented by mosaics, and single images can be retrieved with the same tools. With a leave-n-out cross-validation, our results show that taking into account the spatial relationship between local features and the temporal information of endomicroscopic videos by image mosaicing improves the retrieval accuracy.