Learning medical image interpretation is an evolutive process that requires modular training systems, from non-expert to expert users. Our study aims at developing such a system for endomicroscopy diagnosis. It uses a difficulty predictor to try and shorten the physician learning curve. As the understanding of video diagnosis is driven by visual similarities, we propose a content-based video retrieval approach to estimate the level of interpretation difficulty. The performance of our retrieval method is compared with several state of the art methods, and its genericity is demonstrated with two different clinical databases, on the Barrett's Esophagus and on colonic polyps. From our retrieval results, we learn a difficulty predictor against a ground truth given by the percentage of false diagnoses among several physicians. Our experiments show that, although our datasets are not large enough to test for statistical significance, there is a noticeable relationship between our retrieval-based difficulty estimation and the difficulty experienced by the physicians.