Div150Cred - A Social Image Retrieval Result Diversification Dataset with User Tagging Credibility Estimation

This dataset is designed to support research in the areas of information retrieval that foster new technologies for improving both the relevance and the diversification of search results with explicit focus on the social media context.

The dataset consists of Creative Commons data of 300 landmark locations represented via 45,375 Flickr photos, 16M photo links for around 3.000 users, metadata, Wikipedia pages and content descriptors for text and visual modalities. Data is annotated for the relevance and the diversity of the photos.

The dataset includes also information about user annotation credibility. Credibility is determined as an automatic estimation of the quality (correctness) of a particular user's tags.

Important: much of the Information has been obtained by crawling the Internet and from Flickr. Every possible measure has been taken to ensure that the content has been released under a Creative Commons license that allow redistribution. However, the authors cannot fully guarantee that the collection contains absolutely no content without a Creative Commons license. Such content could potentially enter the collection if it was not correctly marked at the source. In what concerns the content descriptors, features are provided on an as-is basis with no guaranty of being correct. The dataset was validated during the 2014 Retrieving Diverse Social Images Task at the MediaEval Benchmarking Initiative for Multimedia Evaluation.

Resourses:

  • Div150Cred readme file describing the data format
  • credibilityser
  • div_eval.jar
  • devset
  • me14div_example_run.txt
  • testset