Datasets

LICA Dataset

Elad Hirsch , Shubham Yadav , Mohit Garg, Purvanshi Mehta

The LICA dataset is a collection of graphic design layouts, released to promote research in the field of AI for Design. Each layout captures the complete rendering specification of a design — component positions, typography, images, and background — alongside rich natural-language annotations at both the layout and template level.

20 design categories
971,850 unique templates
27,261 animated layouts annotated

arxivGithubHugging Face

Nano-Consistent-150K

Junyan Ye, Dongzhi Jiang , Zilong Huang , Jun He, Leqi Zhu , Zhiyuan Yan , Ruichuan An Hongsheng Li , Conghui He, Weijia Li

Nano-consistent-150k — the first dataset constructed using Nano-Banana that exceeds 150k high-quality samples, uniquely designed to preserve consistent human identity across diverse and complex editing scenarios. A key feature is its remarkable identity consistency: for a single portrait, more than 35 distinct editing outputs are provided across diverse tasks and instructions.

120k single-image editing instances
40k multi-reference generation samples
8 distinct sub-tasks

WebsiteGithubHugging Face

Echo-4o-Image

Junyan Ye, Dongzhi Jiang, Zihao Wang, Leqi Zhu, Zhenghao Hu, Zilong Huang, Jun He, Zhiyuan Yan, Jinghua Yu, Hongsheng Li, Conghui He, Weijia Li

Echo-4o-Image, a large-scale synthetic dataset distilled from GPT-4o.It contains approximately 179,000 samples spanning three distinct task types: 38K surreal fantasy generation tasks, 73K multi-reference image generation tasks, and 68K complex instruction execution tasks.

Imagination: 37,541 Images
Multi-Reference: 72,729 Images
Instruction Following: 67,958 Images

WebsiteGithubHugging Face

ShareGPT-4o-Image

Junying Chen and Zhenyang Cai and Pengcheng Chen and Shunian Chen and Ke Ji and Xidong Wang and Yunjin Yang and Benyou Wang

ShareGPT-4o-Image is a large-scale, high-quality dataset of 92K samples generated by GPT-4o’s image generation capabilities, including 45K text-to-image and 46K text-and-image-to-image examples. It aims to support the development of open multimodal models aligned with GPT-4o’s strengths in image generation.

ShareGPT-4o-Image contains a total of 91K image generation samples from GPT-4o, categorized as follows:

Text-to-Image: 45,717
Text-and-Image-to-Image: 46,539

The image data is packaged into .tar archives:

text_to_image_part_*.tar contains images from the text-to-image set.
text_and_image_to_image_part_*.tar contains images from the text-and-image-to-image set.

Website

ART500K

Hui Mao Ming Cheung James She
HKUST-NIE Social Media Lab, The Hong Kong University of Science and Technology

ART500K is a large-scale visual arts dataset with more than 500K images, each with over 10 attribute labels, apart from some general labels (e.g., artist, genre, art movement), some special labels (e.g., event, historical figure, description) are included. The dataset can be used in different tasks (e.g., visual arts classification, viusal arts retrieval, visual arts image caption, etc.). Both computer science community and visual arts community can get benefits from the dataset.

Raw images of visual arts with general label list: Data&Labels(56.6GB)/Labels
Raw images of visual arts with event label list: Data(15GB)/Labels
Raw images of visual arts with historical figure label list: Data(26GB)/Labels
Raw images of visual arts with place label list: Place1(86GB)/Place2(85GB)/Place3(73GB)/Label
Toy artwork dataset with 43,455 images: Data (7GB)/Labels
Artwork photos: Data (62GB)/Introduction

Website

Painter by Numbers

Files: 17 files
Size: 90.51 GB
Type: zip, csv

Most of the images in this competition are from WikiArt.org. Please assume that all images are protected by copyright and utilize the images only for the purposes of data mining, which constitutes a form of fair use.

train.zip -zip file containing the images in the training set (.jpg)
train_{1,2,..9}.zip – subsets of train.zip since the data size is large. You don’t need any of these if you can download train.zip.
test.zip – zip file containing the images in the test set (.jpg)
train_info.csv – file listing the image filename, artistID, genre, style, date and title for images in the training set.
submission_info.csv – each row lists an index and the filenames of the two images for which the algorithm needs to make a prediction about whether they were created by the same artist.
sampleSubmission.csv – a sample submission file with a column for comparison number and the predicted probability
replacements_for_corrupted_files.zip – 10 files in train and test sets that were corrupted images, these are the correct images

Kaggle

Biodiversity Heritage Library (BHL)

Access : API, Data Dumps
Format: JSON,XML, MODS, BibTex, RIS,TSV,TXT
License : CC0-1.0 license

The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. BHL is revolutionizing global research by providing free, worldwide access to knowledge about life on Earth.

BHL provides data exports and APIs to allow individual users and data providers to download, remix and reuse BHL content.

Exports of BHL bibliographic, scientific name, and full optical character recognized text are available in a variety of formats.
A series of files is available for download that will enable libraries and other data providers to identify digitized titles available within BHL. These files include metadata about each volume scanned, as well as information about the millions of scientific names that have been identified throughout the BHL corpus and the pages on which those names occur.

website300,000+ Illustrations on Flickr

The Art Genome Project

Format: JSON, CSV, API
License: CC-BY-4.0 license

The Art Genome Project is the classification system and technological framework that powers Artsy. It maps the characteristics (we call them “genes”) that connect artists, artworks, architecture, and design objects across history. There are currently over 1,000 characteristics in The Art Genome Project, including art historical movements, subject matter, and formal qualities.

In The Art Genome Project, every gene (category) is applied along with a value from 1-100 that indicates its relevance to the particular work, artist, designer or architect.

However, this list may still be used as subject headings, tags, or general categories for organizing collections of artworks, artists, designers, functional objects, and more.

GithubDEMO

Prints & Photographs Online Catalog (PPOC)

The Library of Congress

Format: JSON
Access: API

The Library of Congress’ recently re-released Print & Photographs Online Catalog (http://www.loc.gov/pictures/) provides a json serialization of the request-scoped state used to create every html page. This immediately enables PPOC to serve as a simple API for developers to make use of while building other applications, integrating Library data in new and innovative ways.

The Prints & Photographs Online Catalog includes images in the following formats:

gif – generally small “thumbnails” used for previewing images; a gif image displays at the top of its associated catalog record and, in some cases, it is the only image that will display to those searching outside the Library of Congress because of rights considerations (extension on the file name is .gif). The resolution is generally about 150×150 pixels.
jpeg – generally a larger image that displays in a separate screen from the catalog record; sometimes two types of jpeg files are available–one for reference viewing and one at a higher resolution (extension on the file name is .jpg).
tiff- generally the highest resolution file available in PPOC, viewed or downloaded via a link on the screen where the reference jpeg displays (extension on the file name is .tif).

API

parrot zone

@proximasan , @EErratica , @KyrickYoung , @sureailabs , @yontelbrot

The Image Synthesis Style Studies is a database of publicly collected information documenting the responses of open source “AI” image synthesis models, such as CLIP and Stable Diffusion, to specific text-based inputs. These inputs include adjectives, names of artists, popular media, or other descriptors (hereby “modifiers”) with plausible visual effects on the style of the images synthesized by these models.

The database includes the recognition status (i.e. whether the model recognized the modifier, indicated with “Yes”, “No” or “Unsure”) of each individual modifier for each set of models. It also includes examples of images synthesized by the above models (Figure 1) when tested with the specified modifier to determine recognition status (see FAQ below – “How do you determine whether the models recognize a modifier?”).

NOTION

Art Datase / Museum

Artwork : 130,000+
Format: CSV
License : CC0-1.0 license

National Gallery of Art Open Data Program

The dataset provides data records relating to the 130,000+ artworks in our collection and the artists who created them. You can download the dataset free of charge without seeking authorization from the National Gallery of Art.
The dataset is published in CSV format and uses UTF-8 encoding, and is updated daily. Links and references to images and other media such as audio and video files are contained in the dataset, but the images and media files themselves are not included under this program.

WebsiteGithub

Records : 15,679+
Format: CSV,JSON
License : CC0-1.0 license

The Museum of Modern Art (MoMA) Collection

The Artists dataset contains 15,679 records, representing all the artists who have work in MoMA’s collection and have been cataloged in our database. It includes basic metadata for each artist, including name, nationality, gender, birth year, death year, Wiki QID, and Getty ULAN ID.
At this time, both datasets are available in CSV format, encoded in UTF-8. While UTF-8 is the standard for multilingual character encodings, it is not correctly interpreted by Excel on a Mac. Users of Excel on a Mac can convert the UTF-8 to UTF-16 so the file can be imported correctly. The datasets are also available in JSON.

WebsiteAPI

Records : 470,000+
Format: CSV
License : CC0-1.0 license

The Metropolitan Museum of Art Open Access CSV

The Metropolitan Museum of Art provides select datasets of information on more than 470,000 artworks in its Collection for unrestricted commercial and noncommercial use.
At this time, the datasets are available in CSV format, encoded in UTF-8. While UTF-8 is the standard for multilingual character encodings, it is not correctly interpreted by Excel on a Mac. Users of Excel on a Mac can convert the UTF-8 to UTF-16 so the file can be imported correctly.

WebsiteGithub

Artworks : 70,000+
Artists : 3,500+
Format: CSV,JSON
License : CC0-1.0 license

The Tate Collection

The dataset in this repository was last updated in October 2014. Tate has no plans to resume updating this repository, but we are keeping it available for the time being in case this snapshot of the Tate collection is a useful tool for researchers and developers.
Here we present the metadata for around 70,000 artworks that Tate owns or jointly owns with the National Galleries of Scotland as part of ARTIST ROOMS. Metadata for around 3,500 associated artists is also included.

WebsiteGithub

Size : 1.75 GB
Access : API, Data Dumps
Format: JSON
License : CC0-1.0 license

The Art Institute of Chicago

Founded in 1879, the Art Institute of Chicago is one of the world’s major museums, housing an extraordinary collection of objects from across places, cultures, and time. We are also a place of active learning for all—dedicated to investigation, innovation, education, and dialogue—continually aspiring to greater public service and civic engagement.
We provide API and data dumps. These data dumps are updated nightly. They are generated from our API. As such, they contain the same data as our API, and their schema mirrors that of the API. The data is dumped in JSON format, with one JSON file per record. Records are grouped by API resource type.

WebsiteData DumpsAPI

Artworks : 64,000+
Access : API, Data Dumps
Format: CSV,JSON
License : CC0-1.0 license

The Cleveland Museum of Art Open Access

The Cleveland Museum of Art (CMA) was founded in 1913 “for the benefit of all the people forever.” The museum strives to help the broadest possible audience understand and engage with the world’s great art. The Cleveland Museum of Art is one of the most comprehensive art museums in the world and one of northeastern Ohio’s principal civic and cultural institutions.
The Cleveland Museum of Art provides datasets of information on more than 64,000 artwork records in its Collection for unrestricted commercial and noncommercial use. Additionally, the museum provides image assets for over 37,000 works, which are made available under the same terms. Links to the web, print, and full-sized, uncompressed versions of these images are included in the dataset where applicable.

WebsiteData DumpsAPI

Artworks : 245,688+
Access : API
Format: JSON

Harvard Art Museums

The Harvard Art Museums API is a REST-style service designed for developers who wish to explore and integrate the museums’ collections in their projects. The API provides direct access to JSON formatted data that powers this website and many other aspects of the museums.
And every request must be accompanied by the apikey parameter and an API key. The API uses keys to authenticate requests. API keys take the form 00000000-0000-0000-0000-000000000000.

WebsiteAPI