ART500K
Hui Mao Ming Cheung James She
HKUST-NIE Social Media Lab, The Hong Kong University of Science and Technology
ART500K is a large-scale visual arts dataset with more than 500K images, each with over 10 attribute labels, apart from some general labels (e.g., artist, genre, art movement), some special labels (e.g., event, historical figure, description) are included. The dataset can be used in different tasks (e.g., visual arts classification, viusal arts retrieval, visual arts image caption, etc.). Both computer science community and visual arts community can get benefits from the dataset.
- Raw images of visual arts with general label list: Data&Labels(56.6GB)/Labels
- Raw images of visual arts with event label list: Data(15GB)/Labels
- Raw images of visual arts with historical figure label list: Data(26GB)/Labels
- Raw images of visual arts with place label list: Place1(86GB)/Place2(85GB)/Place3(73GB)/Label
- Toy artwork dataset with 43,455 images: Data (7GB)/Labels
- Artwork photos: Data (62GB)/Introduction
Painter by Numbers
Files: 17 files
Size: 90.51 GB
Type: zip, csv
Most of the images in this competition are from WikiArt.org. Please assume that all images are protected by copyright and utilize the images only for the purposes of data mining, which constitutes a form of fair use.
- train.zip -zip file containing the images in the training set (.jpg)
- train_{1,2,..9}.zip – subsets of train.zip since the data size is large. You don’t need any of these if you can download train.zip.
- test.zip – zip file containing the images in the test set (.jpg)
- train_info.csv – file listing the image filename, artistID, genre, style, date and title for images in the training set.
- submission_info.csv – each row lists an index and the filenames of the two images for which the algorithm needs to make a prediction about whether they were created by the same artist.
- sampleSubmission.csv – a sample submission file with a column for comparison number and the predicted probability
- replacements_for_corrupted_files.zip – 10 files in train and test sets that were corrupted images, these are the correct images
Biodiversity Heritage Library (BHL)
Access : API, Data Dumps
Format: JSON,XML, MODS, BibTex, RIS,TSV,TXT
License : CC0-1.0 license
The Biodiversity Heritage Library (BHL) is the world’s largest open access digital library for biodiversity literature and archives. BHL is revolutionizing global research by providing free, worldwide access to knowledge about life on Earth.
BHL provides data exports and APIs to allow individual users and data providers to download, remix and reuse BHL content.
Exports of BHL bibliographic, scientific name, and full optical character recognized text are available in a variety of formats.
A series of files is available for download that will enable libraries and other data providers to identify digitized titles available within BHL. These files include metadata about each volume scanned, as well as information about the millions of scientific names that have been identified throughout the BHL corpus and the pages on which those names occur.
The Art Genome Project
Format: JSON, CSV, API
License: CC-BY-4.0 license
The Art Genome Project is the classification system and technological framework that powers Artsy. It maps the characteristics (we call them “genes”) that connect artists, artworks, architecture, and design objects across history. There are currently over 1,000 characteristics in The Art Genome Project, including art historical movements, subject matter, and formal qualities.
In The Art Genome Project, every gene (category) is applied along with a value from 1-100 that indicates its relevance to the particular work, artist, designer or architect.
However, this list may still be used as subject headings, tags, or general categories for organizing collections of artworks, artists, designers, functional objects, and more.
Prints & Photographs Online Catalog (PPOC)
The Library of Congress
Format: JSON
Access: API
The Library of Congress’ recently re-released Print & Photographs Online Catalog (http://www.loc.gov/pictures/) provides a json serialization of the request-scoped state used to create every html page. This immediately enables PPOC to serve as a simple API for developers to make use of while building other applications, integrating Library data in new and innovative ways.
The Prints & Photographs Online Catalog includes images in the following formats:
- gif – generally small “thumbnails” used for previewing images; a gif image displays at the top of its associated catalog record and, in some cases, it is the only image that will display to those searching outside the Library of Congress because of rights considerations (extension on the file name is .gif). The resolution is generally about 150×150 pixels.
- jpeg – generally a larger image that displays in a separate screen from the catalog record; sometimes two types of jpeg files are available–one for reference viewing and one at a higher resolution (extension on the file name is .jpg).
- tiff- generally the highest resolution file available in PPOC, viewed or downloaded via a link on the screen where the reference jpeg displays (extension on the file name is .tif).
The Image Synthesis Style Studies is a database of publicly collected information documenting the responses of open source “AI” image synthesis models, such as CLIP and Stable Diffusion, to specific text-based inputs. These inputs include adjectives, names of artists, popular media, or other descriptors (hereby “modifiers”) with plausible visual effects on the style of the images synthesized by these models.
The database includes the recognition status (i.e. whether the model recognized the modifier, indicated with “Yes”, “No” or “Unsure”) of each individual modifier for each set of models. It also includes examples of images synthesized by the above models (Figure 1) when tested with the specified modifier to determine recognition status (see FAQ below – “How do you determine whether the models recognize a modifier?”).
Art Datase / Museum
Artwork : 130,000+
Format: CSV
License : CC0-1.0 license
National Gallery of Art Open Data Program
The dataset provides data records relating to the 130,000+ artworks in our collection and the artists who created them. You can download the dataset free of charge without seeking authorization from the National Gallery of Art.
The dataset is published in CSV format and uses UTF-8 encoding, and is updated daily. Links and references to images and other media such as audio and video files are contained in the dataset, but the images and media files themselves are not included under this program.
Records : 15,679+
Format: CSV,JSON
License : CC0-1.0 license
The Museum of Modern Art (MoMA) Collection
The Artists dataset contains 15,679 records, representing all the artists who have work in MoMA’s collection and have been cataloged in our database. It includes basic metadata for each artist, including name, nationality, gender, birth year, death year, Wiki QID, and Getty ULAN ID.
At this time, both datasets are available in CSV format, encoded in UTF-8. While UTF-8 is the standard for multilingual character encodings, it is not correctly interpreted by Excel on a Mac. Users of Excel on a Mac can convert the UTF-8 to UTF-16 so the file can be imported correctly. The datasets are also available in JSON.
Records : 470,000+
Format: CSV
License : CC0-1.0 license
The Metropolitan Museum of Art Open Access CSV
The Metropolitan Museum of Art provides select datasets of information on more than 470,000 artworks in its Collection for unrestricted commercial and noncommercial use.
At this time, the datasets are available in CSV format, encoded in UTF-8. While UTF-8 is the standard for multilingual character encodings, it is not correctly interpreted by Excel on a Mac. Users of Excel on a Mac can convert the UTF-8 to UTF-16 so the file can be imported correctly.
Artworks : 70,000+
Artists : 3,500+
Format: CSV,JSON
License : CC0-1.0 license
The Tate Collection
The dataset in this repository was last updated in October 2014. Tate has no plans to resume updating this repository, but we are keeping it available for the time being in case this snapshot of the Tate collection is a useful tool for researchers and developers.
Here we present the metadata for around 70,000 artworks that Tate owns or jointly owns with the National Galleries of Scotland as part of ARTIST ROOMS. Metadata for around 3,500 associated artists is also included.
Size : 1.75 GB
Access : API, Data Dumps
Format: JSON
License : CC0-1.0 license
The Art Institute of Chicago
Founded in 1879, the Art Institute of Chicago is one of the world’s major museums, housing an extraordinary collection of objects from across places, cultures, and time. We are also a place of active learning for all—dedicated to investigation, innovation, education, and dialogue—continually aspiring to greater public service and civic engagement.
We provide API and data dumps. These data dumps are updated nightly. They are generated from our API. As such, they contain the same data as our API, and their schema mirrors that of the API. The data is dumped in JSON format, with one JSON file per record. Records are grouped by API resource type.
Artworks : 64,000+
Access : API, Data Dumps
Format: CSV,JSON
License : CC0-1.0 license
The Cleveland Museum of Art Open Access
The Cleveland Museum of Art (CMA) was founded in 1913 “for the benefit of all the people forever.” The museum strives to help the broadest possible audience understand and engage with the world’s great art. The Cleveland Museum of Art is one of the most comprehensive art museums in the world and one of northeastern Ohio’s principal civic and cultural institutions.
The Cleveland Museum of Art provides datasets of information on more than 64,000 artwork records in its Collection for unrestricted commercial and noncommercial use. Additionally, the museum provides image assets for over 37,000 works, which are made available under the same terms. Links to the web, print, and full-sized, uncompressed versions of these images are included in the dataset where applicable.
Artworks : 245,688+
Access : API
Format: JSON
Harvard Art Museums
The Harvard Art Museums API is a REST-style service designed for developers who wish to explore and integrate the museums’ collections in their projects. The API provides direct access to JSON formatted data that powers this website and many other aspects of the museums.
And every request must be accompanied by the apikey parameter and an API key. The API uses keys to authenticate requests. API keys take the form 00000000-0000-0000-0000-000000000000.
Movie Poster / Music Cover

Movie Genre from its Poster
The collected dataset contains IMDB Id, IMDB Link, Title, IMDB Score, Genre and link to download movie posters.
Poster: 39,371
Size: 26.78 MB

Movie-Poster Dataset
We collected 1,500 movie posters featuring various artistic-style titles to address the current market’s lack of artistic-style text data
Poster: 1500
Size: 1.0 GB