All images and texts in the LAION-400M dataset have been filtered with OpenAI‘s
CLIP by calculating the cosine similarity between the text and image embeddings
and dropping those with a similarity below 0.3. The threshold of 0.3 had been
determined through human evaluations and seemed to be a good heuristic for
estimating semantic image-text-content matching.
The image-text-pairs have been extracted from the Common Crawl web data dump and
are from random web pages crawled between 2014 and 2021.
Manual download instructions: This dataset requires you to
download the source data manually into download_config.manual_dir
(defaults to ~/tensorflow_datasets/downloads/manual/):
Refer to "Download Information" section on https://laion.ai/blog/laion-400-open-dataset/
[[["Easy to understand","easyToUnderstand","thumb-up"],["Solved my problem","solvedMyProblem","thumb-up"],["Other","otherUp","thumb-up"]],[["Missing the information I need","missingTheInformationINeed","thumb-down"],["Too complicated / too many steps","tooComplicatedTooManySteps","thumb-down"],["Out of date","outOfDate","thumb-down"],["Samples / code issue","samplesCodeIssue","thumb-down"],["Other","otherDown","thumb-down"]],["Last updated 2024-09-03 UTC."],[],[],null,["# laion400m\n\n\u003cbr /\u003e\n\n| **Warning:** Manual download required. See instructions below.\n\n- **Description**:\n\nThe LAION-400M dataset is completely openly, freely accessible.\n\nCheck\n\u003chttps://laion.ai/laion-400-open-dataset/\u003e\nfor the full description of this dataset.\n\nAll images and texts in the LAION-400M dataset have been filtered with OpenAI's\nCLIP by calculating the cosine similarity between the text and image embeddings\nand dropping those with a similarity below 0.3. The threshold of 0.3 had been\ndetermined through human evaluations and seemed to be a good heuristic for\nestimating semantic image-text-content matching.\n\nThe image-text-pairs have been extracted from the Common Crawl web data dump and\nare from random web pages crawled between 2014 and 2021.\n\n- **Additional Documentation** :\n [Explore on Papers With Code\n north_east](https://paperswithcode.com/dataset/laion-400m)\n\n- **Homepage** :\n \u003chttps://laion.ai/blog/laion-400-open-dataset/\u003e\n\n- **Source code** :\n [`tfds.vision_language.laion400m.Laion400m`](https://github.com/tensorflow/datasets/tree/master/tensorflow_datasets/vision_language/laion400m/laion400m.py)\n\n- **Versions**:\n\n - **`1.0.0`** (default): Initial release.\n- **Download size** : `Unknown size`\n\n- **Dataset size** : `Unknown size`\n\n- **Manual download instructions** : This dataset requires you to\n download the source data manually into `download_config.manual_dir`\n (defaults to `~/tensorflow_datasets/downloads/manual/`): \n\n Refer to \"Download Information\" section on \u003chttps://laion.ai/blog/laion-400-open-dataset/\u003e\n\n- **Auto-cached**\n ([documentation](https://www.tensorflow.org/datasets/performances#auto-caching)):\n Unknown\n\n- **Splits**:\n\n| Split | Examples |\n|-------|----------|\n\n- **Supervised keys** (See\n [`as_supervised` doc](https://www.tensorflow.org/datasets/api_docs/python/tfds/load#args)):\n `None`\n\n- **Figure**\n ([tfds.show_examples](https://www.tensorflow.org/datasets/api_docs/python/tfds/visualization/show_examples)):\n Not supported.\n\n- **Examples**\n ([tfds.as_dataframe](https://www.tensorflow.org/datasets/api_docs/python/tfds/as_dataframe)):\n Missing.\n\n- **Citation**:\n\n @article{DBLP:journals/corr/abs-2111-02114,\n author = {Christoph Schuhmann and\n Richard Vencu and\n Romain Beaumont and\n Robert Kaczmarczyk and\n Clayton Mullis and\n Aarush Katta and\n Theo Coombes and\n Jenia Jitsev and\n Aran Komatsuzaki},\n title = { {LAION-400M:} Open Dataset of CLIP-Filtered 400 Million Image-Text\n Pairs},\n journal = {CoRR},\n volume = {abs/2111.02114},\n year = {2021},\n url = {https://arxiv.org/abs/2111.02114},\n eprinttype = {arXiv},\n eprint = {2111.02114},\n timestamp = {Fri, 05 Nov 2021 15:25:54 +0100},\n biburl = {https://dblp.org/rec/journals/corr/abs-2111-02114.bib},\n bibsource = {dblp computer science bibliography, https://dblp.org}\n }\n\nlaion400m/images (default config)\n---------------------------------\n\n- **Feature structure**:\n\n FeaturesDict({\n 'caption': Text(shape=(), dtype=string),\n 'image': Image(shape=(None, None, 3), dtype=uint8, description=image),\n 'license': Text(shape=(), dtype=string),\n 'nsfw': ClassLabel(shape=(), dtype=int64, num_classes=4),\n 'original_height': Scalar(shape=(), dtype=int32, description=original height of the image),\n 'original_width': Scalar(shape=(), dtype=int32, description=original width of the image),\n 'similarity': Scalar(shape=(), dtype=float64, description=cosine similarity score between the text and image embedding. Missing values default to -1.0),\n 'url': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description | Value range |\n|-----------------|--------------|-----------------|---------|----------------------------------------------------------------------------------------------|--------------|\n| | FeaturesDict | | | | |\n| caption | Text | | string | HTML alt-text attribute | |\n| image | Image | (None, None, 3) | uint8 | image | |\n| license | Text | | string | type of Creative Commons license (if applicable) | |\n| nsfw | ClassLabel | | int64 | NSFW tag (detected with CLIP). Incohesive and missing tags are replaced with UNTAGGED | |\n| original_height | Scalar | | int32 | original height of the image | |\n| original_width | Scalar | | int32 | original width of the image | |\n| similarity | Scalar | | float64 | cosine similarity score between the text and image embedding. Missing values default to -1.0 | \\[0.0, 1.0\\] |\n| url | Text | | string | image URL | |\n\nlaion400m/embeddings\n--------------------\n\n- **Feature structure**:\n\n FeaturesDict({\n 'caption': Text(shape=(), dtype=string),\n 'image_embedding': Tensor(shape=(512,), dtype=float16, description=CLIP image embedding),\n 'license': Text(shape=(), dtype=string),\n 'nsfw': ClassLabel(shape=(), dtype=int64, num_classes=4),\n 'original_height': Scalar(shape=(), dtype=int32, description=original height of the image),\n 'original_width': Scalar(shape=(), dtype=int32, description=original width of the image),\n 'similarity': Scalar(shape=(), dtype=float64, description=cosine similarity score between the text and image embedding. Missing values default to -1.0),\n 'text_embedding': Tensor(shape=(512,), dtype=float16, description=CLIP text embedding),\n 'url': Text(shape=(), dtype=string),\n })\n\n- **Feature documentation**:\n\n| Feature | Class | Shape | Dtype | Description | Value range |\n|-----------------|--------------|--------|---------|----------------------------------------------------------------------------------------------|--------------|\n| | FeaturesDict | | | | |\n| caption | Text | | string | HTML alt-text attribute | |\n| image_embedding | Tensor | (512,) | float16 | CLIP image embedding | |\n| license | Text | | string | type of Creative Commons license (if applicable) | |\n| nsfw | ClassLabel | | int64 | NSFW tag (detected with CLIP). Incohesive and missing tags are replaced with UNTAGGED | |\n| original_height | Scalar | | int32 | original height of the image | |\n| original_width | Scalar | | int32 | original width of the image | |\n| similarity | Scalar | | float64 | cosine similarity score between the text and image embedding. Missing values default to -1.0 | \\[0.0, 1.0\\] |\n| text_embedding | Tensor | (512,) | float16 | CLIP text embedding | |\n| url | Text | | string | image URL | |"]]