DINOv3

🆕 [2025-08-14] 🔥 DINOv3 backbones are now available in Hugging Face Hub and supported by the Hugging Face Transformers library

DINOv3 🦖🦖🦖

Meta AI Research, FAIR

Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab,
Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa,
Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang,
Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts,
Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,
Julien Mairal, Hervé Jégou, Patrick Labatut, Piotr Bojanowski

[ 📜 Paper] [ 📰 Blog] [ 🌐 Website] [ 📖 BibTeX]

Reference PyTorch implementation and models for DINOv3. For details, see the DINOv3 paper.

Overview

High-resolution dense features.
We visualize the cosine similarity maps obtained with DINOv3 output features
between the patches marked with a red cross and all other patches.

An extended family of versatile vision foundation models producing high-quality dense features and achieving outstanding performance on various vision tasks including outperforming the specialized state of the art across a broad range of settings, without fine-tuning

Pretrained models

ℹ️ Please follow the link provided below to get access to all the model weights: once accepted, an e-mail will be sent with the complete list of URLs pointing to all the available model weights (both backbones and adapters). These URLs can then be used to either:

download the model or adapter weights to a local filesystem and point torch.hub.load() to these local weights via the weights or backbone_weights parameters, or
directly invoke torch.hub.load() to download and load a backbone or an adapter from its URL via also the weights or backbone_weights parameters.

See the example code snippets below.

⚠️ Please use wget instead of a web browser to download the weights.

ViT models pretrained on web dataset (LVD-1689M):

Model	Parameters	Pretraining Dataset	Download
ViT-S/16 distilled	21M	LVD-1689M	[link]
ViT-S+/16 distilled	29M	LVD-1689M	[link]
ViT-B/16 distilled	86M	LVD-1689M	[link]
ViT-L/16 distilled	300M	LVD-1689M	[link]
ViT-H+/16 distilled	840M	LVD-1689M	[link]
ViT-7B/16	6,716M	LVD-1689M	[link]

ConvNeXt models pretrained on web dataset (LVD-1689M):

Model	Parameters	Pretraining Dataset	Download
ConvNeXt Tiny	29M	LVD-1689M	[link]
ConvNeXt Small	50M	LVD-1689M	[link]
ConvNeXt Base	89M	LVD-1689M	[link]
ConvNeXt Large	198M	LVD-1689M	[link]

ViT models pretrained on satellite dataset (SAT-493M):

Model	Parameters	Pretraining Dataset	Download
ViT-L/16 distilled	300M	SAT-493M	[link]
ViT-7B/16	6,716M	SAT-493M	[link]

Pretrained backbones (via PyTorch Hub)

Please follow the instructions here to install PyTorch (the only required dependency for loading the model). Installing PyTorch with CUDA support is strongly recommended.

Pretrained backbones (via Hugging Face Transformers)

All the backbones are available in the the DINOv3 collection on Hugging Face Hub and supported via the Hugging Face Transformers library. Please refer to the corresponding documentation for usage, but below is a short example that demonstrates how to obtain an image embedding with either [Pipeline] or the [AutoModel] class.

from transformers import pipeline
from transformers.image_utils import load_image

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/pipeline-cat-chonk.jpeg"
image = load_image(url)

feature_extractor = pipeline(
    model="facebook/dinov3-convnext-tiny-pretrain-lvd1689m",
    task="image-feature-extraction", 
)
features = feature_extractor(image)

import torch
from transformers import AutoImageProcessor, AutoModel
from transformers.image_utils import load_image

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = load_image(url)

pretrained_model_name = "facebook/dinov3-convnext-tiny-pretrain-lvd1689m"
processor = AutoImageProcessor.from_pretrained(pretrained_model_name)
model = AutoModel.from_pretrained(
    pretrained_model_name, 
    device_map="auto", 
)

inputs = processor(images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
    outputs = model(**inputs)

pooled_output = outputs.pooler_output
print("Pooled output shape:", pooled_output.shape)

where model and pretrained_model_name above can be one of:

facebook/dinov3-vits16-pretrain-lvd1689m
facebook/dinov3-vits16plus-pretrain-lvd1689m
facebook/dinov3-vitb16-pretrain-lvd1689m
facebook/dinov3-vitl16-pretrain-lvd1689m
facebook/dinov3-vith16plus-pretrain-lvd1689m
facebook/dinov3-vit7b16-pretrain-lvd1689m
facebook/dinov3-convnext-base-pretrain-lvd1689m
facebook/dinov3-convnext-large-pretrain-lvd1689m
facebook/dinov3-convnext-small-pretrain-lvd1689m
facebook/dinov3-convnext-tiny-pretrain-lvd1689m
facebook/dinov3-vitl16-pretrain-sat493m
facebook/dinov3-vit7b16-pretrain-sat493m

Image transforms

For models using the LVD-1689M weights (pretrained on web images), please use the following transform (standard ImageNet evaluation transform):

import torchvision

def make_transform(resize_size: int = 224):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return transforms.Compose([to_tensor, resize, normalize])

For models using the SAT-493M weights (pretrained on satellite imagery), please use the following transform:

import torchvision

def make_transform(resize_size: int = 224):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.430, 0.411, 0.296),
        std=(0.213, 0.156, 0.143),
    )
    return transforms.Compose([to_tensor, resize, normalize])

Pretrained heads – Image classification

Backbone	Pretraining Dataset	Head Dataset	Download
ViT-7B/16	LVD-1689M	ImageNet	[link]

The (full) classifier models can be loaded via PyTorch Hub:

, backbone_weights=)
“>

import torch

# DINOv3
dinov3_vit7b16_lc = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_lc', source="local", weights=<DEPTHER/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)

Pretrained heads – Depther trained on SYNTHMIX dataset

Backbone	Pretraining Dataset	Head Dataset	Download
ViT-7B/16	LVD-1689M	SYNTHMIX	[link]

, backbone_weights=)”>

depther = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_dd', source="local", weights=<DEPTHER/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)

Full example code of depther on an image

, backbone_weights=)

img_size = 1024
img = get_img()
transform = make_transform(img_size)
with torch.inference_mode():
with torch.autocast(‘cuda’, dtype=torch.bfloat16):
batch_img = transform(img)[None]
batch_img = batch_img
depths = depther(batch_img)

plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.imshow(img)
plt.axis(“off”)
plt.subplot(122)
plt.imshow(depths[0,0].cpu(), cmap=colormaps[“Spectral”])
plt.axis(“off”)
“>

from PIL import Image
import torch
from torchvision import transforms
import matplotlib.pyplot as plt
from matplotlib import colormaps

def get_img():
    import requests
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    return image

def make_transform(resize_size: int | list[int] = 768):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return transforms.Compose([to_tensor, resize, normalize])

depther = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_dd', source="local", weights=<DEPTHER/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)

img_size = 1024
img = get_img()
transform = make_transform(img_size)
with torch.inference_mode():
    with torch.autocast('cuda', dtype=torch.bfloat16):
        batch_img = transform(img)[None]
        batch_img = batch_img
        depths = depther(batch_img)

plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.imshow(img)
plt.axis("off")
plt.subplot(122)
plt.imshow(depths[0,0].cpu(), cmap=colormaps["Spectral"])
plt.axis("off")

Pretrained heads – Detector trained on COCO2017 dataset

Backbone	Pretraining Dataset	Head Dataset	Download
ViT-7B/16	LVD-1689M	COCO2017	[link]

, backbone_weights=)”>

detector = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_de', source="local", weights=<DETECTOR/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)

Pretrained heads – Segmentor trained on ADE20K dataset

Backbone	Pretraining Dataset	Head Dataset	Download
ViT-7B/16	LVD-1689M	ADE20K	[link]

, backbone_weights=)”>

segmentor = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_ms', source="local", weights=<SEGMENTOR/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)

Full example code of segmentator an image

, backbone_weights=)

img_size = 896
img = get_img()
transform = make_transform(img_size)
with torch.inference_mode():
with torch.autocast(‘cuda’, dtype=torch.bfloat16):
batch_img = transform(img)[None]
pred_vit7b = segmentor(batch_img) # raw predictions
# actual segmentation map
segmentation_map_vit7b = make_inference(
batch_img,
segmentor,
inference_mode=”slide”,
decoder_head_type=”m2f”,
rescale_to=(img.size[-1], img.size[-2]),
n_output_channels=150,
crop_size=(img_size, img_size),
stride=(img_size, img_size),
output_activation=partial(torch.nn.functional.softmax, dim=1),
).argmax(dim=1, keepdim=True)
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.imshow(img)
plt.axis(“off”)
plt.subplot(122)
plt.imshow(segmentation_map_vit7b[0,0].cpu(), cmap=colormaps[“Spectral”])
plt.axis(“off”)”>

import sys
sys.path.append(REPO_DIR)

from PIL import Image
import torch
from torchvision import transforms
import matplotlib.pyplot as plt
from matplotlib import colormaps
from functools import partial
from dinov3.eval.segmentation.inference import make_inference


def get_img():
    import requests
    url = "http://images.cocodataset.org/val2017/000000039769.jpg"
    image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
    return image

def make_transform(resize_size: int | list[int] = 768):
    to_tensor = transforms.ToTensor()
    resize = transforms.Resize((resize_size, resize_size), antialias=True)
    normalize = transforms.Normalize(
        mean=(0.485, 0.456, 0.406),
        std=(0.229, 0.224, 0.225),
    )
    return transforms.Compose([to_tensor, resize, normalize])

segmentor = torch.hub.load(REPO_DIR, 'dinov3_vit7b16_ms', source="local", weights=<SEGMENTOR/CHECKPOINT/URL/OR/PATH>, backbone_weights=<BACKBONE/CHECKPOINT/URL/OR/PATH>)

img_size = 896
img  = get_img()
transform = make_transform(img_size)
with torch.inference_mode():
    with torch.autocast('cuda', dtype=torch.bfloat16):
        batch_img = transform(img)[None]
        pred_vit7b = segmentor(batch_img)  # raw predictions  
        # actual segmentation map
        segmentation_map_vit7b = make_inference(
            batch_img,
            segmentor,
            inference_mode="slide",
            decoder_head_type="m2f",
            rescale_to=(img.size[-1], img.size[-2]),
            n_output_channels=150,
            crop_size=(img_size, img_size),
            stride=(img_size, img_size),
            output_activation=partial(torch.nn.functional.softmax, dim=1),
        ).argmax(dim=1, keepdim=True)
plt.figure(figsize=(12, 6))
plt.subplot(121)
plt.imshow(img)
plt.axis("off")
plt.subplot(122)
plt.imshow(segmentation_map_vit7b[0,0].cpu(), cmap=colormaps["Spectral"])
plt.axis("off")

Pretrained heads – Zero-shot tasks with `dino.txt`

Backbone	Download
Backbone	ViT-L/16 distilled	[link], vocabulary, vocabulary license

The (full) dino.txt model can be loaded via PyTorch Hub:

Installation

The training and evaluation code requires PyTorch version >= 2.7.1 as well as a few other 3rd party packages. Note that the code has only been tested with the specified versions and also expects a Linux environment. To setup all the required dependencies for training and evaluation, please follow the instructions below:

micromamba (Recommended) – Clone the repository and then create and activate a dinov3 conda environment using the provided environment definition:

micromamba env create -f conda.yaml
micromamba activate dinov3

Getting started

Several notebooks are provided to get started applying DINOv3:

PCA of patch features: display the PCA of DINOv3 patch features on a foreground object (rainbow visualizations from the paper) [Run in Google Colab]
Foreground segmentation: train a linear foreground segmentation model based on DINOv3 features [Run in Google Colab]
Dense and sparse matching: match patches from objects on two different images based on DINOv3 features [Run in Google Colab]
Segmentation tracking: video segmentation tracking using a non-parametric method based on DINOv3 features [Run in Google Colab]

Data preparation

ImageNet-1k

The root directory of the dataset should hold the following contents:

/test/ILSVRC2012_test_00000001.JPEG
/test/[..]
/test/ILSVRC2012_test_00100000.JPEG
/train/n01440764/n01440764_10026.JPEG
/train/[...]
/train/n15075141/n15075141_9993.JPEG
/val/n01440764/ILSVRC2012_val_00000293.JPEG
/val/[...]
/val/n15075141/ILSVRC2012_val_00049174.JPEG
/labels.txt

The provided dataset implementation expects a few additional metadata files to be present under the extra directory:

/class-ids-TRAIN.npy
/class-ids-VAL.npy
/class-names-TRAIN.npy
/class-names-VAL.npy
/entries-TEST.npy
/entries-TRAIN.npy
/entries-VAL.npy

These metadata files can be generated (once) with the following lines of Python code:

“, extra=”“)
dataset.dump_extra()”>

from dinov3.data.datasets import ImageNet

for split in ImageNet.Split:
    dataset = ImageNet(split=split, root="", extra="")
    dataset.dump_extra()

Note that the root and extra directories do not have to be distinct directories.

ImageNet-22k

Please adapt the dataset class to match your local setup.

⚠️ To execute the commands provided in the next sections for training and evaluation, the dinov3 package should be included in the Python module search path, i.e. simply prefix the command to run with PYTHONPATH=..

Training

Fast setup: training DINOv3 ViT-L/16 on ImageNet-1k

Run DINOv3 pre-training on 4 H100-80GB nodes (32 GPUs) in a SLURM cluster environment with submitit:

Training time is approximately 14 hours and the resulting checkpoint should reach 82.0% on k-NN eval and 83.5% on linear eval.

The training code saves the weights of the teacher in the eval folder every 12500 iterations for evaluation.

Exact DINOv3 setup: training DINOv3 ViT-7B/16

DINOv3 ViT-7B/16 is trained on a private dataset. The training involves 3 stages:

Pretraining
Gram anchoring
High resolution adaptation

Pretraining

Launch DINOV3 ViT-7B/16 pretraining on 32 nodes (256 GPUs) in a SLURM cluster environment with submitit.

Gram anchoring

High-resolution adaptation

Multi-distillation

Test setup:

Evaluation

The training code regularly saves the teacher weights. In order to evaluate the model, run the following evaluation on a single node:

Logistic regression classification on ImageNet-1k

k-NN classification on ImageNet-1k

Linear classification with data augmentation on ImageNet-1k

Text alignment on DINOv3 using dino.txt

Text alignment can be done following the method from dino.txt aka DINOv2 Meets Text.

”
output-dir=“>

PYTHONPATH=${PWD} python -m dinov3.run.submit dinov3/eval/text/train_dinotxt.py 
   --nodes 4 
  # An example config for text alignment is here: dinov3/eval/text/configs/dinov3_vitl_text.yaml  
  trainer_config_file="" 
  output-dir=<PATH/TO/OUTPUT/DIR>

Launching the above trains text alignment on 4 nodes with 8 gpus each (32 gpus in total).
Please note that the text alignment model in the DINOv3 paper was trained on a private dataset and here we have given an example config in dinov3/eval/text/configs/dinov3_vitl_text.yaml using CocoCaptions dataset for illustration purpose.
Please adapt the provided CocoCaptions dataset class, the dataset can be found here

License

DINOv3 code and model weights are released under the DINOv3 License. See LICENSE.md for additional details.

Contributing

See contributing and the code of conduct.

Citing DINOv3

If you find this repository useful, please consider giving a star ⭐ and citation 🦖:

@article{simeoni2025dinov3,
  title = {{{DINOv3}}},
  author = {Sim{'e}oni, Oriane and Vo, Huy V. and Seitzer, Maximilian and Baldassarre, Federico and Oquab, Maxime and Jose, Cijo and Khalidov, Vasil and Szafraniec, Marc and Yi, Seungeun and Ramamonjisoa, Micha{"e}l and Massa, Francisco and Haziza, Daniel and Wehrstedt, Luca and Wang, Jianyuan and Darcet, Timoth{'e}e and Moutakanni, Th{'e}o and Sentana, Leonel and Roberts, Claire and Vedaldi, Andrea and Tolan, Jamie and Brandt, John and Couprie, Camille and Mairal, Julien and J{'e}gou, Herv{'e} and Labatut, Patrick and Bojanowski, Piotr},
  year = {2025},
  month = aug,
  url={https://ai.meta.com/research/publications/dinov3},
  urldate = {2025-08-13},
}

Previous ArticleToday’s NYT Connections: Sports Edition Hints and Answers for Aug. 15, #326

Next Article OneSignal (YC S11) Is Hiring Engineers

TechAiVerse

Jonathan is a tech enthusiast and the mind behind Tech AI Verse. With a passion for artificial intelligence, consumer tech, and emerging innovations, he deliver clear, insightful content to keep readers informed. From cutting-edge gadgets to AI advancements and cryptocurrency trends, Jonathan breaks down complex topics to make technology accessible to all.

Subscribe to Updates

What's Hot

DINOv3

DINOv3

DINOv3 🦖🦖🦖

Overview

Pretrained models

Pretrained backbones (via PyTorch Hub)

Pretrained backbones (via Hugging Face Transformers)

Image transforms

Pretrained heads – Image classification

Pretrained heads – Depther trained on SYNTHMIX dataset

Pretrained heads – Detector trained on COCO2017 dataset

Pretrained heads – Segmentor trained on ADE20K dataset

Pretrained heads – Zero-shot tasks with dino.txt

Installation

Getting started

Data preparation

ImageNet-1k

ImageNet-22k

Training

Fast setup: training DINOv3 ViT-L/16 on ImageNet-1k

Exact DINOv3 setup: training DINOv3 ViT-7B/16

Pretraining

Gram anchoring

High-resolution adaptation

Multi-distillation

Test setup:

Evaluation

Logistic regression classification on ImageNet-1k

k-NN classification on ImageNet-1k

Linear classification with data augmentation on ImageNet-1k

Text alignment on DINOv3 using dino.txt

License

Contributing

Citing DINOv3

Related Posts

Pretrained heads – Zero-shot tasks with `dino.txt`