Video length is 6:11

Automated Data Labeling Using Vision-Language Models

Learn how to use the latest vision-language models in MATLAB® to perform automated data labeling for aerial imagery. The vision-language models include CLIP, Grounding DINO, and Moondream for image retrieval, text-prompted object detection, and image captioning, respectively. SAM is also included to generate pixel-level object masks from object detections. The imagery is generated from NAIP data from USGS Earth Explorer, shot over Hanscom Air Force Base.

Published: 13 Feb 2026