Large vision model (LVM) platforms

Technology Radar

Published : Oct 23, 2024

Oct 2024

Assess

Large language models (LLMs) grab so much of our attention these days, we tend to overlook ongoing developments in large vision models (LVMs). These models can be used to segment, synthesize, reconstruct and analyze video streams and images, sometimes in combination with diffusion models or standard convolutional neural networks. Despite the potential for LVMs to revolutionize the way we work with visual data, we still face significant challenges in adapting and applying them in production environments. Video data, for instance, presents unique engineering challenges for collecting training data, segmenting and labeling objects, fine-tuning models and then deploying the resulting models and monitoring them in production. So, while LLMs lend themselves to simple chat interfaces or plain text APIs, a computer vision engineer or data scientist must manage, version, annotate and analyze large quantities of streaming video data; this work requires a visual interface. LVM platforms are a new category of tools and services — including V7, Nvidia Deepstream SDK and Roboflow — that have emerged to address these challenges. Deepstream and Roboflow are particularly interesting to us because they combine an integrated GUI development environment for managing and annotating video streams with a set of Python, C++ or REST APIs to invoke the models from application code.