Computer Vision

What is computer vision?

Computer vision is the engineering discipline of building systems that interpret images, video, and 3D data — detecting objects, classifying scenes, reading text, measuring dimensions, or tracking activity over time. In production, it powers automated quality inspection, drone-based infrastructure monitoring, OCR pipelines, video analytics, and visual search.

We build computer vision systems that ship to production — not research demos. The difference is the data pipeline, the labeling and re-labeling loop, the model optimization for the deployment target, and the monitoring that catches degradation before it causes a quality escape.

Key terms used on this page:

Object detection: Locating and classifying multiple objects within an image with bounding boxes. The output is "what" and "where." Models: YOLO, Detectron2, DETR.
Semantic segmentation: Labeling every pixel with a class — useful for measuring areas, defect shapes, or precise boundaries. Models: U-Net, Mask R-CNN, SAM.
OCR (Optical Character Recognition): Extracting text from images and documents. Tools: Tesseract, AWS Textract, Google Document AI, Azure Form Recognizer.
Edge inference: Running the model on a local device (Jetson, Coral, industrial PC) instead of in the cloud, for latency, bandwidth, or data-residency reasons.
mAP (mean Average Precision): The standard accuracy metric for object detection, combining precision and recall across confidence thresholds and classes.

How does a computer vision project actually ship?

We follow a sequence designed to validate feasibility on real images before committing to a full build. Most failed CV projects skip data collection in real lighting and camera conditions, then discover the model doesn't work outside the sample set.

1. Discovery and feasibility. We collect representative images from the actual deployment environment — same camera, same lighting, same angles — and label a baseline set. Output: a feasibility report with achievable accuracy estimates.

2. Rapid prototyping. We fine-tune a pre-trained model (YOLO, ResNet, ViT) on the baseline set and measure mAP, precision, and recall on held-out data. If the prototype hits the business threshold, we proceed; if not, we recommend more data, better cameras, or a different approach.

3. Production engineering. We build the inference pipeline, optimize for the deployment target (cloud GPU, Jetson, Coral), integrate with cameras and downstream systems, and ship monitoring.

4. Deployment and tuning. We run a shadow mode period where the model sees real production data without acting on it, tune thresholds, and cut over with a rollback plan.

A typical engagement runs 8 to 16 weeks. Multi-camera systems or projects requiring custom labeling pipelines run 4 to 6 months.

When should you use a cloud vision API versus a custom model?

Cloud vision APIs (AWS Rekognition, Google Vision API, Azure Computer Vision) are excellent for generic tasks — face detection, common object labels, celebrity recognition, content moderation, OCR on standard documents. They're fast to integrate and cheap at low volume.

Train a custom model when:

The objects are domain-specific. Manufacturing defects, specific equipment classes, retail SKUs, agricultural diseases — none of these are in the cloud APIs' training data.
Accuracy on your data is unacceptable. Cloud APIs often hit 70–80% on adjacent tasks; a custom model fine-tuned on your data routinely hits 95%+.
Latency requires edge inference. Real-time defect detection at 30+ FPS on a manufacturing line can't make a round-trip to the cloud.
Data residency or cost rules out the API. High-volume video analytics is prohibitively expensive on cloud APIs and often requires on-prem inference.

The middle path: use cloud APIs for the parts that are generic (OCR, face blurring, content moderation) and train custom models for the parts that are differentiated.

How do you handle quality inspection and defect detection?

Defect detection is the highest-value computer vision use case in manufacturing and logistics, and the hardest to do well. The challenge is class imbalance — defects are rare, so naive accuracy metrics lie.

Data collection. We work with the operations team to capture defects in their natural distribution, augmented with rarer defect classes.
Modeling. Classification works for "is this part defective?" Object detection or segmentation works for "where on the part is the defect?" We typically use YOLOv8/YOLO-NAS for detection and U-Net or Mask R-CNN for segmentation.
Synthetic data. For very rare defects (1 in 10,000), we use diffusion models or copy-paste augmentation to expand the training set without waiting months for real defects.
Human-in-the-loop. We deploy with a confidence threshold below which the part routes to a human inspector. As the model improves, the threshold rises and the human review rate falls.
Precision-recall tuning. Tuned for the cost ratio between false positives (rework) and false negatives (escapes to customer).

Production defect-detection systems we've shipped routinely hit 95–99% precision after the active-learning loop has run for 8–12 weeks.

How do you deploy computer vision at the edge?

Most industrial computer vision applications require real-time processing close to the camera — sub-100ms latency, no cloud dependency, often no network connectivity at all. We optimize models for the target hardware and treat deployment as a first-class engineering problem.

Hardware. NVIDIA Jetson Orin / Xavier for the GPU-class workloads, Coral Edge TPU for low-power embedded, x86 industrial PCs with discrete GPUs for harder workloads, and increasingly Apple Silicon for niche cases.
Optimization. TensorRT for NVIDIA, ONNX Runtime as the cross-platform default, OpenVINO for Intel hardware. Quantization (INT8, FP16) cuts latency 2–4x with minimal accuracy loss when calibrated correctly.
Camera integration. RTSP from IP cameras, USB UVC, GigE Vision and USB3 Vision for industrial Basler/FLIR/Cognex cameras. We design the frame pipeline (capture, preprocessing, inference, postprocessing) to hit the latency target end-to-end, not just for the model.
Reliability. Watchdog services, automatic restart, frame-drop monitoring, and a heartbeat back to a central dashboard so you know which devices are healthy.

Should you build, buy, or partner for computer vision?

Option	Best for	Speed	Differentiation	Cost (3 yr TCO)	Lock-in
Buy a cloud vision API (AWS Rekognition, Google Vision, Azure Computer Vision)	Generic tasks, low volume, fast integration	Days	None — competitors get the same labels	USD 10K–200K depending on volume	Medium — easy to switch APIs
Buy an industrial CV product (Cognex VisionPro, Keyence, Basler-based suites)	Standard manufacturing inspection, off-the-shelf gauging	Weeks	Low — vendor's library	USD 100K–600K including hardware	High — proprietary, expensive to leave
Buy a no-code platform (Roboflow, Landing AI, Datature)	Mid-complexity custom models, smaller data team	2–8 weeks	Medium — you own the model weights	USD 30K–200K	Medium — weights portable, training pipeline sticky
Build in-house on open source (PyTorch, Ultralytics YOLO, OpenCV, MMDetection)	Mature engineering org, distinctive imagery, long horizon	4–12 months to first production model	Highest	USD 500K–2M+ including platform team	Low
Partner with a custom shop (our model)	Domain-specific detection, edge deployment, want to own IP	8–16 weeks per system	High — built on your data and cameras	USD 100K–350K per system, predictable	Low — you own the code and weights

The pattern we recommend most often: use a cloud API for the generic layer (OCR, content moderation, face detection), partner on the domain-specific custom models, and build internal capability after 2–3 production wins justify a platform team.

How do you choose between YOLO, Detectron2, and Vision Transformers?

The architecture is downstream of the deployment target and accuracy requirements.

YOLO (YOLOv8, YOLO-NAS, YOLOv10) via Ultralytics. Default for real-time object detection, edge deployment, and most industrial use cases. Fast, well-supported, easy to train on custom data.
Detectron2 / MMDetection. Better accuracy on dense detection and segmentation tasks where speed is less critical. Default for research-grade benchmarks and complex scenes.
Vision Transformers (ViT, DINOv2, SAM). Strongest general-purpose visual backbones, excellent for classification and zero-shot tasks. Heavier compute, harder to deploy at the edge without distillation.
Hugging Face transformers (CLIP, BLIP, OWL-ViT). Default for vision-language tasks, image captioning, and open-vocabulary detection.
OpenCV. Still the right tool for classical CV (calibration, geometric transforms, blob detection) that ML overcomplicates.

We typically prototype with YOLOv8 for detection, ResNet or ViT for classification, and SAM/Mask R-CNN for segmentation. The final architecture is decided by the evaluation harness, not the framework's marketing.

What does a computer vision engagement look like with us?

A typical engagement runs 8 to 16 weeks per system. Multi-camera deployments or projects with custom labeling pipelines run 4 to 6 months. We start with a 1-week scoping sprint that captures real images from the actual deployment environment and validates feasibility before commitment.

We charge hourly with a cap, so the budget is predictable and scope can flex. Outcomes are measured against the business metric — defects caught, false-reject rate reduction, throughput gain, manual-review time saved — and instrumented from day one. We instrument both model metrics (mAP, precision, recall) and business metrics in the same dashboard so the trade-offs are visible to operations, not just to engineering.

What does computer vision cost?

Realistic ranges based on the engagements we run:

Single-camera detection or classification system (one model, cloud or edge): USD 100,000 to 180,000.
Multi-camera edge deployment (3–10 cameras, real-time inference, monitoring): USD 200,000 to 450,000.
Full inspection platform (custom labeling pipeline, retraining loop, multi-line deployment): USD 350,000 to 800,000.

Hardware for edge deployments runs USD 1,500–6,000 per device for Jetson-class hardware plus camera and enclosure. Cloud GPU inference for non-real-time workloads typically runs USD 500–5,000 per month depending on volume.

Annual maintenance runs 15–25% of build cost — re-labeling on drifted data, retraining, and adapting to new product variants or seasonal changes.

For pricing on the strategy work that often precedes a build, see our AI Consulting page. For platform and engagement pricing details, see Pricing.

Frequently asked questions about computer vision

Should we use a cloud vision API or train a custom model?

Use a cloud API (AWS Rekognition, Google Vision, Azure Computer Vision) when the task is generic — face detection, common object labeling, OCR on standard documents. Train a custom model when you need to detect domain-specific objects, defects, or conditions the API was never trained on. Most industrial use cases require custom.

How much labeled data do we need for a custom computer vision model?

With transfer learning from a pre-trained backbone (YOLO, ResNet, Vision Transformer), 500–2,000 labeled images per class is usually enough for a working model. From scratch, you'd need 10x that. For defect detection on rare events, augmentation and synthetic data extend the dataset further.

Can computer vision run in real time on edge devices?

Yes. With model optimization — quantization, pruning, TensorRT or ONNX conversion — modern detection models run at 30+ FPS on NVIDIA Jetson Orin, Coral Edge TPU, and even Raspberry Pi class hardware for smaller models. Latency targets of 30–100ms are routine.

How accurate are production computer vision systems?

On well-defined detection tasks with clean data, mAP (mean average precision) of 90%+ is achievable. Defect detection in manufacturing routinely hits 95–99% precision once the model is trained on enough representative defects. The constraint is data, not the model.

How do you handle model degradation when conditions change?

Lighting, camera angle, seasonal changes, and new product variants all degrade vision models. We instrument production with confidence-distribution monitoring, scheduled re-evaluation on fresh samples, and an active-learning loop that flags uncertain predictions for human review and re-labeling.

What hardware do we need to run a computer vision system?

Cloud inference on GPU instances (T4, L4, A10) for non-real-time workloads. NVIDIA Jetson Orin or Xavier for real-time edge. Coral Edge TPU for low-power embedded. Industrial deployments often pair Cognex or Basler cameras with Jetson devices. We pick the hardware in scoping, not after.

Can we use computer vision with our existing camera infrastructure?

Usually yes. We integrate with RTSP streams, IP cameras, USB webcams, and industrial cameras (Basler, FLIR, Cognex). For older analog systems, we add a frame grabber or migrate to IP cameras as part of the project. We don't require ripping out your camera hardware.