This project brings object detection and segmentation to life using YOLOv8 (You Only Look Once, version 8), the latest generation of the YOLO deep learning framework. It delivers advanced computer vision functionality for identifying and segmenting objects in images, video files, and live camera streams.
Representing a significant advancement in object detection technology, YOLOv8 provides higher accuracy, faster performance, and a more user-friendly design compared to its predecessors. The project highlights real-world applications of YOLOv8 for both detection (drawing bounding boxes around objects) and segmentation (creating pixel-level masks), making it highly suitable for applications like security surveillance, autonomous driving systems, retail analytics, and industrial automation.
The system includes both a Command Line Interface (CLI) and a Python API, offering flexibility for different workflows. It handles batch processing for static images and videos, as well as real-time analysis through webcam input.
Core Features:
The project leverages YOLOv8's state-of-the-art architecture for object detection and segmentation tasks. YOLOv8 processes images in a single forward pass through the neural network, making it exceptionally fast while maintaining high accuracy.
YOLOv8 represents a major evolution in the YOLO series, introducing several architectural improvements:
Backbone Network:
Neck Network:
Head Network (Detection):
Head Network (Segmentation):
Object Detection uses YOLOv8 detection models (yolov8x.pt) to identify objects and draw bounding boxes around them. The model predicts:
The detection process involves:
Object Segmentation employs YOLOv8 segmentation models (yolov8x-seg.pt) to perform instance segmentation. Beyond detection, the model generates:
The segmentation process extends detection with:
The system is organized into six independent functionalities:
Each functionality can be executed through either CLI commands or Python scripts, providing flexibility for different use cases. The CLI approach is ideal for quick testing and batch processing, while the Python API allows for integration into larger applications and custom workflows.
All operations use pre-trained YOLOv8 models capable of detecting 80 different object classes from the COCO dataset. The models are optimized for:
YOLOv8 divides the input image into an $S \times S$ grid and predicts bounding boxes directly without anchor boxes:
Grid Cell Prediction: For each grid cell $(i, j)$, the model predicts:
$$\mathbf{P}_{ij} = [\hat{x}, \hat{y}, \hat{w}, \hat{h}, \text{conf}, c_1, c_2, ..., c_n]$$
where:
The model predicts offsets that are transformed to absolute coordinates:
$$x = \sigma(\hat{x}) + c_x$$
$$y = \sigma(\hat{y}) + c_y$$
$$w = p_w \cdot e^{\hat{w}}$$
$$h = p_h \cdot e^{\hat{h}}$$
where:
IoU measures the overlap between predicted box $B_p$ and ground truth box $B_{gt}$:
$$\text{IoU}(B_p, B_{gt}) = \frac{\text{Area}(B_p \cap B_{gt})}{\text{Area}(B_p \cup B_{gt})}$$
Complete IoU (CIoU) Loss: YOLOv8 uses CIoU for bounding box regression:
$$\mathcal{L}_{\text{CIoU}} = 1 - \text{IoU} + \frac{\rho^2(\mathbf{b}, \mathbf{b}^{gt})}{c^2} + \alpha v$$
where:
Total Loss:
$$\mathcal{L}_{\text{total}} = \lambda_{\text{box}} \mathcal{L}_{\text{box}} + \lambda_{\text{cls}} \mathcal{L}_{\text{cls}} + \lambda_{\text{dfl}} \mathcal{L}_{\text{dfl}}$$
Box Loss (CIoU):
$$\mathcal{L}_{\text{box}} = \sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} \, \mathcal{L}_{\text{CIoU}}(B_{ij}, \hat{B}_{ij})$$
Classification Loss (Binary Cross-Entropy):
$$\mathcal{L}_{\text{cls}} = -\sum_{i=0}^{S^2} \sum_{j=0}^{B} \mathbb{1}_{ij}^{\text{obj}} \sum_{c \in \text{classes}} \left[ p_c \log(\hat{p}_c) + (1-p_c)\log(1-\hat{p}_c) \right]$$
Distribution Focal Loss (DFL):
$$\mathcal{L}_{\text{dfl}} = -\sum_{i=0}^{n} (y_i + 1 - y) \log(S_i) - (y - y_i) \log(S_{i+1})$$
where $S$ is the softmax probability distribution for box regression.
NMS eliminates duplicate detections by suppressing boxes with high IoU overlap:
Algorithm:
Mathematical Formulation:
$$\mathcal{D} = \{B_1, B_2, ..., B_n\} \quad \text{(sorted by confidence)}$$
$$\mathcal{D}_{\text{keep}} = \{B_i \in \mathcal{D} \mid \text{IoU}(B_i, B_j) < \tau, \, \forall B_j \in \mathcal{D}_{\text{keep}}, \, \text{conf}(B_i) < \text{conf}(B_j)\}$$
where $\tau$ is the NMS threshold.
For instance segmentation, YOLOv8 predicts mask coefficients and combines them with prototype masks:
Prototype Masks: The neck network generates $k$ prototype masks:
$$\mathbf{P} = \{\mathbf{P}_1, \mathbf{P}_2, ..., \mathbf{P}_k\} \in \mathbb{R}^{k \times H \times W}$$
Mask Coefficients: For each detected instance, predict coefficient vector:
$$\mathbf{c}_i = [c_{i1}, c_{i2}, ..., c_{ik}] \in \mathbb{R}^k$$
Final Mask: Linear combination followed by sigmoid activation:
$$\mathbf{M}_i = \sigma\left(\sum_{j=1}^{k} c_{ij} \cdot \mathbf{P}_j\right)$$
where $\mathbf{M}_i \in [0,1]^{H \times W}$ is the binary mask for instance $i$.
Mask Loss (Binary Cross-Entropy):
$$\mathcal{L}_{\text{mask}} = -\frac{1}{HW} \sum_{x,y} \left[ m_{xy} \log(\hat{m}_{xy}) + (1-m_{xy})\log(1-\hat{m}_{xy}) \right]$$
where $m_{xy}$ is ground truth mask and $\hat{m}_{xy}$ is predicted mask at pixel $(x,y)$.
The final detection confidence combines objectness and class probability:
$$\text{Score} = \text{Objectness} \times \text{Class Probability}$$
$$\text{Score}_c = P(\text{Object}) \times P(\text{Class}=c \mid \text{Object})$$
Detections with $\text{Score}_c < \text{threshold}$ (typically 0.25) are filtered out.
requirements.txt
ultralytics>=8.0.0
# Clone the repository
git clone https://github.com/kemalkilicaslan/Object-Detection-and-Segmentation-with-YOLOv8.git
cd Object-Detection-and-Segmentation-with-YOLOv8
# Install required package
pip install -r requirements.txt
Object-Detection-and-Segmentation-with-YOLOv8
├── Object-Detection-with-YOLOv8/
├── Object-Segmentation-with-YOLOv8/
├── README.md
├── requirements.txt
└── LICENSE
Pre-trained Models (automatically downloaded on first use):
yolov8x.pt - YOLOv8 extra-large detection modelyolov8x-seg.pt - YOLOv8 extra-large segmentation modelInput Files:
.jpg, .png, .webp formats.mp4, .avi, .mov formatsCLI:
yolo detect predict model=yolov8x.pt source="img.jpg" save=True
Python:
from ultralytics import YOLO
model = YOLO('yolov8x.pt')
results = model('img.jpg', save=True)
CLI:
yolo task=segment mode=predict model=yolov8x-seg.pt source='img.jpg' save=true
Python:
from ultralytics import YOLO
model = YOLO('yolov8x-seg.pt')
results = model('img.jpg', save=True)
CLI:
yolo detect predict model=yolov8x.pt source="video.mp4" save=True
Python:
from ultralytics import YOLO
model = YOLO('yolov8x.pt')
results = model('video.mp4', save=True)
CLI:
yolo task=segment mode=predict model=yolov8x-seg.pt source='video.mp4' save=true
Python:
from ultralytics import YOLO
model = YOLO('yolov8x-seg.pt')
results = model('video.mp4', save=True)
CLI:
yolo detect predict model=yolov8x.pt source=0 show=True
Python:
from ultralytics import YOLO
model = YOLO('yolov8x.pt')
model.predict(source="0", show=True)
Controls:
q or Esc to quit the applicationCLI:
yolo task=segment mode=predict model=yolov8x-seg.pt source='0' show=True
Python:
from ultralytics import YOLO
model = YOLO('yolov8x-seg.pt')
model.predict(source="0", show=True)
Input Image:
Output Image:
Input Image:
Output Image:
Input Video:
Output Video:
Input Video:
Output Video:
Demo Video:
Demo Video:
Performance varies based on hardware, model size, and input resolution:
| Metric | Object Detection | Object Segmentation |
|---|---|---|
| Processing Speed (GPU) | 50-100+ FPS | 30-60 FPS |
| Processing Speed (CPU) | 5-15 FPS | 2-8 FPS |
| Detection Accuracy (mAP) | 53.9% (COCO) | 52.3% (COCO) |
| Supported Classes | 80 (COCO dataset) | 80 (COCO dataset) |
Model Comparison:
| Model Size | Parameters | Speed (ms) | mAP50 | mAP50-95 |
|---|---|---|---|---|
| YOLOv8n | 3.2M | 1.5 | 37.3% | 28.4% |
| YOLOv8s | 11.2M | 2.3 | 44.9% | 36.2% |
| YOLOv8m | 25.9M | 4.5 | 50.2% | 42.8% |
| YOLOv8l | 43.7M | 6.8 | 52.9% | 45.7% |
| YOLOv8x | 68.2M | 9.2 | 53.9% | 47.1% |
[Image/Video/Camera Input]
↓
[Preprocessing]
├── Resize to 640×640
├── Normalize pixel values
└── Letterbox padding
↓
[YOLOv8 Backbone (CSPDarknet)]
├── Feature extraction at multiple scales
├── Residual connections
└── SPPF layer
↓
[Neck Network (PAN)]
├── Bottom-up feature fusion
├── Top-down feature fusion
└── Multi-scale feature aggregation
↓
[Detection Head (Anchor-free)]
├── Classification branch → Class probabilities
├── Regression branch → Box coordinates
└── Objectness branch → Confidence scores
↓
[Post-processing]
├── Confidence filtering (threshold > 0.25)
├── Non-Maximum Suppression (IoU threshold)
└── Coordinate transformation to original size
↓
[Output: Bounding Boxes + Labels + Confidence]
[Image/Video/Camera Input]
↓
[Preprocessing]
↓
[YOLOv8 Backbone + Neck]
↓
[Detection Head]
├── Classification
├── Box regression
└── Mask coefficients prediction
↓
[Segmentation Head]
├── Prototype mask generation (k masks)
├── Mask coefficient per instance
└── Linear combination: M = σ(Σ c_i · P_i)
↓
[Post-processing]
├── NMS on boxes
├── Mask upsampling to original resolution
├── Binary mask thresholding
└── Instance-level mask refinement
↓
[Output: Segmentation Masks + Boxes + Labels]
| Library | Version | Purpose |
|---|---|---|
| ultralytics | 8.0+ | YOLOv8 implementation, model training, and inference |
YOLOv8 Detection Models:
yolov8x.pt (extra-large)YOLOv8 Segmentation Models:
yolov8x-seg.pt (extra-large)Supported Object Classes (COCO): person, bicycle, car, motorcycle, airplane, bus, train, truck, boat, traffic light, fire hydrant, stop sign, parking meter, bench, bird, cat, dog, horse, sheep, cow, elephant, bear, zebra, giraffe, backpack, umbrella, handbag, tie, suitcase, frisbee, skis, snowboard, sports ball, kite, baseball bat, baseball glove, skateboard, surfboard, tennis racket, bottle, wine glass, cup, fork, knife, spoon, bowl, banana, apple, sandwich, orange, broccoli, carrot, hot dog, pizza, donut, cake, chair, couch, potted plant, bed, dining table, toilet, tv, laptop, mouse, remote, keyboard, cell phone, microwave, oven, toaster, sink, refrigerator, book, clock, vase, scissors, teddy bear, hair drier, toothbrush.
This project is open source and available under the Apache License 2.0.
Special thanks to the Ultralytics team for developing and maintaining YOLOv8, making state-of-the-art object detection and segmentation accessible to everyone. This project builds upon the COCO dataset and the extensive research in computer vision that has enabled these capabilities.
Note: This project uses pre-trained models for demonstration purposes. For production applications, consider fine-tuning models on domain-specific datasets and ensuring compliance with relevant regulations regarding computer vision and AI systems.