-
Using object segmentation to roto moving objects and people in live-action video, with the aim to separate specific parts, such as individual clothing items (e.g., shirts, pants, arms). The ideal output would be generated masks saved as a Cryptomatte channel in a exr sequence for use in Nuke, similar to the functionality available in 3D renders but for live-action footage.
-
My background is in VFX, where I have experience with manual tracking and rotoscoping; however, I have no prior experience with machine learning.
Paper: SAM-2: Segment Anything in Images and Videos
I decided to train my own model instead of using a pre-trained one from Meta to learn the full workflow. I first tried U-Net, which identified humans but didn’t produce clean enough results for usable roto. After more research, I switched to SegNet. It sometimes segmented humans well but often predicted in empty areas during early epochs, wasting resources. I realized focusing only on regions with humans could improve results and reduce false positives. Since YoloNet is robust, I combined it with SegNet to provide both a bounding box and an additional filter. The bounding box helps focus the model while also serving as a garbage mask.
- Data Source: [https://www.kaggle.com/datasets/tapakah68/supervisely-filtered-segmentation-person-dataset] from Kaggle / Supervisely.
The dataset is available for non-commercial use, such as research and teaching, which is a positive step. A benefit is that the company offers automated annotation tools, reducing the need for underpaid manual work, as seen in their labelling toolbox: https://supervisely.com/labeling-toolbox/images.
A problematic area is the licensing of images. Supervisely has mentioned that some images in the dataset originate from Pexels and are linked to the platform without specifying which images. Pexels provides a licensing policy allowing the unrestricted use of the content for personal and commercial purposes without direct permission from subjects. Moreover, since Pexels allows user-generated uploads, verifying whether all individuals depicted in the images have provided consent cannot be easy, especially in diverse datasets like this.
Upon a reverse image search, some random images were identified as stock images, which is a positive sign. However, the dataset's diversity and the lack of clarity about image sources make it challenging to thoroughly verify ethical use. The company should pay more attention to rights and ensure consent is obtained from participants. Despite these challenges, using AI tools for labelling is a positive step, and our adherence to the terms of service and ethical guidelines is positive.
Here is an example of the results of both the training and validation tests.

(overlay is just for visualization purposes)
Here is an example of the model working on video files. I combined YoloNet to create a bounding box to help SegNet focus only on where the humans are in the scene. It is also used to remove any artefacts that occur outside the bounding box. Then I ran my pretrained SegNet to make predictions on each frame. I provide the user with both the roto'd human from SegNet and the bounding box in case they want to use it for garbage masks or other purposes. The final output is a multichannel EXR that I successfully tested in Nuke.
Screen recordings: https://github.com/NCCA/ml-programming-assignment-alexmoed/tree/main/Screen_recording_demos
Example EXR: https://github.com/NCCA/ml-programming-assignment-alexmoed/tree/main/Output
Overall, I enjoyed this project and am happy with the progress. I learned a lot about implementing neural networks. It also allowed me to gain a better understanding of how exr's work and how to generate channels and process video data.
But with any project, there is always room for improvement. What would I do if I had more time? The first priority would be to change the aspect ratio and image resolution. Get tighter rotors and figure out how to capture the detail better. I'd also like to keep working on figuring out how to fill holes. I started on this, but it's not exactly there yet. I might need to rethink the approach on that. I also would train a much larger dataset and maybe one that allows me to select different parts of a person, such as a shirt, etc, and I would like to figure out crpytomattes. In the long term, I'd like to figure out how to make it a plugin for Nuke.
Supplemental files:
checkpoint: https://storage.googleapis.com/anmstorage/epoch_checkpoint_v052_960.pt
NP array cache (optional): https://storage.googleapis.com/anmstorage/focus_np.npy https://storage.googleapis.com/anmstorage/images_np.npy https://storage.googleapis.com/anmstorage/masks_np.npy
