Skip to content

MICH3LL3D/WAPO_AR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WAPO_AR

Quick Start

Installs to Remember: brew install python brew install ffmpeg pip install torch (and torchvision)

Changing resolution of youtube video: yt-dlp -F --no-playlist “link goes here”

I also made it so that you can run all the entire process using a command line in the terminal. So downloading the youtube video given the youtube link, turning everything into frames, passing it into the models, getting probabilities, cut and piece together the final videos to cut and keep. To run everything in one command: python run_workflow.py <youtube_link> <path to contain the video frames> <path where the downloaded youtube video should be stored> The last two parts of the command are optional

1 Introduction

My sister being a water polo player inspired me to help her analyze her water polo videos. I had the idea to make the tedious task of trimming raw videos an autonomous process.

2 Related Works

My objective of trimming any unnecessary moments in the water polo videos highly resembles the action recognition problem. So while scouring the internet for any helpful related works, I mainly looked for models whose goals were relating to action recognition. I ended up finding three main action recognition methods [1] rooted in deep neural networks. That being Two-Streams networks, 3D convolutional networks, and Transformer-based methods.

Two-Stream networks separate the video into two separate streams (haha), the RGB stream that handles spatial information, and the Optical Flow stream that deals with the temporal information. Each stream is processed separately and usually goes through its own neural networks, and the results are then fused together before a final classifier is used to determine the appropriate label. 3D CNNs are similar to 2D CNNs but they also move through time (hence the third dimension). Transformer-based methods start off by chopping frames into spatial-temporal patches, used as tokens, which are then flattened and turned into vectors. Then, position information is added to help with self attention before finally classifying and predicting the action. This method differs from the others in that transformers take all the frames at once and use self-attention to compare each patch to every other patch. Transformer-based methods work the best [1] and are the best new technology in computer vision. However, just briefly looking at models using any of the three methods, such as Swin-L [2] and Masked Video Distillation [3], I learned that these models require high-end GPU clusters, which I don’t have access to.

3 My Methodology

Due to hardware constraints, I had to adopt a much more different approach compared to normal action recognition methods. Instead of multi-frame input models, I settled on a single-frame approach using MobileNetv2, trading off a heavy loss of temporal information for efficiency. To remedy this issue, I mainly focused on postprocessing, including temporal smoothing, filtering, and padding. The following image is a basic workflow of how my model creates its decision.

Alt text

4 Data Preprocessing

The training videos were captured using raw video recordings from another parent during the Stanford 14U games. I then went through 118 videos and labeled, by hand, the seconds that I deemed were to be cut or not. Examples of situations that I labeled for removal from the original video were when the referee was blocking the view of the game, timeouts, the time before the first sprint, the pause in between every quarter, and when the camera pans to the crowd or the scoreboard.

I rescaled the original 4k60p videos into 480x270 and 1 fps (frame-per-second) video using ffmpeg’s scale feature. I also manually encoded all the training videos and placed start and end timestamps in seconds into a JSON file. The images are then renamed so that the names end with either 1 for frames that should be removed or 0 for frames that should be kept.

I then randomly chose 70% of the images to form the training set and the rest of the 30% to form the validation set.

One issue I noticed during this process was that I had a disproportionate amount of keep data compared to no-keep data. Because I had so much more keep data than no-keep data, I had to resample the no-keep data more frequently to create a more balanced dataset when training.

5 Training and Validation Results

Some data augmentation functions I used included random resized cropping, horizontal flips, random rotations to a maximum of 15 degrees, and color jitter (brightness ±20%, contrast ±20%, saturation ±20%). Due to not having enough negative data, I had all the labels collected and counted and then weighted by using the following expression: 1/class_weights (more can be found in my source code mobilenet_training.ipynb). This ensures that the rarer the class, the more resampling will occur, to hopefully balance out the huge gap in the amount of data between classes. I have around 20k keep images and 7k no-keep images, so that means that the no-keep images would be resampled around 3 times more than the keep images.

A few notes: My validation testing did not use thresholds, and I only ran through with 10 epochs. Tensorboards with training and validation results will come with future experimentation.

6 Post Processing Method

Overview: First I performed temporal smoothing between the predictions because whether or not a frame is keep or no-keep depends on the context of its surrounding frames. To perform this temporal smoothing, I created two functions: moving_average and smooth_and_pad. moving_average averages over an odd window of frames with the middle one being the current probability (±(n frames)/2)(I used n=21 frame windows). np.convolve then moves across the array, filtering and averaging the probabilities. smooth_and_pad calls the moving_average function, then creates a boolean mask to create one final boolean array that says whether or not the given frame (corresponding to the index of the array) should be kept in the final video. Next, I implemented a threshold among the smoothed scores, acting as a rudimentary keep and no-keep mask. Furthermore, I noticed that just relying on the threshold would cut any 3-5 seconds of no-keep data from the entire video. This wouldn’t make sense in a real life user viewing experience, as constant cutting of the videos would ruin the whole experience. I added padding of a n number of frames on both the start and end of the keep frame range (I used 5 frames of padding). This makes sure that any no-keep gap that is really short isn’t cut out and will not impact the viewing quality.

Details: I used validation data to determine a threshold of 0.3 on temporally filtered scores through a 21 frame window, and paddings of 5 seconds. Padding Around the Action Peaks: Raw Scores:

Alt text

Smoothed and Padded Scores:

Alt text

I shaded in all the areas/frames that were deemed to be kept due to the padding method.

New Keep Compilation:

Alt text

New No Keep Compilation:

Alt text

I tested the model with around 10 other longer videos, with majority success.

7 Hyperparameter Sweeping Results

In the Works

8 Failure Cases

Stanford v Patriot B Success Case:

Alt text

Stanford 14U v Patriot B No Keep Video

Thoughts: There is a slight padding issue where one or two frames are not chosen to be kept in between shots like five meters or turnarounds. There is one five meter where the only shot kept is the shot where the ball is in the cage which makes it a little iffy but still good enough regardless. But, the waiting period before the actual five meter shots were always chosen to be deleted which is exactly what the model should be doing. For the keep compilation, the beginning has a couple of seconds where the video shows the girls warming up and shooting into a cage which for the model to deem to be kept is not the fault of the model even though they are false positives. There are also a few seconds where the quarter has ended and the players are swimming back to the edge of the pool, but like before, keeping these frames isn't that big of a deal and can be kept.

Stanford v Thunder Failure Case:

Alt text

Stanford 14U v Thunder: Keep Comp

Thoughts: The no keep compilation correctly classified the beginning where all the players are listening to the coach by the edge of the pool, but pretty much classifies the rest of the video. It seems that with good lighting, the model is able to better determine whether or not this part of the video should be kept or not. To be fair, I did not add these types of harsh lighting cases into the training process. The model seems to not be able to analyze properly with the harsh lighting obstructing a lot of the player’s positions and movement.

Alt text

Sun Glare Example

Alt text

Example where the referee stays in front of the camera for a couple of frames.

9 Next Steps

There are a couple of steps I can take in the future to better the adaptability and predictive power of the model. The first of which is to collect more data involving fog, sun glare, and referee interference with the camera. The second step would be to better the data augmentation of negative data. The most noticeable case that would benefit from this would be using Apple's image segmentation methods [4] to place various images of referee interference into other videos to create a synthetic referee. Another useful strategy, particularly useful for the Stanford v Thunder videos, is to add synthetic sun glare to random frames.

Finally, I could use inter-frame feature fusion. So instead of processing each frame entirely independently, I would pass multiple consecutive frames through the same MobileNet backbone separately. And then from each one, I’ll extract the MobileNet’s last convolutional layer or its last pooling layer to get the final feature maps to capture correlations over time (this could be with methods like 1D temporal convolution). This allows the model to integrate short-term temporal context, something we desperately needed before.

References

[1] Zixuan Tang, Youjun Zhao, Yuhang Wen, Mengyuan Liu. A survey on backbones for deep video action recognition. arXiv:2405.05584. 2024

[2] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, Han Hu. Video swin transformer. arXiv:2106.13230. 2021

[3] Rui Wang, Dongdong Chen, Zuxuan Wu, Yinpeng Chen, Xiyang Dai, Mengchen Liu, Lu Yuan, Yu-Gang Jiang. Masked video distillation: rethinking masked feature modeling for self-supervised video representation learning. arXiv:2212.04500. 2023

[4] Zhang, C.; Zou, K.; Pan, Y. A method of apple image segmentation based on color-texture fusion feature and machine learning. Agronomy 2020, 10, 972.

About

an autonomous water polo video trimmer

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors