Lucia Rapanova Woollett - Data Engineer and ML Enthusiast

Ball Detection in Sports Videos

21 Jul 2023

Python PyTorch YOLO models Transformers CNNs Deep Learning Object Detection

Team Project

In 2023, I started my Master’s Degree, and one of the “flagship” classes was a two-semester class called “Team Project”, which tries to simulate a real work of development teams. A team of 5-7 students develops a complex project, including collecting requirements from the supervisor (usually a teacher, but often it is someone from a company or NGOs). So, we got together, selected a topic, and it was time to dive into it.

Ball Detection

In recent years, technology has become increasingly common in sports, with ball-tracking systems being widely used in professional matches. One important aspect involves analyzing game strategies and evaluating player performance using ball trajectory data. However, this feat is particularly hard to achieve as ball images can be small, blurry, and sometimes invisible, with afterimage tracks.

This ball detection topics really caught our interest, but we had some doubts, since none of us had any experience with object detection (or neural networks) (yet). But of course, we love challenges, and what could go wrong, right?

This project was made in cooperation with Sportradar, which is a leading sports technology company focused on many areas in sport, combining data and technology. This was a fantastic opportunity for us. We were also assigned a faculty supervisor - he is a fantastic teacher (and very patient).

First Semester

Large portion of the first semester was spent on learning about this topic, setting up our workspace and tech stack, trying to understand the current state of object detection and learning the fundamentals of neural networks. Also, we were expected to use Jira, Confluence, have sprints and stand-up meetings, even played planning poker, but frankly, this project does not seem to be ideal example of agile project. Personally, I would choose a different approach, as the principles of agile and scrum are applied better for software development. Or, well, definitely adapt the principles for machine learning project in a different way. AI projects have a unique life cycle and we felt like the agile methodology was “forced to us” in a way. Another challenge was actually working as a team. We are still friends to this day, don’t worry, that isn’t the problem. But how exactly do you divide such project between people? When working on software product, you would probably designate some people to work on backend, frontend, and so on. But here? We decided that each of us will test different models and then discuss the results with the rest of the team, pick the best models and continue training and so on. We had couple of machines available for training at the University, but they were shared among other students, so we had to also plan for that.

Our lovely computers working hard

Object Detection Models

Deep learning methods use multiple layers to progressively extract input data with improved learning performance and a wide range of applications. Convolutional Neural Networks (CNNs) are a type of deep neural network that use a series of convolutional layers to automatically learn hierarchical representations of the input data, starting with simple features such as edges and lines and building up to more complex features such as shapes and patterns.

One such model is YOLO - You Only Look Once. YOLO has undergone several iterations and improvements, becoming one of the most popular object detection frameworks. YOLO is a neural network for object detection that divides an input image into a grid and predicts bounding boxes and class probabilities for each grid cell in a single forward pass. We tested several YOLO models (5, 7, 8) but of course, since then, many new versions were released (9, 10, 11), and the latest version as of 5/26/2026 is YOLO26 (they jumped from 11 to 26, breaking the tradition).

In 2017, another popular neural network architecture was established - Transformer and Vision Transformer (ViT). They showed that CNN is not necessary and a pure transformer applied directly to sequences of image patches can perform very well various computer vision tasks. Image patches are treated like words in a sentence - fed into a Transformer encoder to build a global understanding of the image. Based on these findings, we focused on derivations inspired by ViT, such as YOLOS and Deformable DETR.

The final model we tried was TrackNet. TrackNet is a deep learning network specifically created for ball tracking from broadcast videos. First, the model utilizes temporal information by processing multiple adjacent video frames simultaneously instead of relying on a single frame. This approach enables the model to detect the ball even when it temporarily disappears from a single frame, which can occur due to camera artifacts. Secondly, the model generates a detection heatmap. A heatmap is ideally a Gaussian distribution centered on the ball and is used to indicate the position of the ball.

Most of the models that we have selected were pre-trained on the datasets consisting of various types of objects, such as ImageNet or COCO. To fine-tune selected models, we use a custom dataset.

Dataset

The dataset was created by Sportradar. They provided us with 135 annotated videos (for private use, during the research). Each video has resolution of 1280 x 720, they vary in length (30-150 seconds) and consist of more than 230 000 annotated frames in total. Each frame may contain one or more tennis balls and one or more tennis players. Coordinates are provided in a format describing only the center of the ball as a x and y positions, while players’ annotations consist of a full bounding box.

We extracted frames from the videos and split them into subsets containing 92 train, 20 validation, and 23 test videos. Experts manually selected each subset to ensure a balance of court surfaces and players involved. We focused on tennis balls during rallies and those considered in use.

Split	Videos	Images
Train	92	83 599
Validation	20	17 877
Test	23	22 859

Results

There would be a lot to discuss here - how many epochs we trained for, which optimizer we used, what was the learning rate… I don’t believe these details are the most important, but I will share the final results. Overall, transformer-based models didn’t perform very well (probably due to the need for larger dataset), TrackNet’s approach showed promising results, but slow postprocessing time. The best performing model was YOLOv5m6, which was pre-trained on higher resolution images(1280 × 1280), thus making it better suited for our dataset. Models such as YOLOv8 Nano or YOLOv5 Small have shown potential for being effective while having a minimal number of parameters, reaching 0.88 Recall and 0.83 Precision, with average inference time of 14ms.

After this, we basically kept tuning the hyperparameters, we received some more annotations, plus we had a couple more ideas to test to improve the results (such as instead of inputting a single video frame to the model, multiple adjacent video frames will be fed at once. This helps the model to detect the ball even if it disappears in a single frame, and grayscale difference stacking - since the camera is static, it will highlight the changes between frames). We also created an app, where users can input any videos and see the model’s output. The final results can be seen on the poster:

Our poster from IIT.SRC conference

With the project wrapped up, it was time to show it to the world.

Presenting

The Team Project class culminates into a competition called Team Project Cup (TP Cup), where the teams present their projects in front of jury, which picks the top 3 projects and they also receive cash prize. We also attended a IIT.SRC conference, where we had a chance to meet the jury for the first time, and also discuss the solution with other students or anyone else who attended the conference. Probably the most common question we received was whether the solution can be used for different sports. I mean… we hadn’t tested it initially, but later we tried table tennis, badminton and soccer (football) and it actually performed pretty well! (But to be honest, it would probably detect anything that looked like a small yellow ball at that point).

tpcup

Presenting was definitely a big challenge. The teams were very strong and the jury had a lot of good points. It was very balanced and truly felt like anyone could be the winner. Suspense was kept until the end, but I am happy to say, that we won the second place with our ball detection project!

team

I also want to take a moment and thank everyone for their work, our fantastic team, supervisor and everyone from Sportradar. It was a struggle, but also a lot of fun (I really went down the memory lane while writing this post).

And one of the best moments was when we tried filming ourselves, playing tennis at the University’s parking lot and trying the inference, and it worked very impressively, which was the best proof of our hard work.