Lecturer Tracking

By Maximilian Hahn

Introduction

The lecturer-tracking module can be considered the core or backbone of an automated lecture recording system. The primary purpose of the tracking module is to correctly identify and track the lecturer during the video and output the coordinates. Thus the tracking module needed to be robust enough to deal with varying environments and use cases. We used OpenCV, an open source computer vision library, for much of our processing. This library provided us with many complicated image processing methods that allowed us to test a plethora of approaches to a single problem with minimal effort.

Aims

CILT requires the system to be able to run well across a plethora of lecture scenarios. We, therefore, devised a set of use cases that cover all realistic scenarios that would occur in a lecture theatre as well as some edge case scenarios. Successfully completing this requirement would provide strong support for the robustness of the system across all realistic scenarios. CILT requires that lecture videos are released within 8 hours of the lecture finishing and thus all video post-processing must fall within that period. This doesn’t leave much time for our framework as other processes and steps also form part of this 8- hour period. One example of these processes needs to manually cut the video to the exact start and end of the lecture. CILT recommended this module take at most three times the runtime of the actual video. From this we devised the research aim that focuses on processing the lecturer-tracking module in less than two times the length of the input video, as we believe our planned algorithms can meet and beat the requirement outlined.

Our research aims therefore are:

  1. Can this module run efficiently enough to be processed in less than 2 times the length of the video?
    This will fulfill the process speed requirement meaning lecture videos will be available to students faster without creating a buffer of unedited videos.
  2. Does our system correctly segment out all motion and decide on the correct lecturer 90% of the time for likely use cases?
    This will fulfill the main functional requirements meaning our solution will be usable without further work for most lecture recording.
  3. Does our system correctly segment out all motion and decide on the correct lecturer 90% of the time for unlikely (edge) use cases?
    This will fulfill the edge case functional requirements meaning our solution will work for a variety of odd cases and is generally quite robust.

Our aim is to develop a fully functioning 4K lecture tracking solution. This includes making the solution both time efficient and robust enough to track a lecturer precisely across a plethora of possible cases.

Component Overview

The diagram below shows a high-level overview of the tracking module. After that, we discuss each component in more detail.



Motion Detection

The movement detection and lecturer recognition module requires the lecture video as input and reads all frames of the video file during its execution. Not all frames need to be processed; to reduce processing time for this module, we implemented the ability to skip frames, these are only read, not processed. These frames are passed through a background subtraction method to determine where motion occurs. Contours then represent this movement.

Recognition Algorithm

This stage yields rectangles that encapsulate an area of motion. First, remove any contours that have very few nodes, since comlex shapes of interest will have many nodes. A bounding box is created around each contour using OpenCV’s boundingRect method. Since we assume that the lecturer is in the middle of the screen concerning the y-axis any rectangle’s top that is greater than a maximum y threshold is deleted. For the same reason, any rectangle’s bottom that is smaller than a minimum y threshold is eliminated. The rectangles aspect ratio is also used to remove unrealistically broad rectangles.

Ghost Tracking and Lecturer Selection

A history of these rectangles is created in a class called “Ghost”. Once all frames have been read, these ghosts are post-processed to select a lecturer for each frame and the locations of the ghosts are shifted to represent the position of the lecturer more accurately. A lecturer is chosen from these rectangles of motion using the intersection of rectangles across frames and time on the screen of a certain area. This can be thought of as similar to a heat map. This section then sends the locations of the lecturer at each frame processed, to the virtual cinematographer module.

Results and Conclusions

We evaluated our program by testing the process time of this module, and we also tested the module against lecture use cases that we set up and acted out. From these tests, the module correctly identifies the lecturer for most use cases with a run-time well within our defined restrictions. The run-time of the solution is an average of 1.13 times the length of the video. We do highly recommend that the software is run on an SSD with a very fast read because the read time reduces the processing speed severely.