Automating Camera Motions

Detecting Regions of Interest in Dynamic Scenes with Camera Motions

(paper) (video)


Kihwan Kim (NVIDIA),

Dongryeol Lee (Georgia Tech)

Irfan Essa (Georgia Tech)

This paper details the methods that a group of NVIDIA researchers have developed for dynamic camera motions. Most motion fields are generated using stat-multi-view videos for precision, but in this paper a method for using a single moving view in order to generate a stochastic motion field using Guassian process regression (their words). In laymen terms, essentially what they’re doing is generating a vector field by tracking motion events, and finding points of convergence. Naturally using vector fields is the way to go, as finding points of convergence is as simple as taking a derivative but generating the motion field with reliable data dynamically is the feat worth mentioning (if, of course, that was not obvious). Kihwan et al happen to use their motion tracking with fantastic PTZ (pan-tilt-zoom) control. The contents of the paper are (1) methods for generating a ‘stochastic’ motion field representing motion tendencies, (2) predicting import future locations, and (3) evaluation for measuring the quality of predictions.

The generation of the motion fields is not unexplored territory, and the paper goes into some detail about Radial Basis Functions (RBF) compared to Gaussian Process Regression (GPR). Ultimately GPR seems to win out. Essentially a motion field is a point x of ‘noisy’ observed velocity components in two orthogonal directions and time. Then the GPR is applied using some fancy statistics. After generating the motion field the location of points of interest is another application of math. They use these methods to measure regions that a well-versed cameraman would track and adjust accordingly.

The prediction of future location pretty much falls out of the modeling of the motion vector field. The only concern would be computational expenses but that seems to be worth it. In comparison to previously touted RBF method, they both require inversion of an n by n kernel matrix (O(n^3)) however, the calculation of GPR confidence coefficients clocks in at O(n^2). Still, the key difference being that the RBF method must update all vectors continuously whereas GPR method updates only the final destination by excluding the extrapolated vectors that have a low confidence value.

The test of the worthiness of their method was done by taking an actual sports broadcast, conforming the image to a plane, and then seeing how the cameraman’s panning matched to the computer’s panning. The result isn’t bad at all, in fact it’s actually kind of good which is really worth noting. All in all, this is interesting for it’s application of significant math and the contribution to computer vision. It’s not very hard to let your imagination run wild with possible applications of this sort of technology.