The results in the table below are the most comprehensive benchmarks we have released to date on PRCV’s people tracking capabilities as compared to other published results. Overall the system is far better than any technology today - and it is a general solution. This does not cover our other capabilities such as gaze tracking and behavior recognition. Bolded scores are the best in their categories. The meaning of each metric is explained below.

Tracker

MOTA %

MOTP

MT (%)

PT (%)

ML (%)

#IDSW

#FRAG

Notes

multiview

47.44

0.166m

51.22

31.71

17.07

10

78

Lower resolution images. Fast runtime. Realworld data.

SORT

33.4

72.1

11.7

57.4

30.9

1001

1664

High resolution images. 2D projective. MOT database

Xu 2016

30.6

72.02

23.19

65.79

11.01

299

200

High resolution images. 2D projective. CAMPUS

Xu 2017

35.64

72.38

25.89

61.59

12.52

265

219

High resolution images. 2D projective. CAMPUS

Tang 2017

46.8

79

18.2

41.7

40.1

481

595

MOT16 test set

Table 3: Comparison of Perceive to other top tracking algorithms. The metrics are:

  • MOTA: Accuracy. MOTA later increased as a localization issue was addressed. Higher scores are better.
  • MOTP: Precision. The tracker is an average of 16cm off (0.166m), when tracking someone. Not directly comparable to pixel distance which other systems use. Lower scores are better.
  • MT: Mostly tracked tracks. Higher better. Percentage of ground-truth trajectories that are at least 80% covered by the tracker.
  • PT: Partly tracked tracks = 1 - MT - ML
  • ML: Mostly lost tracks. Lower better.
  • IDSW: Identity switches. A detection is swapped between two people. Lower better. The total number of times that the trajectory switches ground-truth identity. For example, when two people cross in the video, and that causes the track to switch.
  • FRAG: Track fragments. A person is being tracked for a bit, then the track is lost, and then re-acquired, resulting in a "fragmentation" of the track. Lower better.

Measuring precision and recall in the classic sense is more complicated with a tracker. The following three metrics explain how this is measured in more detail.

MOTP: Multiple object tracking accuracy. The ability to measure precise object positions: total error in track-point positions divided by total number of matchers. In meters.

MOTA: Multiple object tracking precision. The first term is the ratio of misses (false negatives) computed over every track. The second term is the ratio of false positives. The third term is the ratio of mismatches. (i.e., ID switches, see below.) These three terms together give the total error rate.

MODP: Multiple object detection precision. The average IoU (i.e., intersection area divided by union area) for every tracked point. Note there's some bias (of about 20cm) which consistently crops up, presumably because of an integer math bug in the system. Solving this particular bug will show up in better MODP results. However, also note that MODP is much more challenging when calculating such in a 3D environment. This means that comparing our MODP results to most papers will be comparing apples to oranges.

MODA: Multiple object detection accuracy. Same as MOTP, but mismatches (ID switches) are not counted.

These results show that our long-sought goal of building a tracker which can work on almost any kind of input video with many different tracker configurations has been achieved. These metrics form the basis of the argument that the software solution from PRCV is even superior to some hardware assisted solutions on the market today.