Another difference from naturalistic driving is that the driver may not always look forward (i.e. the F1 gaze position), as has been the case in the videos used. With regard to the longer duration, it can be expected that changing this functionality to a more realistic view will reduce the accuracy of the observations due to greater confusion between the areas considered. If statistical significance is not a useful indication, what size does Kappa reflect an appropriate match? The guidelines would be useful, but factors other than concordance can influence their size, which poses a problem for the interpretation of a certain order of magnitude. As Sim and Wright noted, two important factors are prevalence (codes are equivalent or vary their probabilities) and distortion (marginal probabilities are similar or different for both observers). If other things are equal, the kappas are higher when the codes are equipable. On the other hand, kappas are higher when codes are distributed asymmetrically by both observers. Unlike variations in probability, the distortion effect is greater when Kappa is small than when it is large. :261-262 An important challenge in naturalistic driving studies is the large amount of data. For example, the SHRP2 database contains more than 4300 years of naturalistic driving data collected from approximately 3400 motorists (Hankey et al., 2016). Similarly, the UDRIVE database contains 41,000 h of data on passenger cars and more than 45,000 h of data on heavy goods vehicles (Van Nes et al., 2019). Frame-by-frame analysis of video data by human annotators is time-consuming and expensive, although the analysis focuses specifically on automatically recognized events such as right turns or sharp brakes, which are still common enough in typical naturalistic studies to require significant notement work.
So there was an engine for developing computer algorithms to automatically comment on video images, for example to predict where drivers are looking. For example, Fridman et al. (2016) developed a machine learning algorithm that classifies gaze direction into six vision positions based on head position (i.e., road, medium stack, combined instrument, rearview mirror, left, right). The important thing is that the algorithm was trained and validated from video data from a field study, collected using a camera placed on the dashboard and commented on by human programmers. Similarly, Vora et al. (2017) trained their neural folding network with commented naturalistic video data, collected inside a camera near the rear-view mirror. Through the use of human notes as labels, the algorithms will strive to achieve the quality of that rating and, therefore, they will be at best as good as human annotation. Therefore, accurate human annotation is indispensable for the development of such algorithms. In the above approaches, it seems implicitly considered that if two programmers agree, their rating is correct. Consistent with this assumption, some studies explicitly refer to Glance datasets as “Ground Truth” and then use the noted data to train and validate automatic annotation algorithms (Tawari and Trivedi, 2014; Vora et al., 2017). Other studies seem to implicitly consider annotation to be the fundamental truth, and then use the data for algorithm development (Belyusar et al., 2016; Fridman et al., 2016; Seppelt et al., 2017).
It is important to recognize that the use of comments as fundamental truth assumes that, if two annotators agree, their remark is correct. However, the question of whether this is indeed the case has rarely been investigated, although some have raised the issue. For example, Naqvi et al. (2018) has decided not to use a selection of existing visual recordings, in particular because “information on ground truth gaze position is not available”. (p. 15). Therefore, they argue, it is not possible to assess the accuracy of their eye-detection algorithm. Cohen Kappa`s statistics are the concordance between two evaluators, po being the relatively observed correspondence between the evaluators (identical to accuracy) and Pe is the hypothetical probability of an accidental match. . . .