Introduction
The Physical Activity Guidelines for Americans suggest that adults achieve 150 min of moderate to vigorous physical activity (MVPA) per week [1] to obtain health benefits. Self-reported data indicates that 62.0% of adults met these physical activity guidelines, but this dropped to 9.6% for objective measures of activity using Accelerometry [2]. Such a disconnect between perceived and actual activity illustrates the importance of using a wearable device to quantitatively monitor activity and reduce sedentary behaviour in the adult population.
Activity monitor validation is crucial when monitoring habitual physical activity and health behaviour within clinical trials. Pedometer use can increase physical activity during leisure time, especially with the assistance of self-help walking programs [3]. A goal of 3000 steps in 30 min five days a week can be easily measured using pedometers to achieve current physical activity guidelines [4]. Since activity trackers monitor physical activity that can signify health status, favourable relationships between step counts and blood pressure, cholesterol, body mass index (BMI), and waist circumference have been shown [5, 6].
Understanding the relationship between step count and health assessments is important to researchers, consumers, and health professionals when using activity monitors to achieve healthy living. The rise in popularity of activity monitors may lead to questioning the validity of these devices becoming paramount, and it is important to assess their accuracy in various environments. Objective measurement accuracy of steps can be determined by evaluating the device against an established criterion measure such as manual step counts. The validity of consumer devices is difficult to ascertain during overground conditions, which is why most researchers perform testing on treadmills.
For this study, we selected the Actigraph and the Apple Watch to measure steps in these conditions. The Actigraph was selected since it is considered the gold standard portable device to measure step count in research, and the Apple Watch was selected due to its popularity in three categories of wearable devices - smart watches, basic wearable bands, and portable navigation [7, 8]. Although the Actigraph is considered the gold standard for measuring step count using a portable device, the video-recorded step counting will provide the true step value as it is less prone to measurement errors. It should also be noted that these devices are not the only commercial devices on the market. Evaluating the accuracy of a popular consumer device, the Apple Watch, against an already established industry standard medical-grade activity monitoring device, the Actigraph, could help reassure the general population over using consumer devices to monitor health variables such as step counts.
Minimal research exists to validate the Apple Watch during overground conditions. However, there are studies on validating the step-counting accuracy of activity monitors that used treadmill protocols [9-13]. Although treadmill protocols are convenient and easily replicable, this cannot express the validity of step counts during off-treadmill settings. Therefore, it is important to validate the devices in different settings, as most daily activities occur in overground conditions, not on a treadmill. One-third of the population attain their physical activity by walking, and 60% of them reported using neighbourhood streets, shopping malls, and parks as their preferred locations to walk [14]. Validating the Apple Watch during overground conditions is crucial for individuals who use walking as their primary source of physical activity and will assure them that they are accurately measuring their physical activity goals.
Some shortcomings of validation using treadmill protocols include altering gait mechanics by shortening stride length, increasing step cadence, and reducing normal gait variability [15]. Thus, assessing the accuracy of step counts on a treadmill may not accurately translate to steps during overground walking or in a free-living environment. An important benefit of using treadmill protocols for validating activity monitors is the ability to manually observe participant behaviour during the testing period. overground walking behaviour can be simulated by instructing participants to walk freely over a pre-determined distance while under video observation. Allowing the participant to walk freely overground would avoid alterations in gait mechanics, which can occur on treadmills. Each participant’s average walking speed is calculated by dividing the total distance travelled by the time elapsed. This same speed and time are used on the treadmill to create comparable conditions between overground and treadmill stages. Analysing both conditions, we can compare the activity monitor accuracy in both environments across a range of different speeds. Such protocols result in comparable and repeatable data for validating devices, with direct manual observation under both conditions.
More research is necessary for optimal activity monitor placement and performance during overground conditions [16]. Therefore, the purpose of this study was to validate the step counts of the Apple Watch and wrist, hip and ankle-worn Actigraphs during off-treadmill (overground) and on-treadmill conditions using self- paced walking at different intensities and video monitoring manually count steps. The novelty of this work is comparing the accuracy of step counting of consumer and medical-grade activity monitoring devices for varied pace walking during overground conditions to the results of a controlled setting on the treadmill. To the best of our knowledge, there is minimal data on the validity of step counts of activity monitors between overground and treadmill walking.
Material and methods
Participants
A total of 40 participants aged 18-65 years, mainly recruited from the Pennsylvania State University Berks campus and the surrounding Berks County community, participated in the study. Participants who were able to wear the activity monitors and who could walk on a flat overground surface and a treadmill were recruited. Participant recruitment, testing, and data analysis occurred between 2017 and 2018.
Physiological measures
A Health-O-Meter Professional (model: 500KL; Healthometer, MO, USA) stadiometer was used to measure height to within 0.25 cm accuracy. Each participant’s height was measured twice barefoot, with an optional third measurement taken if the two measurements differed by 0.25 cm or more. Height was used when initialising each wearable device before participation. Weight was measured to the nearest 0.1 kg using the Health-o-Meter Professional (model: 500KL; Healthometer, Mo, USA) stadiometer and the scale was calibrated before each trial. Participant weight was measured in workout clothes and without shoes and used to initialise each wearable device before participation. BMI was determined as mass (kg)/height (m2).
Activity monitors
Participants simultaneously wore four wearable activity monitors during testing, including the Apple Watch Series 1 and wrist, hip, and ankle-worn GT9x Actigraphs. The Apple Watch Series 1 (Apple Inc., cA, USA) is an accelerometer-based device that provides estimates of heart rate, distance travelled, calories expended, activity minutes, and standing time. Before each test, an Apple Watch was calibrated using a designated phone and applied to fit snugly on top of the participant’s left wrist. Each participant’s demographic information, including height, weight, and age, was entered into the Apple Watch application. The Apple Watch is pre-programmed to record step counts every minute the user is active.
The Actigraph GT9X Link Monitor (Actigraph, FL, USA) is a small, lightweight (14 g; 3.5 x 3.5 x 1 cm) accelerometer. Actigraph wrist (proximal to the Apple Watch), hip (anterior superior iliac spine), and ankle (lateral malleolus) monitors were placed on the participant’s left side for consistency. The three Actigraph GT9X Link devices were calibrated using the participants’ height, weight, age, and location of device attachment. The Actigraph GT9X Link Monitor was set to record data in one-minute epochs to be consistent with the Apple Watch data collection.
Walking protocol
Participants walked on a set track on a flat, overground surface to represent typical walking experienced under overground conditions. Verbal instructions were provided to walk at three stages (slow, moderate, fast) that would allow participants to self-regulate their behaviour as much as they would do under overground conditions and would emulate various real-life scenarios. The slow stage was described as taking a stroll in the park, where they would walk and talk with a friend. The moderate stage was described as being late to an event and walking briskly, and the fast stage was described as walking swiftly to catch a bus that was about to leave. Participants walked around the outer perimeter of a rectangular track that was 24 m long and 1 m wide, with four cones placed at the outer edges of the walking track. Each walking stage consisted of six laps, and the time to complete each stage was recorded. The known distance and time taken to walk six laps for each exercise stage during the overground condition were measured to calculate the average speed (speed = distance/time) for each stage. After a brief rest from the overground walking stages, participants completed an equivalent three-stage treadmill protocol. The average speed and time taken to complete each stage during the overground walking condition were used to determine the treadmill speed and duration for each stage of the treadmill protocol. Consistent speed in both conditions facilitated similar participant behaviours. The mean [± standard deviation (SD)] walking speed (seconds) for the slow, moderate, and fast stages in both conditions were 260 (52.8), 200 (31.0), and 169 (24.5), respectively.
Video recording
Both conditions were recorded using a canon VIX- IA HF500 camcorder to observe the steps. Video recordings were manually assessed for step counts by an experienced independent observer. To simplify manual step counts, the participants wore a bright, light-coloured calf sleeve to make each step more distinct. The quality of manual assessments was confirmed by a second observer [intra class correlation coefficients (ICC) were 0.986 (overground) and 0.999 (treadmill); n = 4].
Statistical analysis
Range, absolute error (observed - device steps), per cent relative error (observed - device steps) *100/ob- served steps, correlation, and root mean square error (RMSE) were reported for the Apple Watch and wrist, hip, and ankle-worn Actigraphs during overground and treadmill walking. The RMSE was used as an unsigned indicator of the overall quality of the step-counting estimates. That is, undercounting and overcounting of steps are considered equally as errors and do not cancel out. Statistical analyses employed SPSS V.25 (IBM Corp., NY, USA) and MATLAB R2017a (Math- works Inc., MA, USA).
Results
Forty participants (14 F/26 M) between the ages 18-65 years (< 20, 20-29, 30-39, 40-49, < 50 years; 27.5%, 60%, 2.5%, 7.5%, 2.5%, respectively) with a wide BMI range (BMI; 19.43-38.19 kg/m2, mean BMI: 25.5 ± 4 kg/m2) completed the study. Most of the cohort were < 30 years old (87.5%), which indicated that most were younger participants. Range of values, absolute error, correlation, and RMSE are shown for all devices during both conditions at all three walking intensities in Table 1. The average steps taken with all devices overground were 469 (slow), 411 (moderate), and 348 (fast). The treadmill totals were 477 (slow), 425 (moderate), and 359 (fast). These data show that participants were taking fewer additional steps on the treadmill protocol to cover the same walking distance, indicating that the two conditions were not very different (ICC = 0.968; n = 40).
Table 1
Manual step counts, absolute errors, relative error and root mean squared error for different stages amongst devices during overground and treadmill conditions
During the overground condition, the Apple Watch had the lowest relative and absolute error (absolute = -2.32, -8.95, and -20.0 steps; relative = -1%, -2% and -5%) compared to the other devices. The Apple Watch had similar results (absolute = -3.35 and -2.25 steps; relative = -1% and -1%) during the on-treadmill protocol for the moderate and fast stages. The ankle Acti- graph had a lower error (absolute = -8.3 steps; relative = -2%) on the treadmill for the slow stage than the Apple Watch (absolute = 17.8 steps; relative = 4%) for the same stage. Overall, these results suggest that the Apple Watch seemed to be more accurate than the wrist, hip, and ankle-worn Actigraphs for overground and treadmill walking.
The correlations between devices and observed steps (true value) for all conditions and walking paces are shown in Table 2. During overground conditions, the Apple Watch had the highest correlation (0.8352, 0.9095, and 0.7646) amongst slow, moderate, and fast paces, respectively. Similar results were obtained during the treadmill conditions, with the Apple Watch having the highest correlation (0.7376, 0.8973, and 0.8758) amongst slow, moderate, and fast paces, respectively. The RMSE was computed for the Apple Watch and the wrist, hip, and ankle-worn Actigraphs to compare predicted vs observed results. The RMSE for the Apple Watch (over-ground: 27.57, 46.74 and 47.22; treadmill: 42.03, 18.58 and 19.12) were lower than the wrist, hip, and ankle-worn Actigraphs, except for the hip-worn Actigraph at moderate intensity during overground walking. Collectively, the Apple Watch had the lowest overall errors. The lower absolute and relative errors indicated better Apple Watch performance across the cohort, while the lower RMSE values indicated better Apple Watch performance for individual participants.
Table 2
Correlation between overground, treadmill, and manual step count during slow, moderate, and fast walking paces
Figure 1 compares Apple Watch and Actigraphs steps during overground and treadmill walking, with device performance shown by error bars representing the SD of individual errors with respect to the manual step counts. The lowest SD was observed for the Apple Watch, indicating more consistent estimations. The video analysis is represented in Figure 1 as the dashed and dotted lines, labelled ‘Over-Ground ref” and “Treadmill Ref”. The proximity of the overground and treadmill reference lines in Figure 1 shows that the steps taken during the two conditions were not very different, highlighting the equivalence of treadmill and overground walking.
Discussion
The purpose of this study was to analyse the differences in step counts between the Apple Watch and wrist, hip, and ankle-worn Actigraphs during overground and treadmill walking at different intensities and compare the device results with manual step counting. To our knowledge, our study is one of the initial studies to validate the Apple Watch consumer device against Actigraph, a medical-grade monitoring device, at three different walking speeds (slow, moderate and fast) in two separate walking conditions (overground vs. treadmill) while using manual counts as a reference.
The three walking speeds are somewhat representative of real-life scenarios in which the Apple Watch would be typically used. Collectively, the Apple Watch was accurate when measuring step counts during overground and treadmill walking at various intensities for adults (particularly for the younger cohort) with a wide BMI range. Although a few studies have evaluated activity monitors in overground environments, they did not employ video-monitored manual counts but used a criterion device such as a Step Watch to measure steps [17], which is not infallible [18, 19] or without limitations.
The Apple Watch’s performance in evaluating step counts was comparable to and more accurate than the industry standard, medical-grade Actigraph device. Wrist, hip, and ankle-worn Actigraphs have been previously used to validate activity patterns in healthy controls and clinical populations [20, 21]. This study’s results are in line with our earlier work tracking steps using the Apple Watch at fixed speeds on the treadmill, where there was a total error of 0.034% (1.07 steps) when compared with the manual counts obtained from video recordings [9]. Also, in our previous study, the Apple Watch minimally overestimated steps at lower, moderate, and brisk walking speeds and underestimated steps at a faster pace (it seemed to be most accurate at the moderate intensity pace, which is similar to the current study findings). Differences in comparable outputs between slow-paced walking and moderate to vigorous-paced walking were previously recorded [22] and may explain the error found in our study at this speed. Previous research has also shown errors at slow paces when using accelerometers [23].
Previous work [24] recorded adolescent males regularly performing more physical activity than females. The differences in physical activity levels may also influence walking speeds at lower paces. our study results are consistent with previous reports in determining the usefulness of activity monitors to track health measures such as step counts [9-13]. Our study results might empower consumers and healthcare providers to rely on activity data from the Apple Watch. Furthermore, it may provide an opportunity to influence and encourage their monitored activity. We surmise that the accuracy of the Apple Watch in estimating steps might be due to its built-in proprietary watchOS algorithms.
For a given speed, participants were taking more steps during the treadmill protocol to cover the same walking distance as the overground protocol, which indicates that they were taking shorter strides due to having an altered gait on the treadmill. Previous study findings support our results, showing that when older adults were allowed to choose a preferred (self-selected) walking pace, they walked faster, used longer strides, and had a faster stride rate overground than when they walked on a treadmill [25]. Since the average speed for overground and treadmill conditions were synchronised, steps taken during both conditions were not entirely different, as indicated in Table 1 and shown by the proximity of the overground and treadmill reference lines in Figure 1. Due to the proximity of the two reference lines, it is important to note that our study validates the practicality of using treadmill protocols as a proxy for overground walking. The correlation between the video analysis and the Actigraph and Apple Watch over both conditions are illustrated in Figure 2 and Figure 3.
Figure 2
Comparing total step count of video analysis vs Actigraph (wrist, hip, and ankle) and Apple Watch during overground walking for all three walking stages (slow, moderate, fast)

Apple Watch during treadmill walking for all three walking stages (slow, moderate, fast)
This study shows the practical implications of how the Apple Watch can be used to accurately measure physical activity during overground conditions at different walking paces. With walking being a popular form of achieving physical activity goals and the rise in popularity of smart devices to measure physical activity, the Apple Watch provides a good alternative to pedometers for those seeking to more accurately measure steps. These results also build an important foundation for future validation research using treadmill protocols.
Our study was limited by primarily involving relatively young, healthy volunteers in a controlled setting for both conditions. In addition, instructions were provided to walk on a set track during overground conditions, which may not truly reflect natural walking. Average walking speed, calculated as the average velocity from each stage during the overground walking, was used for the treadmill stages. So, the calculated average values may mask the natural variations (initial accelerations and terminal decelerations) in walking speed during each overground stage that generally occurs in a real-life scenario. The participants’ field of vision was fairly static (i.e., there was no visual flow as experienced with overground walking) when walking on the treadmill, which might have influenced the results. Another limitation was that the first-generation Apple Watch was used, and there are newer models on the market. Although this study looked at an older Apple Watch model, it gives insight into the accuracy of this device series. Nonetheless, it would be worth looking into the newer models, especially with the advancements in technology since the first model was released.
In conclusion, the Apple Watch is an accurate and credible consumer device that could be used for on or off-treadmill activity monitoring. Overground and treadmill walking was rather consistent, demonstrating the essential equivalence of treadmill and overground gait mechanics. Coupling wearable technology with constant monitoring and real-time feedback might provide a social incentive to motivate societies and improve physical activity levels. This research can assist in forming the groundwork for further research into the validity of other popular activity monitors, such as the newer generation Apple Watches, Samsung Galaxy Watches, Fitbits, and WHOOP. Future research can also validate activity monitors on various overground conditions, such as hiking, trail running, or walking on surfaces such as sand and mountainous terrain.