
We conducted an online crowdsourcing study in which participants interacted with the virtual coach Kai in up to five sessions between 1 February and 19 March 2024. The Human Research Ethics Committee of Delft University of Technology granted ethical approval for the study (Letter of Approval number: 3683). We preregistered the study in the Open Science Framework (OSF)72, and no changes were made compared to the preregistration.
Study design
We performed a longitudinal study with a micro-randomized design73, which entails assigning an intervention option at random to each participant at each pertinent decision point. The two intervention options were providing and not providing human feedback, which were chosen with probabilities of 20% and 80%, respectively. The four decision points were the days between each pair of five sessions with the virtual coach. To assess the effect of the intervention options, participants reported their effort spent on the activity assigned by the virtual coach as well as their return likelihood in case of an unpaid intervention in sessions 2–5 (Fig. 5). Based on the collected data, we performed inferential statistics to determine the effect of human feedback on the effort and return likelihood (RQ1). Moreover, we trained an RL model that optimizes the effort people spend on their activities over time. Using this model, we ran human data-based simulations to assess the long-term effects of human feedback under varying settings for the cost of providing feedback (RQ2). Such human data-based simulations are a common way to assess RL models30. Lastly, we compared the optimal policies of RL models that not only optimize the effort spent on activities (i.e., prognosis) but also account for other ethical principles (Table 4) concerning the human feedback allocated to different smoker subgroups (RQ3). The weights assigned to the different ethical principles are thereby also based on smokers’ preferred principles for allocating human feedback from our post-questionnaire (Table 4).

Between each pair of sessions, participants had a 20% chance of receiving a human feedback message.
Materials
The materials developed for this study include the virtual coach Kai, 37 preparatory activities for quitting smoking, and human feedback messages.
We implemented the text-based virtual coach Kai by closely following the implementation of the virtual coach Sam74, which was developed for another smoking cessation study and overall perceived positively by smokers12,75. There were two versions of Kai, one for smokers and one for vapers. Below, we describe the smoking version. The only difference of the vaping version is that smoking-related terms in the dialogs were replaced with their vaping counterparts (e.g., “smoker” was replaced by “vaper”). After introducing itself as wanting to prepare people for quitting smoking and becoming more physically active, with the latter possibly aiding the former76,77, Kai explained that one of two human coaches could sometimes send a feedback message between sessions. These human coaches were described as having a background in Psychology, including knowledge of how to help people change their behavior. In each of up to five sessions, Kai collected information on an individual’s current state by asking about their importance and self-efficacy for preparing for quitting, human feedback appreciation, and energy. Afterward, Kai proposed a new preparatory activity. In the next session, Kai asked about the effort people spent on their activity and their experience with it, as well as their likelihood of returning to the session if it was unpaid. People were told that one of the human coaches could read their experience description to write a feedback message, and that more specific descriptions would help write more specific feedback. Kai informed people when they were chosen to receive human feedback after the session. At the end of the session, participants received a reminder message with their activity on Prolific (Supplementary Fig. 2). Like Sam, Kai gave compliments for spending a lot of effort on activities, expressed empathy otherwise, and maintained an encouraging attitude. The Rasa-based implementation of Kai78 and a demo video79 are available online. The conversation structure is shown in Supplementary Fig. 3.
In each session, Kai proposed a new preparatory activity. This activity was randomly chosen from a set of 37 short activities (e.g., past successes for quitting smoking, role model for others by quitting smoking, visualizing becoming more physically active as a battle) created based on discussions with health experts, the activities of the smoking cessation applications by Michie et al.80 and Albers et al.12, the behavior change techniques by Michie et al.81, and smoking cessation material by organizations such as the National Cancer Institute and the Dutch Trimbos Institute. Since becoming more physically active can make it easier to quit smoking76,77, 17 activities addressed becoming more physically active. One example of an activity is given in Table 3 and all activities can be found in Supplementary Table 3.
Between sessions, participants sometimes received a human feedback message. These messages were written by one of two human coaches, who were Master’s students in Psychology. Following the model by op den Akker et al.82, the human coaches were instructed to write messages that contained the following components: feedback, argument, and suggestion or reinforcement. They also received the general guidelines to refer to things in people’s lives to build rapport, show understanding if people have low confidence, and reinforce people when they are motivated. When writing the feedback, the human coaches had access to anonymized data on people’s baseline smoking and physical activity behavior (i.e., smoking/vaping frequency, weekly exercise amount, existence of previous quit attempts of at least 24 hours, and the number of such quit attempts in the last year), introduction texts from the first session with the virtual coach, previous preparatory activity (i.e., activity formulation, effort spent on the activity and experience with it, return likelihood), current state (i.e., self-efficacy, perceived importance of preparing for quitting, human feedback appreciation), and new activity formulation. All feedback messages ended with a disclaimer that they were not medical advice. A screenshot of how we sent human feedback messages to participants is provided in Fig. 6. All 523 written messages are available online83.

The message ended with a disclaimer that it was not medical advice.
Measures
We collected four primary groups of measures, namely, the effort spent on activities, the likelihood of returning to a session, state features, and participants’ preferred principles for allocating human feedback.
The virtual coach asked participants about the effort they put into their previously assigned activity on a scale from 0 (“Nothing”) to 10 (“Extremely strong”), adapted from Hutchinson and Tenenbaum84 as also done by Albers et al.35.
To determine participants’ return likelihood, the virtual coach asked participants the question “Currently you are taking part in a paid experiment. Imagine this was an unpaid [smoking/vaping] cessation program. How likely would you then have quit the program or returned to this session?”, rated on a scale from −5 (“definitely would have quit the program”) to 5 (“definitely would have returned to this session”). 0 was labeled as “neutral.”
Moreover, we measured five variables (i.e., features) that describe a person’s state in each session: (1) the perceived importance based on the question “How important is it to you to prepare for quitting [smoking/vaping] now?”, adapted from Rajani et al.85 and rated on a scale from 0 (“not at all important”) to 10 (“desperately important”), (2) self-efficacy based on the question “How confident are you that you can prepare for quitting [smoking/vaping] now?”, adapted from the Exercise Self-Efficacy Scale is by McAuley86 and rated on a scale from 0 (“not at all confident”) to 10 (“highly confident”), (3) human feedback appreciation based on the question “How would you view receiving a feedback message from a human coach after this session?”, rated on a scale from -10 (“very negatively”) to 10 (“very positively”), with 0 labeled as “neutral,” (4) energy based on the question “How much energy do you have?”, rated on a scale from 0 (“none”) to 10 (“extremely much”), and (5) the session number.
We further determined participants’ preferred principles for allocating human feedback by asking them to distribute 100 points across 11 allocation principles after the question, “Based on which principles/rules should the virtual coach decide when a human coach should give feedback to people who are preparing to quit [smoking/vaping]? Assign 100 credits to the principles below, where more credits mean that you are more in favor of a principle.” Nine principles were derived from those presented by Persad et al.39, adapted to the smoking cessation context (Supplementary Table 4). We supplemented these principles with one further formulation of treating people equally (i.e., least amount of human feedback so far) and with the principle of respecting people’s autonomy by prioritizing people who most appreciate receiving human feedback.
Participants
Participants were recruited from the crowdsourcing platform Prolific Academic. Eligible were people who smoked tobacco products or vaped daily, were fluent in English, and had not participated in the conversational sessions of our two previous studies with similar preparatory activities56,57. Participants further had to give digital informed consent, confirm smoking/vaping daily, and indicate being contemplating or preparing to quit smoking/vaping87 and not being part of another intervention to quit smoking/vaping to pass the prescreening questionnaire. The study was framed as preparation for quitting smoking/vaping for people recruited as daily smokers/vapers. Out of 852 people who started the first conversational session, 500 completed all five sessions, and 449 provided their preferences for allocating human feedback based on different principles in the post-questionnaire. To increase the chance that participants would read the human feedback messages, they were told they might be asked to confirm having read a received message to be invited to the next session. Participants who failed more than one attention check in the prescreening questionnaire were not invited to the first session. Moreover, participants had to respond to the invitations to the sessions and the post-questionnaire within two days. The participant flow is shown in the Supplementary Information. Participants who completed a study component were paid based on the minimum payment rules on Prolific, which require a payment rate of six pounds sterling per hour. Participants were informed that their payment was independent of how they reported on their preparatory activities to account for self-interest and loss aversion biases88. Participants who failed more than one attention check in the prescreening or post-questionnaire were not compensated for that respective questionnaire. Participants were from countries of the Organization for Economic Co-operation and Development (OECD), excluding Turkey, Lithuania, Colombia, and Costa Rica, but including South Africa89. Of the 679 participants with at least one interaction sample, 330 (48.60%) identified as female, 335 (49.34%) as male, and 14 (2.06%) provided another gender identity. The age ranged from 19 to 71 (M = 36.30, SD = 11.21). Further participant characteristics (e.g., education level, smoking/vaping frequency) can be found in Supplementary Table 5.
Procedure
Participants meeting the qualification criteria could access the prescreening questionnaire on Prolific, and those who passed the prescreening were invited to the first session with Kai about 1 day later. Invitations to a subsequent session were sent about 3 days after having completed the previous one. Between sessions, participants each time had a 20% chance of receiving a human feedback message. About three days after completing the last session, participants were invited to a post-questionnaire in which they were asked about their preferred principles for allocating human feedback, first by means of an open question and then by distributing points across given principles.
Data preparation
We collected all interaction samples of pairs of sessions in which people answered at least the effort, return likelihood, and the first state feature question (i.e., perceived importance) in the next session. Missing values in interaction samples (N = 5) for the remaining state features were imputed with the corresponding feature’s sample population median. Our data and analysis code are publicly available90.
Data analysis for RQ1: short-term effects of human feedback on engagement
First, we wanted to assess whether human feedback positively affects engagement in the short term. For this, we performed Bayesian inferential analyses.
To determine the direct effect of human feedback on the effort people spend on their activities and their return likelihood, we compared samples where people received human feedback to samples where they did not. For each of the two dependent variables (i.e., effort and return likelihood), we fit a model containing a general mean, a random intercept for each participant, and a binary fixed effect for human feedback received after the previous session. We fit both models with diffuse priors based on the ones used by McElreath91 and assessed them by interpreting the posterior probability that the fixed effect for human feedback is greater than zero based on the guidelines by Chechile48. We further report 95% highest density intervals (HDIs).
Besides the direct effect of human feedback on the effort and return likelihood, there might also be a delayed effect. For example, if human feedback increases a person’s self-efficacy, then the person may spend a lot of effort on future activities even when not receiving additional human feedback. To determine whether having received human feedback leads to a higher effort and return likelihood, we fit two further statistical models. For both dependent variables (i.e., effort and return likelihood), we fit a model containing a general mean, a random intercept for each participant, and a fixed effect for whether participants had received human feedback until then. We again fit both models with diffuse priors and used posterior probabilities and 95% HDIs to assess whether the effect of having received human feedback is positive.
The delayed effect of human feedback might be stronger for people who have received multiple feedback messages. To determine whether having received more human feedback leads to a higher effort and return likelihood, we created two further statistical models by extending the previous two models with a fixed effect for the number of times participants had received human feedback until then. Again, we fit both models with diffuse priors and used posterior probabilities and 95% HDIs to assess whether the effect of multiple human feedback messages is positive.
Data analysis for RQ2: long-term effects of optimally allocated human feedback on engagement—RL model
While our inferential analysis of delayed human feedback effects already looked a few steps into the future, it was based on randomly allocated human feedback. However, in some situations, giving human feedback might also be detrimental in the long run. So now, we want to use simulations to assess the long-term effects of optimally allocated human feedback based on a person’s state. With optimally allocated human feedback, we mean feedback that is only given in situations (a) where it is ultimately more beneficial than not giving feedback, and (b) where this benefit outweighs the economic cost of giving human feedback.
To study the long-term effects of optimally allocated human feedback, we designed and trained an RL model for deciding when to allocate human feedback. Starting with a base model that maximizes the effort people spend on their activities over time, we add the consideration of human feedback costs, and later for RQ3 of other ethical principles for allocating feedback. Figure 7 visualizes our final model.

The arrows indicate which state features are used to predict the different reward functions. The five reward functions can be combined by setting different weights α.
We can define our approach as a Markov decision process (MDP) 〈S, A, R, T, γ〉. The action space A consisted of two actions (i.e., giving human feedback no/yes), the reward function R: S × A → [0, 1] was determined by the self-reported effort spent on activities, T: S × A × S → [0, 1] was the transition function, and the discount factor γ was set to 0.85 to favor rewards obtained earlier over rewards obtained later as also done in previous work (e.g., refs. 32,35). The finite state space S described the state a person was in and was captured by their perceived importance of and self-efficacy for preparing for quitting smoking/vaping, as well as their appreciation of receiving human feedback. The goal of an MDP is to learn an optimal policy π*: S → Π(A) that describes an action to take in each state that maximizes the expected cumulative discounted reward \({\mathbb{E}}\left[\mathop{\sum }\nolimits_{t = 0}^{\infty }{\gamma }^{t}{r}_{t}\right]\). The optimal Q-value function \({Q}^{*}:S\times A\to {\mathbb{R}}\) describes the expected cumulative discounted reward for executing action a in state s and π* in all subsequent states. In the following, we describe each component in more detail.
We considered six features to describe the state space: (1) the perceived importance, (2) self-efficacy, (3) the difficulty of the assigned activity based on the activity difficulty ratings by Albers et al.92, (4) energy, (5) human feedback appreciation, and (6) the session number. The first three features were considered since goal-setting theory posits that goal commitment, facilitated by importance and self-efficacy, and task difficulty are moderators of the effects that goals have on performance93. More precisely, low commitment and high task difficulty might make it harder for people to reach their goals, which may make human feedback more beneficial. We further included energy since it was shown to be an important predictor of the effort people spend on preparatory activities for quitting smoking in a previous study57. Moreover, since the novelty of the intervention may influence people’s motivation to do the activities12, we also captured the session number.
To reduce the size of the state space and thus create a more robust model, we selected three abstracted base state features based on our collected data. Specifically, using the G-algorithm94 and its adaptation by Albers et al.35 as inspiration, we iteratively selected the feature for which the Q-values for the abstracted feature values were most different. We thereby specified the first selected feature to have three and the second and third features two abstracted values. Abstract features were computed based on percentiles. For example, to create an abstract feature with two values, we set all values less than or equal to the median to 0 and those greater than the median to 1. Besides reducing the required data, selecting a subset of the state features also has the advantage that the virtual coach would in the future need to ask people fewer questions per session, which is in line with keeping smoker demands to a minimum80. The three selected features were (1) perceived importance with three values, (2) self-efficacy with two values, and (3) human feedback appreciation with two values. The base state space thus had size 3 × 2 × 2 = 12. We refer to the resulting base states with three-digit strings such as 201 (here perceived importance is high, self-efficacy is low, and human feedback appreciation is high). Supplementary Fig. 5 and Supplementary Fig. 6 show the mean effort and number of samples per combination of values for the three selected features.
The action space was defined by two actions: giving (a = 1) and not giving (a = 0) human feedback.
Just as in the algorithm by Albers et al.35, the base reward signal was based on asking people how much effort they spent on their previous activity on a scale from 0 to 10. Based on the sample population mean effort \(\overline{e}\), the reward r ∈ [0, 1] for an effort response e was computed as follows:
$$r=\left\{\begin{array}{ll}\frac{e}{2\overline{e}}\qquad\qquad\qquad{if}\,{e} \,<\, \overline{e}\\ 1-\frac{10-e}{2(10-\overline{e})}\quad\quad\;{if}\,{e} \,>\, \overline{e}\\ 0.5\qquad\qquad\qquad{otherwise}.\end{array}\right.$$
(1)
The idea behind this reward signal was that an effort response equal to the mean effort was awarded a reward of 0.5, and that rewards for efforts greater and lower than the mean were each equally spaced.
The reward and transition functions were estimated from our data.
Due to budget constraints, the base reward may cause human feedback to be allocated to more people than can be economically afforded. To be able to reduce the amount of allocated human feedback, we introduce the human feedback cost c to be included in the reward computation depending on the action a:
$${r}_{c}=\left\{\begin{array}{ll}{r}\qquad\qquad\qquad{if}\,a=0\\ {r-c}\qquad\qquad\;{if}\,{a=1} \quad \end{array}\right.$$
(2)
We computed 0.001-optimal policies and corresponding Q* with Gauss-Seidel value iteration from the Python MDP Toolbox. We use π*,c to refer to an optimal policy for a certain cost c.
Data analysis for RQ2: long-term effects of optimally allocated human feedback on engagement—analysis steps
First, we assume we have no economic budget constraints and can allocate as much human feedback as we wish (i.e., c = 0). To assess the effects of such unlimited human feedback over time, we ran simulations based on our collected data to compare four different policies concerning the mean reward per activity assignment over time: (1) the optimal policy π*,0, (2) the policy of always assigning human feedback, (3) a theoretical average policy where each of the two actions is taken \(\frac{1}{2}\) times for each person at each time step, and (4) the policy of never assigning human feedback. To obtain a realistic population, the simulated people were initially distributed across the state features following the distribution we observed in the first session of our study (Supplementary Fig. 7).
In practice, budget constraints might limit the amount of available feedback and thus make it impossible to always allocate human feedback according to π*,0. To reduce the amount of allocated human feedback, we added different human feedback costs to the base reward, and assessed the resulting mean reward and amount of allocated human feedback over time. The considered costs were chosen such that the resulting optimal policies π*,c all differ in the number of states that are allocated feedback. We again used as the starting population the distribution of people across the 12 states we observed in our study’s first session.
Data analysis for RQ3: effect of different ethical allocation principles on human feedback received by smoker subgroups
Given that we can only provide limited human feedback, we cannot allocate human feedback to everybody. The RL models we have trained for RQ2 all allocate human feedback to those who will see the largest increase in effort spent on preparatory activities over time because of the feedback. This can be seen as maximizing total benefits according to the allocation principle that Persad et al.39 call prognosis. However, we can also use other ethical principles in our RL model. Here, we now want to assess the effects of incorporating different ethical allocation principles on the subgroups of smokers who receive feedback.
To get a realistic assessment of the effect of incorporating different ethical allocation principles, let us first define a potential live smoking cessation application. Suppose we have an application in which people have up to nine sessions with a virtual coach, after each of which they can get feedback from a human coach. As people sometimes drop out of eHealth applications before completing them18,95, we assume, based on the average percentage of negative return likelihood ratings per session of our longitudinal research study, a 15% chance that people drop out of our application after each session. The spots of people who have either completed all nine sessions or have dropped out are given to new people. These new people are distributed across the 12 base states, as in the first session of our study. Taking about six minutes to write a feedback message, the human coach can give feedback to around 58 people every day. Assuming 166 spots in the application, this amounts to 35%. Therefore, the human feedback costs in our analyses were set such that, on average, about 35% of people receive feedback every day.
To also reward allocating human feedback according to ethical principles other than prognosis, we extended the RL model. Specifically, we created the four auxiliary (i.e., additional) rewards shown in Table 4. We use first-come, first-served to illustrate the effect of treating people equally. Note that the ethical principles of youngest first, instrumental value, and reciprocity can all be represented by setting an individual characteristic-based priority level. To compute these auxiliary rewards, we extended the state space by two features, each with three values: (1) a random individual characteristic-based priority level that remains fixed for each person and (2) time since the last human feedback. Both of these state features only influence the auxiliary reward and not the base reward (i.e., prognosis). Each auxiliary reward raux ∈ [0, 1] is then computed as \({r}_{aux}=\frac{aux-au{x}_{min}}{au{x}_{max}-au{x}_{min}}\), where aux is a person’s value for the measure underlying the auxiliary reward (e.g., the time since the last human feedback) and auxmin and auxmax are the lowest and highest possible values for the measure.
Using the rewards from Table 4 and the weights given to them by smokers, we compared six policies based on which states they allocate feedback: (1) the optimal policy based on the base reward, (2–5) the four optimal policies for using the base reward together with either first-come, first-served, sickest first, autonomy, or priority with the two rewards weighted based on the smoker-preferred weights, and (6) the optimal policy based on all five rewards weighted according to weights derived from smokers’ preferred principles for allocating human feedback (Table 4). Due to the relatively large drop in reward between human feedback costs of 0.07 and 0.09 observed for our analysis of the long-term effects of limited feedback (Fig. 3a), we set the human feedback cost to 0.07 for the base reward-based optimal policy, which means that after each session around 35% of people get feedback (Supplementary Fig. 8b). Since incorporating auxiliary rewards can change the amount of allocated feedback, we tuned the costs for the other policies such that these policies also allocate feedback to around 35% of people, thus allowing for a fair comparison between policies.
link