Do Behavioral Observations Make People Catch the Goal? A Meta-Analysis on Goal Contagion

Goal contagion is a social-cognitive approach to understanding how other people’s behavior influences one’s goal pursuit: An observation of goal-directed behavior leads to an automatic inference and activation of the goal before it can be adopted and pursued thereafter by the observer. We conducted a meta-analysis focusing on experimental studies with a goal condition, depicting goal-directed behavior and a control condition. We searched four databases (PsychInfo, Web of Science, ScienceDirect, and JSTOR) and the citing literature on Google Scholar, and eventually included e = 48 effects from published studies, unpublished studies and registered reports based on 4751 participants. The meta-analytic summary effect was small − g = 0.30, 95%CI [0.21; 0.40], τ² = 0.05, 95%CI [0.03, 0.13] − implying that goal contagion might occur for some people, compared to when this goal is not perceived in behavior. However, the original effect seemed to be biased through the current publication system. As shown by several publication-bias tests, the effect could rather be half the size, for example, selection model: g = 0.15, 95%CI [–0.02; 0.32]. Further, we could not detect any potential moderator (such as the presentation of the manipulation and the contrast of the control condition). We suggest that future research on goal contagion makes use of open science practices to advance research in this domain.


Introduction
Goals are essential to many species' existence. Understood as representations of desired states that are attainable through action (Kruglanski & Kopetz, 2009), they determine the upcoming steps of living beings as they strive to achieve something, be it nourishment, sex, company, or a place to hide. Humans are no exception, but beyond these aforementioned basic needs (Jolly, 1976;Maslow, 1943), our complex cognitive structure allows us to incorporate many goals in our daily life, like ' catching the train to the workplace,' 'going grocery shopping,' or 'going jogging after work.' Moreover, we can plan ahead so that our daily goals serve as a means for higher-order goals like ' earning money,' 'being healthy,' or 'keeping in shape,' which usually serve self-regulatory purposes in the long run (Carver & Scheier, 1981, 2012. However, it is also due to our cog-nitive architecture that we often monitor other people's behavior as this can contain important information (e.g., My colleagues bring their home-cooked meals for lunch -I also want to live healthily!). Consequently, observing other people's goal-directed behavior might affect our own goals.
We can take others as a source to adjust our goals. A social-cognitive approach to this phenomenon is provided by the theory of goal contagion (henceforth, GC), which was introduced by Aarts and colleagues (Aarts, Dijksterhuis, & Dik, 2008;Aarts, Gollwitzer, & Hassin, 2004) more than a decade ago. As much research has been conducted on this topic, we intend to summarize the evidence for GC in a meta-analysis and search for potential moderators. To do so, we will first provide a clear description of GC, based on theoretical introductions and empirical studies in the literature. We need to overcome vague concepts and paradigms to formulate precise guidelines for the later extraction of studies and effects in the meta-analysis.

The process of and evidence for goal contagion
The original authors of the first GC studies based their theoretical approach on the spontaneous causal inferences framework (Hassin, Bargh, & Uleman, 2002), which posits that people make spontaneous inferences about traits. For instance, the observation of a person offering to help someone else likely leads to the inference that this person Brohmer, H., et al. (2021). Do Behavioral Observations Make People Catch the Goal? A Meta-Analysis on Goal Contagion. International Review of Social Psychology, 34(1): 3, 1-15. DOI: https://doi.org/10.5334/irsp.428

RESEARCH ARTICLE
Do Behavioral Observations Make People Catch the Goal? A Meta-Analysis on Goal Contagion might be helpful in general. However, Aarts et al. (2004) extended this idea from traits of the observed person to the goals she/he is pursuing. Thus, an observed helpful behavior could be an indicator for both the observed person's trait of being helpful in general, but also for his/her current goal of offering help. The latter is the prerequisite for the GC process.
The process of GC can be described as follows: People observe others behaving in a certain way and infer their goals quickly, automatically, which may happen outside their conscious awareness (Aarts et al., 2004;Hassin et al., 2005). If this inferred goal has some relevance to the observers, they are inclined to adopt this goal thereafter. This way, GC is actually a two-step process, predicting a goal-directed behavior of an observer, following the automatic inference and activation of that goal through observations. Laurin (2016) even compared GC to a misattribution process, because the automatically inferred goal of the other person is mistakenly attributed to the self and thus directs the observer's thoughts and actions. Hence, this inference step is often operationalized through variables that do not refer to the goal itself, but assess the activation of the goal indirectly (e.g., a lexical decision task with goal-related words refers to the speed of participants). This is also theoretically sensible and in line with regard to GC's origin in the spontaneous causal inference framework (Hassin et al., 2002), as described above. 1 Nonetheless, research on GC also does not entirely exclude less automatic pathways from the model: indeed GC research sometimes reports direct (sometimes also referred to as ' explicit') measures of goal inference as manipulation checks or additional dependent variables (DVs) (Dik & Aarts, 2007;Jia, Tong, & Lee, 2014) that make direct reference to the goal (e.g., asking participants 'what the person in the text tries to achieve').
Several studies have presented evidence for the existence of GC. Three aspects become apparent from the empirical literature: First, different research teams have tested a wide array of diverse goals. Therefore, the GC effect was shown for goals ranging from having casual sex (Aarts et al., 2004) to behaving prosocially (Dik & Aarts, 2007) to achieving high scores in a task (Leander & Shah, 2013) to dieting (Lee & Shapiro, 2015). Second, in most of the literature, GC manipulation is accompanied by moderators that might operate in a unique way for some goals but potentially not for others. For instance, an observed person showing extra effort in his or her behavior (as moderating condition) might be beneficial for the observer adopting a prosocial goal (Dik & Aarts, 2007), but not for self-serving goals such as earning money (Corcoran et al., 2018). Third, although goal contagion is often conceptualized as a mediation (where automatic activation mediates the effect between goal observation and goal adoption), most studies focus on either demonstrating the automatic activation of the observed goal or behavioral measures as goal adoption without looking at activation (for exceptions that focus on both, see Corcoran et al., 2018;Dik & Aarts, 2007;Jia et al., 2014). Recently, more labs, including our own lab, have attempted to contribute to the body of research on GC by testing both formerly used and new goals, as well as the two-step model, including moderators (Brohmer et al., 2018;Corcoran et al., 2018;Wessler & Hansen, 2016). Interestingly, these attempts often yielded effects close to zero, although sample sizes were much larger than in previous studies.
Because these studies demonstrated that the GC effect does not seem to be as robust as could be expected from the previously published literature, we reasoned that a meta-analysis on GC would be advantageous. On the one hand, we wanted to 1) summarize the evidence for GC, 2) discern the statistical evidence for the automatic activation process and behavioral measures, and 3) identify further moderating effects that might turn out to be important across goals. Hence, our motivation was to see which goals that people perceive in others are truly affecting their own goals and under which conditions. On the other hand, we also wanted to test and correct for potential publication bias to obtain more accurate effect size estimates, which has proven to be effective in other socialpsychological research (Francis, 2012;Kühberger, Fritz, & Scherndl, 2014;Lane & Dunlap, 1978).

Potential moderators
In accordance with the aforementioned three points, we selected the DV that was used (henceforth: DV category) as a first moderator of interest, which has the advantage of being relatively objectively identifiable: it can be automatic goal activation or goal pursuit (including behavioral intention). Automatic activation as indicator of a goal inference is usually measured via variations of the lexical decision or word completion task (e.g., Dik & Aarts, 2007). Goal pursuit contains both behavioral measures and an expression of intention (see supplementary document: https://osf.io/jx7rc/).
Our second moderator of interest differentiates elicited goals to test the idea that some goals might be more contagious than others. There are many dimensions on which goals might differ. We decided to look at goals based on how many people would pursue this goal, which hints at whether a goal can be perceived as quite ' common' (henceforth: common goal). Crucially, this moderator is useful because the GC process is theorized to be strengthened when observers assign a high or positive value to the goal that they infer (Aarts et al., 2004;Brohmer et al., 2018;Corcoran et al., 2018). If a goal is pursued by a majority of people, this indicates a high value in the eyes of many and demonstrates a broader relevance. This in turn could make it more likely from the perspective of a specific individual (such as a study participant) that he/she might also assign a high value to this goal, which should foster the GC process. Therefore, common goals might be more contagious overall. 2 The last two moderators have a more methodological focus and might provide insights into how best to study GC. The third moderator will be the presentation of stimulus material, depicting a goal-directed behavior (henceforth: presentation). This measure is relatively objective: as most stimulus material could be expected to be identifiable as texts or video clips and animations, these forms being most likely to depict behavior in a standardized form. More vivid materials -such as videos -showed stronger effects in other intervention contexts (e.g., Soetens et al., 2014;Walthouwer et al., 2015). Therefore, we assume that video clips and animations might be more effective to elicit GC than texts.
Lastly, we are also interested in to what extent the control conditions for each study might be perceived as neutral or contrary to the goal (henceforth: contrast control). For instance, for an observed prosocial goal in an experimental condition (i.e., someone provides help to another person), a neutral condition could be a situation without a prosocial context (i.e., nobody needs help). A situation contrary to the goal could be when selfish behavior is observed (i.e., someone does not provide help, although he or she could). It is expected that control conditions that are contrary to the goal might result in a stronger GC effect as the contrary condition might inhibit the goaldirected behavior much stronger than a neutral control condition.

Method
We defined general inclusion criteria for extracting published articles from databases and specific inclusion criteria to identify relevant studies in those papers. In addition, we developed our coding scheme for the GC studies based on a preliminary coding of five original papers. This was necessary due to differences in reported statistical information, which often affect the extraction of relevant effects. Hence, we used those five studies to gain general experience for the coding and how to deal with limited information. Those three steps -general inclusion of papers, specific inclusion of studies, and the coding of relevant effects -will be described in the following paragraph. All materials, data, codes, and a PRISMA-guidelines checklist for meta-analyses can be accessed in the accompanying Open Science Framework (OSF) project folder (https://osf.io/mxepy/).

General inclusion criteria for articles and database search
General theoretical eligibility Prior to the database search, we committed ourselves to a definition and general criteria of GC as the benchmark, which we derived from the literature (e.g., Aarts et al., 2004;Dik & Aarts, 2007) and summarized in three crucial points.
First, for GC to occur, an observer has to observe or read about a behavior by another person that implies a certain goal, but the goal is not explicitly mentioned. This is crucial for GC from a theoretical perspective, as a behavior is the originator of the proposed cognitive process of goal inference. This distinguishes GC from other routes of goal activation, such as goal priming (e.g., Bargh, et al. 2001; see also Weingarten et al., 2016), in which the goal itself is presented as a semantic concept. Second, the goal gets activated in the observer based on an inference of the observed behavior. This inference is assumed to happen automatically and outside the observer's conscious awareness ; see also De Houwer & Moors, 2010). Third, even though the GC process should ultimately lead to the adoption of the goal by the observer resulting in goal-directed behavior or intentions, we also accepted studies focusing on studies on goal activation. From a design-specific viewpoint, GC should be demonstrated in an experimental-psychological study. That is, there has to be a goal manipulation including an experimental condition, in which participants observe a goal-directed behavior and (some sort of) control condition, followed by a measure of automatic goal activation or goal pursuit.
The second and third point are not independent: the inference of the goal in the observer is theorized to occur quickly and automatically outside conscious awareness after the observation (Aarts et al., 2004, pp. 24−25). It has to be reiterated that non-automatic inference is usually of minor interest in GC-related research and therefore direct measures of the outcome rather serve as a manipulation check, which is why we do not include it as a central part of our definition. Only after an automatic inference should goal adoption occur, which is typically measured in participants' goal-directed behavior (i.e., goal pursuit) or intention for goal-directed behavior. Notably, inference is often described as a mediator between observation and adoption of the goal, but as the path from inference to goal pursuit or intention is rarely studied, we will focus on the relationships between goal manipulation and goal inference or goal pursuit.
The theory on GC does not set restrictions on the goals to be elicited. In accordance, the preliminary coding of the five papers revealed the use of a diverse range of goals. Therefore, we set no restriction either, and recognized all kinds of goals -be they ' academic achievements' (Wessler & Hansen, 2016), 'being helpful' (Dik & Aarts, 2007), or 'having casual sex' (Aarts et al., 2004). In the same vein, we expected that in an experimental setting the focal goal toward which the behavior is directed should be rather obvious to strengthen the manipulation (although behaviors are sometimes multifinal, see Shah, Kruglanski, & Friedman, 2003).

Systematic search
After fixing definition and criteria, we conducted a systematic search between March and April 2018 in four databases for published work, namely PsychInfo, Web of Science, ScienceDirect, and JSTOR. In all four databases, we applied a similar search logic: we looked in the title, abstract, and keywords for the term 'goal' in combination with ' contagion,' 'social learning,' 'modelling,' 'modeling,' 'role model,' 'social standard' and its alterations, ' comparison standard' and its alterations, or ' observational learning' (for specific search syntaxes, see https://osf.io/w8b9m/). The initial systematic search resulted in k = 2821 articles.

Screening of articles
The articles were split equally among three trained student assistants, who applied the criteria of the general theoretical eligibility to separate relevant from irrelevant papers based on titles and abstracts. This culminated in k = 30 articles that remained of interest. Afterwards, the same coders applied the same criteria again, by looking into the 30 articles in more detail. After this screening, 17 articles had to be excluded, as it turned out they were not eligible, and 13 relevant articles (including the five preliminary articles) and one registered report (see section on unpublished studies) were kept.

Additional searches
However, some potentially relevant GC articles did not show up during this search, which is why we decided to perform an additional browsing in citing literature on Google Scholar and in the reference lists of articles that we obtained from the systematic search. We also performed another extended search on PsychInfo for studies with adult samples, using the keywords ' observational learning,' 'role model,' and 'social learning.' For this search, we excluded the word 'goal' to make sure we would not miss studies conceptually related to GC, despite not employing the same wording.
These procedures -the additional browsing and extended search -yielded a total of k = 2908 articles and documents (including the ones from the previous paragraph). We again checked the content of the articles, which yielded an additional k = 12 articles. These seemed to be of relevance due to a fitting experimental setup, although they did not necessarily self-identify as being GC related. In total, we found k = 24 articles that were of relevance as they potentially contained studies that would fit in this meta-analysis (see Figure 1).

Specific inclusion criteria for studies in published articles
After coding all studies that were initially theoretically eligible (resulting in a total of e = 96 effects that measured automatic goal activation or goal pursuit; e = 127 when also including explicit inference measures), we proceeded to look into the method sections of all selected studies to see if they also fit our specific inclusion criteria. 3 These criteria encompassed points such as whether there was an identifiable goal the authors wanted to elicit in their experimental design, whether participants were adults of at least 18 years of age, and whether the specific goal the experimenters wanted to trigger was not mentioned in between the manipulation and the measurement of the DV. This last point is crucial for a clear distinction of GC from goal priming because GC includes a goal-inference step, which would be obsolete if the goal is mentioned before the DV is measured. Other points were that the goal-directed behavior of the observed person was not identical to the one shown by the participant, in order to distinguish GC from role modeling (Morgenroth, Ryan, & Peters, 2015) or mimicry (Chartrand & Lakin, 2013), and whether the control condition differed sufficiently from the experimental condition (i.e., by not using an attenuated version of the experimental conditions). Finally, it was important that sufficient statistical information was reported according to our preliminary set criteria for the extraction of effects (see below).
We applied these specific criteria study by study to identify suitable effects (as there is often more than one DV measured per study) and had to exclude 38 effects, leaving e = 58 effects. Some of these effects were taken from the same studies and had to be either combined (e.g., if two effects from the same study were based on automatic activation) or reduced to the preferred effect (i.e., only pursuit was used when there was automatic activation and pursuit measured). Finally, this left us with e = 48 effects for the confirmatory analysis. It has to be noted that some effects from self-identified GC studies had to be eliminated from the confirmatory analysis − for instance, if goal pursuit was too close to the manipulation or the goal manipulation itself was too explicit about the goal. Those individual coding decisions that did not fit the criteria are marked in the accompanying spreadsheet as excluded (see https:// osf.io/w8b9m/).

Figure 1:
Decision tree for search for articles leading to the confirmatory and extended analysis; * including effects that did not fit inclusion criteria, explicit inference, and effects from five preliminarily coded articles; all e = 127 coded effects can be found in the data matrix online.

Unpublished studies and studies from registered reports
We also looked for unpublished studies via the OSF and ProQuest using the same search terms as before (see search syntax in the OSF). Furthermore, we contacted relevant labs via email, asking whether there are more unpublished GC studies available. Neither approach yielded further results. Hence, we could only include effects from our own lab, which were not published at the time of the coding procedure, and published studies from registered reports (henceforth: RRs). As RRs employ the main peerreview procedure before the data collection and therefore before any bias can occur, they have to be treated differently from common publications (see Chambers, 2019). Data and codes for the extraction of effects from these studies, along with preregistrations when available, are also provided online.

Criteria for the extraction of relevant effects
We intended to code effects and variables that can broadly be summarized in four categories and which will be discussed in the following sections: effects relevant for GC (confirmatory effects); effects from all studies that passed the initial criteria of general theoretical eligibility (extended effects); effects hypothesized by the original authors (originally hypothesized effects); and potential moderators. The coding was done for all effects that passed the initial criteria for general theoretical eligibility.

Effects for the confirmatory analysis
We determined clear criteria for the extraction of relevant confirmatory effects. We were interested in DVs that represented automatic activation, goal pursuit or behavioral intention. However, these effects of the DV could be present as either main effects or simple effects in factorial designs (i.e., with independent variables in interaction).
We took the main effect if this was the only manipulation (i.e., a goal vs. control group) or when the original authors expected an attenuation of the GC effect through a second factor as a moderator. This latter case implied that the GC effect was present in both conditions of the second factor, although to a different degree.
There were two situations, in which we would consider a simple effect (i.e., the GC effect as observed in one specific condition of factor 2): first, when a knockout effect was expected (i.e., there is the GC effect in the first condition of factor 2, but there is no effect in the second condition of the factor 2); and second, when a crossover effect was expected (i.e., there is a reversed effect in the second condition of factor 2; see Giner-Sorolla, 2018). Furthermore, it could be possible that there were more than two conditions present on a factor. If this was the case, whereby the additional condition was a second control condition (e.g., control 1 vs. control 2 vs. goal), we took the effect that was most neutral and least opposed in comparison to the goal condition to ensure that the GC effect was not driven by the opposed control group. If more than two groups were present on the goal factor, whereby the additional condition was a second goal condition (e.g., control vs. goal low vs. goal high), we would aggregate the goal conditions as they both manipulated the goal of interest (see coding scheme: https://osf.io/ jy9m3/). Means, standard deviations (or standard errors), and group sizes were crucial descriptive statistics for the calculation of the effect size Hedges' g, which is the standardized mean difference Cohen's d corrected for positive bias (Hedges, 1981). During the coding phase, we found that reporting standards varied strongly across papers. Therefore, when descriptive statistics were not reported in studies, we based the calculation of the effect sizes on test statistics like t-values, F-values, χ²-values and r-estimates. An automated procedure is provided by Del Re's R package compute.es (Del Re, 2015). 4

Effects for the extended analysis
Using our definition and our conservative inclusion criteria for GC might have a side effect: our attempt at higher precision could work at the expense of variance that could explain moderating effects. That is, because we will probably have to exclude several studies for the confirmatory analysis, we will also reduce our chance of identifying moderating conditions. Therefore, we will conduct an extended, not preregistered analysis using all effects that passed the initial criteria for general theoretical eligibility, e = 96. Again, similar effects from these studies were combined, resulting in a sample of e = 71 effects.

Originally hypothesized effects
The originally hypothesized effects, which are the hypothesized effects in the primary studies by the original authors, were not always equivalent to the relevant effects for this meta-analysis as the former could contain specific interactions with other variables (e.g., other manipulations) that could deactivate or even reverse the goal contagion effect. Therefore, both effects have to be clearly distinguished to avoid certain effects being falsely attributed to GC alone (for a recent example of such a case see Crede, 2019, in response to Cuddy, Schultz, & Fosse, 2018). P-and Z-values for these effects were extracted for additional publication bias tests and power estimations (Schimmack & Brunner, 2017. Results of these tests complement the main results and are reported in the supplementary document (see https://osf.io/jx7rc/).

Moderators
As the GC effect was expected to be heterogeneous across studies, we intended to identify potential moderators of this effect, which we described in the theory section and in the supplementary materials. They encompass DV category (activation vs. pursuit), common goal (number of people expected to pursue the goal), presentation of manipulation material (texts vs. video clips and animations; excluding single pictures), contrast control (neutrality of the goal condition), and self-versus other-directed goal (exploratory moderator). Other differentiations, such as whether the goals are more short-term versus more long-term oriented or approach-related versus avoidancerelated implying wins and losses might also be of interest.
However, due to the relatively low number of studies (see results section), we decided to test a restricted number of moderators, thereby avoiding an inflation of the falsepositive error rate.
Agreement for all codings was assessed based on the rating of two trained raters, who applied the specific inclusion criteria for all studies and extracted the relevant effects. The agreement was low for all coded effects, κ = .35 and also for the hypothesized effects, κ = .42, despite our carefully developed coding scheme (https:// osf.io/jy9m3/; for details, see supplementary document: https://osf.io/jx7rc/). We conducted the confirmatory and extended analysis using the second rater's codings to see whether the summary effect would be different compared to rater 1. Interestingly, this was not the case: Despite nominal differences in individual estimates across raters, which yielded low reliability scores, the conclusion remained the same (see supplementary Figure S2). In the following, we will report results only from the first rater as the large majority of discussions upon initial disagreement resulted in agreeing with his codings.

Confirmatory and extended analysis
We conducted the random-effects meta-analysis using the restricted maximum likelihood (REML) estimator for estimating the between-study variance in true effect size for e = 48 effects (see Figure 1) from the published and unpublished literature. We opted for the random-effects model instead of the equal-effect model, because we wanted to estimate the average true effect size in the population from which the effects were randomly sampled ( Borenstein et al., 2010). Those effects represented either a measure of automatic activation or goal pursuit and were based on 4751 participants. The results indicated a small summary effect of GC, Hedges' g = 0.30, 95%CI [0.21, 0.40] 5 (see Figure 2A) Figure 2B).
For the previous analyses, we combined intention and behavior/goal pursuit. As a behavioral intention can also be seen as different from actual behavior, we additionally report pure behavioral results for the confirmatory data (e = 26, Figure 2C) and extended data (e = 40, Figure 2D)

Publication bias and correction
To assess potential publication bias, we first correlated the sample size per study with the size of the effect. The negative and significant correlation (see Figure 3A) hinted at potential publication bias as studies with a smaller sample size yielded larger effects (Kühberger et al., 2014). We split the data into published and unpublished relevant effects, where unpublished effects also included effects from RRs, as most publication bias corrections require traditional publications (without preregistration) only. Effects from both subgroups differed considerably: The summary effect of the published studies was larger, g = 0.  00, 5.16], which in turn yielded an effect very close to zero (i.e., no effect). Interestingly, heterogeneity was much smaller and nonsignificant in the subgroups, indicating limited potential for moderating effects.
We proceeded by funnel-plotting the effect sizes of the published effects against the standard errors of the effects (see Figure 3B). Egger's regression (Egger et al., 1997), which is depicted as the diagonal dashed line in Figure 3a, b = 2.45, t = 4.86, p < 0.001, also suggested that small-study effects were present and one possible cause of small-study effects is publication bias. The estimates for the other three models yielded similar results and can be found in Supplementary Table S1.
Next, we applied several older and more recent correction methods for the effect size estimate (descriptions are provided in Table 1). 6 For both the confirmatory and extended model, all corrections brought the estimate closer to zero. Assuming that the true effect is around g = 0.15, this corresponds to 17 to 22 participants that have to be exposed to a goal contagion manipulation in order to find one person who is actually influenced by the observation compared to a control group. 7

Results for the moderators
We conducted meta-regressions for all preregistered (i.e., presentation, DV category, common goal, and contrast control) and exploratory moderators (prosocial/cooperative goal vs. self-serving goal) individually and controlling for the other variables. However, results were always similar: There was virtually no evidence that any moderator showed an expected effect (see Figure 4). Only the zero-order effect of contrast control had a slope coefficient different from zero, which was also tiny in size. However, when we accounted for whether the effect came from the published or unpublished literature, this effect vanished as well (for

General Discussion
Observing someone pursuing his or her goals can have a profound effect on ourselves -maybe even to the extent that we adjust our own goals. The theory on GC (Aarts et al., 2004) allows a social-cognitive approach to this phenomenon, according to which an observation of someone's goal-directed behavior will lead to an automatic inference and activation in the observer and potentially to a behavior towards a similar goal. Here, we set out to summarize the evidence for GC in a meta-analysis and to identify moderators of this process, which is based on 48 effects for the confirmatory analysis and 76 effects for the extended analysis. First, we initially found an overall summary effect of Hedges' g = 0.30. But this effect can be described as  small, and it seems to be biased through the current state of the publication system. On average, published studies reported larger effects, g = 0.42, than unpublished studies and RRs, g = -0.01. Publication bias correction methods estimated the effect size to be around half the size of the uncorrected summary effect. The suggestion of a true effect around g = 0.15 was further supported by the extended analyses after correction. All in all, GC appears to have a rather soft effect, as one needs around 20 people to find one person being affected by the observation. Hence, with regard to the GC effect, observing others is not something that can be expected to distract people all the time from their daily activities. Rather, the GC effect, if it exists, might be limited to particular instances in contexts that are hard to pinpoint, which also becomes clear by looking at the moderators. Second, when looking at the DV category, there was no noteworthy difference between behavioral and automaticactivation outcome measures. This is somewhat surprising as one could expect stronger effects of activation, regarded as prerequisite for a goal to be adopted. Furthermore, focusing on behavior only goal pursuit (without intention for behavior) yielded similar effects as the overall analysis. Taken together, we found no evidence indicating that the GC is stronger or more easily detectable depending on the deployed DV (automatic activation, intention for behavior, or behavioral goal pursuit).
Third, we did not find evidence for moderating effects, which we discuss in more detail in the supplementary materials (see https://osf.io/jx7rc/). Only the effect between the GC condition and the control condition became slightly more pronounced the more contrary the goals in both conditions were. But given that several studies used control conditions contrary to the goal condition (e.g., Dik & Aarts, 2007;Laurin et al., 2016), it is surprising that this effect is not much more pronounced. One reason why contrary-goal control conditions are not as effective as one would think could be that they often indirectly imply the actual goal (Moskowitz & Gesundheit, 2009). For instance, reading about participants doing voluntary work (as a goal contrary to earning money; see Aarts et al., 2004) could still activate the goal of earning money in some participants (e.g., Corcoran et al., 2018) and hence reduce the GC effect.
Fourth, the unpublished studies used larger samples than the published studies, which contributed to their higher precision (i.e., smaller confidence intervals). However, it has to be pointed out that the unpublished studies differed on aspects other than larger samples. Foremost, they partially used different goals, such as physical activity, which were not used in any of the published studies. Hence, it is possible that aspects regarding the goal selection contributed to the considerably smaller effects.
As with many meta-analyses in psychology, this one, too, suffers from deficiencies. An obvious limitation is probably the partially low interrater agreement scores. We intended to ensure that our preregistered coding scheme would produce similar results independent of the raters as this is currently not the standard in quantitative research (Maassen et al., 2019). This turned out to be difficult as both the relevant effects and originally hypothesized effects were often not clear enough to the coders, despite extensive training with five preliminarily coded articles as pilot. Hence, the low scores might partly depend on the experience level of the coders and partly on the large variation of the experimental designs. The latter point should not be underestimated: Different designs result in different effect sizes, which have to be transformed into common effect sizes, which in turn can introduce bias. Concerning the first point, it is important to note that discussions solved most of the disagreements. Moreover, when we conducted meta-analyses for the different raters, the overall effects did not differ, despite differences in individual effects. This result, however, has to be treated with caution as studies were not randomly assigned to these raters.
An issue related to varying designs that lowered the interrater agreement was the ambiguity of the goals and their respective operationalization as manipulation and DV. This ambiguity is problematic from a theoretical perspective because it might imply that the theory is not specific enough. This will be illustrated by looking at achievement goals here, but could really be demonstrated with other goals as well. Achievement goals were used several times to demonstrate the GC effect, but the manipulations and outcome variables were treated differently by different authors (i.e., concepts instead of studies were replicated; see Chambers, 2017), which makes it difficult to conclude that achievement goals generally show or do not show a GC effect. For instance, Leander and Shah (2013, Study 3a) had participants read about a student who had either an immediate or distant deadline for a semester paper (manipulation), which led subjects to work with higher or lower persistence on anagram tasks, respectively (DV): the closer deadline exerted a larger GC effect. Tobin, Greenaway, McCulloch, and Crittall (2015, Study 1) used a similar manipulation but had participants write an essay and its quality was assessed by different raters as DV. In both examples, the original authors argued that the behavioral outcome measure was indicative of an achievement goal being activated. In other examples, the manipulation materials also differed (e.g., Dik & Aarts, 2007;Brohmer et al., 2018). It has to be emphasized that a high degree of ambiguity of conceptually similar studies was the rule rather than the exception. And this ambiguity was fueled even further by differences of control groups and additional interacting variables.
Given this variety, it is even more interesting that only moderate heterogeneity across studies was found. In light of this, two interpretations are possible: either the corrected summary GC effect across studies that we extracted is so clear that it truly represents an existing effect, or the summary GC effect is no more than an artifact of consistently selective reporting in the literature, independent of the designs. The strong evidence for publication bias for both the extracted effects and the originally hypothesized effects (see supplementary document) along with the drop in estimated between-study variance after the publication-bias correction rather indicate the latter.
Certainly, we do not intend to dismiss individual studies or even the theory on goal contagion as a whole. In fact, goal contagion remains an elegant approach to explain how people can become inspired by their peers, which needs to be explored further. However, similar to other topics of social psychology (e.g., Friese et al., 2017;O'Donnell et al., 2018;Simmons & Simonsohn, 2017), goal contagion also suffers from many early studies that showed unreasonably large effects with low sample sizes. Future studies should avoid these pitfalls of low power, which increase both false-negative and false-positive findings.
Furthermore, the theory in its current state seems to be underspecified, despite 15 years of research. There are many examples for this underspecification: for instance, it is not clear whether measuring the accessibility of a goal concept already suffices to say that a goal inference took place, whether goal inference could or needs to occur quickly and automatically or whether a successful GC process should manifest itself in goal pursuit only or in behavioral intentions likewise. One reason for this underspecification lies potentially in a body of research without close replications and where new studies almost always introduced changes in the research designs. Changes in research designs to identify the boundaries of a theory are, of course, important for the development of a theory. But they become problematic when the empirical basis for extended designs is uncertain and thin (see Chambers, 2017, Chapter 3). Moreover, some changes in the research designs of published studies did not correspond well to the original theory, which does not allow for more nuanced conclusions. 8 As a starting point for future research on goal contagion, we think that the theoretical processes underlying GC must be better specified (e.g., by clearly identifying causal effects, or by applying computational modeling techniques, see e.g., Rohrer, 2018;Smaldino, 2019;Guest & Martin, 2020). Also, one should assume very small GC effects of standardized mean differences of around 0.15, based on the publication bias methods for the confirmatory and extended analyses. This, of course, corresponds to sample sizes of at least 1102 participants (one-sided t-test, 1-β = 0.80, α error probability = 0.05), which, in order to be achieved, potentially require collaborations between labs. Anything below this number does not seem to be reasonable, as the literature does not contain enough reliable information on designs that allow for much smaller sample sizes. Additionally, applying open science practices, such as data and material sharing and preregistration, will become necessary so that researchers can learn from each other more efficiently.

Conclusion
GC is a social-cognitive approach to understand how observing others can affect our own goal-directed behavior. However, there are indications of publication bias within the published literature and most recent studies yielded effects clustering around zero. Potential moderators that could advance the theory on GC could not be identified in this meta-analysis, either. We strongly suggest applying open science practices and determining the required sample sizes based on a power analysis in future research to bring goal contagion back on track.

Notes
1 It has to be noted that the original GC articles employed the term 'implicit' for the automatic activation and inference processes. As the conceptualization of 'implicit' is vague -it could mean ' automatic', ' associative', or 'indirect' (see Corneille & Hütter, 2020)we will stick to the term ' automatic' in this article. 2 Note that in the preregistration, we refer to the ' common goal' moderator as 'basic goal' moderator, see https://osf.io/zgqub/. 3 Please note that our specific inclusion criteria were extended during the coding of articles. Any changes made to the preregistered coding scheme are documented online by date (see https://osf.io/w8b9m/). 4 Note that correlation coefficients r were included during the coding procedure as it turned out that some studies reported them, rather than regression coefficients with accompanying t-values. Additional exploratory coding that was not considered a priori is described online and corresponding exploratory analyses are reported in the supplementary document. 5 A reviewer suggested Empirical Bayes model, which yielded a similar effect, g = 0.31, 95% cred. int. [0.22, 0.41]. 6 Note that Trim and Fill has been criticized by methodologists (Terrin et al., 2003;Simmonsohn et al., 2014). We also preregistered Orwin's Fail-Safe-N (Orwin, 1983), which has also been criticized and is recommended not to be used anymore (Becker, 2005;. 7 We would like to thank the anonymous reviewer who hinted at using the NNT (number needed to treat) as an intuitively interpretable effect size. We used Control Event Rates of 0.2 and 0.5 to reach 17 and 22 participants, respectively, see Magnussen (2020). 8 For example, several studies (Loersh et al., 2008;Fast & Tiedens, 2010) included emotional and affective manipulations, but did not explicate how these correspond to the theory on GC. See more examples at https://osf.io/dkxsr/.