Reciprocity through ratings: An experimental study of bias in evaluations

Abstract This paper studies the potential for ratings of seller quality to be influenced by side payments to raters. In a laboratory setting, we find that even modest side payments from sellers to raters have large effects, with the type of rating (favorable or unfavorable) given to a seller determined primarily by how large a monetary transfer the seller makes to the rater. Our results demonstrate that side payments can crowd out a rater’s concern for buyers, even in situations where there is no potential for long-term relationship building.


I Introduction
Online commerce is a major component of the economy, accounting for $389 billion in sales in 2016, or 8.0% of all US retail. 1 This is an increase of 14.4% from 2015, compared to only 2.8% growth in total retail. Rating systems are one of the main facilitators of online (and many offline) transactions, allowing consumers to share information about a product's quality with future buyers. The usefulness of that information is not guaranteed, however, as many rating systems do not provide any explicit reward for giving accurate ratings, leaving open the possibility that other, potentially harmful, incentives may influence the decision to rate. This paper experimentally studies the effect of one such incentive: side payments made by sellers in an attempt to influence ratings. We show that such payments can heavily influence rater behavior, decreasing the usefulness of ratings for future consumers.
In many settings, consumer-generated ratings are public goods and are regularly inefficiently under-supplied in free markets (Avery, Resnick, and Zeckhauser 1999). As a result, encouraging the provision of online ratings has been an emphasis in existing ratings research, leading to several explanations as to why consumers provide ratings, and how institutions can encourage more frequent rating (Chen et al. 2010;Li and Xiao 2014). The main challenge when attempting to encourage ratings comes from the need to overcome small, often implicit, costs that discourage consumers from rating, while also maintaining the accuracy of any ratings that are generated. Some incentives, whether external rewards, or internal motivations, can bias ratings sufficiently to diminish or even eliminate all benefits to consumers (Lafky and Wilson 2015).
One motivation to rate is the desire to reciprocate good treatment with good ratings (positive reciprocity) and ill treatment with poor ratings (negative reciprocity). Both negative and positive reciprocity have been well-documented, showing that individuals are willing to forgo earnings to punish or reward behavior based on perceived fairness (Fehr and Gächter 2002;Fehr and Fischbacher 2004;Nikiforakis and Mitchell 2014;Balafoutas, Nikiforakis, and Rockenbach 2014;Nosenzo et al. 2015). 2 Reciprocity has been demonstrated to play a role in rating behavior in the field, in settings such as Airbnb (Fradkin, Grewal, and Holtz 2019) and eBay (Resnick and Zeckhauser 2002). The role of reciprocity is especially visible in twosided rating environments, in which both buyers and sellers rate one another. When both sides of the market can rate one another, ratings are used strategically to punish or reward other members of the market and to protect one's own reputation (Klein et al. 2006;Klein et al. 2009;Bolton, Greiner, and Ockenfels 2013). Such strategic concerns can damage the informational content of ratings, as raters become more concerned with protecting their own reputation than with accurately describing quality.
One-sided rating environments, which are the focus of this paper, minimize such strategic considerations. Raters in one-sided systems have been shown to exhibit altruistic concern for informing buyers, as well as reciprocity toward sellers based on the seller's choice of quality (Lafky 2014). Reciprocity based on seller quality is not, in itself, troubling as raters face no dilemma over whom to favor with their ratings. If a rater gives an unfavorable rating intended to punish a seller for offering low quality or a favorable rating intended to reward a seller for offering high quality, they are also informing other buyers of the seller's true quality.
A natural, though potentially harmful, way for sellers to exploit reciprocity is through the use of side-payments to raters. Side-payments and gifts have been experimentally investigated in a variety of economic contexts, primarily focused on public sector bribery and corruption, harassment bribes (Abbink et al. 2014), or petty bribes between officials and citizens (Barr and Serra 2010). Similarly, gift-giving has been shown to corrupt principal-agent relationships, with principals more likely to act against an agent's interests following a gift from an interested third-party (Malmendier and Schmidt 2017). A rater who desires to reciprocate a seller's sidepayment (or lack thereof) may be tempted to do so at the expense of accurate ratings, resulting in two competing objectives: reciprocity toward the seller in response to a side-payment, and a desire to accurately inform future buyers.
At their simplest, side-payments may take the form of a direct monetary bribe from a seller to a rater. In practice, side-payments might be less explicit, embodied by more subtle forms of generosity. For example, a company might send an evaluation copy of a product to a reviewer, allowing the reviewer to keep the item after the review is complete as a subtle form of gift. 3 We ask whether such favorable treatment, separate from the product's underlying quality, will influence the type of rating chosen for the company. Specifically, we consider environments in which a payment toward the rater precedes the choice of rating and cannot be contingent upon the rating given. When side-payments are possible, it becomes difficult to disentangle whether ratings are motivated by reciprocity over quality versus reciprocity over side-payments. To isolate reci- 3 We refer specifically to instances in which free products are a form of preferential treatment for reviewers over general customers, such as in the case of the Amazon Vine program. Such exchanges are distinct from, for example, including a bonus item free of charge with every purchase. A company that routinely bundles a product for free to all customers (e.g., free headphones with purchase of a new phone) is essentially just selling a higher-quality product, while reviewers who preferentially receive free demo products have a more favorable experience than can reasonably be expected for the general customer. procity over side-payments, we implement a simple, stylized experiment in which the seller's quality is exogenously determined. In other words, we strip away the seller's quality decision, and therefore responsibility for the quality that they offer, leaving them with only the choice of side-payment. By exogenizing quality, we remove the effect of reciprocity over product quality, and are able instead to cleanly observe the relative strength of a rater's reciprocity toward sellers versus their desire to aid future buyers. Because sellers are not responsible for their quality, we do not have our raters "buy" from the seller, meaning that their payoff is not directly affected by the seller's quality. While raters in the field are often themselves buyers who have experienced the seller's quality, this design again helps to isolate the tension raters face between concern for transfer size and concern for future buyers.
As most rating environments exhibit at least a small cost of rating, arising from the opportunity cost of a rater's time, or the nominal mental effort required to evaluate a product (Kamei and Putterman 2018), we examine the effect of side payments in two experimental environments, one in which rating is costly and one in which it is free. While the costly treatment provides a more realistic description of typical rating systems in the field, the free treatment allows us to cleanly observe the tradeoff raters face between concern for buyers and concern for sellers. Although we are primarily interested in the effect of side-payments on ratings, the difference in costs between treatments also allows us to examine the opposite direction, asking whether the ease with which consumers can punish or reward via ratings influences how generous sellers are with side-payments. This is similar to Xiao (2013) who studies the effectiveness of punishment when the decision to punish is either free or actually beneficial to the punisher.
We are not the first to suggest that sellers may manipulate ratings in their favor, for example by fraudulently reviewing themselves or their competitors (Dellarocas 2006;Mayzlin, Dover, and Chevalier 2014), or by exploiting a good reputation to sell low-quality products (Dini and Spagnolo 2009). Our contribution is to use a controlled environment to show that non-contingent side-payments, even without the possibility of future interactions, can heavily influence ratings.
Our results show that the size of the transfer from the seller is the primary determinant of the type of rating sent. That is, simply giving a gift is not sufficient to curry favor with a rater, but the magnitude of the gift determines the likelihood of a favorable rating. Raters give favorable ratings when sellers make large transfers and unfavorable ratings for small transfers, regardless of the seller's underlying quality. As a result, we observe many "misleading" ratings that are not in buyer's best-interest. We also find that sellers are sensitive to the cost of rating, giving smaller transfers when it is more costly for consumers to provide ratings.
Before discussing our results, we first develop of simple model of rating behavior, initially for strictly-self interested agents, and then when raters have social preferences.

II Theory
Our environment consists of three players, a seller, a rater and a buyer. The game begins with the seller drawing a quality q from the uniform distribution U [0, q max ], and choosing an amount of money, a ∈ [0, a max ], to transfer to the seller. The seller's quality is exogenously determined in order to isolate rater reciprocity over the size of the transfer, separate from possible reciprocity over the seller's quality. The rater learns the seller's q and a, and then has the option to pay a cost c ≥ 0 in order to send a binary rating r ∈ {G, B} (i.e., Good or Bad), or pay no cost and send the empty message r = ∅.
The buyer next learns what rating, if any, was given, and must choose either the seller or an unknown outside option. If the buyer chooses the seller, the buyer receives a payoff equal to q, while the seller receives a fixed payoff π that does not depend upon q. If the buyer chooses the outside option, the buyer's payoff is a random draw, k, from U [0, q max ], and the seller's payoff is zero. The distribution of the outside option is the same as that of q, meaning that the outside option can be effectively thought of as an alternative untried seller for the buyer to choose. Denote the utility of the seller, rater, and buyer by U S , U R , and U B , respectively.
If we assume strictly self-interested preferences, the prediction in this environment is straightforward. Raters exhibit no reciprocity based on transfer size, which means that sellers have no ability to influence ratings and therefore always choose transfers of zero. Similarly, raters have no incentive to inform buyers of seller quality. In short, sellers make no transfers, raters do not rate, and buyers receive no information as to whether the seller or outside option is a better choice.
Next, consider the model in which the rater has other-regarding preferences for both the seller and the buyer, given by where α(a) and β measure the rater's concern for the seller and buyer, respectively, and 1 rate is an indicator function for whether a rating was given. We assume that there is some a * such that α(a) > 0 when a > a * and α(a) < 0 when a < a * , but we do not assume any specific functional form. 4 The rater therefore exhibits either positive or negative concern for the seller, depending upon how generous the seller was with the choice of a. If the seller was (subjectively) generous, then the rater exhibits altruistic concern for the seller, however if the seller is perceived as being selfish, then the rater prefers to see the seller worse off instead.
Unlike α(a), the rater's concern for the buyer, β, is not a function but a fixed parameter. This is because the rater does not observe any behavior from the buyer on which to condition positive or negative concern. The rater's payoff is not directly influenced by the seller's exogenously determined quality, and as a result the rater's desire for reciprocity depends only on the transfer, a, and not on the quality, q.
The rater will give a positive rating, r = G, if the utility from rating is greater than the utility from not rating: 5 For simplicity, we assume that the buyer treats ratings credulously and randomizes over the choices if no rating is sent. 6 We next substitute for U S and U B to get: which can be rewritten as: Intuitively, Equation 1 says that the rater is more likely to give a positive rating the more generous the seller is (larger a), the higher the seller's quality (larger q) or the lower the cost of rating (smaller c). Generosity and quality substitute for one another, so a high-quality seller can get away with being less generous to the rater, whereas a low-quality seller must be more generous to receive a positive rating.
Likewise, the rater will give an unfavorable rating, r = B, if: function, that raters exhibit positive reciprocity for sufficiently high transfers, and negative reciprocity for sufficiently small transfers. 5 In order to give a positive rating, the rater actually needs to prefer r = G to both r = ∅ and r = B, however the condition for preferring r = G to R = ∅ implies the condition for preferring r = G to r = B. Similarly, when we consider the decision to give a negative rating, below, we need only compare r = B to r = ∅. These comparisons are made explicit in the appendix.
6 For the buyer to behave credulously and follow the rater's advice, we only need that E(q|r = G) > E(q), meaning that positively rated sellers have a higher expected value than the outside option, and E(q|r = B) < E(q), meaning that negatively rated sellers have a lower expected value than the outside option. In other words, we require only that ratings indicate the better option more often than not. If these conditions do not hold, the buyer either ignores ratings entirely, or infers the opposite of a rating's literal meaning (e.g., choose the seller with a negative rating, avoid the seller with a positive rating). In both instances we end up with a babbling equilibrium in which no communication occurs and no transfers are made. Babbling is always possible in communication environments, but as we will show later, we do not see evidence for such behavior in our experiment, as buyers overwhelmingly follow rater recommendations.
The intuition conveyed by Equation 2 is similar to before: the rater is more inclined to give a negative rating the less generous the seller is, the lower the seller's quality, or the lower the cost of rating.
When neither of Equations 1 and 2 hold, the rater prefers to give no rating. In other words, the rater gives no rating when both hold simultaneously. Overall, our model predicts that as the sum of the utilities the rater experiences due to social preferences increases, the rater transitions first from giving a negative rating, to no rating, to eventually giving a positive rating.
We summarize our analysis with four predictions, which we later take to the lab: Prediction 1. Raters may rate even when it is costly to do so.
Equations 1 and 2 show that both conditional altruism toward the seller and unconditional altruism toward the buyer will motivate raters to pay the cost of rating.
Prediction 2. High-quality sellers make smaller transfers than low-quality sellers.
Due to the tradeoff between quality and transfer size in Equation 1, lower quality sellers need to offer larger transfers than their high-quality equivalents in order to generate the same rating.
Prediction 3. Choice of rating depends upon both quality and the size of the transfer. A rating is more likely to be favorable (unfavorable) when quality is higher (lower) or when a larger (smaller) transfer occurs.
As quality or transfer become more extreme (either high or low), the greater is the rater's concern due to social preferences. Sellers of moderate quality or transfers may not elicit a sufficiently strong level of concern from the rater to overcome the cost of rating.
Prediction 4. Raters will mislead buyers to reward or punish sellers for the size of their transfers.
If reciprocity toward sellers dominates altruism toward buyers, ratings will be given solely to aid sellers and will not reveal actual seller quality. We next evaluate these predictions against experimental data.

III Experimental Design and Procedures
Our experiment was conducted at the Cleve E. Willis Experimental Economics Laboratory at the University of Massachusetts, Amherst using z-Tree experimental software (Fischbacher 2007). A total of 396 subjects were recruited from the student population using the ORSEE recruitment system (Greiner 2015). Each session lasted 45 minutes on average, including time for instructions, participation in the experiment, a demographic questionnaire and being paid in private at the conclusion of the session. Subjects were randomly assigned to cubicles that were separated from each other visually and physically, and subjects were prohibited from speaking. All subjects received a common set of instructions and all questions were answered privately.
We implemented a one-shot between-subject design with two treatments. 7 Participants were randomly assigned to groups of three people. Within each group, participants were randomly assigned the roles of seller, rater and buyer, which we referred to in the experiment as Participant A, Participant B and Participant C, respectively. Every group contained exactly one participant of each type.
Within each group, the seller was randomly assigned an integer drawn from 1 to 16 inclusive. 8 This draw represented seller quality, though it was referred to in the experiment simply as "the X number." Randomly drawing quality provides us with crucial exogenous variation to examine the relationship between quality, transfers and choice of rating. After learning the randomly chosen quality, the seller was given an endowment of $16, and was given the option to transfer any amount of that endowment to the rater in $1 increments. 7 A variety of experiments with punishment, deceit and third parties use one-shot games to remove the reputational concerns of repetition and to increase the salience of the single choice. Fehr and Fischbacher (2004) and Carpenter (2007) use one-shot choices for third-party punishment. Falk and Kosfeld (2006), Ziegelmeyer, Schmelz, and Ploner (2012) and Burdìn, Halliday, and Landini (2018) use a one-shot game in two and threeparty principal-agent (and monitor) games. Gneezy (2005) and Sutter (2009) use a one-shot sender-receiver game to understand deception and communication. 8 The instructions indicated that this number and the outside option described below were drawn from 0 to 16 inclusive, but due to a programming error both were actually drawn between 1 and 16 instead. This error was not detected until after the sessions were completed. The effect on subject behavior should have been minimal, and subjects were not harmed by this change, as it meant that mean quality and therefore payoffs were slightly higher than intended.
The rater learned the amount of money that the seller chose to transfer, as well as the quality that was randomly chosen for the seller. The rater chose either to pay a cost, c ≥ 0, to send a message to the buyer, or to not incur a cost and send no message. The rater had a choice of two messages, either "Choose Participant A" or "Do not choose Participant A." In the discussion below we refer to a message recommending the seller as a "positive rating" and one recommending against the seller as a "negative rating," though we did not use these terms The buyer learned what message the rater sent, if any, and then chose between the seller and an outside option. In the experiment we referred to the outside option as "the Y number," which was another integer drawn randomly from 1 to 16. The buyer earned $1 times the seller's quality if they chose the seller, and $1 times the outside option otherwise. As before, we denote the seller transfer by a, which gives us two earnings profiles for each participant: In the free treatment, c = 0, and therefore rater payoffs collapse to simply π R = a for both sending and not sending a message.
Subject understanding of choice sets and payoffs was assured by four quiz questions at the end of the instructions, followed by three control questions in the experimental software. Once all subjects had answered the quiz and control questions correctly (with opportunities to ask questions privately), subjects made their choices in the sequence described above: first seller, then rater, then buyer. After making their experimental choices, each subject completed a basic demographic survey. The survey was not incentivized and subjects were told that their responses on the survey were not connected to their final payment. After completing the survey, subjects were called to a cubicle in order to privately receive their earnings from the Table 1 shows summary statistics for the experiment. There were no statistically significant differences by treatment or across experimental session in demographic characteristics. Due to the lower frequency of rating and resultant decrease in the number of ratings observed, a larger sample size was necessary in the costly treatment relative to the baseline free rating treatment in order to observe a similar number of ratings in each treatment. We therefore ran twice as many sessions in the costly treatment (12 sessions Table 1 shows, a large majority (86%) of sellers transfer positive amounts, and there is no significant difference in the frequency of non-zero transfers between the treatments, with 89% sharing at least $1.00 in the free treatment and 85% doing so in the costly treatment. Consistent with Xiao (2013), sellers are responsive to the likelihood of punishment from raters, sending average transfers of $5.17 in the costly treatment and $6.78 in the free treatment, an increase of 31.14% when rating is free (Mann-Whitney z = 2.67, p = 0.008). Table 2 shows the determinants of transfer size from Tobit (odd columns) and OLS (even columns) regressions. As before, sellers share a smaller portion of their endowment when it is costly for raters to give a rating. Transfer size also depends upon the seller's quality, with higher-quality sellers choosing larger transfers, though this effect is relatively modest, with a 1-unit increase in quality resulting in a $0.17 increase in transfer size. This behavior is somewhat puzzling, and is opposite from our theoretical prediction that quality and transfer size serve as substitutes. Note that quality is no longer statistically significant in specifications

IV Results
(3) and (4) in Table 2. Restricting our regressions by treatment shows a statistically significant and positive relationship between quality and transfer size in the costly treatment, but no such correlation in the free treatment. In other words, in the costly treatment, high-quality sellers give slightly larger transfers than low-quality sellers, while in the free treatment transfer size does not vary by seller quality. We report the results of the restricted regressions in Table A1 the appendix. Seller behavior in the free treatment can be explained by sellers who expect raters to have very little concern for buyers, which, as we discuss below, appears to be an   18 observations left-censored at 0; 1 observation right-censored at 16. * p < 0.10, * * p < 0.05, * * * p < 0.01. empirically sound belief. Seller behavior in the costly treatment is more difficult to explain. A belief that raters are concerned only with transfers should result in sellers being insensitive to their own quality, while a belief that raters are concerned with quality should result in sellers who treat quality and transfer size as substitutes, as explained in the theory section above.
Finally, a seller who believes raters to be concerned only with quality should give no transfer.
We revisit the question of seller behavior in the conclusion, and we control for quality and transfer size in our analyses of rating behavior below.
Our primary interest is in how raters respond after learning seller quality and choice of transfer. We observe in Table 1 that the choice to rate is highly sensitive to costs. More than twice as many raters (84.44%) choose to rate in the free treatment as in the the costly treatment (41.38%) (Mann-Whitney z = 4.708, p < 0.001). The increase in ratings moving from costly to free is perhaps less impressive than the fact that almost half of all raters choose to rate in the costly treatment. A large number of raters are still motivated to rate even when it is costly to do so.
Result 1. Many raters (41.38%) rate even when it is costly to do so.
We expand on Result 1 through a series of regressions on the likelihood that a rater will send any rating, with a binary dependent variable that takes the value 1 if the rater sends a rating and 0 if not. Regression results are shown in Table 3, with probit specifications in the odd columns and OLS in the even columns. In each of the specifications there is a large negative effect from the cost of rating and the size of this effect increases as we add more controls, suggesting that the choice to rate is heavily influenced by the presence of a cost. Consistent with Result 1, the regression results show a rater's likelihood of rating drops between 41 and 65 percentage points when rating is costly. In specifications 5 and 6 we also check the predicted tendency to rate more frequently when quality or transfers tend toward high or low levels, and less frequently for moderate levels. We see modest evidence of this behavior, with small positive correlations and mixed statistical significance.
The size of transfer does not influence the likelihood that a rater rates, however once we condition upon a rating being sent we find that transfers are highly influential. Negative ratings were preceded by average transfers of $4.14, compared to $7.84 for positive ratings (Mann-Whitney z = -4.26, p < 0.001). For a given rating, the amounts transferred are not statistically significantly different across treatments. Table 4 expands on this simple comparison through a series of regressions examining the choice of rating, conditional on any rating being sent. These regressions include seller quality, which controls for the weak relationship between quality and transfer size, allowing us to focus on the effect of transfer size on choice of rating. We model the choice between the two possible ratings as a binary variable that takes the value 1 for a positive rating and 0 for a negative rating. In every specification we find a statistically significant relationship between transfer size and choice of rating, with each additional dollar transferred leading to approximately a 6 percentage point increase in the likelihood that the rating is favorable. 9 As a robustness check, Tables A3 and A4 in the appendix report the results of ordered probit regressions that combines the choice of rating with the decision to rate at all. The categories are a negative rating (0), no rating (1), and a positive rating (2).
The results remain substantively the same: the amount received by a rater corresponds to a lower probability of negative rating and a higher probability of a positive rating.
Result 2. A rater's choice of rating is determined by the size of transfer from the seller.
Result 2 is our main finding, as it indicates that the type of rating sent is determined entirely by the seller's choice of side-payment, and not by seller quality. The result suggests 9 This result is robust to instead using dummy variables for each quartile of amount received or a dummy variable (with interaction term with cost) for whether the amount received was above or below the median transfer of $6.  Robust standard errors in parentheses. Probit coefficients are marginal effects. * p < 0.10, * * p < 0.05, * * * p < 0.01.
that side-payments can fully crowd out the rater's desire to accurately describe quality. This behavior is important and possibly troubling, as it suggests that raters' intrinsic motivation to help other consumers may be small, relative to other factors. Even with no future rewards at stake, raters are more concerned with reciprocating seller side-payments than with accurately informing buyers.
The strong relationship between transfer size and choice of rating produces many ratings that are not in a buyer's best interest, and a rater's choice to punish or reward a seller often comes at the buyer's expense. We say that a rater misleads a buyer if they provide a favorable rating when quality is less than 8 (an overstatement) or an unfavorable rating when quality is greater than 8 (an understatement). Note that misleading ratings are not, strictly speaking, dishonest, as raters are only making recommendations, and neither of the messages they can send (e.g., "Choose Participant A") is ever technically false. Across both treatments 32.9% of ratings are misleading, with more misleading ratings in the free treatment (39.5%), than the costly (25.7%), though that difference is not statistically significant.
Result 3. Nearly one-third (32.9%) of all ratings are misleading.
Examples of misleading ratings can be seen in Figure 1, where the dashed vertical line represents the ex-ante expected quality level. Any favorable rating to the left of the vertical line overstates quality, instructing the buyer to choose the seller even though the seller is of below-average quality. Similarly, unfavorable ratings to the right of the vertical line understate quality, instructing the buyer to avoid the seller, despite the seller having above-average quality.
Although we observe that approximately one-third of ratings are misleading, those numbers likely under-represent raters' actual willingness to mislead buyers. We would not expect to see more than half of all ratings be misleading, even if raters were solely motivated by concern for sellers. For example, the act of punishing a seller for a small transfer only results in misleading behavior when the seller is of high quality, as the remaining half of the time the seller is of low quality and an unfavorable rating intended to punish coincidentally helps the buyer to choose their best option. In other words, the actual proportion of the population willing to give misleading ratings is approximately double what we observe, suggesting that many raters are willing to make buyers worse off in order to punish or reward sellers for their side-payments.
It is possible that risk aversion could cause a rater motivated solely by altruism toward buyers to give a favorable rating to a below-average seller. That said, observed rates of misleading ratings are very similar if we change the threshold used to define a misleading rating, either upward to 9 or downward to 7. Additionally, if risk aversion did lead to favorable ratings for below-average sellers, we would expect to see more overstatements than understatements, as there is no similar incentive to negatively rate above-average sellers. This is not the pattern we observe in the data, with no significant different in the frequency of understatements and overstatements, at 31.25% and 34.15%, respectively (p = 0.80, Mann-Whitney).
Given our experimental parameterization, the seller's reward from being chosen ($16.00) is greater or equal to the maximum difference in earnings to the buyer from choosing between the seller and the outside option. As a result, if a rater were solely concerned with efficiency (i.e., the sum of the seller's and buyer's payoffs) they would give favorable ratings to all sellers, regardless of quality, resulting in many overstatements. We do not observe such a pattern in the data, as overstatements and understatements are equally common.
Despite the frequency with which ratings are misleading, when a rating is sent, buyers generally follow the recommendation made by the rater. The seller is chosen following a favorable rating 94.7% of the time in the free treatment and 83.3% of the time in the costly treatment (see Table 1). Following a negative rating, 5.3% and 5.6% of sellers are chosen in the free and costly treatments, respectively. If no rating is sent, buyers appear to randomize over choosing the seller and the outside option, with 48.3% of buyers in the free treatment and 47.3% of buyers in the costly treatment choosing the seller. 10 None of the buyer behavior is significantly different between treatments. We summarize these findings with our next result.
Result 4. Buyers generally follow rater recommendations.
Buyers generally trust ratings, but what effect do ratings have on buyers? We first examine ex-ante buyer welfare based on both the empirical frequency with which each type of rating is given, and the expected quality conditional upon each type of rating. In other words, within each treatment the expected payoff to a potential buyer is given by: Where π B,T is the expected payoff to a credulous buyer in treatment T, P T is an empirical probability, E T [X|Positive Rating] and E[X|No rating] are the mean quality following a favorable rating or no rating, respectively, and E[Y ] is the unconditional expected value of Y , which does not vary by treatment. We find that π B,C = 8.74 in the costly treatment, and 10 In regressions reported in Table A2 of the appendix, we show that the unconditional probability of the buyer choosing the seller does not correlate with the treatment or receiving a rating of any quality (good or bad). On the contrary, conditional on receiving a rating, a positive rating increased the probability the buyer will choose the seller by approximately 87%. Standard errors in parentheses. * p < 0.10, * * p < 0.05, * * * p < 0.01, via Mann-Whitney.
π B,F = 8.78 in the free treatment, meaning that in both treatments the ex-ante buyer payoffs are very close to the a seller's ex-ante expected quality of 8.5. This shows that, on average, buyers are just as well-off as if they were choosing between their options at random, which is consistent with our findings that ratings are shaped entirely by the size of the seller transfer.
Although ratings do not influence buyer outcomes on average, individual ratings can be helpful or harmful to buyers. Table 5 reports final buyer outcomes following helpful (positive ratings for above average sellers or negative ratings for below average sellers) and misleading ratings in each treatment, showing the change in welfare that result from each type of rating.
Buyers are consistently better off when ratings helpfully recommend the higher-payoff option than when they are misleading, regardless of the cost of rating. Switching from a misleading rating to a helpful one increases buyer welfare by 43% in the costly treatment and 47% in the free treatment. We summarize our findings on buyer outcomes with our final result.
Result 5. Individual ratings can make buyers better or worse off, but on average ratings have no effect on buyer welfare.

V Conclusion
We find evidence that non-contingent side-payments from sellers significantly influence rater behavior. Not only are ratings influenced by such payments, but in our data those payments are the sole determinant of what type of rating is sent. Despite existing evidence that altruism motivates raters to truthfully inform buyers (Lafky 2014), in our environment we observe no such concern, with raters instead responding solely to the size of side-payments they receive.
Raters are willing to mislead buyers by recommending low-quality sellers who have made large transfers, and recommending against high-quality sellers who have made small transfers.
Our results suggest that those designing or participating in rating systems need to be mindful not only of explicit fraud, but also of biases introduced through preferential treatment of raters. For example, a consumer who is provided with a free review copy of a product may provide a more favorable evaluation of that product than if they were required to return the evaluation copy after they finished their review. While existing research on biased ratings has focused primarily on explicit forms of fraud in which sellers surreptitiously provide ratings for products (Dellarocas 2006;Mayzlin, Dover, and Chevalier 2014), our results suggest explicit fraud is not the only source of intentionally skewed ratings. Implicit, reciprocal relationships unrelated to seller quality may also harm recommendations for future consumers.
Our study implements a simple, stylized experimental environment designed to cleanly isolate one component of the interaction between sellers and raters: the role of side-payments on choice of rating. Our results identify rating behavior when the rater is weighing reciprocity over side-payments versus altruism toward buyers. To isolate subject behavior, we omit some features found in rating environments in the field. First, our setting is a one-shot environment with one-sided ratings, which means that there is no opportunity for individual learning or concern for reputation building over time. Second, seller quality is determined exogenously, allowing us to focus solely on the role of side-payments as the source of reciprocity. Third, our raters do not "purchase" from the seller first-hand, meaning that they are not directly affected by seller quality. In the field, raters are frequently consumers who have already purchased from the seller, and are likely to be influenced by reciprocity over both the size of side-payments and the quality they personally experienced, in addition to potential altruism toward buyers.
Finally, we note that the persuasiveness of side-payments in the field could vary considerably based on context: for example, raters may respond differently to side payments in anonymous online markets than in face-to-face interactions. Similarly, our design has a single buyer, however the number of consumers who observe ratings can vary widely by setting. Increasing the number of buyers might increase the altruistic value of a truthful rating and thereby reduce the relative influence of side-payments.
While our experiment was designed to explain the choices of raters, our results also provide some insights into seller behavior. We find no relationship between quality and the size of transfers when rating is free, but a small positive relationship between the two when rating is costly. The lack of relationship in the free treatment can be explained by sellers who believe that raters are concerned only with the size of transfer, and not with underlying quality. More difficult to explain is why we observe a positive relationship in the costly treatment. While the sellers would give larger transfers, but only when ratings are costly. A future experimental design that identifies seller beliefs about raters, for example by informing sellers about raters' past histories of play, could help to explain the behavior we observed in our study.
Our parsimonious design is helpful for isolating the role of side-payments, however there are many opportunities for future studies to examine more complex settings. For example, a design that implemented endogenous choice of quality could build off of our results to provide insights into the dynamics of how ratings are used to reciprocate both quality and side-payments simultaneously. Even better would be to conduct similar experiments in the field, especially in a setting in which both seller quality and transfers between sellers and raters could be observed and easily quantified. Alternatively, it would be helpful to examine sidepayments in settings with two-sided ratings, or when sellers and raters interact repeatedly, allowing for reputation-building and strategic considerations to develop over time.