Everything about online reviews is counter-intuitive. We think that a restaurant with 4.5 stars on Google Maps is typically better than one rated 4.0, but it’s not. These scores barely mean anything. That’s why I sigh whenever someone with a mobile phone tells me we should watch a different movie, eat in another restaurant, or go to a better escape room based only on the online scores. Life isn’t that simple.
But it’s an average of hundreds of opinions!
No, it’s not.
But it’s a social consensus!
Nope, not really.
But all these people can’t be wrong
Yes, they can.
It doesn’t make sense yet, I know. Don’t worry; I promise it will be crystal clear by the time you finish reading this post series. I will show you the troubling data behind online reviews, ground it in human nature and market incentives, discuss how following the stars robs us of the best experiences in life, and share how we can fight back.
When 60% means 50%
The cardinal sin of star ratings is that they pretend to be numbers when they’re really not. For example. many people I talked to intuitively assume that 3 out of 5 stars is roughly a 60% score, like visualized below:
I fell for this trap too. It’s all too easy to forget you can’t rate something as 0 stars. The scale actually starts at 1, and 3 stars only represent 50% of the perfect score:
Why is it this way? One of the best theories I’ve heard points to the interaction design. Say you want to leave a four stars review – you’d click on the fourth star out of the five the website presented you with. But where exactly would you click to select 0 stars?
Whatever the reason really is, saying 3 out of 5 when you actually mean 2 out of 4 is just misleading.
When 4.0 is more than 4.5
Another problem is that the stars cannot be meaningfully compared with each other.
The pizza place right around the corner is also rated as 4.5, but I’m not too fond of their food. I much prefer the 4.0 sushi place nearby. It is my favorite one, and I would pick it over the 4.6 and 4.9 alternatives any day.
The cherry on top is this McDonald’s sharing a 4.4 rating with a 3 Michelin Stars restaurant Nakamura:
What does a good score even mean?
How good is 4.3 out of 5 stars? It depends. On Airbnb, that’s among the bottom 5% of worst properties1; on Google Maps, it’s just above the average2, and on Yelp, we’re talking about a top-tier restaurant2. The number 4.3 doesn’t mean anything on its own. You need to know how the scores are distributed, and that’s one thing review aggregators rarely share.
Here’s a compilation of average scores I found in various studies:
But let’s go even deeper. If a 4.2 is better than 50% of all restaurants, how good exactly is 4.3 or 4.4? That, again, depends. The ratings on Google maps are squished to the right side, so 0.1 stars makes an enormous difference:
There are so few restaurants scoring below three that I could argue the scale actually goes from 3 to 5. It’s very different from Yelp, where the ratings are distributed across the entire spectrum:
It looks like a stretched version of the Google Maps one, but it doesn’t mean all scores are distributed like that. There are all sorts of shapes out there. For example, the movie scores are distributed very differently across different rating sites 3 :
That’s true for hotels, too. Except for holiday rentals, Airbnb and TripAdvisor have quite different charts:
I could keep going, but you get the point. I find it interesting how a score like 4.3 can mean awful, okay, or top-notch and somehow still impact our decisions.
Some things are both fantastic and terrible
When I saw these distributions, I thought: there’s just no way the same items are rated consistently across different sites. Surely enough, they aren’t.
Take Yelp and Google Maps. Top restaurants are usually different, and if a place scores below the average on one platform, there’s a 20% chance it’s above the average on the other2.
It gets even worse for hotels. There are significant differences between how the same properties are rated on Booking.com and Expedia.com or Hotels.com, and there’s only a mild correlation between AirBnB and TripAdvisor1.
And what about movies? Fandango scores somewhat agree with IMDB but not with Metascore or Rotten Tomatoes3.
There is no shortage of explanations on why that happens. Maybe it’s due to different audiences or different review experiences. Or perhaps it’s due to cultural influences, market incentives, or sheer randomness – all of which I will discuss in the next few parts. For now, I wanted to zoom in on another powerful explanation: the scores are calculated in vastly different ways.
The average rating isn’t even an average
I always thought that a score like 4.4 is an average of all the ratings left by hundreds of reviews. Turns out, it’s not. Every platform has its own algorithm to compute these scores, and the details are usually secret.
It may seem like a matter of a few design choices on the surface. Maybe the recent scores from the reputable users who used the product should matter more than the older reviews from shady accounts. But even that gets complicated quickly. It could work well for restaurants, but is it a good way of rating movies? And how can you possibly check if someone watched the film before rating it?
But the problem is much more profound. Some scores are plain misleading; like on Booking.com, rating a property as 0 actually comes out as 2.54. Others are at least suspicious, like on AirBnB where 95% of listings are rated as 4.5+1.
Then, some overall scores reportedly include more than just the reviews. There are places scoring a mere 4.5 on Google Maps places despite having exclusively 5.0 reviews5. While we can’t be sure why 0.5 is missing, it is often attributed to unrelated factors like website traffic, listing completeness, and even opinions from other sites.
Effectively we don’t know what these scores mean. Maybe they are a meaningful reflection of what people are saying, or maybe they are pulled out of a hat. How could we tell?
What went wrong?
There’s a short explanation for all the weird things we’ve seen. Capturing experiences as meaningful numbers is hard, if not impossible. On top of that, all parties participating in reviewing have at least some incentive to manipulate the scores.
As for the long answer, that’s what the upcoming part two will be about. See you next time!
Special thanks to Catalina Muñoz for her suggestions and feedback on this post.
- Zervas, G., Proserpio, D., & Byers, J. W. (2021). A first look at online reputation on Airbnb, where every stay is above average. Marketing Letters, 32(1), 1-16. Direct link.
- Li, H., & Hecht, B. (2021). 3 Stars on Yelp, 4 stars on google maps: a cross-platform examination of restaurant ratings. Proceedings of the ACM on Human-Computer Interaction, 4(CSCW3), 1-25. Direct link.
- Whose ratings should you trust? IMDB, Rotten Tomatoes, Metacritic, or Fandango?
- Eslami, M., Vaccaro, K., Karahalios, K., & Hamilton, K. (2017, May). “Be careful; things can be worse than they appear”: Understanding Biased Algorithms and Users’ Behavior around Them in Rating Platforms. In Proceedings of the international AAAI conference on web and social media (Vol. 11, No. 1, pp. 62-71). Direct link.
- How Google Reviews Calculates Value of Star-Ratings
Why does Google show a 4.8 rating when I have all 5-star reviews?