Where art meets consumer data analytics

We build a holistic view of consumers and their journey using data and analytics. We apply those insights to identify how our partners can optimize the impact of their marketing, and create great consumer experiences.



Campaign Explorer

A system that helps us monitor, analyze and explain everything that happens in a marketing campaign and thereby infer what engages people.


Complex Adaptation

A blockchain-based decentralized marketing platform. This project is in stealth mode right now.


Revisiting McPhee’s Theory of Exposure and the Long Tail

Ever since Chris Anderson published The Long Tail: Why the future of business is selling less of more (2006) marketers have been pondering the implications of a world where the constraints of traditional brick-and-mortar retailing have fallen away as online stores can profitably carry products that appeal to only a select few. Anita Elberse, a Harvard Business School professor, in Should you invest in the long tail (2008) linked these post-Internet era ideas with McPhee’s (1963) old-school Theory of Exposure to bring balance to the thought-space. McPhee’s theory asserts two principles: (1) the most popular products/services (hereafter just “products”) in the fat head enjoy a natural monopoly among casual users, the majority of users in any product category, because these are the products that they can most easily gain awareness of; and, (2) even when casual users become aware of niche long tail products they tend to prefer the mass-market fat head products because these products have been optimized to appeal to a more diverse set of users, while niche products are generally optimized for aficionados. These two factors (lesser known and less appealing) put long tail products in double jeopardy.

We tested McPhee’s theory against Yelp’s review data and found that his ideas were generally supported, but we made some fresh observations in the process. As expected, we found that the venues (most Yelp reviews are for businesses like restaurants and hotels that are more accurately described as venues than products) reviewed by Yelp participants formed a long tail (Figure 1), where the most reviewed businesses are in the fat head. While some have found the head to be separated from the tail in the classic Pareto 80-20 split (80% of venues in the tail), we used a two-step clustering algorithm and found that 90-10 was a more natural split.

A two-step clustering algorithm found that 90% of venues reviewed on Yelp are in the long-tail.
Figure 1. A two-step clustering algorithm found that 90% of venues reviewed on Yelp are in the long-tail.

The same clustering algorithm was also used on reviewers, where they were segmented based on the proportion of niche venues in their reviews. It was found that reviewers fit into four natural clusters as described in Table 1. How was McPhee right? The pure fat head customers (cluster 1) are the most satisfied (average star rating of 3.82) and loyal (5840 checkins per business) consumers among the main-stream businesses that are 97.3% of their experience.

There are four natural clusters among Yelp users based on the proportion of long tail businesses they review.
Table 1. There are four natural clusters among Yelp users based on the proportion of long tail businesses they review.

What’s new? Our interesting discovery was the cluster 2 consumers who are almost as happy (3.77 stars) as the mainstream cluster 1 consumers, but are responsible for a much larger number of reviews that have a monopoly on the readers’ assessment of review helpfulness. It seems that experiencing roughly 25% niche venues may be some kind of novelty sweet-spot that most inspires consumer-generated media, and most enriches consumer knowledge in general. Therefore, McPhee’s theory seems correct but may not tell the whole story of consumer experience in the long tail. Does experience with this mix of venues empower cluster 2 consumers to make better comparisons between the niche and mainstream? Is there something about the personality of cluster 2 consumers that make them more skillful review writers? Are they more allocentric? Many unanswered questions, but a fascinating phenomenon nonetheless.


When do reviews adequately portray a product or service?

It’s not always about quantity

Generally the more reviews you have, the more they converge on a consensus assessment of the experience provided (Figure 1). That suggests to the prospective customer that the experience is very predictable and low risk.

Figure 1. Only 35 reviews, but the rating distribution is starting to converge on 4 stars.

But, that is not always true. Some experiences are polarizing: some people love them while others hate them. Sometimes random chance will bring these two sides together in near equal numbers, with results confusing to the prospective customer (Figure 2).

Figure 2. Many reviews, but consensus is polarized.

There is another dimension to many review sites (Figure 3), where readers assess the helpfulness of reviews. We found that this information tends to bring clarity to even polarized reviews, however the review sites do not present this information in a helpful way.

Review sites have different ways of indicating the helpfulness of reviews, but while they display this information for each review and sometimes the object of the review it is uncertain what the implications of this information is.
Figure 3. Review sites have different ways of indicating the helpfulness of reviews, but while they display this information for each review and sometimes the object of the review it is uncertain what the implications of this information is.

We found that readers’ assessments of helpfulness gradually peak and then decline over time (Figure 4). It seems that the peak is the point where the reviews finally begin to capture all the information necessary for consumers to make an informed prediction of whether the reviewed object or entity is right for them. We call this point information sufficiency. Since matching expectations with outcomes is the key to customer satisfaction, we recommend that consumers be told when the available reviews might not allow them to make an informed prediction of their outcome. This signal may be best sent with a simple set of icons that indicate whether information sufficiency has been reached (Figure 5).

The helpfulness of a set of reviews was observed to peak over time. The time at which the peak occurs is denoted the point of information sufficiency. The vertical dotted line matches one in Figure 6.
Figure 4. The helpfulness of a set of reviews was observed to peak over time. The time at which the peak occurs is denoted the point of information sufficiency. The vertical dotted line matches one in Figure 6.
Since matching expectations with outcomes is the key to customer satisfaction, we recommend that consumers be told when the available reviews might not yet allow them to make an informed prediction of their outcome.
Figure 5. Since matching expectations with outcomes is the key to customer satisfaction, we recommend that consumers be told when the available reviews might not yet allow them to make an informed prediction of their outcome.
Helpfulness and satisfaction interact

Our final observation on this issue is that information sufficiency seems to mark the minima of overall customer satisfaction (Figure 6). We believe this occurs because after information sufficiency is reached prospective customers are better able to predict if they will be satisfied with the experience by reading the reviews. Therefore, customers that expect to be unhappy stay away and those that participate are a self-selected better match for the product or service being offered.

The vertical dotted line matches one in Figure 4 that marks the point of information sufficiency. Customer satisfaction steadily decreases until this point, and increases thereafter as customers who are a good fit for the value offering are better able to self-select into the experience while those who are a poor fit stay away.
Figure 6. The vertical dotted line matches one in Figure 4 that marks the point of information sufficiency. Customer satisfaction steadily decreases until this point, and increases thereafter as customers who are a good fit for the value offering are better able to self-select into the experience while those who are a poor fit stay away.

The conflict over long-tail search

The short version

Search advertisers receive conflicting advice about which queries they should target. Practitioners often advise focusing on the long-tail, those specific queries assumed to identify consumers further into the buying cycle. Alternatively, academic research advises either targeting queries with historically high click-through rates; or, the most-used queries. However, if you want to maximize click-through query usage is the wrong thing to focus on; what really matters is having your ad recognized as the most relevant part of the search results. The likelihood of such recognition is maximized when you target queries that are specific enough to match few other products besides your own, thus minimizing competitive interference. However, consumers do seem to reward seeing lots of competitors when they are comparing their options.

The whole story

Marketing practitioners and the academic literature contradict each other in their advice related to the choice of keywords and the consumer queries search advertisers should target. Academic literature related to this subject is nascent, yet confidently recommends advertisers select keywords that have prompted high click-through rates in the past; (Rutz et al. 2012) and, to track and use the currently most-used keywords. (Skiera et al. 2010) While these two recommendations seem distinct, they both draw advertisers to target the queries most-used by consumers creating a quintessential “red ocean” (vis a vis Kim and Mauborgne 2004) of competition for the attention of the same searchers.

On the other hand, search engine marketers (e.g., Fishkin 2009, Hill 2012) widely advise that advertisers not exclusively compete for the opportunity to show ads in response to the most-used queries, but rather to also target the long-tail queries that are the most specific in the consumer intent they communicate. As Figure 1 depicts, these queries are individually low in their use among searchers, yet are collectively the largest category of search query. These practitioners often explain that there are three categories of search query, categories based on their frequency of use; and, that these categories are related to the three stages of the buying cycle beginning with (1) information gathering. The most-used and most general of search queries (sometimes called the fat-head) are believed to be used by those just beginning to contemplate a purchase. For example, a person starting to plan replacing tires on their car might query “tires” (point A in Figure 1) or the lesser-used English variant “tyres” at point B. As this person starts to consider the range of alternatives open to them they enter the next stage of (2) shopping. These queries are less often used and more specific; they are believed to be used by those who are assembling a consideration set and evaluating options. For example, after evaluating the results of a “tires” search the consumer will make many follow-up searches, perhaps being more specific about the type of car using “Honda tires” (point C) or “Honda Fit tires” (point D) until they finally settle on a specific product thus entering the final stage of (3) purchase. These queries are the least-used and the most specific, the long-tail of the query spectrum. They are believed to be the most desirable queries to target because the searcher is assumed to have decided what they want and is about to make a purchase. For example, a consumer who has decided to buy a specific size of tire might search “185/55R16” (point E) to find the lowest price right before they make a purchase. Clearly the practitioner perspective has merit as evidenced by the general increase in cost-per-click (CPC), the price that advertisers pay the search engine if their ad gets clicked, as queries get more specific. Conventional economic thinking might lead us to predict that the most frequent queries, those most targeted by advertisers, would have the highest CPC. However, it seems the opposite is true; if CPC reflects the economic value of queries then those in the long-tail are apparently more valuable, presumably because a sales conversion is more likely to occur. It would be wrong though to say that competition has no apparent effect on CPC. In Figure 2 the queries to the right of the reference line are so specific that they are only targeted by one advertiser. In Figure 1 the CPC among these queries was observed to decrease, likely because there was little price-raising competition to show ads in response to them.

Long-tail search and the buying cycle.
Figure 1. Long-tail search and the buying cycle. Search queries are a long-tail phenomenon; some are widely used but most are less so. Some example queries with usage volume and cost-per-click (CPC) at the time of this writing: (A) “tires”, 9.1M, $1.80; (B) “tyres”, 4.1M, $1.90, (C) “Honda tires”, 49.5K, $2.73; (D) “Honda Fit tires”, 2.4K, $3.59; and, (E) “185/55R16”, 480, $2.02. As queries get more specific the usage volume tends to decrease while CPC increases, implying the rare queries are more valuable because they signal purchase propensity. The dashed vertical reference line also appears in Figure 2 to aid the reader in connecting them.
Competitive intensity and query usage.
Figure 2. Competitive intensity and query usage. The black dashed line is a locally weighted regression of the plotted points. As queries get more specific, usage among consumers tends to decrease, as does the number of advertisers targeting the query. At the reference line search queries become so specific they only attract one advertiser.

Why does this difference in understanding between practitioners and academics exist? Clearly there is data and logical argument that supports both sides. The problem is that the academic side is focused on a metric rather than the dynamics of the context. Query usage is the wrong thing to focus on, what really matters is having your ad recognized as the most relevant part of the search results. Figure 3A shows the likelihood of such recognition to be maximized when competitive interference is minimized; and that likely occurs when you target queries that match few other products besides your own, queries most likely to be in the long-tail. Note though that a different dynamic appears to be present among fat-head queries. Figure 3B clearly shows consumers reward the presence of more competition among advertisers in the fat-head. Only practitioners have offered a cogent explanation for this: a fat-head query indicates a consumer is comparing alternatives and thus wants to see lots of options. So how should the advertiser respond? Target both types of query but do so with content that helps the consumer perform the task at hand. Ads that target the fat-tail should be all about showing how your value offering compares to your competition. Have the ad lead to a landing page that also does that job. Long-tail queries should be all about making it easy for the consumer to convert, same with the landing page. Good luck and happy marketing.

Advertiser competition and the long-tail.
Figure 3. Advertiser competition and the long-tail. Part A shows that the highly specific queries in the long tail result in more clicks when there are fewer competitors targeting them. We can surmise that these ads are for products more authentically related to the query and are thus more useful to the consumer. Part B highlights that consumers reward seeing more competitors in the fat-head.

The consumer ecosystem

As one of the many people who downloaded the data Yelp posted in its Kaggle data mining competition, I couldn’t help but start thinking about what other value Yelp could provide beyond the consumer services that are the core of its business. I have an idea for a business intelligence application:

Businesses generally define their market in terms of the consumers within, and the other businesses with which they compete in offering similar value to the consumer. This is a very inward-looking way of defining a market. A better way is to define it through the eyes of your customer. Figure 1 was constructed using a methodology described in a later section from some of the Yelp data released for its Kaggle competition. It shows a cluster of businesses in Phoenix, AZ that have been classified into the same business categories. On the surface it seems like a strange mix of radio stations and newspapers, however they are clustered together because they all are categorized as mass media. This depicts a set of competing businesses as typically conceived.

Figure 1: A traditional map of market competitors united by their mass media categorization.

Figure 2 focuses in on one of the businesses in Figure 1, KUPD 98 FM, and depicts its place in a diverse business ecosystem constructed from the other businesses reviewed by listeners who wrote a review of KUPD. The central position of Delux Burger among all these businesses is surprising, counter-intuitive and a potentially valuable insight. However, the true value of the data pattern is that it gives KUPD a deep, multi-faceted insight into the behavior of its listeners. The potential applications of this insight are broad and certainly include arming the KUPD ad sales manager with evidence to show the 71 businesses in Figure 2 the potential value of advertising on KUPD. What might KUPD be willing to pay Yelp for this insight, updated on an ongoing basis? I think that being able to provide this information to any business listed on Yelp would constitute a minimum-viable product.

Figure 2: The business ecosystem of which KUPD is a member. Unlike the traditional market structure depicted in Figure 1, this ecosystem contains no direct competitors (i.e., radio stations). This graphic depicts the businesses that KUPD listeners frequent, they are principally united by their patronage of Delux Burger.

Figures 1 and 2 depict parts of two large networks of businesses listed on Yelp. In Figure 1 that network is created by linking businesses together that were classified into the same business categories (e.g., Discount Store, Nightlife and Music Venues), the more categories a pair of businesses share the more similar those businesses were considered to be. The overlapping web of categories creates a densely connected business network. The network was then pruned to a minimum-spanning tree, the smallest set of connecting links needed to link all the businesses together into a network structure of minimum total length. The simple set of links depicted in Figure 1 is from that minimum-spanning tree. The widely-known Girvan-Newman algorithm was then used to find communities among the businesses in the network. These communities are considered to be the true sub-markets within the overall network. Figure 1 depicts one such community of businesses, connected by the minimum-spanning tree.

A similar procedure was used to generate the network depicted in Figure 2, except in that case businesses were linked because the same people wrote a review of each business. The length of each link was based on the number of reviewers two businesses shared, as well as the number of stars the reviewers awarded each business. Again, the minimum-spanning tree algorithm was used to simplify the network, and Girvan-Newman communities were identified. Figure 2 depicts one such community, except in this case we have a diverse business ecosystem patronized by the same consumers.

As I stated in my introduction, these business ecosystem insights are highly actionable from a managerial perspective. This is particularly true in the way they suggest potential partnerships among businesses that do not compete, yet share the same customers. However, even when viewed from a traditional competitive analysis perspective, the consumers’ view of the market makes it obvious what the real competitive dynamics are. For example, in Figure 2 Delux Burger is the central influence that organizes the ecosystem. There are several other restaurants that, while they only occupy a peripheral position still attract enough attention to be on the consumer’s radar. What of the other restaurants whose food is utterly unremarkable, nether good, nor bad, and thus unworthy of review? Their absence focuses the attention of the competitors in the ecosystem on those who are their real competition. This information should also be a wake-up call for those restaurants that are overlooked. If that insight causes those restaurants to raise their game, then the depiction of Figure 2’s ecosystem might change, prompting all the businesses in the system to want to update their knowledge and business practice on a recurring basis.

I’m working on a web application to display this analysis for all the businesses in this dataset. When it is ready it will be at: http://ecozanti.herokuapp.com/


How much of a market is up for grabs?

In my last blog post I showed you how to use Google search volume to estimate future market share within a product category using Insights for Search, a free Google tool. This time I’ll show you how to estimate how much of a market is composed of consumers looking for a reason to switch brands (i.e., consumers who want you to win them over).

But first a little marketing theory. Most marketers believe that making their customers extremely satisfied is their goal. Customer satisfaction surveys have generally been replaced by one question: On a scale of 1 to 10, how likely are you to recommend us to friends and family? This is known as the net promoter score (see “The One Number You Need to Grow” in Harvard Business Review Dec 2003). Customers near the middle on the net promoter score (i.e., 7-8) are generally your customer only until they can find a vendor or brand they like better. Often when consumers are thinking about making a purchase about which they are not fully satisfied they will update their knowledge of alternatives by making a Google search. Lets look at an example.

Market Share Reporter (MSR) aggregates market share information from a variety sources for the major product categories. One example is the toilet tissue market, shown in Figure 1. Use the instructions in my last blog entry to find the Google search volume for each of the top 5 toilet paper brands over the last 12 months. You will use Google Insights for Search and probably should use settings similar to those in Figure 2. Figure 3 shows results I recently received from that search.

Figure 1. Toilet tissue market share.
Figure 2. Recommended Insights for Search settings.
Figure 3. Sample results.

If you add the market share percentages for the top 5 brands in Figure 1 you might conclude that they have a lock on 76.5% of the market, leaving little opportunity for another toilet tissue manufacturer. However, no matter what market you name its always possible to estimate the not-entirely-satisfied part of the market by checking the search volume for a phrase like “best toilet paper,” or “best tires” or “best NYC barber.” Note the quotes around the example search phrases. Using quotes around your search phrase will ensure your results are not inflated by search phrases that happen to contain those words in a different order, with a different meaning. What’s the significance of knowing that the “best toilet paper” search phrase happens more often than 46% of the searches in the Hygiene & Toiletries category? If Charmin’s brand attention of 72 reflects market share of 23.2%, then “best toilet paper” attention may reflect 14.8% (23.2/72 x 46) market share “in play.” A less conservative estimate could be gained by averaging the results of the same calculation for the top 5 brands (41.1%). Even the conservative estimate of 14.8% represents substantial opportunity. If one new brand could capture it all it would grant instant entry into the top brands of a multi-billion dollar market.


Understanding your competitive landscape with Google’s tools

In Tim Ferriss’ good book The 4-Hour Workweek he describes how to estimate the revenue from any product with the Google Adwords Keyword Tool. I’ve been thinking: what other valuable marketing information can be gained from Google’s free tools? Take the issue of market share, lets say you have an idea for a new niche product but don’t know who might be competing for the same consumer or how much of the market is already served. You can look at the major industry databases like Market Share Reporter (MSR) or Hoovers. However, unless you are thinking about competing in a major product category (e.g., automobiles, toilet paper, luggage) your market won’t be on their radar. A better way is to follow these steps:

  1. Find out who is bidding on Adwords keywords that are closely related to the product you are thinking about offering (i.e., Adwords that you would bid on to get buyers to your website). As Ferriss says: you use the Google Adwords Keyword Tool to find the words that consumers use to find products like yours and then search using those keywords and make a note of which companies come up in the paid search results (top section and right margin); and,
  2. Now use another Google tool called Insights for Search (Insights)to see how frequently consumers search for the company names found in step 1. Insights only allows you to look at search history for 5 companies at once, but you can assemble a full list by aggregating results from separate sets of 5 different companies. I argue in the paragraphs to follow that you now have a prediction of future market share.

Take a look at the following example. Here I’m searching the names of motorcycle brands. Note the two boxes, one around the drop-down box containing the word “Motorcycles” and the other around the bar chart. The drop-down box is one of the most useful features of Insights because you can isolate your search to a specific category so to be sure your results really represent what you intend them to. Take for example the search term “harley.” In this situation I want to know the prevalence of searches for Harley-Davidson motorcycles. I know that some people will abbreviate their search term by using “harley,” so I want to capture that search volume in addition to that of the full brand name. The problem is that some “harley” searches have nothing to do with motorcycles, so to be sure I only get the ones for motorcycles I select the Motorcycle category in the drop-down box.

Figure 1. Google Insights for Search

The bar chart is the information I’m looking for. Note how each bar in Figure 1 has a number beside it. If you try this yourself and don’t see a number then sign up for a free Google account and login. The number is a percentile for search frequency. In this case it indicates how often the keyword “harley” is searched for compared to all other searches in the Motorcycle category. The number 70 means that within the Motorcycle category 70% of all other searches happen less often than searches for “harley.” Why this number is important to marketers is because it is a measure of attention.

Now lets review a little marketing theory. Perhaps the oldest marketing model is the AIDA model of the phases someone goes through when they buy a new product: Attention -> Interest -> Desire -> Action. Market share statistics like those from MSR in Figure 2 measure Action, the sales that happened in the past. How can you tell what sales will happen in the future? You look at measures of the earlier phases, and right here in Insights we are looking at measures of Attention.

Figure 2. Market share for motorcycle brands.

Two months ago, before the summer riding season started, I used Insights to see where consumer attention was in the motorcycle category. Today I checked it again with results as in Figures 3A and B. If you compare Figures 2 and 3A you can see that pre-season riding attention was predicting a modest increase in market share for Honda, Yamaha and Ducati, seemingly at Harley’s loss. Generally though it seemed that attention was consistent with the 2009 market shares. Now that the season is underway Figure 3B is indicating not only higher attention for all motorcycle brands, but a restoration of Harley to the top dog position. The really interesting observation is the big surge of interest in the European brands, particularly BMW and Ducati. Will this increased attention turn into a transfer of market share? My prediction is “yes,” and probably to the detriment of Yamaha, Kawasaki and Suzuki.

Figure3A and B. Attention on motorcycle brands.

I started this post by saying I would show you a way to get market share data for any category too small to be on MSR’s or Hoover’s radar. Now you can see that I’m telling you a way to get a prediction of future share. But what’s better: knowledge of the past or prediction of the future? My example was for a major category that is monitored by MSR. I used that example because I wanted to show you that there is consistency between MSR and the free Google attention data. Now go do some marketing!