The History of Copytesting

The trade of advertising is now so close to perfection that it is not easy to propose any improvement.

Samuel Johnson, 1697


It is said that the roots of advertising theory can be traced back to Aristotle’s principles of persuasive rhetoric. Professional advertising research can be verifiably traced to the beginning of the 20th century and is largely the outcome of academic research projects and the early adoption and development by advertising agencies studying advertising effectiveness.

On the academic side, Walter Dill Scott, who taught advertising psychology at Northwestern University, conducted a program of advertising research for the Agate Club of Chicago in 1901. Daniel Starch began by teaching an advertising psychology course at the University of Wisconsin in 1909. Edward K. Strong, Jr., a psychologist and researcher at Stanford University, published several early books relevant to research methodology, e.g., Psychological Methods as Applied to Advertising (1913) and The Effect of Size of Advertisements and Frequency of Their Presentation (1914). And George Gallup’s work took shape from research he had done during professorships at the University of Iowa, Drake University and Northwestern University.

On the advertising side, Stanley Resor of J. Walter Thompson was one of the early adopters of copy research and by 1910, advertising research had become part of the portfolio of several leading agencies.

It is interesting to note that, even in those early days of ad research, there was strong debate about research methodology as well as on issues that concern us still today, viz., memorability of advertisements, message comprehension, believability, executional devices, etc. Also, as with today, advertising research involved field, lab and survey techniques.

The following is a modest attempt to trace the key developments in advertising research from the beginning of the twentieth century until today.


Prior to 1920, more than one-third of the general magazine advertisements featured a return coupon. These coupons were mainly used as a means of distributing free product samples to potential users. The advertiser was gratified when a particular piece of copy elicited a huge number of coupon requests.

Stanley Resor of J. Walter Thompson was one of the first to commission a study looking at the rates of coupon returns and advertising effectiveness. Resor even hired Dr. John B. Watson, the famous behaviorist, in 1921 as part of his effort to find out why people bought a particular product. For a long period of time the coupon method was viewed as a means of sampling the brand and testing the efficiency of the copy. Of course, since the advertisement had already been published, this was essentially a post-test assessment of the copy. There were also attempts to evaluate different copy treatments based on the cost per coupon returned.

During the Great Depression, coupon usage was at its height and provided an excellent opportunity for market research. In 1930, Harold Rudolph published the book Four Million Inquiries from Magazine Advertising, which was based on studying coupon returns. Subsequently, Daniel Starch compiled an analysis of millions of coupon responses.

Using coupon response as a means to test copy had the benefit of being based on actual consumer behavior instead of mere intentions. However, it had some severe limitations including the inconsistent usage of coupons by certain customers. A survey by the manufacturer of Ipana toothpaste (Bristol-Myers Co.) discovered that the great majority of coupons were returned by habitual clippers, many of whom were already users of the product. Also, certain categories of coupons (e.g., those for free samples, those that required a nominal charge for a sample, those that requested additional information, those that offered mail-order sales, etc.) yielded different patterns of responses.

As it became increasingly clear that analyses of coupon returns were not adequate to test the quality of the advertisement or the impact it had on potential buyers, what has been called the Opinion Ratings method, where responses were sought from potential users of the advertised product, took hold. This method had the additional advantage of being able to obtain feedback from potential customers before incurring any major media expense. Its proponents argued that the results were valid as it could analyze the opinions of likely users instead of random samples of audiences who may or may not be prospects for the product being advertised. However, it was soon found that this approach had its own limitations; it was essentially comparitive in nature (between two or more alternatives), and while people might be able to rate one ad over another for believability or even interest, there was doubt that they could rate them for effectiveness. Consequently the Opinion Ratings approach fell into disfavor in the early 1930s. However, variations of this approach continue today and it may well be credited with being the forerunner of many current focus groups.

A seminal development in copy testing was the 1923 publication of a paper titled “Testing the Effectiveness of Advertisements” in the Harvard Business Review by Daniel Starch. Among other things, Starch proposed that in order to be effective, an advertisement must a) attract attention, b) arouse interest, c) be believable, d) be memorable, and, finally, e) produce a response.


Many of the modern copy testing techniques can be traced back to the research work initiated in the 1930s.

In 1931, Dr. George Gallup applied the Recognition method to measure noting and reading of advertising in Liberty Magazine and its competitors. The Gallup approach was innovative as it used actual readers of specific magazine issues as opposed to studies that used students in unnatural, forced exposure situations.

In 1932, after professorships at Iowa, Drake, and Northwestern, where he taught journalism and psychology, Gallup set up the first copy research department at Young & Rubicam. It was at Y&R that he also established a national interviewing staff and launched a Radio Listenership method based on coincidental recruiting procedures. During the late 1930s, dissatisfaction with the Ad Recognition method led Gallup to develop a new testing protocol, which was based on proven, brand-aided recall and the introduction of several additional metrics, including sales point playback and urge to buy. His focus became memorability, communication and persuasion.

In 1933, Daniel Starch began a syndicated service based on the Recognition method. In it, respondents were asked whether they noted advertisements in a magazine, and, if so, whether they read most of it. The Starch recognition test was popular because it collected quantifiable information regarding the effectiveness of general consumer magazine advertising, and a small number of interviews could be used to project the cost per reader. Also, the fact that the service was marketed as a very inexpensive method boosted its appeal.

In the early 1930s, Arthur C. Nielsen, Sr. used a Share of Sales criterion to gauge marketing and advertising effectiveness. This was established by auditing sales of products sold through food and drug stores. In 1936, Nielsen attached audiometers to families’ radios. The audiometer, originally developed at MIT, automatically recorded radio operations and allowed researchers to estimate audience size for particular radio broadcasts. This gave rise to what is now known as “Nielsen Ratings,” which projects audience size based on household-by-household samples.

The 1930s were also the time when physiological measures were applied to gauge audience response to commercials. In the early 1930s, Dr. Darrell Lucas and Dr. James Weinland of New York University began to test printed display advertisements by connecting students to a psycho-galvanometer, or lie detector. Others used eye-tracking to trace the path of a reader’s eyes in scanning advertisements. While these particular techniques did not take deep root, variations of the approaches still continue.

In 1938, the Advertising Research Foundation (ARF) published a compendium of copy testing methods titled “Copy Testing.” This was the initial effort to compile a resource of the then current knowledge about copy testing.


Advertising on radio reached its peak in the 1940s. Recall testing was especially adapted to broadcast commercials and was applied widely. By 1948, Dr. Gallup had left Y&R to join with Dr. Claude Robinson, then head of Opinion Research Corporation, to start Gallup & Robinson, Inc., now The G&R Cooperative, LLC., the first advertising research firm.

By then, print testing had evolved to include pre-testing, where the test ad was tipped-in to a simulated magazine and evaluated using the Recall/Communication/Persuasion method. This method provided additional qualitative and quantitative data and also utilized more stringent controls than did recognition.

In the 1940s, Dr. Ernest Dichter introduced the concept of rapport between the consumer and the message or brand as a factor in driving sales. This was a marked departure from the emphasis on quantitative methods espoused by contemporary researchers, especially Alfred Politz, who sharply criticized Dichter for introducing so-called “pseudo-scientific” concepts. In retrospect, it is clear that copy research greatly benefited from the use of qualitative consumer interviews, personal introspection, and other motivation research techniques developed by Dichter.

Meanwhile, as an outgrowth of the radio research that he was doing at Y&R, Gallup founded another company called Audience Research, Inc., which specialized in radio and movie audience research. To study an audience’s reaction to a movie, Gallup’s venture purchased the exclusive rights to the Hopkins Televote Machine invented by a former Navy officer. Through a series of individually controlled rheostats, this machine could be turned by respondents to indicate liking or disliking as they viewed the movie. The company also obtained additional data using concept tests, cast recognition and ad budget alternatives to predict box office receipts. When television threatened movies’ existence, Gallup exited the business and sold the Hopkins Televote to Disney. The technique resurfaced years later at Audience Studies, Inc. (ASI).


In the early 1950s, Horace Schewerin introduced a syndicated research system that had, at its core, a measure of persuasion along with measures of brand advertised recall and content recall. Consumers were recruited to a theater and invited to indicate which brands they would choose in a number of categories. After being exposed to the commercials, they were again asked to choose the brands. This pre-post-brand choice design, which was at the core of the Schwerin persuasion measure, has endured in some form or another to this date.

One of the major developments in this period was the introduction of the concept of brand image in advertising. In an influential article in the Harvard Business Review, Gardner and Levy (1955) discussed this concept and explained how intangible symbolisms of advertised products can be used to communicate and appeal to shoppers of different social classes.

In 1955, the ARF carried out the famous Printed Advertising Rating Methods (PARM), which investigated commercial recall and recognitions tests and found recall to be the better measure of retention of impressions made by advertisements.


In 1961, a landmark study broadly promoting copy testing came from the Association of National Advertisers (ANA). The results are published in a book titled Defining Advertising Goals, Measuring Advertising Results, or, DAGMAR. The publication made the argument that, in addition to sales figures, the criterion for advertising effectiveness should be in terms of communication spectrum goals, i.e., Unawareness-Awareness-Comprehension-Conviction-Action.

The DAGMAR model recognized that people’s actions are motivated by cognitions and emotions. It came to be popularly known as the “learn-feel-do” theory and promoted the notion that people who understood advertising claims were more likely to buy than those who were merely aware of the brands.

The DAGMAR position was met with resistance from the motivational researchers, who continued to emphasize the brand imagery aspect of advertising.

However, other researchers contended that the main objective of advertising was as a form of communication, such as establishing familiarity with a new product or brand name. In their case, measured evidence of success lies in how effectively advertising communicates the brand, even though it may not result in an immediate spike in sales.

Another interesting concept that developed at this time was involvement. Herb Krugman (1965) defined involvement as “conscious deliberation of message content.” In other words, the more the reader deliberated the advertising message, the higher the involvement. Conversely, messages that do not require decoding or deliberation in any significant way were considered to be low-involvement.

William Wells found message relevance to be even more important than aesthetics in driving a purchase. He categorized products on the basis of their likeability – as belonging to either “approach” or “avoid” categories – and suggested that the “degree of involvement with the product (as well as risk associated with the purchase) are critical determinants of liking.” Subsequently, Wells et. al. (1964) introduced commercial rating scales that enabled consumers to rate their reactions to advertising.


The 1970s were a particularly eventful time for TV commercials as the rise of independent cable channels facilitated on-air pretesting by driving down cost of exposure. Also, the technique of testing rough commercials on-air was introduced, making pre-testing economically viable.

Split cable and addressable converter boxes helped marketers determine if one campaign was able to sell more products than another by using actual behavior (sales) as the criterion measure; matched households could be shown different ads and then tracked for purchase behavior. Despite its promise, this approach proved to be too expensive and slow for most copy testing needs and was not considered appropriate for non-packaged good advertising.

This period also saw the increased use of mall intercept – forced exposure testing with increased use of diagnostic questions and target groups – a style of research that has received increased interest today with the advent of online panels.

Millward Brown introduced Tracked Recall in the UK in 1972. This method sought to establish traces of the advertisement by using what is known as the Recall-Cueing methodology, which seeks to track “brand-linked presence of the advertisement” in memory.

Also noteworthy during this period was the development of scales to quantify consumer reactions to TV commercials. The theoretical foundation of these measurement (rating) scales can be traced to Wells, who presented empirical evidence gathered from experiments using a 12 item Emotional Quotient (EQ) Scale (1964). His findings indicated that ads that are both well-liked and meaningful to the respondents also tend to be better recalled and more effective. Subsequently, Wells, Leavitt, and McConville (1971) built on these findings and constructed six Reaction Profile Scales dealing with what they believed to be critical dimensions in viewers’ reaction to commercials, viz., humor, vigor, sensuousness, uniqueness, personal relevance, and irritation. Schlinger (1979) further operationalized the concept and developed a rating instrument called the Viewer Response Profile (VRP) to focus on “how people feel after seeing a commercial rather than what they know.” The quantification of affective responses to commercials was thus an important step forward in copy testing methodology.


In 1982, advertising agencies published a manifesto on copy testing called Positioning Advertising Copy Testing (PACT). It was developed by the top research professionals in ad agencies who examined their concerns about and experiences with copy testing practices and, out of it, a consensus on basic copy testing principles emerged.

These principles did not dictate specific methodology but provided a framework to guide testing practices such as relevance to the ad’s objectives, agreement before testing on how the results will be used, the use of multiple measures, the number of test exposures, consistent samples and degree of finish between comparable ads, and that measures should be reliable and valid.

In the 1980s there was a renewed focus to study a commercial and figure out how the specific elements in the ad impact the overall presentation and, consequently, affect the viewer. However, these findings were criticized by some as being largely diagnostic and not very useful when it comes to making evaluative decisions.

1990s to Present

In 1990, the ARF presented the findings on its landmark Copy Research Validity Project. The landmark study took eight years to develop, facing numerous design concerns and significant funding problems. It was the first industry study to empirically show that copy quality alone drives sales results. It remains the only study to independently test the major copy testing protocols and relate their results to in-market sales performance. The main results indicated that Likeability and Recall were the top predictors of in-market sales performance, with both measures being stronger than any measure of persuasion. Additionally controversial was its finding that post-only measures of persuasion were stronger than pre-post measures of persuasion.

Another major advancement during this time sprung from advances in the science of neuro-physiology, with new technologies for measuring and thus understanding how the brain functions. Perhaps chief among this new learning has been the deepening recognition of how powerful the consumer’s emotional response towards the brand is as a motivator of his or her behavior. Accordingly, copy testing techniques have been augmented to measure implicit affective responses.

In addition, there has been a renewed interest in using physiological measures like Electroencephalogram (EEG), Facial Electromyography (Facial EMG), Magnetic Resonance Imaging (MRI), etc., to link implicit physiological responses to specific emotional events, constructs, and devices in an advertisement.

Recently, a new metric, engagement, has also received extensive research attention. The focus is increasingly on whether and to what extent advertising is successful in engaging or involving the audience with the brand, media, and/or advertising. As a result, measures of Love and Likelihood to Recommend have become part of many copy testing protocols.

Further, rapid advances in web technologies have resulted in opportunities to drive down research costs and speed up delivery. Although issues about sample quality and exposure context persist, web-based copy testing is one of the most significant technological developments in copy testing.


Schools of Advertising Research

Advertising Research has a long, fascinating, and sometimes contentious history. It has drawn on research from a wide variety of disciplines including psychology, physiology, sociology, linguistics, and statistics. It has benefited from the wisdom and experiences of many practitioners on the academic, advertiser, agency and supplier side. Trying to predict the human mind is still a largely unknown adventure. However, advancing tools and an increased understanding of those tools are improving our ability to distinguish between substandard, standard, and excellent creative content.

While it is entirely appropriate to sum up the evolution in terms of Virginia Slims unforgettable slogan, “You’ve come a long way baby,” perhaps the new media opportunities, the new, easier, and cheaper ways to purchase goods and services, the new communication techniques such as harnessing word-of-mouth, etc, makes the observation of the Jazz Singer some 80 years ago more pertinent when he said “You ain’t seen nothin’ yet.”


