* ASQ Improving Evidence Example Do-file
* Henrich R. Greve and Seo Yeon Song
* November 2017
*
* Any edits done to this do-file for later posting should contain named/dated log here:
*
* Using sample of data analyzed in:
* Greve, Henrich R., and Seo Yeon Song. 2017. "Amazon Warrior: How a Platform Can Restructure Industry Power and
* Ecology." Advances in Strategic Management 37:299-335.
*
use Post_authordata.dta, clear
*
* Control variable names placed in global macro
*
global cont "BigFivePublisher SmallorMedium IndiePublisher FictionLiterature ChildrensBooks ComicsGraphic ForeignLanguage SalePrice days_since_published unrated i.saledate"
set more off /// Remove scroll stops
set scheme s1mono /// Mono scheme is better for journals
*
* Log transforming DV and restricting the analysis to DailyGrossSales > 0
* Including additional variables: "SalePrice" "days_since_published" "year" "quarter"
* Dropping "f_Kloutscore" as it is highly correlated with "lnf_followers" (r=0.946)
*
* g unrated = AverageRating == 0
* g ln_DailyGrossSales = log(DailyGrossSales)
* g days_since_published = saledate - DatePub
* egen pub_id=group(Publisher)
*
* Interesting empirical fact: the market shares of the Big Five publishers is decreasing over time
* Show histogram of share over time to illustrate that point. The "quarterly" data extracts that we are
* drawing from are not evenly spaced, so ideally we would have an accurate horizontal axis. However,
* making a bar graph is a lot easier, and we can label the date of each bar to let the reader know.
*
*g BigFiveSales = BigFivePublisher*ln_DailyGrossSales
*g SmallSales = SmallorMediumPublisher*ln_DailyGrossSales
*g IndieSales = IndiePublisher*ln_DailyGrossSales
format saledate %td
graph bar (sum) BigFiveSales SmallSales IndieSales, stack percent nolabel ///
over(saledate, relabel(1 "2/14" 2 "4/14" 3 "7/14" 4 "10/14" 5 "1/15" 6 "5/15" 7 "9/15")) ///
legend(label(1 "Big Five") label(2 "Small") label(3 "Indie"))
* Let's run the analysis right away and show a coefficient plot. This would normally be done after doing the scatter graph
* below, but here we reverse the order because we are using a trick (deleting high-review observations) to improve the graph.
* That means we can't do the analysis afterwards in the same program because the data would be different. The deletion has
* been implemented in the posted data file; that's why all the variable generation statements are commented out.
* The coefficient graph we run below is very informative, and extremely flexible in the formatting. We are using some of
* the options to format it, but there are a lot more available. See the coefplot.pdf file for explanations.
*
*egen std_TotalReviews = std(TotalReviews)
*egen std_AverageRating = std(AverageRating)
*egen std_tweet_count = std(tweet_count)
*egen std_f_tenor = std(f_tenor)
reg ln_DailyGrossSales $cont TotalReviews AverageRating tweet_count f_tenor if DailyGrossSales > 0, vce(cluster pub_id)
reg ln_DailyGrossSales $cont std_TotalReviews std_AverageRating std_tweet_count std_f_tenor if DailyGrossSales > 0, vce(cluster pub_id)
coefplot , keep(std_TotalReviews std_AverageRating std_tweet_count std_f_tenor) level(95 99) ci() ciopts(recast(rcap)) xline(0) ///
rename(std_TotalReviews="Reviews" std_AverageRating="Rating" std_tweet_count="Tweets" std_f_tenor="Sentiment") ///
headings(Reviews="{bf: Amazon}" Tweets="{bf: Twitter}")
*
* OK, back to describing the data.
*
* A good start is to take a look at the distributions. In the paper we examine both Amazon reviews and tweets. The
* scattergraphs below are reviews only; tweets can be graphed too but have much weaker effects.
*
histogram TotalReviews if TotalReviews>0 & TotalReviews<=1000, frequency kdensity
histogram tweet_count if tweet_count>0 & tweet_count<=100, frequency kdensity
* Why we think the market share is shifting: Amazon reviews affect sales a lot, and the effect is greater for Indie ebooks.
* This is because the indie books start without the marketing machinery of Big Five books, so each review adds more knowledge.
* This is very easy to show because we can just graph sales as a function of reviews, split by type of publisher.
* To make the graph easy to read we drop the extreme values; otherwise the area with the most points would be squeezed to the
* lower left of the graph, and would be hard to read.
* Show line graph with scattergram
*replace ln_DailyGrossSales = . if ln_DailyGrossSales<0
*replace TotalReviews=. if TotalReviews>2500
twoway scatter ln_DailyGrossSales TotalReviews if saledate==20336 & BigFivePublisher==1 || lfit ln_DailyGrossSales TotalReviews ///
if saledate==20336 & BigFivePublisher==1, name(bigfive, replace) title("Big 5")
twoway scatter ln_DailyGrossSales TotalReviews if saledate==20336 & IndiePublisher==1 || lfit ln_DailyGrossSales TotalReviews ///
if saledate==20336 & IndiePublisher==1, name(indie, replace) title("Indie")
graph combine bigfive indie
* Here is a graph that is fun to compare with the previous one. Usually residual graphs are the final step; they show how
* much is unexplained after analysis, and can also give warnings of strange distributions of errors (this is textbook stuff).
* This residual graph has a different message. It is so similar to the raw-data graph that it demonstrates how the control
* variables did very little in the analysis: the effect of our theoretical variable overwhelms everything else.
* Show best model line graph with residual scattergram
*
reg ln_DailyGrossSales $cont AverageRating tweet_count f_tenor if DailyGrossSales > 0
predict double lnsales_res, residuals
twoway scatter lnsales_res TotalReviews if saledate==20336 & BigFivePublisher==1 & lnsales_res<=6 & lnsales_res>=-4, xscale(range(-4 6)) ///
|| lfitci lnsales_res TotalReviews ///
if saledate==20336 & BigFivePublisher==1, name(bigfiveres, replace) title("Big 5 Full Model") legend(off)
twoway scatter lnsales_res TotalReviews if saledate==20336 & IndiePublisher==1 & lnsales_res<=6 & lnsales_res>=-4, xscale(range(-4 6)) ///
|| lfitci lnsales_res TotalReviews ///
if saledate==20336 & IndiePublisher==1, name(indieres, replace) title("Indie Full Model") legend(off)
graph combine bigfiveres indieres