A special kind of relationship and the cool things they do together, starring Adobe Analytics, R and Catalyst

Categories Analytics

One of the sessions I attended at the Adobe Summit was titled “Advanced analysis with Adobe Analytics and R” and put on by Jessica Langford, Trevor Paulsen, and Randy Zwitch. In this session Randy and Jessica went through a few demos of how they use R and Adobe Analytics in some unique ways. I’ll outline two use cases they talked about: one analysis demo and one data sharing demo.

1. Page Load Time Analysis

Jessica Langford talked about a specific use case where she wanted to measure the impact of page load times on conversions. I’m sure there are many ways to go about this type of analysis, but there is one pretty straight forward method using R and Adobe Analytics.

The first step is to go into R Studio and connect to the Adobe API using Randy’s Sitecatalyst package. Jessica’s demo went through the Data Warehouse instead of the API, but it’s a bit more straight-forward to go through API (the API is limited to 50,000 rows of data, so if you’re working with a larger data set than that, use DW). 

After this, you can draw data straight into R, which means you can take advantage of R’s ability to run calculations on fairly large data sets. The next step is to get actual Adobe data into R using some basic queries. To start, Jessica used one metric and two dimensions: page load time by page name and by user. If you were to visualize this in a table, you would see the following data:

Doing this allows you to get fairly granular data from Adobe. You want granular data because you want control of the calculations being made, and you want the calculations to be made in R. *Note, Jessica also excluded users who bounced because bounces tend to add a lot of noise to the data.

# Remove bouncers  #
visitorRollupDF = visitorRollupDF[ - bouncerIndex(visitorRollupDF$totalPageViews),]

Before doing some analysis, the next step is to pull in conversion data. The conversion data is based upon the user ID, which is important because you need a link between conversions and page load time.  *Note – each row in the image below represents a user. Jessica pulled a lot of metrics into R, but not all were necessary. I only describe the ones used in this page load analysis.

After this, Jessica bucketed page names into groups, based upon page load time. It’s up to you how you want to bucket pages, but an example would be all pages which had an average of 1-2 seconds of page load time. Now that you have buckets of users, you essentially have cohorts from which you can perform analysis.

# Create binary flag for visitors w/ revenue # 
visitorRollupDF$purchaseFlag = revenueFlaggVec(visitorRollupDF$totalRevenue)
# Creating discrete buckets of PLT times #
visitorRollupDF$roundedPLT = createBins(binWidth = 1,visitorRollupDF$meanPLT)

The main part of this analysis looks at conversions for these buckets of users by using a test called the “chi squared test”. In the words of Jessica, a chi squared test is “really good at determining relationships between discrete variables”. If you want to learn more about this test, here is a helpful video.

In the case of this analysis, the variables being analyzed are page load time and likelihood to purchase/revenue. The first part of the analysis is to create a contingency table, which can be seen below. *Note – Jessica used a binary function to separate users who did purchase and those who did not. In the rows, 0 represents users who did not purchase, 1 is users who did. And the columns represent the buckets of page load times.



After this, you are ready to run your chi square test. Go ahead and run the test using a chi square test function (you can learn about that here). The first number to look at is the p-value; you essentially want this number to be very low. If it’s low, there is significant relationship between the two variables (page load time and likelihood to convert). The second set of numbers to look at is your residuals for each bucket. Residuals are a measure of the difference between the expected value and the actual value. A positive residual is a higher than expected value, and a negative residual is a lower than expected value. As you can see by this table, there is a higher than expected conversion rate for page load times of 2, 3, and 4. After 4.99 seconds (5 seconds and above), it switches to lower than expected.

Now, being the amazing analyst that Jessica is, she didn’t stop here. “Handing your boss a table of residuals isn’t the best thing to do”, so it’s better to plot this data and visually draw out the impact. Here Jessica plotted conversion rate by page load time, showing the drastic drop-off after a page takes at least 5 seconds to load.

Jessica also included a few extra components to show why this is a big deal. She first included a table of the top pages, with load times above 5 seconds, to show how many high volume pages are problematic for the business.

Below this table (my favorite part), Jessica made a quick calculation of the “potential revenue loss” due to these slow pages. She calculates this based on the Total Revenue generated if users on slower loading pages converted at the same rate as users on faster loading pages, minus the actual. The result is the potential loss, which in this case, it was $840k. I would be willing to bet the business took immediate action on this problem, mainly because Jessica laid out the problem and impact so clearly for them.

2. Share Your Data

The other cool thing I learned from this talk has to do with how we share our data. Currently, I use a bunch of different methods to share my data. In one day I might share data through an email, excel file, powerpoint presentation, or Tableau workbook. All of these have their pros and cons, but I’m not married to any one of these. However, there is a new way of sharing data using R that excites me.

This method uses two tools: R and a web-application framework for R called Shiny. Essentially, Shiny allows you to make interactive dashboards on the web without doing all of the coding yourself. Shiny is not incredibly hard to use, and connects with all of the hard work you’ve done in R.

 The reason this solution is exciting for me is because it allows your audience to interact with your data, without you being there! It’s one thing for my audience to hear me walk through my analysis, but it’s quite another if they had the data to ‘play’ with. Getting familiar with data requires you to actually get into the data itself and explore. Of course we are never going to get everyone in a company to do this type of work, which is why Shiny dashboards are so powerful. Shiny allows non-data people to interact with your data and become more familiar, without needing a 2 week training course or a change in job description.

Now I didn’t cover everything in this blog post from the talk, but these were some of the highlights for me. Thanks for reading!

I'm a Digital Analyst with an exuberant amount of passion for the digital analytics industry. I work with our analytics team to move clients out of reporting and into actionable insights. I believe in the power of measuring results and hope to one day integrate data-driven philosophies with the potential of social entrepreneurship.