Felix Schönbrodt

Dr. Dipl.-Psych.

Finally! Tracking CRAN packages downloads

[Update June 12: Data.tables functions have been improved (thanks to a comment by Matthew Dowle); for a similar approach see also Tal Galili's post]

The guys from RStudio now provide CRAN download logs (see also this blog post). Great work!

I always asked myself, how many people actually download my packages. Now I finally can get an answer (… with some anxiety to get frustrated ;-)
Here are the complete, self-contained R scripts to analyze these log data:

Step 1: Download all log files in a subfolder (this steps takes a couple of minutes)

## ======================================================================
## Step 1: Download all log files
## ======================================================================

# Here's an easy way to get all the URLs in R
start <- as.Date('2012-10-01')
today <- as.Date('2013-06-10')

all_days <- seq(start, today, by = 'day')
 
year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')
 
# only download the files you don't have:
missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))
 
dir.create("CRANlogs")
for (i in 1:length(missing_days)) {
  print(paste0(i, "/", length(missing_days)))
  download.file(urls[i], paste0('CRANlogs/', missing_days[i], '.csv.gz'))
}

 
 

Step 2: Combine all daily files into one big data table (this steps also takes a couple of minutes…)

## ======================================================================
## Step 2: Load single data files into one big data.table
## ======================================================================
 
file_list <- list.files("CRANlogs", full.names=TRUE)

logs <- list()
for (file in file_list) {
    print(paste("Reading", file, "..."))
    logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)
}

# rbind together all files
library(data.table)
dat <- rbindlist(logs)

# add some keys and define variable types
dat[, date:=as.Date(date)]
dat[, package:=factor(package)]
dat[, country:=factor(country)]
dat[, weekday:=weekdays(date)]
dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]

setkey(dat, package, date, week, country)

save(dat, file="CRANlogs/CRANlogs.RData")

# for later analyses: load the saved data.table
# load("CRANlogs/CRANlogs.RData")

 
 

Step 3: Analyze it!

## ======================================================================
## Step 3: Analyze it!
## ======================================================================

library(ggplot2)
library(plyr)

str(dat)

# Overall downloads of packages
d1 <- dat[, length(week), by=package]
d1 <- d1[order(V1), ]
d1[package=="TripleR", ]
d1[package=="psych", ]

# plot 1: Compare downloads of selected packages on a weekly basis
agg1 <- dat[J(c("TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]

ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x  = element_text(angle=90, size=8, vjust=0.5))


agg1 <- dat[J(c("psych", "TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]

ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x  = element_text(angle=90, size=8, vjust=0.5))

 
 
Here are my two packages,

TripleR

and

RSA

. Actually, ~30 downloads per week (from this single mirror) is much more than I’ve expected!Bildschirmfoto 2013-06-11 um 14.11.30

 

To put things in perspective: package

psych

included in the plot:

Bildschirmfoto 2013-06-11 um 14.11.43

Some psychological sidenotes on social comparisons:

  • Downward comparisons enhance well-being, extreme upward comparisons are detrimental. Hence, do never include
    ggplot2

    into your graphic!

  • Upward comparisons instigate your achievement motive, and give you drive to get better. Hence, select some packages, which are slightly above your own.
  • Of course, things are a bit more complicated than that …

All source code on this post is licensed under the FreeBSD license.

Comments (15) | Trackback

15 Responses to “Finally! Tracking CRAN packages downloads”

  1. Tal Galili says:

    Hi Felix,

    I found your post only after writing the code for my own post here:
    http://www.r-statistics.com/2013/06/answering-how-many-people-use-my-r-package/

    If you’re interested, I’d be happy to include your code in the “installr” package, feel free to send changes/pull-requests to: https://github.com/talgalili/installr/blob/master/R/RStudio_CRAN_data.r

    Cheers,
    Tal

    • FelixS says:

      Hi Tal, I guess many people started working simultaneously on this after the logs have been released! :-)

      It’s a good idea to put this functionality into a package. Feel free to use any code from this post (I’ve added an official license statement); when I find the time I’ll send some changes. I think the data.tables functions should be much faster than the rbind approach!

      Cheers,
      Felix

  2. [...] map which highlights the countries based on how much people there use of R. Felix Schonbrodt wrote a great post on Tracking CRAN packages downloads. In the meantime, I’ve started crafting some basic functions for package developers to easily [...]

  3. Matthew Dowle says:

    Hi,
    Very nice!
    Just a few improvements of data.table usage … the rbindlist is good, but then those 5 dat$<- will copy all of dat each time, just like base R. That's what := is for; e.g., dat[,date:=as.Date(date)] should be faster than dat$date <- as.Date(dat$date), times 5. Also in section 3 the idea is to avoid variable name repetition by not needing d1$; e.g., instead of d1 <- d1[order(d1$V1), ] it's just d1 <- d1[order(V1),].
    Hope that's useful.
    It worked first time for me!
    -Matthew

    • FelixS says:

      Thanks Matthew! I’m new to data.tables (What a great package!), and it’s great to learn new tricks. I updated the source code above.

  4. Anthony Damico says:

    thank you

  5. [...] relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and [...]

  6. [...] relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and [...]

  7. [...] relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and [...]

  8. [...] Tracking CRAN packages downloads [...]

  9. Natalia says:

    I am new to R and found your post extremely useful! Could you please let me know how I can modify the code to generate a graph that would show monthly downloads from a certain time frame?

  10. […] relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and […]

  11. Matt Dowle says:

    Hi,

    This article popped up because the code in it appears on the back of one of the entries for the useR! 2014 T-Shirt competition. Noticed that fread isn’t being used (it was quite new at the time). It saves quite a few lines of code and should be a lot faster.

    Try replacing these lines :

    logs <- list()
    for (file in file_list) {
    print(paste("Reading", file, "…"))
    logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"",
    dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)
    }
    # rbind together all files
    library(data.table)
    dat <- rbindlist(logs)

    with just one line :

    dat <- rbindlist(lapply(file_list, fread))

    Haven't actually tried it on those logs, but this should work and be much faster. Let me know if not. By default it prints a percentage counter as it loads each file, too.

    Matt

    • FelixS says:

      Hi Matt,

      thanks for the hint! It partly works. `read.table` can automatically read the .csv.gz files which have been downloaded.
      `fread` prints an error when attempting to load .csv.gz files. If I unzip the files before, the one-liner works, and has a 10X speedup! (Which, however, does not include the time needed to unzip the files externally).

  12. […] own CRAN mirror. That announcement prompted a few people to analyze some of the available data. Felix Schonbrodt showed how to track R package downloads, Tal Galili looked for the most popular R packages, and James Cheshire also created a map […]

Leave a Reply