Finally! Tracking CRAN packages downloads

[Update June 12: Data.tables functions have been improved (thanks to a comment by Matthew Dowle); for a similar approach see also Tal Galili’s post]

The guys from RStudio now provide CRAN download logs (see also this blog post). Great work!

I always asked myself, how many people actually download my packages. Now I finally can get an answer (… with some anxiety to get frustrated 😉
Here are the complete, self-contained R scripts to analyze these log data:

Step 1: Download all log files in a subfolder (this steps takes a couple of minutes)

## ======================================================================
## Step 1: Download all log files
## ======================================================================
# Here's an easy way to get all the URLs in R
start <- as.Date('2012-10-01')
today <- as.Date('2013-06-10')
all_days <- seq(start, today, by = 'day')
year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')
# only download the files you don't have:
missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))
dir.create("CRANlogs")
for (i in 1:length(missing_days)) {
  print(paste0(i, "/", length(missing_days)))
  download.file(urls[i], paste0('CRANlogs/', missing_days[i], '.csv.gz'))
}

 
 

Step 2: Combine all daily files into one big data table (this steps also takes a couple of minutes…)

## ======================================================================
## Step 2: Load single data files into one big data.table
## ======================================================================
file_list <- list.files("CRANlogs", full.names=TRUE)
logs <- list()
for (file in file_list) {
    print(paste("Reading", file, "..."))
    logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = """,
         dec = "
.", fill = TRUE, comment.char = "", as.is=TRUE)
}
# rbind together all files
library(data.table)
dat <- rbindlist(logs)
# add some keys and define variable types
dat[, date:=as.Date(date)]
dat[, package:=factor(package)]
dat[, country:=factor(country)]
dat[, weekday:=weekdays(date)]
dat[, week:=strftime(as.POSIXlt(date),format="
%Y-%W")]
setkey(dat, package, date, week, country)
save(dat, file="
CRANlogs/CRANlogs.RData")
# for later analyses: load the saved data.table
# load("
CRANlogs/CRANlogs.RData")

 
 

Step 3: Analyze it!

## ======================================================================
## Step 3: Analyze it!
## ======================================================================
library(ggplot2)
library(plyr)
str(dat)
# Overall downloads of packages
d1 <- dat[, length(week), by=package]
d1 <- d1[order(V1), ]
d1[package=="TripleR", ]
d1[package=="psych", ]
# plot 1: Compare downloads of selected packages on a weekly basis
agg1 <- dat[J(c("TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]
ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x  = element_text(angle=90, size=8, vjust=0.5))
agg1 <- dat[J(c("psych", "TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]
ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x  = element_text(angle=90, size=8, vjust=0.5))

 
 
Here are my two packages,

TripleR

and

RSA

. Actually, ~30 downloads per week (from this single mirror) is much more than I’ve expected!Bildschirmfoto 2013-06-11 um 14.11.30
 
To put things in perspective: package

psych

included in the plot:

Bildschirmfoto 2013-06-11 um 14.11.43

Some psychological sidenotes on social comparisons:

  • Downward comparisons enhance well-being, extreme upward comparisons are detrimental. Hence, do never include
    ggplot2

    into your graphic!

  • Upward comparisons instigate your achievement motive, and give you drive to get better. Hence, select some packages, which are slightly above your own.
  • Of course, things are a bit more complicated than that …

All source code on this post is licensed under the FreeBSD license.

27 thoughts on “Finally! Tracking CRAN packages downloads”

    1. Hi Tal, I guess many people started working simultaneously on this after the logs have been released! 🙂
      It’s a good idea to put this functionality into a package. Feel free to use any code from this post (I’ve added an official license statement); when I find the time I’ll send some changes. I think the data.tables functions should be much faster than the rbind approach!
      Cheers,
      Felix

  1. Hi,
    Very nice!
    Just a few improvements of data.table usage … the rbindlist is good, but then those 5 dat$<- will copy all of dat each time, just like base R. That's what := is for; e.g., dat[,date:=as.Date(date)] should be faster than dat$date <- as.Date(dat$date), times 5. Also in section 3 the idea is to avoid variable name repetition by not needing d1$; e.g., instead of d1 <- d1[order(d1$V1), ] it's just d1 <- d1[order(V1),].
    Hope that's useful.
    It worked first time for me!
    -Matthew

    1. Thanks Matthew! I’m new to data.tables (What a great package!), and it’s great to learn new tricks. I updated the source code above.

  2. I am new to R and found your post extremely useful! Could you please let me know how I can modify the code to generate a graph that would show monthly downloads from a certain time frame?

  3. Hi,
    This article popped up because the code in it appears on the back of one of the entries for the useR! 2014 T-Shirt competition. Noticed that fread isn’t being used (it was quite new at the time). It saves quite a few lines of code and should be a lot faster.
    Try replacing these lines :
    logs <- list()
    for (file in file_list) {
    print(paste("Reading", file, "…"))
    logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"",
    dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)
    }
    # rbind together all files
    library(data.table)
    dat <- rbindlist(logs)
    with just one line :
    dat <- rbindlist(lapply(file_list, fread))
    Haven't actually tried it on those logs, but this should work and be much faster. Let me know if not. By default it prints a percentage counter as it loads each file, too.
    Matt

    1. Hi Matt,
      thanks for the hint! It partly works. `read.table` can automatically read the .csv.gz files which have been downloaded.
      `fread` prints an error when attempting to load .csv.gz files. If I unzip the files before, the one-liner works, and has a 10X speedup! (Which, however, does not include the time needed to unzip the files externally).

  4. is updated frequently with free advice about Google Ad – Words strategy,
    tactics, tips tricks and techniques for success in Ad – Words advertising.
    In addition, the observing surgeons could transmit their comments to the operating surgeon, who
    could read them on the Google Glass monitor.
    Reputation Defense Online an around the world Cyber Investigation along with Litigation Assistance Agency for Net Defamation, often receives inquiries from attorneys along
    with law enforcement agencies on the way to subpoena Google’s Legal Division.

  5. Thanks for this nice post!
    I have noticed a little bug in Step 1:
    download.file(urls[i], paste0(‘CRANlogs/’, missing_days[i], ‘.csv.gz’))
    will only work properly if “urls” and “missing_days” coincide (i.e. all the data is being downloaded). Otherwise it might be the case that data from the wrong day is stored in the “right” file.
    A quick and dirty hack:
    download.file(urls[which(all_days == missing_days[i])], paste0(‘CRANlogs/’, missing_days[i], ‘.csv.gz’))
    seems to solve the issue.

  6. It turned out that my computer memory was insufficiently large for the above approach, so I combined Step 1 and Step 2 into the following:
    library(data.table)
    if (file.exists(“CRANlogs.RData”)) {
    load(“CRANlogs.RData”)
    } else {
    dat <- data.table(NULL)
    }
    startdate <- as.Date('2012-10-01')
    enddate <- as.Date('2014-05-15')
    alldays <- seq(startdate, enddate, by = 'day')
    years <- as.POSIXlt(alldays)$year + 1900
    urls 0) lastdate <- max(dat$date) else lastdate <- as.Date("1000-01-01")
    missingdays = lastdate]
    if (length(missingdays > 0)) {
    for (i in 1:length(missingdays)) {
    geturl <- urls[which(alldays == missingdays[i])]
    cat(paste0(i, "/", length(missingdays), ": Get ", geturl, "…"))
    con <- gzcon(url(geturl))
    txt <- readLines(con)
    d <- data.table(read.table(textConnection(txt),
    header = TRUE, sep = ",", quote = "\"",
    dec = ".", fill = TRUE, comment.char = "",
    as.is = TRUE))
    close(con)
    cat(" Add…")
    d[, date:=as.Date(date)]
    d[, package:=factor(package)]
    d[, country:=factor(country)]
    d[, weekday:=weekdays(date)]
    d[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]
    setkey(d, package, date, week, country)
    dat <- rbind(dat, d) # this is BAD, cf. http://stackoverflow.com/questions/11486369/growing-a-data-frame-in-a-memory-efficient-manner
    cat(" Done.\n")
    }
    }
    save(dat, file="CRANlogs.RData")

    1. Maybe elegant, but slow…seems to be better to actually use a tempfile:
      temp <- tempfile()
      download.file(geturl, temp)
      d <- data.table(read.table(temp, header = TRUE, sep = ",", quote = "\"",
      dec = ".", fill = TRUE, comment.char = "", as.is = TRUE))
      unlink(temp)
      instead of
      con <- gzcon(url(geturl))
      txt <- readLines(con)
      d <- data.table(read.table(textConnection(txt),
      header = TRUE, sep = ",", quote = "\"",
      dec = ".", fill = TRUE, comment.char = "",
      as.is = TRUE))
      close(con)

  7. To answer to the question “Is my package being download more or less than the other?”, I add the small following code:
    [code]myPack <- "TripleR"
    myScore <- d1[package==myPack, ]$V1
    aboveMe =myScore,])
    (myPercent <- 1-aboveMe/nrow(d1))
    ## [1] 0.7140716
    [/code]
    So TripleR is in the "top 30%".

  8. Hi,
    Is there some way this information can just be posted online without people having to download the logs for ALL packages and then just select the ones they are interested in? All I’m interested in is how many downloads (and maybe some date/time information like a graph). There should be a website where you can just view this information for any package; I mean, I’m no computer expert but it seems a bit silly. Maybe CRAN should just have something on their website.
    –Sam

  9. I have checked your page and i’ve found some duplicate content, that’s why you don’t rank high in google,
    but there is a tool that can help you to create 100% unique
    articles, search for; boorfe’s tips unlimited content

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.