Felix Schönbrodt

Dr. Dipl.-Psych.

Finally! Tracking CRAN packages downloads

[Update June 12: Data.tables functions have been improved (thanks to a comment by Matthew Dowle); for a similar approach see also Tal Galili's post]

The guys from RStudio now provide CRAN download logs (see also this blog post). Great work!

I always asked myself, how many people actually download my packages. Now I finally can get an answer (… with some anxiety to get frustrated ;-)
Here are the complete, self-contained R scripts to analyze these log data:

Step 1: Download all log files in a subfolder (this steps takes a couple of minutes)

## ======================================================================
## Step 1: Download all log files
## ======================================================================

# Here's an easy way to get all the URLs in R
start <- as.Date('2012-10-01')
today <- as.Date('2013-06-10')

all_days <- seq(start, today, by = 'day')
 
year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')
 
# only download the files you don't have:
missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))
 
dir.create("CRANlogs")
for (i in 1:length(missing_days)) {
  print(paste0(i, "/", length(missing_days)))
  download.file(urls[i], paste0('CRANlogs/', missing_days[i], '.csv.gz'))
}

 
 

Step 2: Combine all daily files into one big data table (this steps also takes a couple of minutes…)

## ======================================================================
## Step 2: Load single data files into one big data.table
## ======================================================================
 
file_list <- list.files("CRANlogs", full.names=TRUE)

logs <- list()
for (file in file_list) {
    print(paste("Reading", file, "..."))
    logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)
}

# rbind together all files
library(data.table)
dat <- rbindlist(logs)

# add some keys and define variable types
dat[, date:=as.Date(date)]
dat[, package:=factor(package)]
dat[, country:=factor(country)]
dat[, weekday:=weekdays(date)]
dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]

setkey(dat, package, date, week, country)

save(dat, file="CRANlogs/CRANlogs.RData")

# for later analyses: load the saved data.table
# load("CRANlogs/CRANlogs.RData")

 
 

Step 3: Analyze it!

## ======================================================================
## Step 3: Analyze it!
## ======================================================================

library(ggplot2)
library(plyr)

str(dat)

# Overall downloads of packages
d1 <- dat[, length(week), by=package]
d1 <- d1[order(V1), ]
d1[package=="TripleR", ]
d1[package=="psych", ]

# plot 1: Compare downloads of selected packages on a weekly basis
agg1 <- dat[J(c("TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]

ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x  = element_text(angle=90, size=8, vjust=0.5))


agg1 <- dat[J(c("psych", "TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]

ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x  = element_text(angle=90, size=8, vjust=0.5))

 
 
Here are my two packages,

TripleR

and

RSA

. Actually, ~30 downloads per week (from this single mirror) is much more than I’ve expected!Bildschirmfoto 2013-06-11 um 14.11.30

 

To put things in perspective: package

psych

included in the plot:

Bildschirmfoto 2013-06-11 um 14.11.43

Some psychological sidenotes on social comparisons:

  • Downward comparisons enhance well-being, extreme upward comparisons are detrimental. Hence, do never include
    ggplot2

    into your graphic!

  • Upward comparisons instigate your achievement motive, and give you drive to get better. Hence, select some packages, which are slightly above your own.
  • Of course, things are a bit more complicated than that …

All source code on this post is licensed under the FreeBSD license.

Comments (21) | Trackback

21 Responses to “Finally! Tracking CRAN packages downloads”

  1. Tal Galili says:

    Hi Felix,

    I found your post only after writing the code for my own post here:
    http://www.r-statistics.com/2013/06/answering-how-many-people-use-my-r-package/

    If you’re interested, I’d be happy to include your code in the “installr” package, feel free to send changes/pull-requests to: https://github.com/talgalili/installr/blob/master/R/RStudio_CRAN_data.r

    Cheers,
    Tal

    • FelixS says:

      Hi Tal, I guess many people started working simultaneously on this after the logs have been released! :-)

      It’s a good idea to put this functionality into a package. Feel free to use any code from this post (I’ve added an official license statement); when I find the time I’ll send some changes. I think the data.tables functions should be much faster than the rbind approach!

      Cheers,
      Felix

  2. [...] map which highlights the countries based on how much people there use of R. Felix Schonbrodt wrote a great post on Tracking CRAN packages downloads. In the meantime, I’ve started crafting some basic functions for package developers to easily [...]

  3. Matthew Dowle says:

    Hi,
    Very nice!
    Just a few improvements of data.table usage … the rbindlist is good, but then those 5 dat$<- will copy all of dat each time, just like base R. That's what := is for; e.g., dat[,date:=as.Date(date)] should be faster than dat$date <- as.Date(dat$date), times 5. Also in section 3 the idea is to avoid variable name repetition by not needing d1$; e.g., instead of d1 <- d1[order(d1$V1), ] it's just d1 <- d1[order(V1),].
    Hope that's useful.
    It worked first time for me!
    -Matthew

    • FelixS says:

      Thanks Matthew! I’m new to data.tables (What a great package!), and it’s great to learn new tricks. I updated the source code above.

  4. Anthony Damico says:

    thank you

  5. [...] relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and [...]

  6. [...] relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and [...]

  7. [...] relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and [...]

  8. [...] Tracking CRAN packages downloads [...]

  9. Natalia says:

    I am new to R and found your post extremely useful! Could you please let me know how I can modify the code to generate a graph that would show monthly downloads from a certain time frame?

  10. […] relying on the nice code that Felix Schonbrodt recently wrote for tracking packages downloads, I have updated my installr R package with functions that enables the user to easily download and […]

  11. Matt Dowle says:

    Hi,

    This article popped up because the code in it appears on the back of one of the entries for the useR! 2014 T-Shirt competition. Noticed that fread isn’t being used (it was quite new at the time). It saves quite a few lines of code and should be a lot faster.

    Try replacing these lines :

    logs <- list()
    for (file in file_list) {
    print(paste("Reading", file, "…"))
    logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"",
    dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)
    }
    # rbind together all files
    library(data.table)
    dat <- rbindlist(logs)

    with just one line :

    dat <- rbindlist(lapply(file_list, fread))

    Haven't actually tried it on those logs, but this should work and be much faster. Let me know if not. By default it prints a percentage counter as it loads each file, too.

    Matt

    • FelixS says:

      Hi Matt,

      thanks for the hint! It partly works. `read.table` can automatically read the .csv.gz files which have been downloaded.
      `fread` prints an error when attempting to load .csv.gz files. If I unzip the files before, the one-liner works, and has a 10X speedup! (Which, however, does not include the time needed to unzip the files externally).

  12. […] own CRAN mirror. That announcement prompted a few people to analyze some of the available data. Felix Schonbrodt showed how to track R package downloads, Tal Galili looked for the most popular R packages, and James Cheshire also created a map […]

  13. […] own CRAN mirror, an announcement that prompted a few people to analyze some of the available data: Felix Schonbrodt showed how to track R package downloads, Tal Galili looked for the most popular R packages, and James Cheshire also created a map […]

  14. Google says:

    is updated frequently with free advice about Google Ad – Words strategy,
    tactics, tips tricks and techniques for success in Ad – Words advertising.
    In addition, the observing surgeons could transmit their comments to the operating surgeon, who
    could read them on the Google Glass monitor.
    Reputation Defense Online an around the world Cyber Investigation along with Litigation Assistance Agency for Net Defamation, often receives inquiries from attorneys along
    with law enforcement agencies on the way to subpoena Google’s Legal Division.

  15. Gregor says:

    Thanks for this nice post!

    I have noticed a little bug in Step 1:

    download.file(urls[i], paste0(‘CRANlogs/’, missing_days[i], ‘.csv.gz’))

    will only work properly if “urls” and “missing_days” coincide (i.e. all the data is being downloaded). Otherwise it might be the case that data from the wrong day is stored in the “right” file.

    A quick and dirty hack:

    download.file(urls[which(all_days == missing_days[i])], paste0(‘CRANlogs/’, missing_days[i], ‘.csv.gz’))

    seems to solve the issue.

  16. Gregor says:

    It turned out that my computer memory was insufficiently large for the above approach, so I combined Step 1 and Step 2 into the following:

    library(data.table)

    if (file.exists(“CRANlogs.RData”)) {
    load(“CRANlogs.RData”)
    } else {
    dat <- data.table(NULL)
    }

    startdate <- as.Date('2012-10-01')
    enddate <- as.Date('2014-05-15')

    alldays <- seq(startdate, enddate, by = 'day')
    years <- as.POSIXlt(alldays)$year + 1900

    urls 0) lastdate <- max(dat$date) else lastdate <- as.Date("1000-01-01")
    missingdays = lastdate]

    if (length(missingdays > 0)) {
    for (i in 1:length(missingdays)) {
    geturl <- urls[which(alldays == missingdays[i])]
    cat(paste0(i, "/", length(missingdays), ": Get ", geturl, "…"))
    con <- gzcon(url(geturl))
    txt <- readLines(con)
    d <- data.table(read.table(textConnection(txt),
    header = TRUE, sep = ",", quote = "\"",
    dec = ".", fill = TRUE, comment.char = "",
    as.is = TRUE))
    close(con)
    cat(" Add…")
    d[, date:=as.Date(date)]
    d[, package:=factor(package)]
    d[, country:=factor(country)]
    d[, weekday:=weekdays(date)]
    d[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]
    setkey(d, package, date, week, country)

    dat <- rbind(dat, d) # this is BAD, cf. http://stackoverflow.com/questions/11486369/growing-a-data-frame-in-a-memory-efficient-manner
    cat(" Done.\n")
    }
    }

    save(dat, file="CRANlogs.RData")

    • Gregor says:

      Maybe elegant, but slow…seems to be better to actually use a tempfile:

      temp <- tempfile()
      download.file(geturl, temp)
      d <- data.table(read.table(temp, header = TRUE, sep = ",", quote = "\"",
      dec = ".", fill = TRUE, comment.char = "", as.is = TRUE))
      unlink(temp)

      instead of

      con <- gzcon(url(geturl))
      txt <- readLines(con)
      d <- data.table(read.table(textConnection(txt),
      header = TRUE, sep = ",", quote = "\"",
      dec = ".", fill = TRUE, comment.char = "",
      as.is = TRUE))
      close(con)

  17. Christophe says:

    To answer to the question “Is my package being download more or less than the other?”, I add the small following code:

    [code]myPack <- "TripleR"
    myScore <- d1[package==myPack, ]$V1
    aboveMe =myScore,])
    (myPercent <- 1-aboveMe/nrow(d1))
    ## [1] 0.7140716
    [/code]

    So TripleR is in the "top 30%".

Leave a Reply