Finally! Tracking CRAN packages downloads

Authornicebread / Posted on2013-06-1127 Comments

[Update June 12: Data.tables functions have been improved (thanks to a comment by Matthew Dowle); for a similar approach see also Tal Galili’s post]

The guys from RStudio now provide CRAN download logs (see also this blog post). Great work!

I always asked myself, how many people actually download my packages. Now I finally can get an answer (… with some anxiety to get frustrated 😉
Here are the complete, self-contained R scripts to analyze these log data:

Step 1: Download all log files in a subfolder (this steps takes a couple of minutes)

[cc lang=”rsplus” escaped=”true”]## ======================================================================
## Step 1: Download all log files
## ======================================================================
# Here’s an easy way to get all the URLs in R
start <- as.Date('2012-10-01') today <- as.Date('2013-06-10') all_days <- seq(start, today, by = 'day') year <- as.POSIXlt(all_days)$year + 1900 urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz') # only download the files you don't have: missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE)) dir.create("CRANlogs") for (i in 1:length(missing_days)) { print(paste0(i, "/", length(missing_days))) download.file(urls[i], paste0('CRANlogs/', missing_days[i], '.csv.gz')) } [/cc]

Step 2: Combine all daily files into one big data table (this steps also takes a couple of minutes…)

[cc lang=”rsplus” escaped=”true”]## ======================================================================
## Step 2: Load single data files into one big data.table
## ======================================================================
file_list <- list.files("CRANlogs", full.names=TRUE) logs <- list() for (file in file_list) { print(paste("Reading", file, "...")) logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", as.is=TRUE) } # rbind together all files library(data.table) dat <- rbindlist(logs) # add some keys and define variable types dat[, date:=as.Date(date)] dat[, package:=factor(package)] dat[, country:=factor(country)] dat[, weekday:=weekdays(date)] dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")] setkey(dat, package, date, week, country) save(dat, file="CRANlogs/CRANlogs.RData") # for later analyses: load the saved data.table # load("CRANlogs/CRANlogs.RData") [/cc]

Step 3: Analyze it!

[cc lang=”rsplus” escaped=”true”]## ======================================================================
## Step 3: Analyze it!
## ======================================================================
library(ggplot2)
library(plyr)
str(dat)
# Overall downloads of packages
d1 <- dat[, length(week), by=package] d1 <- d1[order(V1), ] d1[package=="TripleR", ] d1[package=="psych", ] # plot 1: Compare downloads of selected packages on a weekly basis agg1 <- dat[J(c("TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")] ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x = element_text(angle=90, size=8, vjust=0.5)) agg1 <- dat[J(c("psych", "TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")] ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x = element_text(angle=90, size=8, vjust=0.5))[/cc] Here are my two packages, TripleR and RSA. Actually, ~30 downloads per week (from this single mirror) is much more than I’ve expected!

To put things in perspective: package psych included in the plot:

Some psychological sidenotes on social comparisons:

Downward comparisons enhance well-being, extreme upward comparisons are detrimental. Hence, do never include ggplot2 into your graphic!
Upward comparisons instigate your achievement motive, and give you drive to get better. Hence, select some packages, which are slightly above your own.
Of course, things are a bit more complicated than that …

All source code on this post is licensed under the FreeBSD license.

27 thoughts on “Finally! Tracking CRAN packages downloads”

Tal Galili says:

2013-06-11 at 21:21

Hi Felix,
I found your post only after writing the code for my own post here:
http://www.r-statistics.com/2013/06/answering-how-many-people-use-my-r-package/
If you’re interested, I’d be happy to include your code in the “installr” package, feel free to send changes/pull-requests to: https://github.com/talgalili/installr/blob/master/R/RStudio_CRAN_data.r
Cheers,
Tal

Reply
1. FelixS says:
  
  2013-06-12 at 05:26
  
  Hi Tal, I guess many people started working simultaneously on this after the logs have been released! 🙂
  It’s a good idea to put this functionality into a package. Feel free to use any code from this post (I’ve added an official license statement); when I find the time I’ll send some changes. I think the data.tables functions should be much faster than the rbind approach!
  Cheers,
  Felix
  
  Reply
Pingback: Answering “How many people use my R package?” | R-statistics blog
Matthew Dowle says:

2013-06-12 at 02:20

Hi,
Very nice!
Just a few improvements of data.table usage … the rbindlist is good, but then those 5 dat$<- will copy all of dat each time, just like base R. That's what := is for; e.g., dat[,date:=as.Date(date)] should be faster than dat$date <- as.Date(dat$date), times 5. Also in section 3 the idea is to avoid variable name repetition by not needing d1$; e.g., instead of d1 <- d1[order(d1$V1), ] it's just d1 <- d1[order(V1),].
Hope that's useful.
It worked first time for me!
-Matthew

Reply
1. FelixS says:
  
  2013-06-12 at 05:21
  
  Thanks Matthew! I’m new to data.tables (What a great package!), and it’s great to learn new tricks. I updated the source code above.
  
  Reply
Anthony Damico says:

2013-06-12 at 12:39

thank you

Reply
Pingback: Top 100 R packages for 2013 (Jan-May)! | R-statistics blog
Pingback: Top 100 R packages for 2013 (Jan-May)! | m's R Blog
Pingback: Top 100 R packages for 2013 | spider's space
Pingback: On the RStudio download logs | Omnia sunt Communia!
Natalia says:

2014-03-04 at 22:30

I am new to R and found your post extremely useful! Could you please let me know how I can modify the code to generate a graph that would show monthly downloads from a certain time frame?

Reply
Pingback: What are the top 100 (most downloaded) R packages in 2013? (from simple statistics) | Baker Chen
Matt Dowle says:

2014-06-15 at 00:27

Hi,
This article popped up because the code in it appears on the back of one of the entries for the useR! 2014 T-Shirt competition. Noticed that fread isn’t being used (it was quite new at the time). It saves quite a few lines of code and should be a lot faster.
Try replacing these lines :
logs <- list()
for (file in file_list) {
print(paste("Reading", file, "…"))
logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)
}
# rbind together all files
library(data.table)
dat <- rbindlist(logs)
with just one line :
dat <- rbindlist(lapply(file_list, fread))
Haven't actually tried it on those logs, but this should work and be much faster. Let me know if not. By default it prints a percentage counter as it loads each file, too.
Matt

Reply
1. FelixS says:
  
  2014-07-01 at 13:18
  
  Hi Matt,
  thanks for the hint! It partly works. `read.table` can automatically read the .csv.gz files which have been downloaded.
  `fread` prints an error when attempting to load .csv.gz files. If I unzip the files before, the one-liner works, and has a 10X speedup! (Which, however, does not include the time needed to unzip the files externally).
  
  Reply
Pingback: Analyzing Rstudio CRAN server download logs, and a start towards package recommendation | Stat Of Mind
Pingback: Analyzing package dependencies and download logs from Rstudio, and a start towards building an R recommendation engine | Stat Of Mind
Google says:

2014-08-25 at 07:09

is updated frequently with free advice about Google Ad – Words strategy,
tactics, tips tricks and techniques for success in Ad – Words advertising.
In addition, the observing surgeons could transmit their comments to the operating surgeon, who
could read them on the Google Glass monitor.
Reputation Defense Online an around the world Cyber Investigation along with Litigation Assistance Agency for Net Defamation, often receives inquiries from attorneys along
with law enforcement agencies on the way to subpoena Google’s Legal Division.

Reply
Gregor says:

2014-09-09 at 11:01

Thanks for this nice post!
I have noticed a little bug in Step 1:
download.file(urls[i], paste0(‘CRANlogs/’, missing_days[i], ‘.csv.gz’))
will only work properly if “urls” and “missing_days” coincide (i.e. all the data is being downloaded). Otherwise it might be the case that data from the wrong day is stored in the “right” file.
A quick and dirty hack:
download.file(urls[which(all_days == missing_days[i])], paste0(‘CRANlogs/’, missing_days[i], ‘.csv.gz’))
seems to solve the issue.

Reply
Gregor says:

2014-09-09 at 14:05

It turned out that my computer memory was insufficiently large for the above approach, so I combined Step 1 and Step 2 into the following:
library(data.table)
if (file.exists(“CRANlogs.RData”)) {
load(“CRANlogs.RData”)
} else {
dat <- data.table(NULL)
}
startdate <- as.Date('2012-10-01')
enddate <- as.Date('2014-05-15')
alldays <- seq(startdate, enddate, by = 'day')
years <- as.POSIXlt(alldays)$year + 1900
urls 0) lastdate <- max(dat$date) else lastdate <- as.Date("1000-01-01")
missingdays = lastdate]
if (length(missingdays > 0)) {
for (i in 1:length(missingdays)) {
geturl <- urls[which(alldays == missingdays[i])]
cat(paste0(i, "/", length(missingdays), ": Get ", geturl, "…"))
con <- gzcon(url(geturl))
txt <- readLines(con)
d <- data.table(read.table(textConnection(txt),
header = TRUE, sep = ",", quote = "\"",
dec = ".", fill = TRUE, comment.char = "",
as.is = TRUE))
close(con)
cat(" Add…")
d[, date:=as.Date(date)]
d[, package:=factor(package)]
d[, country:=factor(country)]
d[, weekday:=weekdays(date)]
d[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]
setkey(d, package, date, week, country)
dat <- rbind(dat, d) # this is BAD, cf. http://stackoverflow.com/questions/11486369/growing-a-data-frame-in-a-memory-efficient-manner
cat(" Done.\n")
}
}
save(dat, file="CRANlogs.RData")

Reply
1. Gregor says:
  
  2014-09-09 at 14:48
  
  Maybe elegant, but slow…seems to be better to actually use a tempfile:
  temp <- tempfile()
  download.file(geturl, temp)
  d <- data.table(read.table(temp, header = TRUE, sep = ",", quote = "\"",
  dec = ".", fill = TRUE, comment.char = "", as.is = TRUE))
  unlink(temp)
  instead of
  con <- gzcon(url(geturl))
  txt <- readLines(con)
  d <- data.table(read.table(textConnection(txt),
  header = TRUE, sep = ",", quote = "\"",
  dec = ".", fill = TRUE, comment.char = "",
  as.is = TRUE))
  close(con)
  
  Reply
Christophe says:

2014-09-09 at 15:10

To answer to the question “Is my package being download more or less than the other?”, I add the small following code:
[code]myPack <- "TripleR"
myScore <- d1[package==myPack, ]$V1
aboveMe =myScore,])
(myPercent <- 1-aboveMe/nrow(d1))
## [1] 0.7140716
[/code]
So TripleR is in the "top 30%".

Reply
Sam says:

2015-02-17 at 15:38

Hi,
Is there some way this information can just be posted online without people having to download the logs for ALL packages and then just select the ones they are interested in? All I’m interested in is how many downloads (and maybe some date/time information like a graph). There should be a website where you can just view this information for any package; I mean, I’m no computer expert but it seems a bit silly. Maybe CRAN should just have something on their website.
–Sam

Reply
Pingback: My script to install the “top” R packages « Quantitative Ecology
Jianing says:

2015-08-27 at 16:02

Is anyone running into a memory cap issue?

Reply
Pingback: My script to install the “top” R packages « Quantitative Ecology
Tao Wang says:

2016-07-08 at 03:30

To count, visualize and download R package usage across all time points and in all countries, there is now a small web server that can do all of this:
https://qbrc.swmed.edu/shiny/CRAN/

Reply
JuliaChief says:

2018-04-05 at 17:37

I have checked your page and i’ve found some duplicate content, that’s why you don’t rank high in google,
but there is a tool that can help you to create 100% unique
articles, search for; boorfe’s tips unlimited content

Reply

Step 1: Download all log files in a subfolder (this steps takes a couple of minutes)

Step 2: Combine all daily files into one big data table (this steps also takes a couple of minutes…)

Step 3: Analyze it!

27 thoughts on “Finally! Tracking CRAN packages downloads”

Leave a Reply to Matt Dowle Cancel reply