Finally! Tracking CRAN packages downloads
[Update June 12: Data.tables functions have been improved (thanks to a comment by Matthew Dowle); for a similar approach see also Tal Galili's post]
The guys from RStudio now provide CRAN download logs (see also this blog post). Great work!
I always asked myself, how many people actually download my packages. Now I finally can get an answer (… with some anxiety to get frustrated ![]()
Here are the complete, self-contained R scripts to analyze these log data:
Step 1: Download all log files in a subfolder (this steps takes a couple of minutes)
## ====================================================================== ## Step 1: Download all log files ## ====================================================================== # Here's an easy way to get all the URLs in R start <- as.Date('2012-10-01') today <- as.Date('2013-06-10') all_days <- seq(start, today, by = 'day') year <- as.POSIXlt(all_days)$year + 1900 urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz') # only download the files you don't have: missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE)) dir.create("CRANlogs") for (i in 1:length(missing_days)) { print(paste0(i, "/", length(missing_days))) download.file(urls[i], paste0('CRANlogs/', missing_days[i], '.csv.gz')) } |
Step 2: Combine all daily files into one big data table (this steps also takes a couple of minutes…)
## ====================================================================== ## Step 2: Load single data files into one big data.table ## ====================================================================== file_list <- list.files("CRANlogs", full.names=TRUE) logs <- list() for (file in file_list) { print(paste("Reading", file, "...")) logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", as.is=TRUE) } # rbind together all files library(data.table) dat <- rbindlist(logs) # add some keys and define variable types dat[, date:=as.Date(date)] dat[, package:=factor(package)] dat[, country:=factor(country)] dat[, weekday:=weekdays(date)] dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")] setkey(dat, package, date, week, country) save(dat, file="CRANlogs/CRANlogs.RData") # for later analyses: load the saved data.table # load("CRANlogs/CRANlogs.RData") |
Step 3: Analyze it!
## ====================================================================== ## Step 3: Analyze it! ## ====================================================================== library(ggplot2) library(plyr) str(dat) # Overall downloads of packages d1 <- dat[, length(week), by=package] d1 <- d1[order(V1), ] d1[package=="TripleR", ] d1[package=="psych", ] # plot 1: Compare downloads of selected packages on a weekly basis agg1 <- dat[J(c("TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")] ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x = element_text(angle=90, size=8, vjust=0.5)) agg1 <- dat[J(c("psych", "TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")] ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x = element_text(angle=90, size=8, vjust=0.5)) |
Here are my two packages, TripleR and RSA. Actually, ~30 downloads per week (from this single mirror) is much more than I’ve expected!
To put things in perspective: package psych included in the plot:
Some psychological sidenotes on social comparisons:
- Downward comparisons enhance well-being, extreme upward comparisons are detrimental. Hence, do never include
ggplot2into your graphic! - Upward comparisons instigate your achievement motive, and give you drive to get better. Hence, select some packages, which are slightly above your own.
- Of course, things are a bit more complicated than that …
All source code on this post is licensed under the FreeBSD license.
At what sample size do correlations stabilize?
Maybe you have encountered this situation: you run a large-scale study over the internet, and out of curiosity, you frequently check the correlation between two variables.
My experience with this practice is usually frustrating, as in small sample sizes (and we will see what “small” means in this context) correlations go up and down, change sign, move from “significant” to “non-significant” and back. As an example, see Figure 1 which shows the actual trajectory of a correlation, plotted against sample size (I also posted a video of this evolution).
It is simply the order how participants dropped into the study (i.e., data has not been rearranged). In this case, the correlation started really strong (r = .69) and continuously decayed until it’s final r of .26. The light gray lines show some exemplary bootstrapped alternative trajectories.
In this particular case, at least the sign was stable (“There is a positive relationship in the population”, see also “Type-S errors”). Other trajectories in this data set, however, changed signs or their significance status. One correlation even changed from “negative significant” to “positive significant”!
Obviously, the estimate of a correlation stabilizes with increasing sample size. Now I wanted to know: At which sample size exactly can I expect a correlation to be stable? An informal query amongst colleagues revealed estimates between n = 80 and n = 150.
Together with Marco Perugini, I did a systematic analysis of this question. The results of this simulation study are reported here [PDF, 0.39 MB]. In this paper a “corridor of stability” (COS) has been utilized: Deviations from the true value are defined as tolerable as long as they stay within that corridor (see also Figure 1 for a COS of +/- .1). The point of stability (POS) is that sample size from which on a specific trajectory does not leave the COS anymore.
The point of stability depends on the effect size (How strong is the true correlation?), the width of the corridor of stability (How much deviation from the true value am I willing to accept?), and the confidence in the decision (How confident do I want to be that the trajectory does not leave the COS any more?). If you’re interested in the details: read the paper. It’s not long.
The bottom line is: For typical scenarios in psychology, correlations stabilize when n approaches 250. That means, estimates with n > 250 are not only significant, they also are fairly accurate (see also Kelley & Maxwell, 2003, and Maxwell, Kelley, & Rausch, 2008, for elaborated discussions on parameter accuracy).
Additional analyses (not reported in the publication)
Figure 2 shows the distribution of POS values, depending on the half-width of the COS and on effect size rho. The horizontal axis is cut at n = 300, although several POS were > 300. It can be seen that all distributions have a very long tail. This makes the estimation of the 95th quantile very unstable. Therefore we used a larger number of 100’000 bootstrap replications in each experimental condition in order to get fairly stable estimates for the extreme quantiles.
Finally, Figure 3 shows the probability that a trajectory leaves the COS with increasing sample size.
The dotted lines mark the confidence levels of 80%, 90%, and 95% which were used in the publications. The n where the curves intersect these dotted lines indicate the values reported in Table 1 of the publication. For example, if the true correlation is .3 (which is already more than the average effect size in psychology) and you collect 100 participants, there’s still a chance of 50% that your correlation will leave the corridor between .21 and .39 (which are the boundaries for w=.1).
What is the conclusion? Significance tests determine the sign of a correlation. This conclusion can be made with much lower sample sizes. However, when we want to make an accurate conclusion about the size of an effect with some confidence (and we do not want to make a “Type M” error), we need much larger samples.
The full R source code for the simulations can be downloaded here.
References:
Installation of WRS package (Wilcox’ Robust Statistics)
Some users had trouble installing the WRS package from R-Forge. Here’s a method that should work automatically and fail-safe:
# first: install dependent packages install.packages(c("MASS", "akima", "robustbase")) # second: install suggested packages install.packages(c("cobs", "robust", "mgcv", "scatterplot3d", "quantreg", "rrcov", "lars", "pwr", "trimcluster", "parallel", "mc2d", "psych", "Rfit")) # third: install WRS install.packages("WRS", repos="http://R-Forge.R-project.org", type="source") |
WRS cannot be hosted on CRAN, as CRAN demands help files for every user-visible function. This has not been done for WRS (yet). For the time being, this somewhat more complicated installation routine has to be used.
Further thoughts on post-publication peer review (PPPR)
Sanjay Srivastava blogged some interesting thoughts about the process of post-publication peer review (PPPR), reflecting about his own comment on a PLOS ONE publication. I agree that open peer commentaries after publication are one important part of the future of scientific publishing. There were many cases where I wished to have the opportunity to publish such a commentary. In one case, I actually wrote a commentary on a paper published in Management Science – a strange story about managers, age, and testosterone, which received a lot of press coverage. I submitted it as a commentary to the journal, but it was rejected because of “lack of new results”. Now my commentary rests on SSRN and has been downloaded 5 times in 10 months – yippee-yeah! (probably 3 of these 5 are by myself …). But as SSRN does not allow peer commentaries I could not set a link from the original paper to my comment, and nobody finds it.
Other fields of science additionally established a pre-publication open peer review (also called the “pre-print culture”). Many researchers in mathematics or physics publish their preprints on arXiv and harvest open peer commentaries before submitting the manuscript to a peer-reviewed journal.
I believe devoutly that open PrePPR and PostPPR can significantly improve the quality of scientific output. But one crucial requirement indeed is etiquette, as Sanjay pointed out. I don’t want to see shitstorms coming over scientific articles, especially in the case of young scholars who worked hard to get their first paper published. Comments should be written in the spirit of a collaborative enhancement of research, and less in terms of “debunking”. We all are humans and mistakes can occur. Problems should be pointed out in order to strengthen scientific research, but in a friendly and constructive manner.
Researchers who conceive of science as a highly competitive business where claims have to be fortified and defended might have problems with open peer reviews (e.g., the escalation of the “Bargh rampage” [1][2][3]). But if we see science as a collaborative endeavour in search for knowledge, where no model is “right” but only “less wrong”, open peer reviews can be a very helpful tool.
————–
Some further readings:
- Neanderthal sex debate highlights benefits of pre-publication (news blog on nature.com)
- Interesting post by Jeremy Fox (and commenters), argueing for (standard) pre-publication review, from the perspective of marine biology and ecology
- The case for arXiv and a broader conception of peer-reviews (by Philippe Desjardins-Proulx)
The first CREDAM Award for creative data management goes to … the German government!
“If you torture the data long enough, it will confess.”
This aphorism, attributed to Ronald Coase, sometimes has been used in a disrespective manner, as if was wrong to do creative data analysis. This view obviously is misleading. In contrast, we at IRET have a much more positive and humanistic view of data management, and therefore we have made this aphorism to our leading guide in difficult times.
We at IRET have made it to our mission to proliferate and foster creative ways of data analysis. Therefore, we proudly introduce an award in recognition of outstanding data creativity: the CREDAM Award. CREDAM is both an acronym (CREative DAta Management), and a statement: credam (lat.) means “I will believe”, or “I will trust”.
This years CREDAM Award goes to …….. the German government!
A new report on poverty in Germany is going to be published soon. What does the data say?
Year | Overall property in possession of rich households | Overall property in possession of complete lower half |
|---|---|---|
| 1998 | 45% | 3% |
| 2008 | 53% | 1% |
Seems like a pretty clear picture, and in a previous version of the report, the authors concluded (based on this and other data), that “income disparity increased” (see Süddeutsche Zeitung). But that is wrong!! But why is it wrong? Well, that interpretation “does not reflect the opinion of the German government”.
On the pressure of the leader of the minor coalition partner, Philipp Rösler (which currently would be elected by 4% of Germans), this conclusion was re-interpreted. Now, the report comes to the completely opposite conclusion: “income disparity decreases“!
As this is a great example of creative data analysis, which liberates us from restrictive and anally retentive “scientific” procedures, we are happy to award the first CREDAM trophy to the German government, especially Phillip Rösler. Congratulations!
(Maybe we should think about adopting this strategy for scientific reports as well. Given highly flexible approaches of data analysis, conclusions should rather be based on a majority vote of all (co-)authors and reviewers, not on empirical evidence.)
Improved evolution of correlations
As an update of this post: here’s an improved version of “The evolution of correlations”.
From the original post:
This is the evolution of a bivariate correlation between two questionnaire scales, “hope of power” and “fear of losing control”. Both scales were administered in an open online study. The video shows how the correlation evolves from r = .69*** (n=20) to r = .26*** (n=271). It does not stabilize until n = 150.
Data has not been rearranged – it is the random order how participants dropped into the study. This had been a rather extreme case of an unstable correlation – other scales in this study were stable right from the beginning. Maybe this video could help as an anecdotal caveat for a careful interpretation of correlations with small n’s (and with ‘small’ I mean n < 100) …
The right panel now displays the correlation in each step. The horizontal green line is the final correlation that is approached, the curved dotted line shows the marginal correlation that would be significant at that sample size. As the empirical curve always is above this dotted line, it is significantly different from zero in each step.
Evolution of correlations – improved from Felix Schönbrodt on Vimeo.
Here the code that created the movie. It’s not fully self-contained – the function plotReg plots the dual-panel display, dat0, A, and B are parameters passed to this function. You can insert any other function here. The function loops through the rows of a data frame and saves a plot at every step into a subfolder. Finally, the function needs the command line version of ffmpeg, which connects the pictures to a movie.
makeMovie <- function(fname, dat0, A, B, fps=15) { # create a new directory for the pictures dir.create(fname) # create the picture sequence picName <- paste(fname, "/", fname, "_%03d.jpg", sep="") jpeg(picName, width=800, height=450, quality=95) for (i in 15:nrow(dat0)) { print(i) plotReg(A, B, i, keep=15) } dev.off() # delete any existing movie file unlink(paste(fname,".mpg",sep="")) # point system to R's working directory system(paste("cd ", gsub(" ", "\\ ", getwd(), fixed=TRUE))) # show & execute the command line expression for ffmpeg to glue the pictures together print(paste(paste0("ffmpeg -r ", fps, " -i ", fname, "/", fname, "_%03d.jpg -sameq -r 25 ", paste0(fname,".avi")))) system(paste(paste0("ffmpeg -r ", fps, " -i ", fname, "/", fname, "_%03d.jpg -sameq -r 25 ", paste0(fname,".avi")))) } |
Optimizing parameters for an oscillator – Video
Here’s a video how the modFit function from the FME package optimizes parameters for an oscillation. A Nelder-Mead-optimizer (R function optim) finds the best fitting parameters for an undampened oscillator. Minimum was found after 72 iterations, true parameter eta was -.05:
Evolution of parameters in optimization process from Felix Schönbrodt on Vimeo.
More on estimating parameters of differential equations is coming later on this blog!
Things I’ve learned:
- ffmpeg does not like pngs. They are internally converted to jpg in a very low quality and I could not find a way to improve this quality. Lesson learned: Export high quality jpgs from your R function
- Use a standard frame rate for the output file (i.e., 24, 25, or 30 fps)
- My final ffmpeg command:
ffmpeg -r 10 -i modFit%03d.jpg -r 25 -b:v 5000K modelFit.avi- -r 10: Use 10 pictures / second as input
- -i modFit%03d.jpg: defines the names of the input files, modFit001.jpg, modFit002.jpg, …
- -r 25: Set framerate of output file to 25 fps
- -b:v 5000K: set bitrate of video to a high value
- modelFit.mp4: video name and encoding type (mp4)
R-package: Wilcox’ Robust Statistics updated (WRS v0.20)
Rand Wilcox constantly updates the functions accompanying his books on robust statistics. Recently, they have been updated to version 20. The functions are available in the WRS package for R – for installation simply type
install.packages("WRS", repos="http://R-Forge.R-project.org")
In version 0.20, a number of functions dealing with ANCOVA have been added and some others improved. Unfortunately, only very few help files exist for the functions. I would recommend to check out the source code, as most functions have a comment section roughly explaining the parameters. Alternatively, consult Wilcox’ books for descriptions of the functions.
References:
Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing, 3rd Ed. Academic Press.
Wilcox, R. (2012). Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction. CRC Press.
Wilcox, R. R. (2010). Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy. Springer, 2nd Ed.
Wilcox, R. R. (2009). Basics Statistics: Understanding Conventional Methods and Modern Insights. New York: Oxford.
Parse pdf files with R (on a Mac)
Inspired by this blog post from theBioBucket, I created a script to parse all pdf files in a directory. Due to its reliance on the Terminal, it’s Mac specific, but modifications for other systems shouldn’t be too hard (as a start for Windows, see BioBucket’s script).
First, you have to install the command line tool pdftotext (a binary can be found on Carsten Blüm’s website). Then, run following script within a directory with pdfs:
# helper function: get number of words in a string, separated by tab, space, return, or point. nwords <- function(x){ res <- strsplit(as.character(x), "[ \t\n,\\.]+") res <- lapply(res, length) unlist(res) } # sanitize file name for terminal usage (i.e., escape spaces) sanitize <- function(str) { gsub('([#$%&~_\\^\\\\{}\\s\\(\\)])', '\\\\\\1', str, perl = TRUE) } # get a list of all files in the current directory fi <- list.files() fi2 <- fi[grepl(".pdf", fi)] ## Parse files and do something with it ... res <- data.frame() # keeps records of the calculations for (f in fi2) { print(paste("Parsing", f)) f2 <- sanitize(f) system(paste0("pdftotext ", f2), wait = TRUE) # read content of converted txt file filetxt <- sub(".pdf", ".txt", f) text <- readLines(filetxt, warn=FALSE) # adjust encoding of text - you have to know it Encoding(text) <- "latin1" # Do something with the content - here: get word and character count of all pdfs in the current directory text2 <- paste(text, collapse="\n") # collapse lines into one long string res <- rbind(res, data.frame(filename=f, wc=nwords(text2), cs=nchar(text2), cs.nospace=nchar(gsub("\\s", "", text2)))) # remove converted text file file.remove(filetxt) } print(res) |
… gives following result (wc = word count, cs = characgter count, cs.nospace = character count without spaces):
> print(res)
filename wc cs cs.nospace
1 Applied_Linear_Regression.pdf 33697 186280 154404
2 Baron-rpsych.pdf 22665 128440 105024
3 bootstrapping regressions.pdf 6309 34042 27694
4 Ch_multidimensional_scaling.pdf 718 4632 3908
5 corrgram.pdf 6645 40726 33965
6 eRm - Extended Rach Modeling (Paper).pdf 11354 65273 53578
7 eRm (Folien).pdf 371 1407 886
8 Faraway 2002 - Practical Regression and ANOVA using R.pdf 68777 380902 310037
9 Farnsworth-EconometricsInR.pdf 20482 125207 101157
10 ggplot_book.pdf 10681 65388 53551
11 ggplot2-lattice.pdf 18067 118591 93737
12 lavaan_usersguide_0.3-1.pdf 12608 64232 52962
13 lme4 - Bootstrapping.pdf 2065 11739 9515
14 Mclust.pdf 18191 92180 70848
15 multcomp.pdf 5852 38769 32344
16 OpenMxUserGuide.pdf 37320 233817 197571
How to check your package with R-devel
In response to an update to ggplot2 (now verson 0.9.2) I had to make some minor changes to our package TripleR. The CRAN maintainers also asked to …
Please also fix other issues that may be apparent in checks with a current R-devel.
Now, how can this be done? Here’s my workflow on Mac OS (might be slightly different on Win or Linux):
- Install R-develparallel to your existing (stable) R version
- Before installation, run
sudo pkgutil --forget org.r-project.R.Leopard.fw.pkgin the Terminal, otherwise the installer will overwrite your existing version - Rename your R.app and R64.app or move them temporarily into another folder, as the installer of R-devel probably will replace them by new version that are not compatible with your existing stable R version.
- Before installation, run
- Use RSwitch to change the active R version
- Install packages which your own packages depends on; you have to do it from source, as the binaries for the R-devel do not exist:
install.packages("lme4", type="source") - Check your own package using following flag:
R CMD check pkg --as-cran - Check if your package also works on Windows using winbuilder
Furthermore, check whether your package follows the CRAN Repository Policies.
PS: Finally, I managed to get rid of the annoying R CMD check warnings like “no visible binding for global variable ‘x’”. These occured due to the ggplot2 syntax. Here‘s a solution.




