Finally! Tracking CRAN packages downloads

[Update June 12: Data.tables functions have been improved (thanks to a comment by Matthew Dowle); for a similar approach see also Tal Galili's post]

The guys from RStudio now provide CRAN download logs (see also this blog post). Great work!

I always asked myself, how many people actually download my packages. Now I finally can get an answer (… with some anxiety to get frustrated ;-)
Here are the complete, self-contained R scripts to analyze these log data:

Step 1: Download all log files in a subfolder (this steps takes a couple of minutes)

?View Code RSPLUS
## ======================================================================
## Step 1: Download all log files
## ======================================================================
 
# Here's an easy way to get all the URLs in R
start <- as.Date('2012-10-01')
today <- as.Date('2013-06-10')
 
all_days <- seq(start, today, by = 'day')
 
year <- as.POSIXlt(all_days)$year + 1900
urls <- paste0('http://cran-logs.rstudio.com/', year, '/', all_days, '.csv.gz')
 
# only download the files you don't have:
missing_days <- setdiff(as.character(all_days), tools::file_path_sans_ext(dir("CRANlogs"), TRUE))
 
dir.create("CRANlogs")
for (i in 1:length(missing_days)) {
  print(paste0(i, "/", length(missing_days)))
  download.file(urls[i], paste0('CRANlogs/', missing_days[i], '.csv.gz'))
}

 
 

Step 2: Combine all daily files into one big data table (this steps also takes a couple of minutes…)

?View Code RSPLUS
## ======================================================================
## Step 2: Load single data files into one big data.table
## ======================================================================
 
file_list <- list.files("CRANlogs", full.names=TRUE)
 
logs <- list()
for (file in file_list) {
	print(paste("Reading", file, "..."))
	logs[[file]] <- read.table(file, header = TRUE, sep = ",", quote = "\"",
         dec = ".", fill = TRUE, comment.char = "", as.is=TRUE)
}
 
# rbind together all files
library(data.table)
dat <- rbindlist(logs)
 
# add some keys and define variable types
dat[, date:=as.Date(date)]
dat[, package:=factor(package)]
dat[, country:=factor(country)]
dat[, weekday:=weekdays(date)]
dat[, week:=strftime(as.POSIXlt(date),format="%Y-%W")]
 
setkey(dat, package, date, week, country)
 
save(dat, file="CRANlogs/CRANlogs.RData")
 
# for later analyses: load the saved data.table
# load("CRANlogs/CRANlogs.RData")

 
 

Step 3: Analyze it!

?View Code RSPLUS
## ======================================================================
## Step 3: Analyze it!
## ======================================================================
 
library(ggplot2)
library(plyr)
 
str(dat)
 
# Overall downloads of packages
d1 <- dat[, length(week), by=package]
d1 <- d1[order(V1), ]
d1[package=="TripleR", ]
d1[package=="psych", ]
 
# plot 1: Compare downloads of selected packages on a weekly basis
agg1 <- dat[J(c("TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]
 
ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x  = element_text(angle=90, size=8, vjust=0.5))
 
 
agg1 <- dat[J(c("psych", "TripleR", "RSA")), length(unique(ip_id)), by=c("week", "package")]
 
ggplot(agg1, aes(x=week, y=V1, color=package, group=package)) + geom_line() + ylab("Downloads") + theme_bw() + theme(axis.text.x  = element_text(angle=90, size=8, vjust=0.5))

 
 
Here are my two packages, TripleR and RSA. Actually, ~30 downloads per week (from this single mirror) is much more than I’ve expected!Bildschirmfoto 2013-06-11 um 14.11.30

 

To put things in perspective: package psych included in the plot:

Bildschirmfoto 2013-06-11 um 14.11.43

Some psychological sidenotes on social comparisons:

  • Downward comparisons enhance well-being, extreme upward comparisons are detrimental. Hence, do never include ggplot2 into your graphic!
  • Upward comparisons instigate your achievement motive, and give you drive to get better. Hence, select some packages, which are slightly above your own.
  • Of course, things are a bit more complicated than that …

All source code on this post is licensed under the FreeBSD license.

Comments (10) | Trackback

At what sample size do correlations stabilize?

Maybe you have encountered this situation: you run a large-scale study over the internet, and out of curiosity, you frequently check the correlation between two variables.

My experience with this practice is usually frustrating, as in small sample sizes (and we will see what “small” means in this context) correlations go up and down, change sign, move from “significant” to “non-significant” and back. As an example, see Figure 1 which shows the actual trajectory of a correlation, plotted against sample size (I also posted a video of this evolution).

It is simply the order how participants dropped into the study (i.e., data has not been rearranged). In this case, the correlation started really strong (r = .69) and continuously decayed until it’s final r of .26. The light gray lines show some exemplary bootstrapped alternative trajectories.

evolDemo

Figure 1: The evolution of a correlation.

In this particular case, at least the sign was stable (“There is a positive relationship in the population”, see also “Type-S errors”). Other trajectories in this data set, however, changed signs or their significance status. One correlation even changed from “negative significant” to “positive significant”!

Obviously, the estimate of a correlation stabilizes with increasing sample size. Now I wanted to know: At which sample size exactly can I expect a correlation to be stable? An informal query amongst colleagues revealed estimates between  n = 80 and n = 150.

Together with Marco Perugini, I did a systematic analysis of this question. The results of this simulation study are reported here [PDF, 0.39 MB]. In this paper a “corridor of stability” (COS) has been utilized: Deviations from the true value are defined as tolerable as long as they stay within that corridor (see also Figure 1 for a COS of +/- .1). The point of stability (POS) is that sample size from which on a specific trajectory does not leave the COS anymore.

The point of stability depends on the effect size (How strong is the true correlation?), the width of the corridor of stability (How much deviation from the true value am I willing to accept?), and the confidence in the decision (How confident do I want to be that the trajectory does not leave the COS any more?). If you’re interested in the details: read the paper. It’s not long.

The bottom line is: For typical scenarios in psychology, correlations stabilize when n approaches 250. That means, estimates with n > 250 are not only significant, they also are fairly accurate (see also Kelley & Maxwell, 2003, and Maxwell, Kelley, & Rausch, 2008, for elaborated discussions on parameter accuracy).

Additional analyses (not reported in the publication)

Figure 2 shows the distribution of POS values, depending on the half-width of the COS and on effect size rho. The horizontal axis is cut at n = 300, although several POS were > 300. It can be seen that all distributions have a very long tail. This makes the estimation of the 95th quantile very unstable. Therefore we used a larger number of 100’000 bootstrap replications in each experimental condition in order to get fairly stable estimates for the extreme quantiles.

 

POS_distribution

Figure 2: Distribution of POS values, depending on half-width of COS and effect size rho

Finally, Figure 3 shows the probability that a trajectory leaves the COS with increasing sample size.

probLeaving

Figure 3

The dotted lines mark the confidence levels of 80%, 90%, and 95% which were used in the publications. The n where the curves intersect these dotted lines indicate the values reported in Table 1 of the publication. For example, if the true correlation is .3 (which is already more than the average effect size in psychology) and you collect 100 participants, there’s still a chance of 50% that your correlation will leave the corridor between .21 and .39 (which are the boundaries for w=.1).

What is the conclusion? Significance tests determine the sign of a correlation. This conclusion can be made with much lower sample sizes. However, when we want to make an accurate conclusion about the size of an effect with some confidence (and we do not want to make a “Type M” error), we need much larger samples.

The full R source code for the simulations can be downloaded here.

References:

Maxwell, S. E., Kelley, K., & Rausch, J. R. (2008). Sample size planning for statistical power and accuracy in parameter estimation. Annual Review of Psychology, 59, 537–563. doi:10.1146/annurev.psych.59.103006.093735 [PDF]
Schönbrodt, F. D., & Perugini, M. (in press). At what sample size do correlations stabilize? Journal of Research in Personality. doi:10.1016/j.jrp.2013.05.009 [PDF]
—-
Comments (2) | Trackback

Installation of WRS package (Wilcox’ Robust Statistics)

Some users had trouble installing the WRS package from R-Forge. Here’s a method that should work automatically and fail-safe:

?View Code RSPLUS
# first: install dependent packages
install.packages(c("MASS", "akima", "robustbase"))
 
# second: install suggested packages
install.packages(c("cobs", "robust", "mgcv", "scatterplot3d", "quantreg", "rrcov", "lars", "pwr", "trimcluster", "parallel", "mc2d", "psych", "Rfit"))
 
# third: install WRS
install.packages("WRS", repos="http://R-Forge.R-project.org", type="source")

WRS cannot be hosted on CRAN, as CRAN demands help files for every user-visible function. This has not been done for WRS (yet). For the time being, this somewhat more complicated installation routine has to be used.

No comments | Trackback

Improved evolution of correlations

As an update of this post: here’s an improved version of “The evolution of correlations”.

From the original post:

This is the evolution of a bivariate correlation between two questionnaire scales, “hope of power” and “fear of losing control”. Both scales were administered in an open online study. The video shows how the correlation evolves from r = .69*** (n=20) to r = .26*** (n=271). It does not stabilize until n = 150.

Data has not been rearranged – it is the random order how participants dropped into the study. This had been a rather extreme case of an unstable correlation – other scales in this study were stable right from the beginning. Maybe this video could help as an anecdotal caveat for a careful interpretation of correlations with small n’s (and with ‘small’ I mean n < 100) …

The right panel now displays the correlation in each step. The horizontal green line is the final correlation that is approached, the curved dotted line shows the marginal correlation that would be significant at that sample size. As the empirical curve always is above this dotted line, it is significantly different from zero in each step.

Evolution of correlations – improved from Felix Schönbrodt on Vimeo.

 

Here the code that created the movie. It’s not fully self-contained – the function plotReg plots the dual-panel display, dat0, A, and B are parameters passed to this function. You can insert any other function here. The function loops through the rows of a data frame and saves a plot at every step into a subfolder. Finally, the function needs the command line version of ffmpeg, which connects the pictures to a movie.

?View Code RSPLUS
makeMovie <- function(fname, dat0, A, B, fps=15) {
 
	# create a new directory for the pictures
	dir.create(fname)
 
	# create the picture sequence
	picName <- paste(fname, "/", fname, "_%03d.jpg", sep="")
	jpeg(picName, width=800, height=450, quality=95)
	for (i in 15:nrow(dat0)) {
	  print(i)
	  plotReg(A, B, i, keep=15)
	}
	dev.off()
 
	# delete any existing movie file
	unlink(paste(fname,".mpg",sep=""))
 
	# point system to R's working directory
	system(paste("cd ", gsub(" ", "\\ ", getwd(), fixed=TRUE)))
 
	# show & execute the command line expression for ffmpeg to glue the pictures together
	print(paste(paste0("ffmpeg -r ", fps, " -i ", fname, "/", fname, "_%03d.jpg -sameq -r 25 ",  paste0(fname,".avi"))))
	system(paste(paste0("ffmpeg -r ", fps, " -i ", fname, "/", fname, "_%03d.jpg -sameq -r 25 ",  paste0(fname,".avi"))))
}
Comments (5) | Trackback

Optimizing parameters for an oscillator – Video

Here’s a video how the modFit function from the FME package optimizes parameters for an oscillation. A Nelder-Mead-optimizer (R function optim) finds the best fitting parameters for an undampened oscillator. Minimum was found after 72 iterations, true parameter eta was -.05:

Evolution of parameters in optimization process from Felix Schönbrodt on Vimeo.

More on estimating parameters of differential equations is coming later on this blog!

Things I’ve learned:

  • ffmpeg does not like pngs. They are internally converted to jpg in a very low quality and I could not find a way to improve this quality. Lesson learned: Export high quality jpgs from your R function
  • Use a standard frame rate for the output file (i.e., 24, 25, or 30 fps)
  • My final ffmpeg command: ffmpeg -r 10 -i modFit%03d.jpg -r 25 -b:v 5000K modelFit.avi
    • -r 10: Use 10 pictures / second as input
    • -i modFit%03d.jpg: defines the names of the input files, modFit001.jpg, modFit002.jpg, …
    • -r 25: Set framerate of output file to 25 fps
    • -b:v 5000K: set bitrate of video to a high value
    • modelFit.mp4: video name and encoding type (mp4)
No comments | Trackback

R-package: Wilcox’ Robust Statistics updated (WRS v0.20)

Rand Wilcox constantly updates the functions accompanying his books on robust statistics. Recently, they have been updated to version 20. The functions are available in the WRS package for R – for installation simply type
install.packages("WRS", repos="http://R-Forge.R-project.org")

In version 0.20, a number of functions dealing with ANCOVA have been added and some others improved. Unfortunately, only very few help files exist for the functions. I would recommend to check out the source code, as most functions have a comment section roughly explaining the parameters. Alternatively, consult Wilcox’ books for descriptions of the functions.

References:
Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing, 3rd Ed. Academic Press.
Wilcox, R. (2012). Modern Statistics for the Social and Behavioral Sciences: A Practical Introduction. CRC Press.
Wilcox, R. R. (2010). Fundamentals of Modern Statistical Methods: Substantially Improving Power and Accuracy. Springer, 2nd Ed.
Wilcox, R. R. (2009). Basics Statistics: Understanding Conventional Methods and Modern Insights. New York: Oxford.

No comments | Trackback

Parse pdf files with R (on a Mac)

Inspired by this blog post from theBioBucket, I created a script to parse all pdf files in a directory. Due to its reliance on the Terminal, it’s Mac specific, but modifications for other systems shouldn’t be too hard (as a start for Windows, see BioBucket’s script).

First, you have to install the command line tool pdftotext (a binary can be found on Carsten Blüm’s website). Then, run following script within a directory with pdfs:

?View Code RSPLUS
# helper function: get number of words in a string, separated by tab, space, return, or point.
nwords <- function(x){
	res <- strsplit(as.character(x), "[ \t\n,\\.]+")
	res <- lapply(res, length)
	unlist(res)
}
 
# sanitize file name for terminal usage (i.e., escape spaces)
sanitize <- function(str) {
	gsub('([#$%&~_\\^\\\\{}\\s\\(\\)])', '\\\\\\1', str, perl = TRUE)
}
 
# get a list of all files in the current directory
fi <- list.files()
fi2 <- fi[grepl(".pdf", fi)]
 
 
## Parse files and do something with it ...
res <- data.frame() # keeps records of the calculations
for (f in fi2) {
	print(paste("Parsing", f))
 
	f2 <- sanitize(f)
	system(paste0("pdftotext ", f2), wait = TRUE)
 
	# read content of converted txt file
	filetxt <- sub(".pdf", ".txt", f)
	text <- readLines(filetxt, warn=FALSE)
 
	# adjust encoding of text - you have to know it
	Encoding(text) <- "latin1"
 
	# Do something with the content - here: get word and character count of all pdfs in the current directory
	text2 <- paste(text, collapse="\n")	# collapse lines into one long string
 
	res <- rbind(res, data.frame(filename=f, wc=nwords(text2), cs=nchar(text2), cs.nospace=nchar(gsub("\\s", "", text2)))) 
 
	# remove converted text file
	file.remove(filetxt)
}
 
print(res)

… gives following result (wc = word count, cs = characgter count, cs.nospace = character count without spaces):


> print(res)
                                                    filename    wc     cs cs.nospace
1                              Applied_Linear_Regression.pdf 33697 186280     154404
2                                           Baron-rpsych.pdf 22665 128440     105024
3                              bootstrapping regressions.pdf  6309  34042      27694
4                            Ch_multidimensional_scaling.pdf   718   4632       3908
5                                               corrgram.pdf  6645  40726      33965
6                   eRm - Extended Rach Modeling (Paper).pdf 11354  65273      53578
7                                           eRm (Folien).pdf   371   1407        886
8  Faraway 2002 - Practical Regression and ANOVA using R.pdf 68777 380902     310037
9                             Farnsworth-EconometricsInR.pdf 20482 125207     101157
10                                           ggplot_book.pdf 10681  65388      53551
11                                       ggplot2-lattice.pdf 18067 118591      93737
12                               lavaan_usersguide_0.3-1.pdf 12608  64232      52962
13                                  lme4 - Bootstrapping.pdf  2065  11739       9515
14                                                Mclust.pdf 18191  92180      70848
15                                              multcomp.pdf  5852  38769      32344
16                                       OpenMxUserGuide.pdf 37320 233817     197571

Comments (1) | Trackback

How to check your package with R-devel

In response to an update to ggplot2 (now verson 0.9.2) I had to make some minor changes to our package TripleR. The CRAN maintainers also asked to …

Please also fix other issues that may be apparent in checks with a current R-devel.

Now, how can this be done? Here’s my workflow on Mac OS (might be slightly different on Win or Linux):

  1.   Install R-develparallel to your existing (stable) R version
    1.  Before installation, run sudo pkgutil --forget org.r-project.R.Leopard.fw.pkg in the Terminal, otherwise the installer will overwrite your existing version
    2. Rename your R.app and R64.app or move them temporarily into another folder, as the installer of R-devel probably will replace them by new version that are not compatible with your existing stable R version.
  2.   Use RSwitch to change the active R version
  3.   Install packages which your own packages depends on; you have to do it from source, as the binaries for the R-devel do not exist: install.packages("lme4", type="source")
  4.  Check your own package using following flag: R CMD check pkg --as-cran
  5. Check if your package also works on Windows using winbuilder

Furthermore, check whether your package follows the CRAN Repository Policies.

PS: Finally, I managed to get rid of the annoying R CMD check warnings like “no visible binding for global variable ‘x’”. These occured due to the ggplot2 syntax. Here‘s a solution.

Comments (2) | Trackback

Visually weighted/ Watercolor Plots, new variants: Please vote!

Update Oct-23: Added a new parameter add to the function. Now multiple groups can be plotted in a single plot (see example in my comment)

As a follow-up on my R implementation of Solomon’s watercolor plots, I made some improvements to the function. I fine-tuned the graphical parameters (the median smoother line now diminishes faster with increasing CIs, and the shaded watercolors look more pretty). Furthermore, the function is faster and has more features:

  • You can define any standard regression function for the bootstrap procedure.
    • vwReg(y ~ x, df, method=lm)
    • vwReg(y ~ x + I(x^2), df, method=lm)
  • Provide parameters for the fitting function.
    • You can make the smoother’s span larger. Then it takes more points into account when doing the local fitting. Per default, the smoother fits a polynomial of degree two – that means as you increase span you will approach the overall quadratic fit: vwReg(y ~ x, df, span=2)
    • You can also make the smoother’s span smaller, then it takes less points for local fitting. If it is too small, it will overfit and approach each single data point. The default span (.75) seemed to be the best choice for me for a variety of data sets: vwReg(y ~ x, df, span=0.5)
    • Use a robust M-estimator for the smoother; see ?loess for details: vwReg(y ~ x, df, family=”symmetric”)
  • Provide your own color scheme (or, for example, a black-and-white scheme). Examples see pictures below.
  • Quantize the color ramp, so that regions for 1, 2, and 3 SD have the same color (an idea proposed by John Mashey).

At the end of this post is the source code for the R function.

Some picture – please vote!

Here are some variants of the watercolor plots – at the end, you can vote for your favorite (or write something into the comments). I am still fine-tuning the default parameters, and I am interested in your opinions what would be the best default.

Plot 1: The current default

 

Plot 2: Using an M-estimator for bootstrap smoothers. Usually you get wider confidence intervalls.

 

Plot 3:Increasing the span of the smoothers

 

Plot 4: Decreasing the span of the smoothers

Plot 5: Changing the color scheme, using a predefined ColorBrewer palette. You can see all available palettes by using this command: library(RColorBrewer); display.brewer.all()

 

Plot 6: Using a custom-made palette

 

Plot 7: Using a custom-made palette; with the parameter bias you can shift the color ramp to the “higher” colors:

 

Plot 8: A black and white version of the plot

 

Plot 9: The anti-Tufte-plot: Using as much ink as possible by reversing black and white (a.k.a. “the Milky-Way Plot“)

Plot 10: The Northern Light Plot/ fMRI plot. This plotting technique already has been used by a suspicious company (called IRET – never heard of that). I hurried to publish the R code under a FreeBSD license before they can patent it! Feel free to use, share, or change the code for whatever purpose you need. Isn’t that beautiful?

Plot 11: The 1-2-3-SD plot. You can use your own color schemes as well, e.g.: vwReg(y~x, df, bw=TRUE, quantize=”SD”)

 

Any comments or ideas? Or just a vote? If you produce some nice plots with your data, you can send it to me, and I will post a gallery of the most impressive “data art”!

Cheers,

Felix

Which water color plot do you like most?

View Results

Loading ... Loading ...
?View Code RSPLUS
# Copyright 2012 Felix Schönbrodt
# All rights reserved.
# 
# FreeBSD License
# 
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met:
# 
# 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
#       
# 2. Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
#       
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER `AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
# 
# The views and conclusions contained in the software and documentation
# are those of the authors and should not be interpreted as representing
# official policies, either expressed or implied, of the copyright
# holder.
 
# Version history:
# 0.1: original code
# 0.1.1: changed license to FreeBSD; re-established compability to ggplot2 (new version 0.9.2)
 
## Visually weighted regression / Watercolor plots
## Idea: Solomon Hsiang, with additional ideas from many blog commenters
 
 
# B = number bootstrapped smoothers
# shade: plot the shaded confidence region?
# shade.alpha: should the CI shading fade out at the edges? (by reducing alpha; 0 = no alpha decrease, 0.1 = medium alpha decrease, 0.5 = strong alpha decrease)
# spag: plot spaghetti lines?
# spag.color: color of spaghetti lines
# mweight: should the median smoother be visually weighted?
# show.lm: should the linear regresison line be plotted?
# show.CI: should the 95% CI limits be plotted?
# show.median: should the median smoother be plotted?
# median.col: color of the median smoother
# shape: shape of points
# method: the fitting function for the spaghettis; default: loess
# bw = TRUE: define a default b&w-palette
# slices: number of slices in x and y direction for the shaded region. Higher numbers make a smoother plot, but takes longer to draw. I wouldn'T go beyond 500
# palette: provide a custom color palette for the watercolors
# ylim: restrict range of the watercoloring
# quantize: either "continuous", or "SD". In the latter case, we get three color regions for 1, 2, and 3 SD (an idea of John Mashey)
# add: if add == FALSE, a new ggplot is returned. If add == TRUE, only the elements are returned, which can be added to an existing ggplot (with the '+' operator)
# ...: further parameters passed to the fitting function, in the case of loess, for example, "span = .9", or "family = 'symmetric'"
vwReg <- function(formula, data, title="", B=1000, shade=TRUE, shade.alpha=.1, spag=FALSE, spag.color="darkblue", mweight=TRUE, show.lm=FALSE, show.median = TRUE, median.col = "white", shape = 21, show.CI=FALSE, method=loess, bw=FALSE, slices=200, palette=colorRampPalette(c("#FFEDA0", "#DD0000"), bias=2)(20), ylim=NULL, quantize = "continuous",  add=FALSE, ...) {
	IV <- all.vars(formula)[2]
	DV <- all.vars(formula)[1]
	data <- na.omit(data[order(data[, IV]), c(IV, DV)])
 
	if (bw == TRUE) {
		palette <- colorRampPalette(c("#EEEEEE", "#999999", "#333333"), bias=2)(20)
	}
 
	print("Computing boostrapped smoothers ...")
	newx <- data.frame(seq(min(data[, IV]), max(data[, IV]), length=slices))
	colnames(newx) <- IV
	l0.boot <- matrix(NA, nrow=nrow(newx), ncol=B)
 
	l0 <- method(formula, data)
	for (i in 1:B) {
		data2 <- data[sample(nrow(data), replace=TRUE), ]
		data2 <- data2[order(data2[, IV]), ]
		if (class(l0)=="loess") {
			m1 <- method(formula, data2, control = loess.control(surface = "i", statistics="a", trace.hat="a"), ...)
		} else {
			m1 <- method(formula, data2, ...)
		}
		l0.boot[, i] <- predict(m1, newdata=newx)
	}
 
	# compute median and CI limits of bootstrap
	library(plyr)
	library(reshape2)
	CI.boot <- adply(l0.boot, 1, function(x) quantile(x, prob=c(.025, .5, .975, pnorm(c(-3, -2, -1, 0, 1, 2, 3))), na.rm=TRUE))[, -1]
	colnames(CI.boot)[1:10] <- c("LL", "M", "UL", paste0("SD", 1:7))
	CI.boot$x <- newx[, 1]
	CI.boot$width <- CI.boot$UL - CI.boot$LL
 
	# scale the CI width to the range 0 to 1 and flip it (bigger numbers = narrower CI)
	CI.boot$w2 <- (CI.boot$width - min(CI.boot$width))
	CI.boot$w3 <- 1-(CI.boot$w2/max(CI.boot$w2))
 
 
	# convert bootstrapped spaghettis to long format
	b2 <- melt(l0.boot)
	b2$x <- newx[,1]
	colnames(b2) <- c("index", "B", "value", "x")
 
	library(ggplot2)
	library(RColorBrewer)
 
	# Construct ggplot
	# All plot elements are constructed as a list, so they can be added to an existing ggplot
 
	# if add == FALSE: provide the basic ggplot object
	p0 <- ggplot(data, aes_string(x=IV, y=DV)) + theme_bw()
 
	# initialize elements with NULL (if they are defined, they are overwritten with something meaningful)
	gg.tiles <- gg.poly <- gg.spag <- gg.median <- gg.CI1 <- gg.CI2 <- gg.lm <- gg.points <- gg.title <- NULL
 
	if (shade == TRUE) {
		quantize <- match.arg(quantize, c("continuous", "SD"))
		if (quantize == "continuous") {
			print("Computing density estimates for each vertical cut ...")
			flush.console()
 
			if (is.null(ylim)) {
				min_value <- min(min(l0.boot, na.rm=TRUE), min(data[, DV], na.rm=TRUE))
				max_value <- max(max(l0.boot, na.rm=TRUE), max(data[, DV], na.rm=TRUE))
				ylim <- c(min_value, max_value)
			}
 
			# vertical cross-sectional density estimate
			d2 <- ddply(b2[, c("x", "value")], .(x), function(df) {
				res <- data.frame(density(df$value, na.rm=TRUE, n=slices, from=ylim[1], to=ylim[2])[c("x", "y")])
				#res <- data.frame(density(df$value, na.rm=TRUE, n=slices)[c("x", "y")])
				colnames(res) <- c("y", "dens")
				return(res)
			}, .progress="text")
 
			maxdens <- max(d2$dens)
			mindens <- min(d2$dens)
			d2$dens.scaled <- (d2$dens - mindens)/maxdens	
 
			## Tile approach
			d2$alpha.factor <- d2$dens.scaled^shade.alpha
			gg.tiles <-  list(geom_tile(data=d2, aes(x=x, y=y, fill=dens.scaled, alpha=alpha.factor)), scale_fill_gradientn("dens.scaled", colours=palette), scale_alpha_continuous(range=c(0.001, 1)))
		}
		if (quantize == "SD") {
			## Polygon approach
 
			SDs <- melt(CI.boot[, c("x", paste0("SD", 1:7))], id.vars="x")
			count <- 0
			d3 <- data.frame()
			col <- c(1,2,3,3,2,1)
			for (i in 1:6) {
				seg1 <- SDs[SDs$variable == paste0("SD", i), ]
				seg2 <- SDs[SDs$variable == paste0("SD", i+1), ]
				seg <- rbind(seg1, seg2[nrow(seg2):1, ])
				seg$group <- count
				seg$col <- col[i]
				count <- count + 1
				d3 <- rbind(d3, seg)
			}
 
			gg.poly <-  list(geom_polygon(data=d3, aes(x=x, y=value, color=NULL, fill=col, group=group)), scale_fill_gradientn("dens.scaled", colours=palette, values=seq(-1, 3, 1)))
		}
	}
 
	print("Build ggplot figure ...")
	flush.console()
 
 
	if (spag==TRUE) {
		gg.spag <-  geom_path(data=b2, aes(x=x, y=value, group=B), size=0.7, alpha=10/B, color=spag.color)
	}
 
	if (show.median == TRUE) {
		if (mweight == TRUE) {
			gg.median <-  geom_path(data=CI.boot, aes(x=x, y=M, alpha=w3^3), size=.6, linejoin="mitre", color=median.col)
		} else {
			gg.median <-  geom_path(data=CI.boot, aes(x=x, y=M), size = 0.6, linejoin="mitre", color=median.col)
		}
	}
 
	# Confidence limits
	if (show.CI == TRUE) {
		gg.CI1 <- geom_path(data=CI.boot, aes(x=x, y=UL), size=1, color="red")
		gg.CI2 <- geom_path(data=CI.boot, aes(x=x, y=LL), size=1, color="red")
	}
 
	# plain linear regression line
	if (show.lm==TRUE) {gg.lm <- geom_smooth(method="lm", color="darkgreen", se=FALSE)}
 
	gg.points <- geom_point(data=data, aes_string(x=IV, y=DV), size=1, shape=shape, fill="white", color="black")		
 
	if (title != "") {
		gg.title <- theme(title=title)
	}
 
 
	gg.elements <- list(gg.tiles, gg.poly, gg.spag, gg.median, gg.CI1, gg.CI2, gg.lm, gg.points, gg.title, theme(legend.position="none"))
 
	if (add == FALSE) {
		return(p0 + gg.elements)
	} else {
		return(gg.elements)
	}
}
Comments (34) | Trackback

Amazing fMRI plots for everybody!

Dear valued customer,

it is a well-known scientific truth that research results which are accompanied by a fancy, colorful fMRI scan, are perceived as more believable and more persuasive than simple bar graphs or text results (McCabe & Castel, 2007; Weisberg, Keil, Goodstein, Rawson, & Gray, 2008). Readers even agree more with fictitious and unsubstantiated claims, as long as you provide a colorful brain image, and it works even when the subject is a dead salmon.

 

The power of brain images for everybody

What are the consequence of these troubling findings? The answer is clear. Everybody should be equipped with these powerful tools of research communication! We at IRET made it to our mission to provide the latest, cutting-edge tools for your research analysis. In this case we adopted a new technology called “visually weighted regression” or “watercolor plots” (see here, here, or here), and simply applied a new color scheme.

But now, let’s get some hands on it!

The example

Imagine you invested a lot of effort in collecting the data of 41 participants. Now you find following pattern in 2 of your 87 variables:

You could show that plain scatterplot. But should you do it? Nay. Of course everybody would spot the outliers on the top right. But which is much more important: it is b-o-r-i-n-g!

What is the alternative? Reporting the correlation as text? “We found a correlation of r = .38 (p = .014)”. Yawn.

Or maybe: “We chose to use a correlation technique that is robust against outliers and violations of normality, the Spearman rank coefficient. It turned out that the correlation broke down and was not significant any more (r = .06, p = .708).”.

Don’t be silly! With that style of scientific reporting, there would be nothing to write home about. But you can be sure: we have the right tools for you. Finally, the power of pictures is not limited to brain research – now you can turn any data into a magical fMRI plot like that:

Isn’t that beautiful? We recommend to accompany the figure with an elaborated description: “For local fitting, we used spline smoothers from 10`000 bootstrap replications. For a robust estimation of vertical confidence densities, a re-descending M-estimator with Tukey’s biweight function was employed. As one can clearly see in the plot, there is  significant confidence in the prediction of the x=0, y=0 region, as well as a minor hot spot in the x=15, y=60 region (also known as the supra-dextral data region).”

Magical Data Enhancer Tool

With the Magical Data Enhancer Tool (MDET) you can …

  • … turn boring, marginally significant, or just crappy results into a stunning research experience
  • … publish in scientific journal with higher impact factors
  • … receive the media coverage that you and your research deserve
  • … achieve higher acceptance rates from funding agencies
  • … impress young women at the bar (you wouldn’t show a plain scatterplot, dude?!)

FAQ

Q: But – isn’t that approach unethical?
A: No, it’s not at all. In contrast, we at IRES think that it is unethical that only some researchers are allowed to exploit the cognitive biases of their readers. We design our products with a great respect for humanity and we believe that every researcher who can afford our products should have the same powerful tools at hand.

Q: How much does you product cost?
A: The standard version of the Magical Data Enhancer ships for 12’998 $. We are aware that this is a significant investment. But, come on: You deserve it! Furthermore, we will soon publish a free trial version, including the full R code on this blog. So stay tuned!

 

Best regards,

Lexis “Lex” Brycenet (CEO & CTO Research Communication)
International Research Enhancement Technology (IRET)

Comments (8) | Trackback