Parse pdf files with R (on a Mac)

Inspired by this blog post from theBioBucket, I created a script to parse all pdf files in a directory. Due to its reliance on the Terminal, it’s Mac specific, but modifications for other systems shouldn’t be too hard (as a start for Windows, see BioBucket’s script).

First, you have to install the command line tool pdftotext (a binary can be found on Carsten Blüm’s website). Then, run following script within a directory with pdfs:

# helper function: get number of words in a string, separated by tab, space, return, or point.
nwords <- function(x){
    res <- strsplit(as.character(x), "[ \t\n,\\.]+")
    res <- lapply(res, length)
    unlist(res)
}

# sanitize file name for terminal usage (i.e., escape spaces)
sanitize <- function(str) {
    gsub('([#$%&~_\\^\\\\{}\\s\\(\\)])', '\\\\\\1', str, perl = TRUE)
}

# get a list of all files in the current directory
fi <- list.files()
fi2 <- fi[grepl(".pdf", fi)]


## Parse files and do something with it ...
res <- data.frame() # keeps records of the calculations
for (f in fi2) {
    print(paste("Parsing", f))
   
    f2 <- sanitize(f)
    system(paste0("pdftotext ", f2), wait = TRUE)

    # read content of converted txt file
    filetxt <- sub(".pdf", ".txt", f)
    text <- readLines(filetxt, warn=FALSE)
   
    # adjust encoding of text - you have to know it
    Encoding(text) <- "latin1"
   
    # Do something with the content - here: get word and character count of all pdfs in the current directory
    text2 <- paste(text, collapse="\n") # collapse lines into one long string
   
    res <- rbind(res, data.frame(filename=f, wc=nwords(text2), cs=nchar(text2), cs.nospace=nchar(gsub("\\s", "", text2))))
   
    # remove converted text file
    file.remove(filetxt)
}

print(res)

… gives following result (wc = word count, cs = characgter count, cs.nospace = character count without spaces):

> print(res)
                                                    filename    wc     cs cs.nospace
1                              Applied_Linear_Regression.pdf 33697 186280     154404
2                                           Baron-rpsych.pdf 22665 128440     105024
3                              bootstrapping regressions.pdf  6309  34042      27694
4                            Ch_multidimensional_scaling.pdf   718   4632       3908
5                                               corrgram.pdf  6645  40726      33965
6                   eRm - Extended Rach Modeling (Paper).pdf 11354  65273      53578
7                                           eRm (Folien).pdf   371   1407        886
8  Faraway 2002 - Practical Regression and ANOVA using R.pdf 68777 380902     310037
9                             Farnsworth-EconometricsInR.pdf 20482 125207     101157
10                                           ggplot_book.pdf 10681  65388      53551
11                                       ggplot2-lattice.pdf 18067 118591      93737
12                               lavaan_usersguide_0.3-1.pdf 12608  64232      52962
13                                  lme4 - Bootstrapping.pdf  2065  11739       9515
14                                                Mclust.pdf 18191  92180      70848
15                                              multcomp.pdf  5852  38769      32344
16                                       OpenMxUserGuide.pdf 37320 233817     197571
Comments (3) | Trackback

How to check your package with R-devel

In response to an update to ggplot2 (now verson 0.9.2) I had to make some minor changes to our package TripleR. The CRAN maintainers also asked to …

Please also fix other issues that may be apparent in checks with a current R-devel.

Now, how can this be done? Here’s my workflow on Mac OS (might be slightly different on Win or Linux):

  1.   Install R-develparallel to your existing (stable) R version
    1.  Before installation, run one of these lines in the Terminal:
      sudo pkgutil --forget org.r-project.R.Leopard.fw.pkg
      sudo pkgutil --forget org.r-project.R.mavericks.fw.pkg

      (read the messages of the installer which version you are supposed to use), otherwise the installer will overwrite your existing version

    2. Rename your R.app and R64.app or move them temporarily into another folder, as the installer of R-devel probably will replace them by new version that are not compatible with your existing stable R version.
  2.   Use RSwitch to change the active R version
  3.   Install packages which your own packages depends on; you have to do it from source, as the binaries for the R-devel do not exist:
    install.packages("lme4", type="source")
  4.  Check your own package using following flag:
    R CMD check pkg --as-cran
  5. Check if your package also works on Windows using winbuilder

Furthermore, check whether your package follows the CRAN Repository Policies.

PS: Finally, I managed to get rid of the annoying R CMD check warnings like “no visible binding for global variable ‘x'”. These occured due to the ggplot2 syntax. Here‘s a solution.

Comments (3) | Trackback

Visually weighted/ Watercolor Plots, new variants: Please vote!

Update Oct-23: Added a new parameter

add

to the function. Now multiple groups can be plotted in a single plot (see example in my comment)

As a follow-up on my R implementation of Solomon’s watercolor plots, I made some improvements to the function. I fine-tuned the graphical parameters (the median smoother line now diminishes faster with increasing CIs, and the shaded watercolors look more pretty). Furthermore, the function is faster and has more features:

  • You can define any standard regression function for the bootstrap procedure.
    • vwReg(y ~ x, df, method=lm)
    • vwReg(y ~ x + I(x^2), df, method=lm)
  • Provide parameters for the fitting function.
    • You can make the smoother’s span larger. Then it takes more points into account when doing the local fitting. Per default, the smoother fits a polynomial of degree two – that means as you increase span you will approach the overall quadratic fit: vwReg(y ~ x, df, span=2)
    • You can also make the smoother’s span smaller, then it takes less points for local fitting. If it is too small, it will overfit and approach each single data point. The default span (.75) seemed to be the best choice for me for a variety of data sets: vwReg(y ~ x, df, span=0.5)
    • Use a robust M-estimator for the smoother; see ?loess for details: vwReg(y ~ x, df, family=”symmetric”)
  • Provide your own color scheme (or, for example, a black-and-white scheme). Examples see pictures below.
  • Quantize the color ramp, so that regions for 1, 2, and 3 SD have the same color (an idea proposed by John Mashey).

At the end of this post is the source code for the R function.

Some picture – please vote!

Here are some variants of the watercolor plots – at the end, you can vote for your favorite (or write something into the comments). I am still fine-tuning the default parameters, and I am interested in your opinions what would be the best default.

Plot 1: The current default

 

Plot 2: Using an M-estimator for bootstrap smoothers. Usually you get wider confidence intervalls.

 

Plot 3:Increasing the span of the smoothers

 

Plot 4: Decreasing the span of the smoothers

Plot 5: Changing the color scheme, using a predefined ColorBrewer palette. You can see all available palettes by using this command: library(RColorBrewer); display.brewer.all()

 

Plot 6: Using a custom-made palette

 

Plot 7: Using a custom-made palette; with the parameter bias you can shift the color ramp to the “higher” colors:

 

Plot 8: A black and white version of the plot

 

Plot 9: The anti-Tufte-plot: Using as much ink as possible by reversing black and white (a.k.a. “the Milky-Way Plot“)

Plot 10: The Northern Light Plot/ fMRI plot. This plotting technique already has been used by a suspicious company (called IRET – never heard of that). I hurried to publish the R code under a FreeBSD license before they can patent it! Feel free to use, share, or change the code for whatever purpose you need. Isn’t that beautiful?

Plot 11: The 1-2-3-SD plot. You can use your own color schemes as well, e.g.: vwReg(y~x, df, bw=TRUE, quantize=”SD”)

 

Any comments or ideas? Or just a vote? If you produce some nice plots with your data, you can send it to me, and I will post a gallery of the most impressive “data art”!

Cheers,

Felix

Which water color plot do you like most?

View Results

Loading ... Loading ...
# Copyright 2012 Felix Schönbrodt
# All rights reserved.
#
# FreeBSD License
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
# met:
#
# 1. Redistributions of source code must retain the above copyright
# notice, this list of conditions and the following disclaimer.
#      
# 2. Redistributions in binary form must reproduce the above copyright
# notice, this list of conditions and the following disclaimer in the
# documentation and/or other materials provided with the distribution.
#      
# THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDER `AS IS'' AND ANY
# EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
# PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR
# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
# EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
# PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
# PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF
# LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING
# NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
# SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
#
# The views and conclusions contained in the software and documentation
# are those of the authors and should not be interpreted as representing
# official policies, either expressed or implied, of the copyright
# holder.

# Version history:
# 0.1: original code
# 0.1.1: changed license to FreeBSD; re-established compability to ggplot2 (new version 0.9.2)

## Visually weighted regression / Watercolor plots
## Idea: Solomon Hsiang, with additional ideas from many blog commenters


# B = number bootstrapped smoothers
# shade: plot the shaded confidence region?
# shade.alpha: should the CI shading fade out at the edges? (by reducing alpha; 0 = no alpha decrease, 0.1 = medium alpha decrease, 0.5 = strong alpha decrease)
# spag: plot spaghetti lines?
# spag.color: color of spaghetti lines
# mweight: should the median smoother be visually weighted?
# show.lm: should the linear regresison line be plotted?
# show.CI: should the 95% CI limits be plotted?
# show.median: should the median smoother be plotted?
# median.col: color of the median smoother
# shape: shape of points
# method: the fitting function for the spaghettis; default: loess
# bw = TRUE: define a default b&w-palette
# slices: number of slices in x and y direction for the shaded region. Higher numbers make a smoother plot, but takes longer to draw. I wouldn'T go beyond 500
# palette: provide a custom color palette for the watercolors
# ylim: restrict range of the watercoloring
# quantize: either "continuous", or "SD". In the latter case, we get three color regions for 1, 2, and 3 SD (an idea of John Mashey)
# add: if add == FALSE, a new ggplot is returned. If add == TRUE, only the elements are returned, which can be added to an existing ggplot (with the '+' operator)
# ...: further parameters passed to the fitting function, in the case of loess, for example, "span = .9", or "family = 'symmetric'"
vwReg <- function(formula, data, title="", B=1000, shade=TRUE, shade.alpha=.1, spag=FALSE, spag.color="darkblue", mweight=TRUE, show.lm=FALSE, show.median = TRUE, median.col = "white", shape = 21, show.CI=FALSE, method=loess, bw=FALSE, slices=200, palette=colorRampPalette(c("#FFEDA0", "#DD0000"), bias=2)(20), ylim=NULL, quantize = "continuous",  add=FALSE, ...) {
    IV <- all.vars(formula)[2]
    DV <- all.vars(formula)[1]
    data <- na.omit(data[order(data[, IV]), c(IV, DV)])
   
    if (bw == TRUE) {
        palette <- colorRampPalette(c("#EEEEEE", "#999999", "#333333"), bias=2)(20)
    }

    print("Computing boostrapped smoothers ...")
    newx <- data.frame(seq(min(data[, IV]), max(data[, IV]), length=slices))
    colnames(newx) <- IV
    l0.boot <- matrix(NA, nrow=nrow(newx), ncol=B)
   
    l0 <- method(formula, data)
    for (i in 1:B) {
        data2 <- data[sample(nrow(data), replace=TRUE), ]
        data2 <- data2[order(data2[, IV]), ]
        if (class(l0)=="loess") {
            m1 <- method(formula, data2, control = loess.control(surface = "i", statistics="a", trace.hat="a"), ...)
        } else {
            m1 <- method(formula, data2, ...)
        }
        l0.boot[, i] <- predict(m1, newdata=newx)
    }

    # compute median and CI limits of bootstrap
    library(plyr)
    library(reshape2)
    CI.boot <- adply(l0.boot, 1, function(x) quantile(x, prob=c(.025, .5, .975, pnorm(c(-3, -2, -1, 0, 1, 2, 3))), na.rm=TRUE))[, -1]
    colnames(CI.boot)[1:10] <- c("LL", "M", "UL", paste0("SD", 1:7))
    CI.boot$x <- newx[, 1]
    CI.boot$width <- CI.boot$UL - CI.boot$LL
   
    # scale the CI width to the range 0 to 1 and flip it (bigger numbers = narrower CI)
    CI.boot$w2 <- (CI.boot$width - min(CI.boot$width))
    CI.boot$w3 <- 1-(CI.boot$w2/max(CI.boot$w2))

   
    # convert bootstrapped spaghettis to long format
    b2 <- melt(l0.boot)
    b2$x <- newx[,1]
    colnames(b2) <- c("index", "B", "value", "x")

    library(ggplot2)
    library(RColorBrewer)
   
    # Construct ggplot
    # All plot elements are constructed as a list, so they can be added to an existing ggplot
   
    # if add == FALSE: provide the basic ggplot object
    p0 <- ggplot(data, aes_string(x=IV, y=DV)) + theme_bw()
   
    # initialize elements with NULL (if they are defined, they are overwritten with something meaningful)
    gg.tiles <- gg.poly <- gg.spag <- gg.median <- gg.CI1 <- gg.CI2 <- gg.lm <- gg.points <- gg.title <- NULL

    if (shade == TRUE) {
        quantize <- match.arg(quantize, c("continuous", "SD"))
        if (quantize == "continuous") {
            print("Computing density estimates for each vertical cut ...")
            flush.console()
       
            if (is.null(ylim)) {
                min_value <- min(min(l0.boot, na.rm=TRUE), min(data[, DV], na.rm=TRUE))
                max_value <- max(max(l0.boot, na.rm=TRUE), max(data[, DV], na.rm=TRUE))
                ylim <- c(min_value, max_value)
            }

            # vertical cross-sectional density estimate
            d2 <- ddply(b2[, c("x", "value")], .(x), function(df) {
                res <- data.frame(density(df$value, na.rm=TRUE, n=slices, from=ylim[1], to=ylim[2])[c("x", "y")])
                #res <- data.frame(density(df$value, na.rm=TRUE, n=slices)[c("x", "y")])
                colnames(res) <- c("y", "dens")
                return(res)
            }, .progress="text")

            maxdens <- max(d2$dens)
            mindens <- min(d2$dens)
            d2$dens.scaled <- (d2$dens - mindens)/maxdens  
       
            ## Tile approach
            d2$alpha.factor <- d2$dens.scaled^shade.alpha
            gg.tiles <-  list(geom_tile(data=d2, aes(x=x, y=y, fill=dens.scaled, alpha=alpha.factor)), scale_fill_gradientn("dens.scaled", colours=palette), scale_alpha_continuous(range=c(0.001, 1)))
        }
        if (quantize == "SD") {
            ## Polygon approach
           
            SDs <- melt(CI.boot[, c("x", paste0("SD", 1:7))], id.vars="x")
            count <- 0
            d3 <- data.frame()
            col <- c(1,2,3,3,2,1)
            for (i in 1:6) {
                seg1 <- SDs[SDs$variable == paste0("SD", i), ]
                seg2 <- SDs[SDs$variable == paste0("SD", i+1), ]
                seg <- rbind(seg1, seg2[nrow(seg2):1, ])
                seg$group <- count
                seg$col <- col[i]
                count <- count + 1
                d3 <- rbind(d3, seg)
            }
           
            gg.poly <-  list(geom_polygon(data=d3, aes(x=x, y=value, color=NULL, fill=col, group=group)), scale_fill_gradientn("dens.scaled", colours=palette, values=seq(-1, 3, 1)))
        }
    }
   
    print("Build ggplot figure ...")
    flush.console()
   
   
    if (spag==TRUE) {
        gg.spag <-  geom_path(data=b2, aes(x=x, y=value, group=B), size=0.7, alpha=10/B, color=spag.color)
    }
   
    if (show.median == TRUE) {
        if (mweight == TRUE) {
            gg.median <-  geom_path(data=CI.boot, aes(x=x, y=M, alpha=w3^3), size=.6, linejoin="mitre", color=median.col)
        } else {
            gg.median <-  geom_path(data=CI.boot, aes(x=x, y=M), size = 0.6, linejoin="mitre", color=median.col)
        }
    }
   
    # Confidence limits
    if (show.CI == TRUE) {
        gg.CI1 <- geom_path(data=CI.boot, aes(x=x, y=UL), size=1, color="red")
        gg.CI2 <- geom_path(data=CI.boot, aes(x=x, y=LL), size=1, color="red")
    }
   
    # plain linear regression line
    if (show.lm==TRUE) {gg.lm <- geom_smooth(method="lm", color="darkgreen", se=FALSE)}
       
    gg.points <- geom_point(data=data, aes_string(x=IV, y=DV), size=1, shape=shape, fill="white", color="black")       
       
    if (title != "") {
        gg.title <- theme(title=title)
    }
   

    gg.elements <- list(gg.tiles, gg.poly, gg.spag, gg.median, gg.CI1, gg.CI2, gg.lm, gg.points, gg.title, theme(legend.position="none"))
   
    if (add == FALSE) {
        return(p0 + gg.elements)
    } else {
        return(gg.elements)
    }
}
Comments (43) | Trackback
© 2017 Felix Schönbrodt | Impressum | Datenschutz | Contact