Inspired by this blog post from theBioBucket, I created a script to parse all pdf files in a directory. Due to its reliance on the Terminal, it’s Mac specific, but modifications for other systems shouldn’t be too hard (as a start for Windows, see BioBucket’s script).
First, you have to install the command line tool pdftotext (a binary can be found on Carsten Blüm’s website). Then, run following script within a directory with pdfs:
[cc lang=”rsplus” escaped=”true”]
# helper function: get number of words in a string, separated by tab, space, return, or point.
nwords <- function(x){
res <- strsplit(as.character(x), "[ \t\n,\\.]+")
res <- lapply(res, length)
unlist(res)
}
# sanitize file name for terminal usage (i.e., escape spaces)
sanitize <- function(str) {
gsub('([#$%&~_\\^\\\\{}\\s\\(\\)])', '\\\\\\1', str, perl = TRUE)
}
# get a list of all files in the current directory
fi <- list.files()
fi2 <- fi[grepl(".pdf", fi)]
## Parse files and do something with it ...
res <- data.frame() # keeps records of the calculations
for (f in fi2) {
print(paste("Parsing", f))
f2 <- sanitize(f)
system(paste0("pdftotext ", f2), wait = TRUE)
# read content of converted txt file
filetxt <- sub(".pdf", ".txt", f)
text <- readLines(filetxt, warn=FALSE)
# adjust encoding of text - you have to know it
Encoding(text) <- "latin1"
# Do something with the content - here: get word and character count of all pdfs in the current directory
text2 <- paste(text, collapse="\n") # collapse lines into one long string
res <- rbind(res, data.frame(filename=f, wc=nwords(text2), cs=nchar(text2), cs.nospace=nchar(gsub("\\s", "", text2))))
# remove converted text file
file.remove(filetxt)
}
print(res)
[/cc]
... gives following result (wc = word count, cs = characgter count, cs.nospace = character count without spaces):
[cc lang="rsplus" escaped="true"]
> print(res)
filename wc cs cs.nospace
1 Applied_Linear_Regression.pdf 33697 186280 154404
2 Baron-rpsych.pdf 22665 128440 105024
3 bootstrapping regressions.pdf 6309 34042 27694
4 Ch_multidimensional_scaling.pdf 718 4632 3908
5 corrgram.pdf 6645 40726 33965
6 eRm – Extended Rach Modeling (Paper).pdf 11354 65273 53578
7 eRm (Folien).pdf 371 1407 886
8 Faraway 2002 – Practical Regression and ANOVA using R.pdf 68777 380902 310037
9 Farnsworth-EconometricsInR.pdf 20482 125207 101157
10 ggplot_book.pdf 10681 65388 53551
11 ggplot2-lattice.pdf 18067 118591 93737
12 lavaan_usersguide_0.3-1.pdf 12608 64232 52962
13 lme4 – Bootstrapping.pdf 2065 11739 9515
14 Mclust.pdf 18191 92180 70848
15 multcomp.pdf 5852 38769 32344
16 OpenMxUserGuide.pdf 37320 233817 197571
[/cc]
Thanks for creating this! It is most excellent.
sh: pdftotext: command not found
I am getting the above error while executing the command…
below is the code i am running
# helper function: get number of words in a string, separated by tab, space, return, or point.
nwords <- function(x){
res <- strsplit(as.character(x), "[ \t\n,\\.]+")
res <- lapply(res, length)
unlist(res)
}
# sanitize file name for terminal usage (i.e., escape spaces)
sanitize <- function(str) {
gsub('([#$%&~_\\^\\\\{}\\s\\(\\)])', '\\\\\\1', str, perl = TRUE)
}
# get a list of all files in the current directory
fi <- list.files("Sample")
fi2 <- fi[grepl(".pdf", fi)]
fi2
## Parse files and do something with it …
res <- data.frame() # keeps records of the calculations
for (f in fi2) {
print(paste("Parsing", f))
f2 <- sanitize(f)
system(paste0("pdftotext ", f2), wait = TRUE)
f2
# read content of converted txt file
filetxt <- sub(".pdf", ".txt", f)
text <- readLines(filetxt, warn=FALSE)
# adjust encoding of text – you have to know it
Encoding(text) <- "latin1"
# Do something with the content – here: get word and character count of all pdfs in the current directory
text2 <- paste(text, collapse="\n") # collapse lines into one long string
res <- rbind(res, data.frame(filename=f, wc=nwords(text2), cs=nchar(text2), cs.nospace=nchar(gsub("\\s", "", text2))))
# remove converted text file
file.remove(filetxt)
}
print(res)
You need to install the pdftotext command-line tool.
But you could also take a look at the pdftools package: http://ropensci.org/blog/2016/03/01/pdftools-and-jeroen