Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
menu search
person
Welcome To Ask or Share your Answers For Others

Categories

I'm trying to extract data from tables inside some pdf reports.

I've seen some examples using either pdftools and similar packages I was successful in getting the text, however, I just want to extract the tables.

Is there a way to use R to recognize and extract only tables?

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
606 views
Welcome To Ask or Share your Answers For Others

1 Answer

Awsome question, I wondered about the same thing recently, thanks!

I did it, with tabulizer ‘0.2.2’ as @hrbrmstr also suggests. If you are using R > 3.5.x, I'm providing following solution. Install the three packages in specific order:

# install.packages("rJava")
# library(rJava) # load and attach 'rJava' now
# install.packages("devtools")
# devtools::install_github("ropensci/tabulizer", args="--no-multiarch")

Update: After just testing the approach again, it looks like it's enough to just do install.packages("tabulizer") now. rJava will be installed automatically as a dependency.

Now you are ready to extract tables from your PDF reports.

library(tabulizer)

## load report
l <- "https://sedl.org/afterschool/toolkits/science/pdf/ast_sci_data_tables_sample.pdf" 
m <- extract_tables(l, encoding="UTF-8")[[2]]  ## comes as a character matrix
## Note: peep into `?extract_tables` for further specs (page, location etc.)!

## use first row as column names
dat <- setnames(type.convert(as.data.frame(m[-1, ]), as.is=TRUE), m[1, ])
## example-specific date conversion
dat$Date <- as.POSIXlt(dat$Date, format="%m/%d/%y")
dat <- within(dat, Date$year <- ifelse(Date$year > 120, Date$year - 100, Date$year))

dat ## voilà
#    Speed (mph)          Driver                        Car    Engine       Date
# 1      407.447 Craig Breedlove          Spirit of America    GE J47 1963-08-05
# 2      413.199       Tom Green           Wingfoot Express    WE J46 1964-10-02
# 3      434.220      Art Arfons              Green Monster    GE J79 1964-10-05
# 4      468.719 Craig Breedlove          Spirit of America    GE J79 1964-10-13
# 5      526.277 Craig Breedlove          Spirit of America    GE J79 1965-10-15
# 6      536.712      Art Arfons              Green Monster    GE J79 1965-10-27
# 7      555.127 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-02
# 8      576.553      Art Arfons              Green Monster    GE J79 1965-11-07
# 9      600.601 Craig Breedlove Spirit of America, Sonic 1    GE J79 1965-11-15
# 10     622.407   Gary Gabelich                 Blue Flame    Rocket 1970-10-23
# 11     633.468   Richard Noble                   Thrust 2 RR RG 146 1983-10-04
# 12     763.035      Andy Green                 Thrust SSC   RR Spey 1997-10-15

Hope it works for you.

Limitations: Of course, the table in this example is quite simple and maybe you have to mess around with gsub and this kind of stuff.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
thumb_up_alt 0 like thumb_down_alt 0 dislike
Welcome to ShenZhenJia Knowledge Sharing Community for programmer and developer-Open, Learning and Share
...