I am downloading large batches of pdfs from parliaments. I scraped the pdf addresses and now try to download them.
To do this, I set up a debian instance on a university cloud.
It worked fine for most of them, but for 4 parliaments, I downloaded an error page of having to accept cookies. The result is an html page with pdf file ending that contains mainly the question if I accept cookies.
This error does not happen on either Ubuntu or Windows 10. I figure this works because I accepted the cookies here in the Browser. I changed my code to RCurl and exported the cookies as txt files based on the 2 entries I found on stackoverflow.
I used the following example, as I mentioned it works on windows and ubuntu, but also without the cookiefile.
library(RCurl)
# the pdf to dl
appURL<-"http://www.dokumentation.landtag-mv.de/parldok/dokument/44970/eu_ratspraesidentschaft.pdf"
curl = getCurlHandle()
curlSetOpt(cookiefile="cookiesmv.txt"
, curl=curl, followLocation = TRUE)
pdfData <- getBinaryURL(appURL, curl = curl)
writeBin(pdfData, "test2.pdf")
to reproduce, the cookiefile:
www.landtag-mv.de FALSE / FALSE 1641900313 cookieconsent_status dismiss www.landtag-mv.de FALSE / FALSE 1641900313 dp_cookieconsent_status {"dp--cookie-statistics":true,"dp--cookie-marketing":true} www.dokumentation.landtag-mv.de FALSE / FALSE 1641907216 cookieconsent_dismissed yes www.dokumentation.landtag-mv.de FALSE / FALSE 0 ASP.NET_SessionId ejtlcpjr0saw40ahceu4akb1
Maybe somebody has insights about where RCurl draws the cookies from...
best regards and thank you in advance, I hope I gave all the info necessary!