I'm writing a crawler in Go that crawls the trendmicro site for security reports (in pdf form), then uses wget when it gets a hit on a pdf to download the pdf. Afterwards, I use a library I found from github (https://github.com/sajari/docconv) to translate the pdf into text, then print that text to the terminal. For some reason, the code works just fine for the first two pdfs, but once it hits the third, it says that the pdf was not found, even though it downloads just the same as the others and is in the same directory. The program trace is below:
HTTP request sent, awaiting response... 200 OK
Length: 2268867 (2.2M) [application/pdf]
Saving to: ‘rpt-2016-annual-security-roundup-a-record-year-for-enterprise-threats.pdf’
rpt-2016-annual-sec 100%[===================>] 2.16M 1.02MB/s in 2.1s
2019-02-20 15:25:07 (1.02 MB/s) - ‘rpt-2016-annual-security-roundup-a-record-year-for-enterprise-threats.pdf’ saved [2268867/2268867]
2019/02/20 15:25:07 open rpt-2016-annual-security-roundup-a-record-year-for-enterprise-threats.pdf : no such file or directory
exit status 1
I diff'd the two strings to see if they're different but they're not. And again, the first two download just fine. Here's the affected part of the code:
pdfName := strings.Replace(link, "https://documents.trendmicro.com/assets/rpt/", "", -1)
fmt.Printf("PDF name: %s
", pdfName)
existingPdf := 0;
if len(pdfs) > 0 {
for i := 0; i < len(pdfs); i++ {
if pdfs[i] == pdfName {
existingPdf = 1;
}
}
}
if existingPdf == 0 {
pdfs = append(pdfs, pdfName)
command := "wget " + link
cmd := exec.Command("/bin/bash", "-c", command, ">", "/dev/null", "2>&1")
cmd.Stdout = os.Stdout
cmd.Stderr = os.Stderr
cmd.Run()
res, err := docconv.ConvertPath(pdfName)
if err != nil {
log.Fatal(err)
}
fmt.Println(res)
}
EDIT: If you'd like the full code and/or would like to run it, I've put it in pastebin (https://pastebin.com/5JkJDamV)