I'm really new to Python (really, really new). This is the problem i need help to solve:
I have a list of images URLs inside a txt file. There are around 80.000 URLs in this file.
I need to scan all of these imagems with pytesseract
and save the results inside a csv file.
I found a solution, but I wanted to optimize it.
I'm doing this now: ? I download all the images to my computer using PowerShell (yes, I'm on Windows) ? After they are all saved in a folder (and this is taking a long time), I use the following code (which I found on the internet) to scan all the images and save the extracted text and the image file name to a .csv file:
from PIL import Image
from pytesseract import image_to_string
import pytesseract
import os
import csv
def main():
# path for the folder for getting the raw images
path =r"C:Users
aphaelgomesDesktopProjeto OCR - Connect MarketplaceImagens - Powershell"
# link to the file in which output needs to be kept
fullTempPath =r"C:Users
aphaelgomesDesktopProjeto OCR - Connect MarketplaceOCR Checker Python
esultsoutputFile.csv"
# iterating the images inside the folder
for imageName in os.listdir(path):
inputPath = os.path.join(path, imageName)
img = Image.open(inputPath)
# applying ocr using pytesseract for python
pytesseract.pytesseract.tesseract_cmd = r"C:Users
aphaelgomesAppDataLocalProgramsTesseract-OCResseract.exe"
text = pytesseract.image_to_string(img, lang ="eng")
# saving the text for appending it to the output.txt file
# a + parameter used for creating the file if not present
# and if present then append the text content
file1 = open(fullTempPath, "a+")
# providing the name of the image
file1.write(imageName+"
")
# providing the content in the image
file1.write(text+"
")
file1.close()
# for printing the output file
file2 = open(fullTempPath, 'r')
print(file2.read())
file2.close()
if __name__ == '__main__':
main()
The point is: is there a way I can jump the downloading process I'm doing using PowerShell? I'd really appreciate any help doing this. The idea is to do this whole process in Python: as I said, I already have all the file links inside a .txt, so I needed a Python code to read them one by one and save the file name and the extracted text from image inside a .csv
Thank you very much :)