Hi guys, when creating a ML Classifier dataset, I am manipulating it by adding folders of different txt files for each document type that will allow me to classify between each document type.
I have been creating these txt files using the PDF to txt activity, the output from this activity is a text file in a sort of structured format where it keeps the space between lines similar to the native pdf.
However, I have some files that are not native, and I need to convert them using OCR. The output of this activity is not structured and is one long string. If I add these files, will it affect the classifier because of the lack of structure? Does the structure of the txt within the text file make any difference or does it look specifically at words?
Thanks in advance!