I’m working on a UiPath automation where I need to extract and merge data from two PDF files, PDF A and PDF B, into a single PDF file for each account number. The files are input in a non-fixed order and contain scattered data across various pages.
Details:
PDF A: Contains account numbers and related data.
PDF B: Contains account numbers, related data, and a program date.
The data related to account numbers are not consistently located on the same pages in either PDF.
the challenge is how to effectively consolidate this scattered data from both PDFs into one coherent PDF file per account number.
my logic: * I use a DataTable to capture the program date from PDF B, intending to merge this with the data from PDF A for the corresponding account number.
However, I am encountering a significant challenge: the DataTable is nullified when the workflow processes the second PDF, which leads to data loss from the first processed PDF.
Is there an alternative approach or strategy that could better handle the data consolidation from these randomly paginated PDFs?
Use for each file in folder and loop through files
Then use a read pdf tect activity and extract the account number…now using one more for eqch file in folder on an out folder where the merged files are saved search if the pdf with account number identified is already present if present then merge the new pdf on to old…if not found then just copy the file and rename the file with account number
So this ensures that the first file will be copied and named after account number and from the second gile with same account number is merged with the first one
data on both the pdfs are different and we want all of that data, only common thing is account number.
**naming convention : sample: ** FD_044554_Customerlnvoice_________Final Pricing May 2024_________x_
where 0445544 is the account number and May 2024 is the program date.
1.) both my pdfs are run one by one, (order is not fixed)in the same sequence((using regex)).
2.)for example: once the pdf(B) is there in the sequence → AccountNumber is extracted, Program date is extracted(using regex)and page numbers are counted(using counter).
3.)all this data is added to a datatable DT. (DT has multiple entries for same account number since the data in the pdf is scattered and not present in single page) if you can suggest a better way here.for step 3.
4.) once i have my DT ready, i am running a for each row in DT loop.
for every row in DT where account number is same , I extract the Programdate, PageNumber and that common Accountnumber and try to create a pdf (using Extract pdf).
the range of the Extract pdf is given as for example(currentrow(pagenumber))
5.) once the complete DT is iterated a single pdf file is created per account number as above naming convention. in sample
6.) the pdf is moved to completed number
→
now when the second pdf starts running, the datatable is reintialized at the start of sequence because this workflow is invoked again, from the main file
and
all the above steps are repeated
but i donot have the program date for the account number now, because only pdf B has that data.
My doubts:
1.)so how can i utilise the previosuly creatted pdf files naming convention where i can get the program date ??
2.)or else how can i use the previosuly created datatable to extract program date for that account number?
but program date is only available in one of the pdf-(PDF B).