I’m working on a UiPath automation where I need to extract and merge data from two PDF files, PDF A and PDF B, into a single PDF file for each account number. The files are input in a non-fixed order and contain scattered data across various pages.
Details:
PDF A: Contains account numbers and related data.
PDF B: Contains account numbers, related data, and a program date.
The data related to account numbers are not consistently located on the same pages in either PDF.
the challenge is how to effectively consolidate this scattered data from both PDFs into one coherent PDF file per account number.
my logic: * I use a DataTable to capture the program date from PDF B, intending to merge this with the data from PDF A for the corresponding account number.
However, I am encountering a significant challenge: the DataTable is nullified when the workflow processes the second PDF, which leads to data loss from the first processed PDF.
Is there an alternative approach or strategy that could better handle the data consolidation from these randomly paginated PDFs?
data on both the pdfs are different and we want all of that data, only common thing is account number.
**naming convention : sample: ** FD_044554_Customerlnvoice_________Final Pricing May 2024_________x_
where 0445544 is the account number and May 2024 is the program date.
1.) both my pdfs are run one by one, (order is not fixed)in the same sequence((using regex)).
2.)for example: once the pdf(B) is there in the sequence → AccountNumber is extracted, Program date is extracted(using regex)and page numbers are counted(using counter).
3.)all this data is added to a datatable DT. (DT has multiple entries for same account number since the data in the pdf is scattered and not present in single page) if you can suggest a better way here.for step 3.
4.) once i have my DT ready, i am running a for each row in DT loop.
for every row in DT where account number is same , I extract the Programdate, PageNumber and that common Accountnumber and try to create a pdf (using Extract pdf).
the range of the Extract pdf is given as for example(currentrow(pagenumber))
5.) once the complete DT is iterated a single pdf file is created per account number as above naming convention. in sample
6.) the pdf is moved to completed number
→
now when the second pdf starts running, the datatable is reintialized at the start of sequence because this workflow is invoked again, from the main file
and
all the above steps are repeated
but i donot have the program date for the account number now, because only pdf B has that data.
My doubts:
1.)so how can i utilise the previosuly creatted pdf files naming convention where i can get the program date ??
2.)or else how can i use the previosuly created datatable to extract program date for that account number?
but program date is only available in one of the pdf-(PDF B).
data on both the pdfs are different and we want all of that data, only common thing is account number.
**naming convention : sample: ** FD_044554_Customerlnvoice_________Final Pricing May 2024_________x_
where 0445544 is the account number and May 2024 is the program date.
1.) both my pdfs are run one by one, (order is not fixed)in the same sequence((using regex)).
2.)for example: once the pdf(B) is there in the sequence → AccountNumber is extracted, Program date is extracted(using regex)and page numbers are counted(using counter).
3.)all this data is added to a datatable DT. (DT has multiple entries for same account number since the data in the pdf is scattered and not present in single page) if you can suggest a better way here.for step 3.
4.) once i have my DT ready, i am running a for each row in DT loop.
for every row in DT where account number is same , I extract the Programdate, PageNumber and that common Accountnumber and try to create a pdf (using Extract pdf).
the range of the Extract pdf is given as for example(currentrow(pagenumber))
5.) once the complete DT is iterated a single pdf file is created per account number as above naming convention. in sample
6.) the pdf is moved to completed number
→
now when the second pdf starts running, the datatable is reintialized at the start of sequence because this workflow is invoked again, from the main file
and
all the above steps are repeated
but i donot have the program date for the account number now, because only pdf B has that data.
My doubts:
1.)so how can i utilise the previosuly creatted pdf files naming convention where i can get the program date ??
2.)or else how can i use the previosuly created datatable to extract program date for that account number?
but program date is only available in one of the pdf-(PDF B).
There are a few things you can do to improve your workflow:
Write the datatable into a CSV/Excel file to save it. Process goes like this:
1.1 At the end of a workflow, use Write CSV or Write Range to save datatable to a file
1.2 At the beginning of the next workflow, read the CSV/Excel to retrieve the datatable with saved data
1.3 Proceed with filling out the datatable with more data
1.4 Back to step 1 - write datatable to CSV/Excel to save it
1.5 I would write these to a new file each time in a temp folder to use for debugging
1.6 Add a short workflow to cleanup temp folder after each run so it doesn’t take up excessive storage space
Multiple entries for same account number is potentially going to add confusion. To avoid this issue, each time you’re about to add data to datatable:
2.1 Use lookup datatable first, find RowIndex of existing account number
2.2 IF RowIndex is -1:
2.2.1 it means there is no existing account number
2.2.2 add data to a new row
2.3 IF RowIndex is any other number (eg. 2)
2.3.1 it means there is existing data
2.3.2 update corresponding row with new data
2.3.3 consider if you want to overwrite data in a cell or append with new data