Help with REGEX to extract Table from PDF

Friends, good day. I need to extract data from 1000 PDF, with 2 o 3 pages each, with following format.

s1.pdf (48.6 KB)


I need all the data highlighted (yellow and red) BUT I am having trouble getting the data that is highlighted in red correctly.

Can someone please help me with the expression I need for “Intervención Quirurgica Programada”; “Intervención Quirurgica Realizada”; “Código”; “Cant.”; “Descripción”.

I need any of the following two formats

El archivo que estoy trabajando es el siguiente:
extraccion_pdf.xaml (23.2 KB)

Thaks!!

Hello @carlos_sabino open pdf using chrome and extract table using Data extracting Concept–to open pdfs in a chrome enable allow access to url in chrome extansion.

image

Hi @carlos_sabino

You can use these split actions instead of regex

For ‘Intervención Quirurgica Programada:’

str.Split({"Intervención Quirurgica"+Environment.NewLine + "Programada:"},StringSPlitoptions.None)(1).Split({"Intervencion Quirurgica"+Environment.NewLine+"Realizada:"},StringSPlitOptions.None)(0).Trim.Replace(Environment.NewLine,"")

For ‘Intervencion Quirurgica Realizada:’

str.Split({"Intervencion Quirurgica"+Environment.NewLine+"Realizada:"},StringSPlitoptions.None)(1).Split({"Cirugia Realizada:"},StringSPlitOptions.None)(0).Trim.Replace(Environment.NewLine,"")

For Each line item:Use the read pdf with preserve formatting . Below is the xaml for this part

I have first split the main string to get only rows needed

Then looped through each row which matches regex for rows also identified the rows with only description column and added it back to previous row


output:

xaml
Sequence2.xaml (15.5 KB)

Hope you can club all of these together as needed

cheers

Thanks @Anil_G , the file you show works correctly for “code”, “quantity” and “description”. Discarded when I uploaded the sample file I inadvertently deleted the surgeon’s data, and now when I tested with the original file, it only extracted to the place where the doctor’s name is, which is always variable.


P1.pdf (41.4 KB)
P2.pdf (41.4 KB)

And unfortunately I am very basic with UiPath, I don’t know how I can join the new DT with the information that I had been able to capture with regex (“Convenio”; “Internación” “Edad”; “Sexo”; “Inicio Cirugia”; “Fin Cirugia”) . For “Intervención Quirurgica Programada” and “Intervención Quirurgica Realizada” do I have to create a new assignment with the code that you gave me?

Thank you very much, and sorry for the insistence, I am just starting and this seems very advanced to me, but unfortunately I need it.

Hello friends, I’m still stuck at the same point. You can collect the rest of the information, but I can’t join the process and the strings that @Anil_G uploaded (so I can extract the “codigo”, “cantidad” and “descripcion” with the process that I did.


extraccion_pdf (2).xaml (21.9 KB)

I watched a lot of videos and read a lot, but I couldn’t find a solution.

I really appreciate if someone could help me.

Cheers!

Hi @carlos_sabino

Little held up yesterday, couldn’t response. So I Modified your xaml to fit the output that I got for other columns as well. Please check and let me know if you face any issues

BlankProcess5 - Copy (4).zip (5.9 KB)

Hope this helps

cheers

Dear @Anil_G, first of all, thank you very much for the answer and the time spent. The process works perfectly, you can see what you had ordered. I had done very similar processes, but they always gave me errors, it was my lack of experience in the code.


The only thing that didn’t work was the “programado” and “realizado” string. The error it marks is Scheduled:_Assign: Index was outside the bounds of the array.

Thank you!!

Hi @carlos_sabino

You are welcome!!

From the locals panel please check the extpdf values and see if the values that I gave in split are matching or not…May be they have changed a bit.Change them accordingly

cheers