How to extract all the data from a PDF


#1

Hello everyone,

I’m new in RPA so sorry if the question sound ridiculous.

My problem is that I want to extract all the data and their signification from a very long PDF (almost 200 pages) into a data table. And I only know how to extract them manually with data scrapping.

Thank you for your help


#2

On what basis you want to convert your pdf into rows and columns? For instance: All paragraphs in a page forms 1 row?


#3

I want to extract all the data table (and no paragraphs) present in the PDF file into one multi rows and columns in Excel sheet.
But the problems is that some of the data tables can’t be ridden by the Data Scraping tool.


#4

Hello everyone,

I keep needing your help.

I want to extract all the data from numerous pdf file, and then reorganize them in one unique DataBase.

My loop work pretty well, but I don’t understand why I can’t extract the data.
I’m using, Inside my loop, read pdf text -> extract structured data -> excel application scope -> write range.

There is my workflow if you want to see : Demo.xaml (9.5 KB)

Regards
Antoine


#5

Please check this

ExtractMetadata
An XML string that enables you to define what data to extract from the indicated web page.

https://www.uipath.com/activities-guide/extract-structured-data


#6

Thanks for the help, but i already look at this and didn’t find the answer.

The problem is that, when i enter PDF in the Metadata of Extract Structured Data they said : “Unable to cast object of type ‘Newtonsoft.Json.JValue’ to type ‘System.String’.”

I don’t know how to do


#7

Hi,

I try something else for my problem : I use Read PDF, then i use Assigne with Substring, in order to extract the data that I wan’t from the string of the Read PDF.
But it doesn’t work again.
Can someone look at my workflow ?

Specific Data.xaml (13.7 KB)

Regards


#8

Bcz your assignment activity doesn’t have new LHS
Your assigning Read PDF output variable ExtractDT in LHS and in RHS your passing substring correctly.
Please create new variable of type string and pass it in LHS and give a try.


#9

WF looks fine to me now.

Substring gives error or empty?

Substring gives wrong text?

To troubleshoot Writeline your pdf output and get the text and perform substring manually and see.


#10

When you say LHS and RHS, does it mean left and right box in the assignement ?

If it that it doesn’t work.
When I Debug the automation (with or whitout) a new variable, a “exception type : ArgumentOutOfRangeException” is detecteted.

Maybe it’s my first index who doesn’t work ?


#11

yes.


#12

So, what first index to write in order to find a figure, preceded by “Active total …” and followed by $ ? Knowing that the position and the structure around the figure is not the same in every PDF ?

Regards


#13

Hello, it’s me again

I try to extract the data with Extract Structured Data, and specifie the value that I want in the selector but it doesn’t work.

There is my workflow : Data extract.xaml (13.5 KB)

And there is some PDF where I want to extratc the value of “Actif total” for every years : ra2004_rapport_annuel_fr.pdf (1.2 MB)
ra2014_rapport_annuel_fr.pdf (2.4 MB)
ra1989_rapport_annuel_fr.pdf (3.5 MB)

Regards


#14

Does this gets checkboxes as well? i am unable to get the radio buttons information.