How to extract data from html file

I have html file (consider it as invoice file) from where i need to extract lot of information like, company code, company name, payment type, date etc. I am able to extract the data but it is taking lot of time to extract data from one file. Can you please suggest best way we can extract the so that extraction is faster.


1 Like

How are you currently extracting from the HTML file? using Strings or OCR or GET Text?

get text/data scrapping using chrome browser.


convert/print three html page to PDF and read from PDF… The provides greater results comparatively.


converting it to pdf is making it more difficult to extract data. Any other workaround/

Use get relative text


Relative text will make it further slow since u have to extract a lot of data since its a Invoice file of an organisation

Can you provide more context as to what exactly you want to extract.

HTML data extraction must be fairly easy and fast. Are you using selectors, string operation or are you importing it and then extracting it?

Hi, Please see the screen below : the one in green we need to extract. FYI it is kind of structured data (if you closely observe). Assume this file containing 150 such data.

Sorry i have to remove sensitive info. :slight_smile:


one approach could be to download the webpage first, then parse through it using String and perform string manipulation to get the desired information

Complicated but, would defintely be faster


If the selectors have the Table/TD/ Table Column or Table row attribute then those can be incremented by passing dynamic selector and the value can be gotten if scraping is taking too much time.

So no need to inspect element and find the DomPath either.

By seeing the kind of data is only the text it must be fairly quick/fast.

Reply if you need more help :slight_smile:

I have a similar question but maybe slightly different. I am processing html files that reside on my computer. I need to extract data that resides in a grid within the file. That being said, I’m looking for the best way to pull this out. Should I use webscraping? Should I try and iterate through the source HTML (60K lines)? This grid is a small subset of the information within the html document so once I am done pulling that data, I would like the end processing.

Any thoughts/feedback would be much appreciated!