How to extract data from html file

md.ahtesham · November 23, 2018, 9:30am

I have html file (consider it as invoice file) from where i need to extract lot of information like, company code, company name, payment type, date etc. I am able to extract the data but it is taking lot of time to extract data from one file. Can you please suggest best way we can extract the so that extraction is faster.

Regards,
Ahtesham

neonova · November 23, 2018, 9:36am

How are you currently extracting from the HTML file? using Strings or OCR or GET Text?

md.ahtesham · November 23, 2018, 10:20am

get text/data scrapping using chrome browser.

megharajky · November 23, 2018, 10:56am

Hello,

convert/print three html page to PDF and read from PDF… The provides greater results comparatively.

Thanks,
Meg

md.ahtesham · November 23, 2018, 11:55am

converting it to pdf is making it more difficult to extract data. Any other workaround/

AshwinS2 · November 23, 2018, 12:26pm

Hi
Use get relative text

Thanks
Ashwin.s

md.ahtesham · November 23, 2018, 1:19pm

Relative text will make it further slow since u have to extract a lot of data since its a Invoice file of an organisation

Raghavendraprasad · November 23, 2018, 1:58pm

Hi,
Can you provide more context as to what exactly you want to extract.

HTML data extraction must be fairly easy and fast. Are you using selectors, string operation or are you importing it and then extracting it?

md.ahtesham · November 23, 2018, 2:59pm

Hi, Please see the screen below : the one in green we need to extract. FYI it is kind of structured data (if you closely observe). Assume this file containing 150 such data.

Sorry i have to remove sensitive info.

Regards,

neonova · November 24, 2018, 7:24am

one approach could be to download the webpage first, then parse through it using String and perform string manipulation to get the desired information

Complicated but, would defintely be faster

Raghavendraprasad · November 29, 2018, 9:14am

Hi,

If the selectors have the Table/TD/ Table Column or Table row attribute then those can be incremented by passing dynamic selector and the value can be gotten if scraping is taking too much time.

So no need to inspect element and find the DomPath either.

By seeing the kind of data is only the text it must be fairly quick/fast.

Reply if you need more help

joel.fuller · May 11, 2021, 8:19pm

I have a similar question but maybe slightly different. I am processing html files that reside on my computer. I need to extract data that resides in a grid within the file. That being said, I’m looking for the best way to pull this out. Should I use webscraping? Should I try and iterate through the source HTML (60K lines)? This grid is a small subset of the information within the html document so once I am done pulling that data, I would like the end processing.

Any thoughts/feedback would be much appreciated!

Topic		Replies	Views
Reading a HTML and extracting information Studio pdf , studio , question , activities_panel , html	7	4547	May 3, 2022
Extract data from .html file Activities uiautomation , question	21	217	August 8, 2025
Extract specific data from .html file Activities uiautomation , activities , question	30	3797	September 24, 2021
How to extract elements from html page web Studio	14	2320	April 22, 2024
Extract the HTML content from a webpage Activities uiautomation , web	10	186	November 23, 2025

How to extract data from html file

Related topics