Extract data from Word

Extract Header from Word.

Word documents has headers, bullet points and su bullet points. Extract headers and bullet points.

Pls provide sample input and expected output.
Cheers

@J0ska

Attached is the input… I have highlited the output lines

So the output should look like:
X.Y. SPRINT
[EXTRACT]nxnxnxnxnxn
[EXTRACT]nxnxnxnxnxn
X.Y. SPRINT
[EXTRACT]nxnxnxnxnxn
[EXTRACT]nxnxnxnxnxn

If so then just extend the “if” statement like
If paraText.ToLower().Contains(“[extract]”) or paraText.ToLower().Contains(“sprint”) Then

Cheers

@J0ska

That query gives me every header even though the Bullet point doesn’t exist. I need to extract Header only and only when there is a bullet point with “EXTRACT” word under that Header section.

So try something like this:

If paraText.ToLower().Contains(“sprint”) Then
  Dim header As Microsoft.Office.Interop.Word.Paragraph = newDoc.Content.Paragraphs.Add()
  header.Range.FormattedText = para.Range.FormattedText
end if
  
If paraText.ToLower().Contains(“[extract]”) Then
  Dim bullet As Microsoft.Office.Interop.Word.Paragraph = newDoc.Content.Paragraphs.Add()
  bullet.Range.FormattedText = para.Range.FormattedText
  if not isnothing(header) then
    header.Range.InsertParagraphAfter()
    header = nothing
  end if
  bullet.Range.InsertParagraphAfter()
End If

Note I am writing without syntax check so it may contain errors.

@J0ska

Its saying “header” is not declared. I thinks it’s throwing at these lines. I think header is not declared at this level right?

header.Range.InsertParagraphAfter()
header = nothing

Thanks
Monika

Sure. There might be some syntax errors, or wrong variable scoping.
But the basic logic is to capture every “header” but add it into new document with following “bullet” and clear the “header” once added to the new document.

Cheers

This may work…

Dim inputObj As Object = CType(inputPath, Object)
Dim outputObj As Object = CType(outputPath, Object)
Dim headerText As Microsoft.Office.Interop.Word.Range
Dim header As Microsoft.Office.Interop.Word.Paragraph
Dim bullet As Microsoft.Office.Interop.Word.Paragraph

If File.Exists(inputPath) Then
	Console.WriteLine("Processing: " & inputPath)
	Dim wordApp As New Microsoft.Office.Interop.Word.Application
	wordApp.Visible = True
	
	Dim doc As Microsoft.Office.Interop.Word.Document = wordApp.Documents.Open(FileName:=inputObj, ReadOnly:=True)
	Dim newDoc As Microsoft.Office.Interop.Word.Document = wordApp.Documents.Add()
	
	For Each para As Microsoft.Office.Interop.Word.Paragraph In doc.Paragraphs
		Dim paraText As String = para.Range.Text.Trim()
		If paraText.ToLower().Contains("sprint") Then
			headerText = para.Range.FormattedText
		End If 
		
		If paraText.ToLower().Contains("[extract]") Then
			If Not IsNothing(headerText) Then
				Console.WriteLine(headerText.Text)
				header = newDoc.Content.Paragraphs.Add()
				header.Range.FormattedText = headerText
				header.Range.InsertParagraphAfter()
				headerText = Nothing
			End If
			Console.WriteLine(para.Range.Text)
			bullet = newDoc.Content.Paragraphs.Add()
			bullet.Range.FormattedText = para.Range.FormattedText
			bullet.Range.InsertParagraphAfter()
		End If
	Next
	
	newDoc.SaveAs2(FileName:=outputObj)
	newDoc.Close()
	doc.Close()
	wordApp.Quit()
End If


1 Like

@J0ska

this is not working for me. Nothing has been written to the output file in the first place.

Hi all,

Can someone help here?

I tested the above code with attached sample file and it works fine.

doc6.doc (32 KB)

@J0ska

That’s working for me too. Sorry my bad, I didn’t realize you haven’t written the save document code. I added it and it worked. Thank you so much for your help.

Right, sorry for that. I updated the code just in case someone wants to try it.
Cheers