Extract text from Word docx files with Power Automate

This post explains how to extract text from Microsoft Word docx files using only built in actions in Power Automate. 3rd party actions exist, which are more probably more sophisticated and can certainly make this process easier.

docx files are actually zip files

The first thing to that is important to understand, is that a word docx file is actually a zip file that contains a number of folders and files. The root of the zip folder contains these files:

Image of the root folder of a word docx zip file

The word folder in the root of the zip file contains more files and folders:

Within the word folder, there is a file called document.xml (sometimes documentN.xml) which contains the actual document content, and this is the file which we will parse with Power Automate. My example word document looks like this:

The content of document.xml contains:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
  <w:body>
    <w:p xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordml" w:rsidP="65D7FFCD" w14:paraId="2C078E63" wp14:textId="568EF955">
      <w:pPr>
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
      </w:pPr>
      <w:bookmarkStart w:name="_GoBack" w:id="0"/>
      <w:bookmarkEnd w:id="0"/>
      <w:r w:rsidRPr="65D7FFCD" w:rsidR="648E6FBB">
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
        <w:t xml:space="preserve">How to extract </w:t>
      </w:r>
      <w:r w:rsidRPr="65D7FFCD" w:rsidR="648E6FBB">
        <w:rPr>
          <w:b w:val="1"/>
          <w:bCs w:val="1"/>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
        <w:t xml:space="preserve">text </w:t>
      </w:r>
      <w:r w:rsidRPr="65D7FFCD" w:rsidR="648E6FBB">
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
        <w:t>from a Microsoft Word docx file</w:t>
      </w:r>
      <w:r w:rsidRPr="65D7FFCD" w:rsidR="4ECA4038">
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
        <w:t>.</w:t>
      </w:r>
    </w:p>
    <w:p w:rsidR="65D7FFCD" w:rsidP="65D7FFCD" w:rsidRDefault="65D7FFCD" w14:paraId="77484B81" w14:textId="049555E8">
      <w:pPr>
        <w:pStyle w:val="Normal"/>
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
      </w:pPr>
    </w:p>
    <w:p w:rsidR="4ECA4038" w:rsidP="65D7FFCD" w:rsidRDefault="4ECA4038" w14:paraId="10A0E5FC" w14:textId="67D59CCA">
      <w:pPr>
        <w:pStyle w:val="Normal"/>
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
      </w:pPr>
      <w:r w:rsidRPr="65D7FFCD" w:rsidR="4ECA4038">
        <w:rPr>
          <w:sz w:val="24"/>
          <w:szCs w:val="24"/>
        </w:rPr>
        <w:t>This document explains how to extract text from a Microsoft Word document using standard Power Automate actions. The result isn’t perfect, but it should be good enough for basic usage.</w:t>
      </w:r>
    </w:p>
    <w:sectPr>
      <w:pgSz w:w="12240" w:h="15840" w:orient="portrait"/>
      <w:pgMar w:top="1440" w:right="1440" w:bottom="1440" w:left="1440" w:header="720" w:footer="720" w:gutter="0"/>
      <w:cols w:space="720"/>
      <w:docGrid w:linePitch="360"/>
    </w:sectPr>
  </w:body>
</w:document>

As you can see from the above, the text data is on lines 18,27,34, 41 and 66 of the XML file.

Step 1 – Extract the contents of the Word document

To be able to access the content of document.xml the docx file needs to be extracted first. Use the flow action Extract archive to folder to extract the docx file to a temporary folder. Make sure you set the overwrite option to Yes.

Note: You will not be able to select the word document from the file browser within the action because it filters the available files to show only files with a .zip extension. So you can either:

Rename the docx file to .zip
Put in the file path manually or use dynamic content from a previous step

In my flow, the action looks like this:

Image of a word document being extracted by Power Automate

Step 2 – Filter the output of the extraction

The output of the Extract archive to folder action is an array of objects which contains information about every file extracted from the archive. This output needs to be filtered so that we can get the file Id of the document.xml. So add a filter array action and use the output of Extract archive to folder as the input for the filter. Click the edit in advanced mode link and use this filter expression:

@and(startsWith(item()['Name'], 'document'),endsWith(item()['Name'], 'xml'))

This will filter the array and narrow it down to just the file containing the document contents. Here is how my filter array looks:

Step 3 – Get the file content of document.xml

Add a Get file content action and use this expression for the file:

first(body('Filter_array'))['Id']

It should look like this:

Step 4 – Grab the content of the text elements

Finally, add a compose action and use the following expresison:

xpath(xml(outputs('Get_file_content')?['body']), '//*[name()=''w:t'']/text()')

Here is how it looks in my flow:

The xpath expression will grab each element named w:t and return an array of strings of the content found in those elements. Click here If you’d like to learn more about the structure of a word docx file. The output from my sample document produced the following array:

[
  "How to extract ",
  "text ",
  "from a Microsoft Word docx file",
  ".",
  "This document explains how to extract text from a Microsoft Word document using standard Power Automate actions. The result isn’t perfect, but it should be good enough for basic usage."
]

At this point you can either iterate through the results, or use a simple join expression to create a single string from the results. Here is a screenshot of the entire flow:

Image of a Power Automate Flow that extracts the text content from a Word docx file

As you can see from the above, it is possible to Extract Text from a Word docx file with Power Automate quite easily, and a more sophisticated xpath expression could target specific regions of text required.

Comments

erfan says

September 15, 2021 at 7:54 pm

Is the process similar if I would like to read the content and some particular textfield from an HTML file?
ARUN says

January 17, 2022 at 12:05 pm

Hi Paulie,

could you please provide xpath lines to extract paragraph properties, paragraph Id, Text Font styles?
Rune Holm says

February 4, 2022 at 10:48 pm

G R E A T! Works like a charm 😉
Bradley Brooks says

February 9, 2022 at 5:34 pm

Instead of extracting text, how can I replace text (specifically a place holder in a template doc) and then re-zip it?
Micah says

May 3, 2022 at 3:36 pm

@Bradley Brooks – I’m not certain on how to re-zip, but I’ve used the “replace” expression a few times and it has worked well.
Below is what I used to remove the [] and ” from the extracted text by using the “Compose” action. It would be added after the docxText action.

replace(replace(replace(string(outputs(‘docxText’)), ‘[‘, ”), ‘]’, ”), ‘”‘, ”)
Mark says

December 6, 2022 at 2:46 pm

Hi

I am wondering if the following would be possible?

1. Ctrl + H (find and replace one word with another)

2. Open a folder of Word Docs (saved somewhere like OneDrive as needed)

3. Go through each document in the folder and carry out the specific Ctrl + H command

4. Overwrite the old Word docs with the new ones in the folder

The documents in the folder will change each time, but the folder location can remain constant.

The word being replaced in the Ctrl + H (Microsoft Word shortcut) function will change everytime.

Thanks
mac says

February 1, 2023 at 5:58 pm

Thank you!! Been looking everywhere for this solution and yours is so easy to follow
David F says

March 8, 2023 at 6:45 pm

This works great, thanks! How would I add to the xpath expression to include the Heading format information as well as the document text?

e.g.

I’d want to get the Heading3 value to output to the array also.

I need to capture when each document section starts and ends.
Bob says

October 6, 2023 at 1:00 pm

This works great with a local word document. But is there a way to do this with a word document on sharepoint?

When I try it with a sharepoint document i get:

The template language expression ‘first(body(‘filter_array’))[‘Id’]’ cannot be evaluated because property ‘Id’ cannot be selected.
GED says

January 4, 2024 at 4:27 am

how to put it in csv, without distorting the new lines
AM says

January 30, 2024 at 3:34 pm

Is there a equivalent SharePoint API call to perform same operation using HTTP Action?
AM says

January 30, 2024 at 3:37 pm

Is there a equivalent SharePoint API call to perform same operation using HTTP Action instead of using Extract Folder action?

Additional menu