VX Development Blog - Compiling docx templates with python-docx

Compiling docx templates with python-docx

Overview

Nearly in every solution related to the document workflow we are forced to deal with several different document formats. From the development point of view. the best document formats to deal with are csv/xlsx and pdf, for report and receipt/certificate generation respectively. But life is not that easy as we would like it to be, so we need to deal with docx documents as well, and by saying “dealing with” I mean creating, generating, parsing, modifying and using them as template for document generation.

Using docx as document template

As our language and environment of choice is Python we have several good enough libraries for working with each document format mentioned above. Python’s standard csv library covers all needs related to csv reports or data extraction tasks. For pdf document generation we are using wkhtmltopdf open source tool and xlsx documents are handled with the help of XlsxWriter and openpyxl libraries. Now there’s no issue with the docx document libraries as well (and we will discuss python-docx in this article), unless you’re using them as a template for document generation. Here’s the simple diagram for general usage of docx templates in our solutions.

Fig.1. DOCX typical usage in our applications

How it was done and what was missing

Document template compilation (Step 2.1 on the Fig.1) was implemented via our template library provided by the wheezy.web framework that we are building our web application on. It's not much different from Jinja2 or Django templates so you need to have valid HTML/XML as a source. To be able to compile document as XML/HTML we were extracting "document.xml" file from docx archive as shown in the code snippet below.

with zipfile.ZipFile(io.BytesIO(base64.b64decode(docx_document))) as archive:
    template = archive.read('word/document.xml').decode('utf-8')

Fig. 2. docx archive unpacking

After extracting document.xml file we were passing it and variable dictionary to the template compilation engine and getting compiled xml file on the output. To get compiled docx we were pulling back together zip archive already with the replaced original document.xml.

There are couple of disadvantages in this routine.

It is complex.

You need to know internal structure of the docx archive.

You can only compile text variables into the docx.

But, hey, it’s working for us for 5 years on ~150 installations.

The biggest issue was the 3rd point mentioned above. There was a requirement to place not only text variables, but photos and tables as well, which is impossible in case of document compilation with HTML template engine.

dompx - the docx domper

After some research on docx document manipulation libraries for python, we have decided to give python-docx library a try. It is well documented, the code structure and style is very clean, so we were able to find answers to our questions in a very short period.We have decided to write a small document compilation engine to replace our existing one and on our way we had one very bold requirement: existing document templates should seamlessly work on the new engine and no-one should even notice that something has changed under the hood.With the help of python-docx we have understood the structure of the docx documents (paragraphs, runs, elements, etc.) and parsing document for it’s smallest components became a piece of cake…or more accurately it became semi-recursive generator function presented on Fig 3.

def paragraphs(document: Document) -> Paragraph:
    """Paragraph generator for the document

    Parameters
    ----------
    document : Document

    Yields
    ------
    Paragraph

    """
    # first are document level paragaphs
    for paragraph in document.paragraphs:
       yield paragraph

    # we have also paragraph hidden in the document level tables
    yield from table_paragraphs(document.tables)

    # header level paragraphs goes here
    header = document.sections[0].header
    for paragraph in header.paragraphs:
       yield paragraph

    # header level table paragraphs goes here
    yield from table_paragraphs(header.tables)

    # footer level paragraphs goes here
    footer = document.sections[0].footer
    for paragraph in footer.paragraphs:
       yield paragraph

    # footer level table paragraphs goes here
    yield from table_paragraphs(footer.tables)


def table_paragraphs(tables: Iterable) -> Paragraph:
    """Extracting table-level paragraphs which are hidden in the table cells

    Parameters
    ----------
    tables : Iterable

    Yields
    ------
    Paragraph

    """
    for table in tables:
       for col in table.columns:
           for cell in col.cells:
               for paragraph in cell.paragraphs:
                   yield paragraph

               # but wait, there's more! what about tables hidden in the
               # table cells?
               yield from table_paragraphs(cell.tables)

Fig. 3. Iterating over document paragraphs

So the next major task was variable placeholder detection [find better word] and replacement with the text, image or a table defined as a variable modifier/filters. General syntax for variable placeholders can be summarized as @{variable_name}!modifier expression. For detecting such placeholders in the docx paragraphs (or in the runs) we used following regular expression (Fig 4).

# regular expression for extracting wheezy.template like variables and
# expressions from the docx document: ex. @{data['key']}!ss

token = re.compile(r'(@{?[\w\.\[\]\'\"\(\)]+}?)(![a-z]+)?')

Fig. 4. Super-puper regular expression

The next step is replacing variable_name with it’s according value. Note that variable name actually can be any valid Python expression really, like data[‘first_name’] or data.get(‘first_name’, {}).get(‘en’, ’’) or alike. To compile these expressions we used Python’s eval method against the predefined dataset (Fig 5).

def compile_expr(expr: str, data: dict) -> str:
    """Compiling expression matched by regexp against data

    Parameters
    ----------
    expr : str
    data : dict

    Returns
    -------
    str

    """
    try:
       # we need to remove @, {, } symbols to work with pure python expression
       val = eval(expr[1:].strip('{}'), {}, data)  # nosec
    except Exception as ex:
       throw(_("Can't generate file due to not well formed template: [{ex}]"),
             params={'ex': str(ex)})

    return val

Fig. 5. Evaluating Python expression against data

To support various data types, like images or tables, we implemented various docx injection methods and bound them with a modifier used in placeholder definition. Hence the placeholder for injecting image into the document will be something like this: @{profile_picture}!img. Variable “profile_picture” should contain the path of the image and !img modifier will define the method for handling image injection. Similar to this !tbl modifier will direct data to the table injection method. On Fig 6 are presented injection methods described above.

def img(doc: Document, run: Run, expr: str, mod: str, data: dict):
    """Embbed image into the document

    Parameters
    ----------
    doc : Document
    run : Run
    expr : str
    mod : str
    data : dict

    """
    # in case of images we don't need expression in the run, hence replacing it
    # with the empty string
    run.text = run.text.replace(f'{expr}{mod}', '')
    # image "value" here should be a path of the image
    if picture := compile_expr(expr, data):
       path, width, height = None, None, None

       if isinstance(picture, str):
           path = picture
       elif isinstance(picture, tuple):
           path, width, height = picture
       else:
           return

       run.add_picture(
           path,
           width=(width and Mm(width)),
           height=(height and Mm(height))
       )

def tbl(doc: Document, run: Run, expr: str, mod: str, data: dict):
    """Embbed table into the document

    Parameters
    ----------
    doc : Document
    run : Run
    expr : str
    mod : str
    data : dict

    """
    # in case of table we don't need expression in the run, hence replacing it
    # with the empty string
    run.text = run.text.replace(f'{expr}{mod}', '')
    if matrix := compile_expr(expr, data):
       # as for now we are supporting strict structured matrix like data, e.g.
       # list of lists
       if not isinstance(matrix, list) and not isinstance(matrix[0], list):
           return

       # create a table in document with matrix dimensions
       table = doc.add_table(len(matrix), len(matrix[0]))
       table.style = 'Table Grid'

       # and populate the table with the matrix values
       for ridx, row in enumerate(matrix):
           for cidx, cell in enumerate(row):
               table.cell(ridx, cidx).text = str(cell)

       run.element.addnext(table._tbl)

Fig. 6. Image and table compilers

Summary

By using python-docx we have simplified our solution and in the same time we have hugely improved our knowledge and capabilities of docx manipulations. Also we have created framework agnostic solution for dealing with docx templates.