Nemirwen's corner

Computer graphics, linux, and more from the cyberwitch's hut

Batch PDF Form filling

I just spent an evening playing around with pdftk and it's form filling features, and since the process is a bit involved, I'll try to share my findings below.

Overview

For the purpose of filling PDF forms, pdftk has two commands that interest us: dump_data_fields_utf8 and fill_form. However, fill_form expects a .fdf file as input. That file is a basically a very specific kind of PostScript file that is not realistically human writeable. For this reason, we have to add a .fdf creation step to our agenda.

In the end, the high level view of the process looks like the following:

Field extraction

This step is pretty simple, pdftk does all the work:

pdftk form.pdf dump_data_fields_utf8 > fields.flds

If you have a look inside the generated fields.flds file, you'll get a bunch of fields metadata, formatted like the following:

---
FieldType: Text
FieldName: Adresse du domicile
FieldNameAlt: Indiquer l'adresse de votre domicile
FieldFlags: 8388610
FieldJustification: Left
---
FieldType: Button
FieldName: distinction Motif 1
FieldNameAlt: Motif 1
FieldFlags: 0
FieldValue: Off
FieldJustification: Left
FieldStateOption: Oui
FieldStateOption: Off

In there, the most interesting lines are FieldName, because that's the name that'll identify the field later while we're filling it, and in the case of a button FieldStateOption, because those are the available options you'll need to chose from.

If you are interested in the FieldFlags meanings, I compiled a reference in here.

Data entry

In order to generate the .fdf file, I chose to use the fdfgen python library. However, copying all the field names by hand to create the required array is just tedious, so we'll script this with a bit of awk:

#!/usr/bin/awk -f

BEGIN {
    FS = ": ";
    printf "fields = [";
}

/FieldName:/ {
    # https://stackoverflow.com/a/23118210/5309963
    printf "\n    (\"%s\", \"\"),", substr($0, index($0,$2));
}

/FieldStateOption/ {
    printf " # Opt: \"%s\"",$2;
}

END {
    printf "\n]\n";
}

This script could be improved to detect the different FieldType, or even add a comment if the Required flag is set.

Running this script on the fields.flds file we generated previously should give us a usable python data file fields.py:

fields = [
    ...
    ("Adresse du domicile", ""),
    ("distinction Motif 1", ""), # Opt: "Oui" # Opt: "Off"
    ...
]

Now it's just a case of inputing our values in the second part of each tuple.

fdf generation

Taking a liberal amount of inspiration from the example code of fdfgen, we write the following script, in which the error checking is "left as an exercise to the reader":

#!/usr/bin/env python3

from fdfgen import forge_fdf
from fields import fields
import sys

fdf = forge_fdf(fdf_data_strings=fields)

with open(sys.argv[1], "wb") as f:
    f.write(fdf)

Given the appropriate first argument, running this script should generate a .fdf file suitable for use with pdftk.

Filled pdf creation

Now all the pieces are coming together! All that is left to do is to tell pdftk to take our brand spanking new .fdf file and use it to fill the original pdf form with all of our precious data:

pdftk form.pdf fill_form fields.fdf output filled_form.pdf

Lo and behold, the result:

Here we see a screenshot of a part of a an official form, that reads "Fait à
: Mos Eisley, Le : soir, à : l'apéro".
Look at this magnificent form! Doesn't it look beautiful all filled with data like this?

And if you ended up here because of a genuine need to fill pdf forms via the CLI, I wish you a lot of luck, and May the Force be with you!

Bibliography

http://www.myown1.com/linux/pdf_formfill.shtml https://github.com/ccnmtl/fdfgen/