How to create and edit PDF documents in Python


In our previous tutorial, we learned how to read PDF documents in Python and discussed the basics of the PyPDF2 library. While some projects require data extraction from PDF documents, it’s also very common that you may need to create your own PDF for things like automatic invoice generation or booking confirmation.

An amazing library that you can use to create and edit documents in Python is the PyPDF2 library. The library has a huge feature set that lets you do all sorts of things like extracting information like text, images, and metadata from the PDF document that we covered in the previous tutorial. You can also create and edit a PDF document, encrypt and decrypt it, add or remove annotations, and more.

In this tutorial, we will focus on creating and editing PDF documents. Let’s begin.

Creating PDF documents

We use the PdfReader class to read and extract the content from a PDF document and we use the file PdfWriter class to create new PDF files. One limitation of PyPDF2 is that you can only use the library to create new PDF files from existing PDF files.

We will start by creating a blank page for our PDF file and this requires us to instantiate an object using the PdfWriter() class. This class has a method called add_blank_page() which will create a blank page with the specified size and add it to the existing object.

Page size is specified in default user space units, where 72 units equals 1 inch. With this in mind, we can create an A4 size page by multiplying 8.27 with 72 to get the page width and 11.69 with 72 to get the page height.

I used the following code to create a blank PDF document using PyPDF2:

It is important to use integer values ​​for page width and height. Otherwise, you’ll end up with an incorrectly sized PDF document. I used the open() function in Python and specified a filename along with the mode of opening. The value wb+ it means I will open the binary for writing and updating.

Next, I use the write() method for writing the content of my_pdf_pages oppose the doc.pdf file. Sure, you’ll only see a blank page if you open the file now, but we were able to create it using the library.

Remember how we read different pages from a PDF document in the previous tutorial using the file pages property? The pages The property stored all document pages as a list of Page objects. We can extract a specific set of pages and then embed them into our newly created PDF using the file add_page() method.

Here’s an example where I read the contents of two different PDF books and write some of their pages into a new file in sequence:

Much of the code here is similar to the previous example. The only difference is that instead of add_blank_page() method, we are using the add_page() method to add to Page object to our document. We iterate over the pages with indexes from 1 to 9 and then add them to ours PdfWriter called object my_pdf_pages one by one. Once all the pages are added we write them to our file called extracts.pdf.

A few months ago I downloaded a book that I wanted to read. However, only one chapter could be downloaded at a time, and I wanted to merge them all into one document. I did this with a third party service back then, but we can do it just as easily using a few lines of code.

Instead of reading a file one page at a time and then adding that page to our document, we can also add the entire file at once using the append_pages_from_reader() function. This function also takes a second parameter which is the name of the callback function you want to call with each page addition.

Cutting, inserting and concatenating PDF documents

There is another class called PdfMerger in the PyPDF2 library which you can use to create a PDF document in Python. This class offers more advanced features than PdfWriter class. There are two important functions that we will cover here: append() And merge().

Let’s start with append(). In the previous section, we used the append_pages_from_reader() function since PdfWriter class to add our book chapters one after another. The advantage of using append() is that it gives you more options and flexibility.

As you can see, this code is much shorter than the one I wrote above to do the same thing. The important difference is that we didn’t have to instantiate a PdfReader object to add chapters. The append() method from PdfMerger class just needs a filename or a file object.

The append() method takes four different parameters. The first is the file name as we saw above.

The second parameter is a string identifying a bookmark to apply to the start of the included file. We could use it to add the chapter count as a bookmark in our generated document.

The third parameter allows you to add only a specific set of pages to the book rather than the entire chapter. It can be a (start, stop[, step]) tuple to indicate the start index, the stop index and the number of pages to skip.

When I ran the above code, it created a bookmarked PDF document for each chapter. It also only had the first 10 pages of each chapter.

Let’s say you have a bunch of books but they don’t have an index or a preface at the beginning. The author provides the index as a separate PDF document. How do you preface that at the beginning of the books? The append() method won’t help much here, especially if you also want to add content somewhere in the middle of the book. Fortunately, another similar method called merge() it would be helpful here.

The first line above adds the index document to the beginning of our PdfMerger object while the second line rewrites all the merged data into our PDF file.

Adding bookmarks to a PDF document

It is quite possible that you will be required to add bookmarks for some specific pages to a PDF document for easy access. A handy method that you can use to add bookmarks is called add_outline_item(). This method is available in both PdfWriter class and the PdfMerger class. Two required parameters to this method specify the title and page number for the bookmark. The title must be a string and the page number must be an integer.

You can also specify a parent structure element as a third parameter to create nested bookmark elements. The next three parameters determine the font color, weight, and style of the bookmark. Here is an example that uses the first two parameters to bookmark the table of contents of Chapter 1.

Final thoughts

In this tutorial, we learned how to create a PDF document in Python and how to add content to the document by adding individual pages or a group of pages. We also learned how to add content to particular locations in our PDF document using the file PdfMerger class from the PyPDF2 library.



Source link

By LocalBizWebsiteDesign

Leave a Reply

Your email address will not be published. Required fields are marked *