How to conserve the pdf layout after converting content from English to French using Python



I am working on a simple application which will help me to convert all my pdf files which have text in English to French text as pdf. I have worked on a simple proof of concept which helps me to iterate over the given file and convert all text into French. Now I am stuck on saving the converted french text into a pdf with a similar structure of the original English version.

import PyPDF2
from googletrans import Translator
translator = Translator()

read_pdf = PyPDF2.PdfFileReader(open('any_english.pdf', 'rb'))
write_pdf = PyPDF2.PdfFileWriter()
number_of_pages = read_pdf.getNumPages()

for i in range(number_of_pages):
    page = read_pdf.getPage(i)
    page_content = page.extractText()
    print translator.translate(page_content, dest='fr').text

    // Save the converted version text in french into a pdf conserving structure as original pdf


All contents in the pdf are text format not image.

2 Answers: 

There are no easy ways to open, edit and rewrite pdfs in Python. However, depending on the complexity of the PDF/structure you might have success converting the PDF to HTML, translating and then generating a PDF from the HTML.

For converting PDF to HTML, there is pdf2html which has a basic Python wrapper.

Once the translation is done you can reverse this process with various degrees of success using e.g. weasyprint, html2pdf (Mac only), wkhtmltopdf (requires Qt).


Basically you cant directly create a PDF file in a specific format. But you can try writing your data in xhtml format then convert into .pdf using xhtml2pdf. Hope this might help you in your requirement.