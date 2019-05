Convert PDF to text using Calibre (GUI)

It worth noting that both tools used to extract text from PDF files mentioned in this article cannot extract the text if the PDF is made of images (for example scanned book pages / pictures). Calibre is a free and open source e-book software suite. It supports organizing, displaying, editing, and converting e-books, supporting a wide range of formats. The application runs on Linux, macOS, and Microsoft Windows.Calibre should be available in your Linux distribution's repositories, and you should be able to install it using whatever software store you have on your system. For example, to install it on Debian, Ubuntu, Linux Mint, Fedora, openSUSE, or Arch Linux, use:Calibre may also be installed on Linux by using the Flathub package (requires setting up Flathub / Flatpak on some Linux distributions).There's yet another way to install Calibre on Linux explained on the application's downloads page , where you'll also find macOS and Windows binaries.Now that Calibre is installed on your system, launch it and clickto add the PDF (or multiple PDFs - Calibre supports batch converting multiple PDF files to text) you want to convert to text.There are many options you can tweak in this conversion dialog. For example, you can choose to automatically remove spacing between paragraphs, or insert a blank line between paragraphs (). You can also set the character encoding and line ending style (system, unix, windows, old_mac), and even format it to markdown.After you're done with the configuration, click thebutton to start converting the PDF to text. The converted .txt file can be found in the directory where you've set the Calibre library location (and then insubfolders; if the author or book name can't be determined, the subfolder is called "Unknown").What Calibre lacks in this case is a way to only convert a page or a page range - it can currently only convert entire PDF files to text.pdftotext is a command line utility that converts PDF files to plain text. It has many options, including the ability to specify the page range to convert, maintain the original physical layout of the text as best as possible, set line endings (unix, dos or mac), and even work with password-protected PDF files.pdftotextis part of the poppler / poppler-utils / poppler-tools package (depending on the Linux distribution you're using). Install this package as follows:In other Linux distributions use your package manager to install the poppler / poppler-utils package.Now that the package is installed, you can(I recommend using thisoption for maintaining the original physical layout, but you can try it without it too) with:You'll need to replacewith the name of the PDF file, andwith the name you want the generated TXT file to be called. Also add the paths before filenames if needed (e.g.). If no output text file is specified, pdftotext will name the file with the same file name as the original PDF file.Use(first page to convert) and(last page to convert) followed by the page number, like this:Replaceandwith the first and last page number to extract, andwith the PDF filename.You can specify that too, usingfollowed byor. E.g. for unix line endings:pdftotext doesn't support batch PDF to text conversion (anddoesn't work), but you can convert all the PDF files in a folder to text files by using a Bash FOR loop:For more options, runand