I have a PDF ebook that isn't well formatted. I want to extract all the text into a .tex file and generate a new, nicely formatted PDF document using Latex. I've tried doing this two ways:
First
- pdftotext to get a .txt file with all the text
- enscript to generate a rtf file
- unrtf to get an .tex file
This method doesn't really work because I loose all the formatting and breaks. I basically get one long paragraph in Latex.
Second
- pdftohtml generates one big HTML file with the proper formatting in HTML.
- I open the HTML in ooffice and "save as" .txt file
- enscript and unrtf like before to get a .tex file
The problem with this method is that ooffice sometimes doesn't want to open the whole document because I guess it's too big. Does anyone know of a better way to get what I want?



