Gentoo Forums
Gentoo Forums
Gentoo Forums
Quick Search: in
[0916] pdf2htmlEX: converts PDF to HTML w/o losing fmt
View unanswered posts
View posts from last 24 hours

 
Reply to topic    Gentoo Forums Forum Index Unsupported Software
View previous topic :: View next topic  
Author Message
coolwanglu
n00b
n00b


Joined: 01 Sep 2012
Posts: 6

PostPosted: Sat Sep 01, 2012 5:10 pm    Post subject: [0916] pdf2htmlEX: converts PDF to HTML w/o losing fmt Reply with quote

[0916 Update]
Added 2 more demo pages:
http://coolwanglu.github.com/pdf2htmlEX/demo/cheat.html
http://coolwanglu.github.com/pdf2htmlEX/demo/geneve.html

* Completed removed Boost
* Relaxed dependency of C++11, supports GCC no earlier than 4.4.6
* Links are now supported (In-document jumping is accurate to pages)
* Fixed an encoding problem for some fonts.
Demo comes first:
http://coolwanglu.github.com/pdf2htmlEX/demo/demo.html

Another (with CJK):
http://coolwanglu.github.com/pdf2htmlEX/demo/chn.html

Home page:
https://github.com/coolwanglu/pdf2htmlEX

There are bascially 2 types of pdf-to-html converters:
One is roughly a pdf-to-text converter with a few pre-defined formats in HTML.
The other is render-everything-as-images converter, which loses all text and generated huge files.

But pdf2htmlEX takes advatanges of both, retaining both Text and Styling.
Features:
1.Extract and embed fonts from PDF
2.Optimizing for web while making sure render is precise
3.Non-text objects are rendered as images.
4.Single-file output mode -- I know you hate spearated font/image files

To compile & install
grab a recent poppler (>=0.20.3), make sure '--enable-xpdf-headers' is used for configure
grab the latest git version of fontforge https://github.com/fontforge/fontforge, because I submitted a few features/bugs for pdf2htmlEX
the boost c++ library. (See detailed depended components in the project home page)
cmake
GCC that supports c++11

If any of you enjoyed this tool and would like to package it for Gentoo, please contact me. Many Thanks!

Any suggestion, fork/star-at-gihub, bug-report is appreciated.


Last edited by coolwanglu on Sun Sep 16, 2012 2:47 pm; edited 1 time in total
Back to top
View user's profile Send private message
avx
Advocate
Advocate


Joined: 21 Jun 2004
Posts: 2152

PostPosted: Sun Sep 02, 2012 11:28 am    Post subject: Reply with quote

As promised to you on the Arch forums, I now tried it.

As already stated, I've been impressed with your demo and at least for my test document, it also worked out pretty well, way better than other tools I've used before.

Problems:
a) it's slow, because it only uses a single CPU-Core, thus I got
Code:
pdf2htmlEX test.pdf  185,52s user 1,67s system 99% cpu 3:08,05 total
Core i7-920, single core, rest of the machine being idle.

b) Files created are rather huge, there's no compression whatsoever. 6.1mb PDF->26.3mb HTML -> gzip -9 -> 6.3mb HTML.GZ. ~4x the space to store is too much. I haven't found a way to open the .html.gz in Opera/Firefox and let it render, though in theory they've got the capability, I guess because there's no HTTP-Header involved.

c) while Opera&Firefox load and render the file rather quickly, luakit(webkit) seems not to be able to handle it, killed the process after in ran for 2mins on 100% CPU(single core) without actually showing anything. Don't have Chromium on the machine currently, will build and test with it.

d) Scrolling through the resulting document is awfully slow, f.e. mupdf is (not meassured actually) at least 3x faster paging through the original pdf.

Adding to my features already wished for on the Arch forums, maybe injecting a small snippet of JS at the top, allowing to switch .css would be nice, to maybe switch between white@Black and black@white.
_________________
++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.
Back to top
View user's profile Send private message
coolwanglu
n00b
n00b


Joined: 01 Sep 2012
Posts: 6

PostPosted: Sun Sep 02, 2012 2:03 pm    Post subject: Reply with quote

avx wrote:

Thanks for your review!

a) Yeah, noticed that, and the main reason might the calling of fontforge
b) Stream Objects in PDF are often compressed, so maybe it's "fair" to compare a pdf with a html.gz. I also noticed that html.gz is not supported by browsers, so I didn't implemented the gzip feature. But if you publish the html on servers, enabling inflate/deflate might be enough
c) Also noticed, especially on Windows. You may try the parameter "-l 1" which tells pdf2htmlEX to process only the first page, and see if it will get better. But yeah, I need to optimized the generated HTML
d) I left everything in separated files, so it's easy to change the background color. But font color, suppose I change all black to white, how about other colors, shall I invert them or leave them alone?
Back to top
View user's profile Send private message
coolwanglu
n00b
n00b


Joined: 01 Sep 2012
Posts: 6

PostPosted: Sun Sep 02, 2012 3:49 pm    Post subject: Reply with quote

[quoe="avx"]
More about the slowness.
Currently fontforge is used to manipulate fonts, fontforge scripts are generated and executed, which should be slow as there's lot's of text parsing.
There is python-fontforge module, but not sure it'll improve the performance a lot.

I'm not trying to load fontforge.so directly and use the function directly, it's somewhat hacky. But fontforge didn't public the functions.
I'm also looking for an alternative of fontforge, but have not found yet.
Back to top
View user's profile Send private message
avx
Advocate
Advocate


Joined: 21 Jun 2004
Posts: 2152

PostPosted: Sun Sep 02, 2012 6:27 pm    Post subject: Reply with quote

No problem, I can wait a little if the result is worth it, which in this case it is.

Didn't look at the code so far, but would it be possible to render the pages seperately and merge them into one file at the end? So one thread for every CPU-core or let the user decide how much to do at once?

As for the colors, I'm not really sure how to handle them in the best way, it's just that reading them at night with the bright backgrounds is tiresome - and also costs more battery on many mobile devices(though I didn't try if that's even possible with their limited power).

Update, tried with Chromium from portage, which brings its own version of webkit, and it works; scrolling speed is as bad as in FF/O, but at least it works.
Edit, tried with dwb, built against the same version of webkit as luakit, it works but constantly has a high CPU load even when letting it sit idle. Scrolling is even much slower than in the other browsers.
Edit2, tried again with luakit, after letting it burn the CPU for quite some time, it finally showed up, but scrolling is virtually impossible.
_________________
++++++++++[>+++++++>++++++++++>+++>+<<<<-]>++.>+.+++++++..+++.>++.<<+++++++++++++++.>.+++.------.--------.>+.>.
Back to top
View user's profile Send private message
coolwanglu
n00b
n00b


Joined: 01 Sep 2012
Posts: 6

PostPosted: Tue Sep 04, 2012 3:43 am    Post subject: Reply with quote

avx wrote:
No problem, I can wait a little if the result is worth it, which in this case it is.

Didn't look at the code so far, but would it be possible to render the pages seperately and merge them into one file at the end? So one thread for every CPU-core or let the user decide how much to do at once?

As for the colors, I'm not really sure how to handle them in the best way, it's just that reading them at night with the bright backgrounds is tiresome - and also costs more battery on many mobile devices(though I didn't try if that's even possible with their limited power).

Update, tried with Chromium from portage, which brings its own version of webkit, and it works; scrolling speed is as bad as in FF/O, but at least it works.
Edit, tried with dwb, built against the same version of webkit as luakit, it works but constantly has a high CPU load even when letting it sit idle. Scrolling is even much slower than in the other browsers.
Edit2, tried again with luakit, after letting it burn the CPU for quite some time, it finally showed up, but scrolling is virtually impossible.


I'm linking fontforge instead of using fontforge text scripts, which improves the performance a lot (Well, depends on how do you define 'a lot', but you can feel it's faster)
You can try the lastest devv branch in the git repo, and you might need a -lpython2.7 (or your version) if you see lots of 'undefined reference: Py_***', as pkg-config is not configures correctly in fontforge.


Parallelism, yeah, also asked by others, it seems that pages are relatively independent, and pdf is read-only. But there are some dependencies in pdf2htmlEX. I'll keep in mind of this, and will work on this after finishing a few things about rendering, which I think is slight more important than speed.

Colors, probably I'll consult other PDF viewers, see what're their logics.

About the performance issue, how many pdf files have you tried, what kinds fo them (scientific papers, poster or others?) and are they big? Could you please send a few to me if they are not private?
Back to top
View user's profile Send private message
Display posts from previous:   
Reply to topic    Gentoo Forums Forum Index Unsupported Software All times are GMT
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum