I have a large collection of pdf documents. Finding things in the collection these days involves a lot of trial and error. I would like to be able to find articles by more than their filename. Preferably, I could create a database and search within the documents themselves. Currently, I keep a copy of these files available via my web server (apache, obviously) using the autoindex module. Is there a relatively easy way to add this search functionality and access it through my website?
I wanted to do something similar a while back, but gave up. You can maybe try extracting the text from each pdf using "pdftotext", and then allowing a keyword search on each PDF.
Alternatively, you can give "tellico" a try. I use it to maintain my Bibtex database. It allows you to attach metadata like the abstract and text which is searchable. It's a very nice program.
* kde-misc/tellico
Available versions: 0.12 0.13.1 ~0.13.3 ~0.13.4 0.13.5 ~0.13.6
Installed: 0.13.5
Homepage: http://www.periapsis.org/tellico
Description: A collection manager for the KDE environment
EDIT
It also allows for exporting to XML, HTML etc. It has many useful features.
Let me guess, you picked out yet another colorful box with a crank that I'm expected to turn and turn until OOP! big shock, a jack pops out and you laugh and the kids laugh and the dog laughs and I die a little inside.
Might be a bit of an overkill, but have a look at Kat (http://kat.sourceforge.net) . It's a desktop search engine (or supposed to be, when it's finished...), but it can index PDF files for fulltext/metadata search.
no ebuild as of yet, though i know there's an old one somewhere on the net, and for the little it's worth here's my adaptation of it for kat-0.5.3
inherit kde
DESCRIPTION="The open source answer to WhereIsIt and Google Desktop Search"
HOMEPAGE="http://kat.sourceforge.net/"
SRC_URI="mirror://sourceforge/kat/${P}.tar.gz"
I use refbase to manage my literature. I had Endnote libraries which I converted over & have added more manually. I can export bibtex and endnote citation files. I had originally thought of full-text pdf searches, but (1)I don't really need it and (2)It is useless for scanned PDFs. A proper literature manager was a good solution for me.
karnesky wrote:I use refbase to manage my literature. I had Endnote libraries which I converted over & have added more manually. I can export bibtex and endnote citation files. I had originally thought of full-text pdf searches, but (1)I don't really need it and (2)It is useless for scanned PDFs. A proper literature manager was a good solution for me.
Will refbase import PDFs automagically or do you need to enter each reference manually?
Let me guess, you picked out yet another colorful box with a crank that I'm expected to turn and turn until OOP! big shock, a jack pops out and you laugh and the kids laugh and the dog laughs and I die a little inside.
nahpets wrote:Will refbase import PDFs automagically or do you need to enter each reference manually?
How would it import them automagically? Most PDFs don't contain the optional metadata & it is hard to discern the bibliographic information from just the text (assuming it isn't a scanned PDF).
I had a naming scheme for my PDFS (Journal-Vol-Page.pdf) & so filled in the filename when I did a batch import initially. Since the batch import, I have added files manually. I may make a bookmarklet so that it auto-fills from science direct info & MODS/endnote/bibtex import are being developed.
I looked at refbase, and it is close to what I would like to have, it provides a way to link to the PDF file where ever it may live on a local or remote filesystem. But it seems like more management and overhead then I want - data entry.
As it is, It seems the best way to keep journal pdfs is to give them names like Jounal-vXX-pXXXX-keyword1-keyword2-keyword3.pdf and make them all live in a folder that is index-accessible through apache.
its1louder wrote:I looked at refbase, and it is close to what I would like to have, it provides a way to link to the PDF file where ever it may live on a local or remote filesystem. But it seems like more management and overhead then I want - data entry.
Yeah--data entry sucks. But citation managers are great if you ever have to cite the papers you index.
If you already have or can easily generate a list of references, refbase can import it. That's what I did--massaged Endnote and bibtex data into the online repository. If you have regular names for your PDFs (as you propose, but perhaps w/out keywords), you'll also be able to link those in. My original Endnote and bibtex data had been build by myself and others over time & often it can just be imported from Web Of Science or Elsevier or what-not.
with beagle you ca
- search from a GTK2 GUI client called"best"
- search from the command line
- search from the beagle webservices thing, where you can use your browser! Use the webservices use flag for this
I currently use Aigaion to organize my citations, basically the pdf's I've been reading. In this LAMP tool, citations can be stored using the BibTeX format of LaTeX. Also, the citations can be organized in multiple branches of a customizable topic tree. Each citation can be linked to one or more files.
This tool is easy to use and can be used for more than just citations. Also other written materials like books or lecture sheets can be organized with this.
refbase-0.9.0 was just released & data entry sucks a bit less--it can automagically import from any of the common citation file formates (ISI/RIS/Endnote/BibTeX/etc.) & can import by PubMed ID.