Creating searchable pdf database?

Message

pidgas · Post by **pidgas** » Fri May 27, 2005 8:27 pm

I have a large collection of pdf documents. Finding things in the collection these days involves a lot of trial and error. I would like to be able to find articles by more than their filename. Preferably, I could create a database and search within the documents themselves. Currently, I keep a copy of these files available via my web server (apache, obviously) using the autoindex module. Is there a relatively easy way to add this search functionality and access it through my website?

Thanks for the thoughts
Pid

nahpets · Post by **nahpets** » Sat May 28, 2005 7:38 am

I wanted to do something similar a while back, but gave up. You can maybe try extracting the text from each pdf using "pdftotext", and then allowing a keyword search on each PDF.

Alternatively, you can give "tellico" a try. I use it to maintain my Bibtex database. It allows you to attach metadata like the abstract and text which is searchable. It's a very nice program.

* kde-misc/tellico
Available versions: 0.12 0.13.1 ~0.13.3 ~0.13.4 0.13.5 ~0.13.6
Installed: 0.13.5
Homepage: http://www.periapsis.org/tellico
Description: A collection manager for the KDE environment

EDIT
It also allows for exporting to XML, HTML etc. It has many useful features.

fnord · Post by **fnord** » Sat May 28, 2005 9:56 am

Might be a bit of an overkill, but have a look at Kat (http://kat.sourceforge.net) . It's a desktop search engine (or supposed to be, when it's finished...), but it can index PDF files for fulltext/metadata search.
no ebuild as of yet, though i know there's an old one somewhere on the net, and for the little it's worth here's my adaptation of it for kat-0.5.3

inherit kde

DESCRIPTION="The open source answer to WhereIsIt and Google Desktop Search"
HOMEPAGE="http://kat.sourceforge.net/"
SRC_URI="mirror://sourceforge/kat/${P}.tar.gz"

LICENSE="GPL-2"
SLOT="0"
KEYWORDS="~x86"

IUSE="pdflib"

DEPEND=">=dev-db/sqlite-3.2.0
pdflib? (app-text/poppler)"

need-kde 3.3

src_compile(){
PREFIX="`kde-config --prefix`"
kde_src_compile
}

karnesky · Post by **karnesky** » Mon Jun 13, 2005 9:05 pm

I use refbase to manage my literature. I had Endnote libraries which I converted over & have added more manually. I can export bibtex and endnote citation files. I had originally thought of full-text pdf searches, but (1)I don't really need it and (2)It is useless for scanned PDFs. A proper literature manager was a good solution for me.

nahpets · Post by **nahpets** » Tue Jun 14, 2005 11:17 am

karnesky wrote:I use refbase to manage my literature. I had Endnote libraries which I converted over & have added more manually. I can export bibtex and endnote citation files. I had originally thought of full-text pdf searches, but (1)I don't really need it and (2)It is useless for scanned PDFs. A proper literature manager was a good solution for me.

Will refbase import PDFs automagically or do you need to enter each reference manually?

karnesky · Post by **karnesky** » Tue Jun 14, 2005 3:31 pm

nahpets wrote:Will refbase import PDFs automagically or do you need to enter each reference manually?

How would it import them automagically? Most PDFs don't contain the optional metadata & it is hard to discern the bibliographic information from just the text (assuming it isn't a scanned PDF).

I had a naming scheme for my PDFS (Journal-Vol-Page.pdf) & so filled in the filename when I did a batch import initially. Since the batch import, I have added files manually. I may make a bookmarklet so that it auto-fills from science direct info & MODS/endnote/bibtex import are being developed.

its1louder · Post by **its1louder** » Thu Jul 07, 2005 8:18 pm

I looked at refbase, and it is close to what I would like to have, it provides a way to link to the PDF file where ever it may live on a local or remote filesystem. But it seems like more management and overhead then I want - data entry.

As it is, It seems the best way to keep journal pdfs is to give them names like Jounal-vXX-pXXXX-keyword1-keyword2-keyword3.pdf and make them all live in a folder that is index-accessible through apache.

karnesky · Post by **karnesky** » Fri Jul 08, 2005 2:33 am

its1louder wrote:I looked at refbase, and it is close to what I would like to have, it provides a way to link to the PDF file where ever it may live on a local or remote filesystem. But it seems like more management and overhead then I want - data entry.

Yeah--data entry sucks. But citation managers are great if you ever have to cite the papers you index.

If you already have or can easily generate a list of references, refbase can import it. That's what I did--massaged Endnote and bibtex data into the online repository. If you have regular names for your PDFs (as you propose, but perhaps w/out keywords), you'll also be able to link those in. My original Endnote and bibtex data had been build by myself and others over time & often it can just be imported from Web Of Science or Elsevier or what-not.

asiobob · Post by **asiobob** » Wed Dec 28, 2005 8:55 am

beagle supports PDF searching.

with beagle you ca
- search from a GTK2 GUI client called"best"
- search from the command line
- search from the beagle webservices thing, where you can use your browser! Use the webservices use flag for this

www.gnome.org/projects/beagle

wim-x · Post by **wim-x** » Wed Jun 28, 2006 7:33 pm

I currently use Aigaion to organize my citations, basically the pdf's I've been reading. In this LAMP tool, citations can be stored using the BibTeX format of LaTeX. Also, the citations can be organized in multiple branches of a customizable topic tree. Each citation can be linked to one or more files.

This tool is easy to use and can be used for more than just citations. Also other written materials like books or lecture sheets can be organized with this.

karnesky · Post by **karnesky** » Thu Oct 26, 2006 3:43 pm

karnesky wrote:[Yeah--data entry sucks.

refbase-0.9.0 was just released & data entry sucks a bit less--it can automagically import from any of the common citation file formates (ISI/RIS/Endnote/BibTeX/etc.) & can import by PubMed ID.

Creating searchable pdf database?

Creating searchable pdf database?

Kat?

Aigaion