Forums

Skip to content

Advanced search
  • Quick links
    • Unanswered topics
    • Active topics
    • Search
  • FAQ
  • Login
  • Register
  • Board index Assistance Portage & Programming
  • Search

Creating searchable pdf database?

Problems with emerge or ebuilds? Have a basic programming question about C, PHP, Perl, BASH or something else?
Post Reply
Advanced search
11 posts • Page 1 of 1
Author
Message
pidgas
n00b
n00b
Posts: 22
Joined: Tue Oct 05, 2004 1:33 pm

Creating searchable pdf database?

  • Quote

Post by pidgas » Fri May 27, 2005 8:27 pm

I have a large collection of pdf documents. Finding things in the collection these days involves a lot of trial and error. I would like to be able to find articles by more than their filename. Preferably, I could create a database and search within the documents themselves. Currently, I keep a copy of these files available via my web server (apache, obviously) using the autoindex module. Is there a relatively easy way to add this search functionality and access it through my website?

Thanks for the thoughts
Pid
Top
nahpets
Veteran
Veteran
User avatar
Posts: 1178
Joined: Sun Oct 05, 2003 11:18 pm
Location: Montreal, Canada

  • Quote

Post by nahpets » Sat May 28, 2005 7:38 am

I wanted to do something similar a while back, but gave up. You can maybe try extracting the text from each pdf using "pdftotext", and then allowing a keyword search on each PDF.

Alternatively, you can give "tellico" a try. I use it to maintain my Bibtex database. It allows you to attach metadata like the abstract and text which is searchable. It's a very nice program.
* kde-misc/tellico
Available versions: 0.12 0.13.1 ~0.13.3 ~0.13.4 0.13.5 ~0.13.6
Installed: 0.13.5
Homepage: http://www.periapsis.org/tellico
Description: A collection manager for the KDE environment
EDIT
It also allows for exporting to XML, HTML etc. It has many useful features.
Let me guess, you picked out yet another colorful box with a crank that I'm expected to turn and turn until OOP! big shock, a jack pops out and you laugh and the kids laugh and the dog laughs and I die a little inside.
Top
fnord
n00b
n00b
Posts: 15
Joined: Wed Aug 21, 2002 7:33 pm
Location: Israel

Kat?

  • Quote

Post by fnord » Sat May 28, 2005 9:56 am

Might be a bit of an overkill, but have a look at Kat (http://kat.sourceforge.net) . It's a desktop search engine (or supposed to be, when it's finished...), but it can index PDF files for fulltext/metadata search.
no ebuild as of yet, though i know there's an old one somewhere on the net, and for the little it's worth here's my adaptation of it for kat-0.5.3

inherit kde
DESCRIPTION="The open source answer to WhereIsIt and Google Desktop Search"
HOMEPAGE="http://kat.sourceforge.net/"
SRC_URI="mirror://sourceforge/kat/${P}.tar.gz"

LICENSE="GPL-2"
SLOT="0"
KEYWORDS="~x86"

IUSE="pdflib"

DEPEND=">=dev-db/sqlite-3.2.0
pdflib? (app-text/poppler)"

need-kde 3.3

src_compile(){
PREFIX="`kde-config --prefix`"
kde_src_compile
}
Top
karnesky
Apprentice
Apprentice
Posts: 218
Joined: Thu Mar 18, 2004 9:07 pm
Contact:
Contact karnesky
Website

  • Quote

Post by karnesky » Mon Jun 13, 2005 9:05 pm

I use refbase to manage my literature. I had Endnote libraries which I converted over & have added more manually. I can export bibtex and endnote citation files. I had originally thought of full-text pdf searches, but (1)I don't really need it and (2)It is useless for scanned PDFs. A proper literature manager was a good solution for me.
Donate to F/OSS
Top
nahpets
Veteran
Veteran
User avatar
Posts: 1178
Joined: Sun Oct 05, 2003 11:18 pm
Location: Montreal, Canada

  • Quote

Post by nahpets » Tue Jun 14, 2005 11:17 am

karnesky wrote:I use refbase to manage my literature. I had Endnote libraries which I converted over & have added more manually. I can export bibtex and endnote citation files. I had originally thought of full-text pdf searches, but (1)I don't really need it and (2)It is useless for scanned PDFs. A proper literature manager was a good solution for me.
Will refbase import PDFs automagically or do you need to enter each reference manually?
Let me guess, you picked out yet another colorful box with a crank that I'm expected to turn and turn until OOP! big shock, a jack pops out and you laugh and the kids laugh and the dog laughs and I die a little inside.
Top
karnesky
Apprentice
Apprentice
Posts: 218
Joined: Thu Mar 18, 2004 9:07 pm
Contact:
Contact karnesky
Website

  • Quote

Post by karnesky » Tue Jun 14, 2005 3:31 pm

nahpets wrote:Will refbase import PDFs automagically or do you need to enter each reference manually?
How would it import them automagically? Most PDFs don't contain the optional metadata & it is hard to discern the bibliographic information from just the text (assuming it isn't a scanned PDF).

I had a naming scheme for my PDFS (Journal-Vol-Page.pdf) & so filled in the filename when I did a batch import initially. Since the batch import, I have added files manually. I may make a bookmarklet so that it auto-fills from science direct info & MODS/endnote/bibtex import are being developed.
Donate to F/OSS
Top
its1louder
Tux's lil' helper
Tux's lil' helper
User avatar
Posts: 75
Joined: Thu Jul 03, 2003 5:55 am
Location: Santa Barbara CA

  • Quote

Post by its1louder » Thu Jul 07, 2005 8:18 pm

I looked at refbase, and it is close to what I would like to have, it provides a way to link to the PDF file where ever it may live on a local or remote filesystem. But it seems like more management and overhead then I want - data entry.

As it is, It seems the best way to keep journal pdfs is to give them names like Jounal-vXX-pXXXX-keyword1-keyword2-keyword3.pdf and make them all live in a folder that is index-accessible through apache.
These go to eleven.
Top
karnesky
Apprentice
Apprentice
Posts: 218
Joined: Thu Mar 18, 2004 9:07 pm
Contact:
Contact karnesky
Website

  • Quote

Post by karnesky » Fri Jul 08, 2005 2:33 am

its1louder wrote:I looked at refbase, and it is close to what I would like to have, it provides a way to link to the PDF file where ever it may live on a local or remote filesystem. But it seems like more management and overhead then I want - data entry.
Yeah--data entry sucks. But citation managers are great if you ever have to cite the papers you index.

If you already have or can easily generate a list of references, refbase can import it. That's what I did--massaged Endnote and bibtex data into the online repository. If you have regular names for your PDFs (as you propose, but perhaps w/out keywords), you'll also be able to link those in. My original Endnote and bibtex data had been build by myself and others over time & often it can just be imported from Web Of Science or Elsevier or what-not.
Donate to F/OSS
Top
asiobob
Veteran
Veteran
User avatar
Posts: 1375
Joined: Wed Oct 29, 2003 8:13 am
Location: Bamboo Creek

  • Quote

Post by asiobob » Wed Dec 28, 2005 8:55 am

beagle supports PDF searching.

with beagle you ca
- search from a GTK2 GUI client called"best"
- search from the command line
- search from the beagle webservices thing, where you can use your browser! Use the webservices use flag for this

www.gnome.org/projects/beagle
Top
wim-x
Tux's lil' helper
Tux's lil' helper
User avatar
Posts: 110
Joined: Fri Nov 26, 2004 3:08 pm
Location: Netherlands

Aigaion

  • Quote

Post by wim-x » Wed Jun 28, 2006 7:33 pm

I currently use Aigaion to organize my citations, basically the pdf's I've been reading. In this LAMP tool, citations can be stored using the BibTeX format of LaTeX. Also, the citations can be organized in multiple branches of a customizable topic tree. Each citation can be linked to one or more files.

This tool is easy to use and can be used for more than just citations. Also other written materials like books or lecture sheets can be organized with this.
Top
karnesky
Apprentice
Apprentice
Posts: 218
Joined: Thu Mar 18, 2004 9:07 pm
Contact:
Contact karnesky
Website

  • Quote

Post by karnesky » Thu Oct 26, 2006 3:43 pm

karnesky wrote:[Yeah--data entry sucks.
refbase-0.9.0 was just released & data entry sucks a bit less--it can automagically import from any of the common citation file formates (ISI/RIS/Endnote/BibTeX/etc.) & can import by PubMed ID.
Donate to F/OSS
Top
Post Reply

11 posts • Page 1 of 1

Return to “Portage & Programming”

Jump to
  • Assistance
  • ↳   News & Announcements
  • ↳   Frequently Asked Questions
  • ↳   Installing Gentoo
  • ↳   Multimedia
  • ↳   Desktop Environments
  • ↳   Networking & Security
  • ↳   Kernel & Hardware
  • ↳   Portage & Programming
  • ↳   Gamers & Players
  • ↳   Other Things Gentoo
  • ↳   Unsupported Software
  • Discussion & Documentation
  • ↳   Documentation, Tips & Tricks
  • ↳   Gentoo Chat
  • ↳   Gentoo Forums Feedback
  • ↳   Duplicate Threads
  • International Gentoo Users
  • ↳   中文 (Chinese)
  • ↳   Dutch
  • ↳   Finnish
  • ↳   French
  • ↳   Deutsches Forum (German)
  • ↳   Diskussionsforum
  • ↳   Deutsche Dokumentation
  • ↳   Greek
  • ↳   Forum italiano (Italian)
  • ↳   Forum di discussione italiano
  • ↳   Risorse italiane (documentazione e tools)
  • ↳   Polskie forum (Polish)
  • ↳   Instalacja i sprzęt
  • ↳   Polish OTW
  • ↳   Portuguese
  • ↳   Documentação, Ferramentas e Dicas
  • ↳   Russian
  • ↳   Scandinavian
  • ↳   Spanish
  • ↳   Other Languages
  • Architectures & Platforms
  • ↳   Gentoo on ARM
  • ↳   Gentoo on PPC
  • ↳   Gentoo on Sparc
  • ↳   Gentoo on Alternative Architectures
  • ↳   Gentoo on AMD64
  • ↳   Gentoo for Mac OS X (Portage for Mac OS X)
  • Board index
  • All times are UTC
  • Delete cookies

© 2001–2026 Gentoo Foundation, Inc.

Powered by phpBB® Forum Software © phpBB Limited

Privacy Policy

 

 

magic