Skip to main content Help Control Panel

Login   A+   A-

Community «   Discussion forum «   Bug tracker «  

indexing PDF files?

Pietrzak Karl -- on Jul. 25 2006
interesting in indexing PDF, Microsoft Word, etc. files
Hey everyone!

You may remember me as the guy who posted a few weeks ago because he couldn't get YACS to work; the solution ended up increasing the PHP max memory in php.ini.

Anyways, the YACS installation for my university is coming along beautifully. We're constantly being amazed by the feature set and how we can use it (for example, the per-section configuration options are wonderful for different departments of a college!).

Anyways, I am interested in indexing Microsoft Word, PDF, and other types of files. Other CMSes have this functionality, and I'm wondering whether YACS has it and I just haven't found the documentation, or it just doesn't have any.

I also wanted to say that as a software developer, with the proper help (e.g., finish these files) I can help write the code to make this happen! What do you think, Bernard?

Thanks, and take care!
GnapZ
from Caribbean
2970 posts

on Jul. 29 2006


I'm sorry but i don't understand what you mean by indexing files ... if it is not in the database, i think that you want to sort files by type of document and not only by date or name.

If you can develop this feature, it could be nice to add this to the next yacs versions.

Thanks.
GregL
avatar
43 posts

inspired from GnapZ on Jul. 29 2006


GnapZ : I think he means indexing the pdf files and their content in the search engine. I think it is not yet possible to do so.
TheAlchemist
19 posts

inspired from GregL on Jul. 29 2006


GregL:

You're exactly right. Thanks for clearing up what I originally said.

In other words, it would be great if YACS looked -inside- PDF files, Microsoft Word files, etc. instead of just the fields of a page.

I don't think this is technically too difficult (pdf2txt for PDF files, antiword for Microsoft Word files, etc.), but I have never done anything of this sort.

Thanks all!
GnapZ
from Caribbean
2970 posts

inspired from TheAlchemist on Jul. 29 2006


TheAlchemist : Ok, i see know what you mean. This could be a new feature for the next versions of Yacs. Lets wait for Bernard to know what does he think about this.
Bernard
avatar
from nearby-an-airport
Associate, 6732 posts

inspired from TheAlchemist on Jul. 31 2006


TheAlchemist: Why do you always come with so difficult question guy? Anyway you are right, and YACS deserves a better indexation scheme for files.

Actually, if you look carefully at line 437 of files/edit.php, you will see a proud comment flagging the place where to add such a thing...

I know that some tools are able to extract searchable text from binary files, but I have not practiced them yet. So, if you want to start something in this field, please proceed...

Thank you very much for your interest in YACS. Glad to know that it has proven quite useful to your university.
TheAlchemist
19 posts

on Aug. 1 2006


" TheAlchemist: Why do you always come with so difficult question guy? Anyway you are right, and YACS deserves a better indexation scheme for files. "


Ah, it's my specialty.

" Actually, if you look carefully at line 437 of files/edit.php, you will see a proud comment flagging the place where to add such a thing... "


Yup, I see it...

" I know that some tools are able to extract searchable text from binary files, but I have not practiced them yet. So, if you want to start something in this field, please proceed... "


Well, first I will be working on LDAP support for YACS, because this is crucial for our needs.

I'll work on extracting text from binary files if I have the time. We'll see.

" Thank you very much for your interest in YACS. Glad to know that it has proven quite useful to your university. "


Of course! I think the university would allow linking to the page once it gets officially deployed. I can't wait to see this CMS a few years from now when it would have (hopefully) hundreds of documents.
Bernard
avatar
from nearby-an-airport
Associate, 6732 posts

inspired from TheAlchemist on Aug. 1 2006


TheAlchemist: Hundreds, are you kidding? I know some sites with thousands of files. YACS is a serious CMS, ya know...
TheAlchemist
19 posts

inspired from Bernard on Aug. 1 2006


" TheAlchemist: Hundreds, are you kidding? I know some sites with thousands of files. YACS is a serious CMS, ya know... "


Oh, I'm sorry! I mis-spoke. By "this CMS" I meant "the CMS of my university". Sorry about that. I didn't mean to put YACS down. The existence of large-scale YACS installations was one of the reasons we decided to use it in the first place.
Bernard
avatar
from nearby-an-airport
Associate, 6732 posts

inspired from TheAlchemist on Aug. 1 2006


TheAlchemist: YACS already supports XML-RPC for remote authentication, therefore I suppose you can reuse part the code for the LDAP stuff... Everything is in function login() of users/users.php. Also, new authentication parameters for LDAP could be handled in the configuration panel for users, at users/configure.php. Good luck and let the Force be with you!

Rate this page
Posted by TheAlchemist on Jul. 25 2006, edited by Bernard on Jul. 25 2006, (popular)