如何从10 000个PDF文档中提取元数据并将其存储在我的数据库中?

I'm currently building a document-sharing platform, and to attract as many users as possible, I want to already add 10 000 documents to my platform. The documents are only PDF files. I'm working with Symfony2, but I guess this doesn't change much to the problem: how can I extract the metadata I need from these documents (for example, title, the first 100 words for the description) automatically and insert it into my database (in my case, hydrate it into my entities, but I know that part).

I guess a crawler is what I'm looking for but I have no idea where to find something like this nor how to make it work.

Thanks in advance!

well as you don't have a real question:

  • define what document types/formats you allow
  • google for how to read each document type with php (php-functions, libraries, code-snippets)
  • determine the file type of uploaded documents
  • read the files in php using the googled funcs, libs etc.

when you have done all this and then have a specific problem: ask a real question ;)