Maintain formatting while reading PDF document
9 views (last 30 days)
Show older comments
Hello, while reading a PDF document, I want to let the formatting as it is - for the bold to be bold, for the italic to be italic. I have tried this with extractFileText, but not successful. How can this be done? Thanks.
3 Comments
Answers (2)
akshatsood
on 15 Jan 2024
I understand that you want to maintain formatting while reading a PDF document. To extract text from a PDF document while preserving formatting such as bold and italic, you would typically need a more advanced PDF processing tool or library that supports rich text extraction. MATLAB's built-in "extractFileText" function does not preserve text formatting, as it is designed to extract plain text.
I hope this helps.
0 Comments
Hassaan
on 15 Jan 2024
If you want to preserve the formatting, MATLAB itself does not provide built-in functions to directly extract formatted text from PDFs, as this requires interpretation of the PDF content stream which can be quite complex due to the nature of PDF formatting.
- External Tools: Use an external tool designed for PDF text extraction that preserves formatting. There are several tools available that can extract text with formatting from PDFs, such as Adobe Acrobat's SDK or other third-party libraries. You can call these tools from MATLAB using the system function or other interfacing methods depending on the tool.
- PDF to Word: Convert the PDF to a Word document (which preserves formatting) using an external tool or online service, and then use MATLAB to read the Word document using functions from the Text Analytics Toolbox.
- Manual Inspection: If you only have a few documents and you're looking for specific formatted text, you might manually inspect the PDF file for the markup of bold and italic text. However, this is not practical for large-scale or automated extraction.
- Custom Scripting with Other Programming Languages: Use a scripting language that has libraries for PDF manipulation (like Python with PyPDF2 or PDFMiner) to extract the text while preserving formatting, and then pass the extracted content to MATLAB if needed.
- Optical Character Recognition (OCR): Use OCR tools that can recognize and preserve text formatting. MATLAB has an OCR function that can recognize text in images, but it won't retain text formatting. You would need to use a more advanced OCR tool for formatted text extraction.
[status, cmdout] = system('command-to-extract-formatted-text-from-pdf');
Remember to replace 'command-to-extract-formatted-text-from-pdf' with the actual command that invokes your PDF text extraction tool.
For advanced document processing needs that go beyond what MATLAB directly supports, it's usually more effective to use a combination of tools, possibly involving other programming environments that have more specialized libraries for handling PDFs.
---------------------------------------------------------------------------------------------------------------------------------------------------------
If you find the solution helpful and it resolves your issue, it would be greatly appreciated if you could accept the answer. Also, leaving an upvote and a comment are also wonderful ways to provide feedback.
Professional Interests
- Technical Services and Consulting
- Embedded Systems | Firmware Developement | Simulations
- Electrical and Electronics Engineering
Feel free to contact me.
3 Comments
Christopher Creutzig
on 19 Jan 2024
Just for clarification, extractFileText already does >90% of the complexity of parsing the PDF stream you mentioned. The reason it does not give information about font names, bold/italic/roman, position on the page, etc. is that its design point is to read the text to then use in text analytics workflows.
Most of that information is, after all, already used internally to arrange the text found correctly before returning a string.
See Also
Categories
Find more on Text Analytics Toolbox in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!