Extracting text from any file is harder than it looks. Extracting formatting is even harder.
We take for granted document processing on an individual scale: double-click the file (or use a simple command-line phrase) and the contents of the file display. But it gets more complicated at scale. Imagine you’re a recruiter searching resumes for keywords or a paralegal looking for names in thousands of pages of discovery documents. The formats, versions, and platforms that generated them could be wildly different. The challenge is even greater when it’s time sensitive, for example if you have to scan all outgoing emails for personally identifiable information (PII) leakages, or you have to give patients a single file that contains all of their disclosure agreements, scanned documents, and MRI/X-ray/test reports, regardless of the original file format.
At Hyland we produce a document processing toolkit that independent software vendors can implement to identify files, extract text, render file content, convert formats, and annotate documents in over 550 formats. These are Document Filters, and any software that interacts with documents will need Document Filters.
One library for 550 formats may seem like overkill, but imagine stringing together dozens of open source libraries, testing each of these libraries each time a new release hits the wild. We give you one dependency, one point of contact if something goes wrong, and one library to deploy instead of dozens.
We started as a company that sold desktop search software called ISYS. The application was built in Pascal for MS-DOS and provided mainframe-level search on PCs. Eventually, other companies, such as Microsoft and Google, started providing desktop search applications for free, and it’s tough to compete with free.
This led us to realize that the sum of the parts was greater than the whole; getting text out of files and delivering the exact location is harder than it seems and relevant to applications other than search. Our customers noticed our strength in text extraction and wanted that as something they could integrate or embed in their software and across multiple platforms.
Identifying that Pascal was not going to meet our needs we pivoted our engineers to rebuild the app in C++ over the next year for about half a dozen computing platforms. Since then, we’ve learned a lot about content processing at scale and how to make it work on any platform.
On any platform
When we rewrote our software, one of the key factors was platform support. At the time, Pascal only supported Windows, and while it now supports Mac and Linux, it was and still is a niche language. That wasn’t going to work for us, as we wanted to support big backend processing servers like Solaris and HP-UX. We considered writing it in C, but we would have had to invent a lot of the boilerplate that C++ gave us for free.
We were able to port about 80% of the code from Pascal. The other 20% was new OS abstractions, primarily to support the Windows API functions we lose on other platforms and the various quirks of each platform. Each compiler makes different assumptions of how to implement C++ code, so we use multiple compilers to see what those assumptions are.
Complicating things was that we not only had to consider operating systems, but CPUs as well. Different CPUs process bytes in different orders, called byte endianness. All Intel and ARM chips use little-endian, where the least significant byte is stored first. SPARC chips historically used in Solaris machines used big-endian storage, where the most significant byte was stored first. When you’re reading from a file, you need to know what chipset produced it, otherwise you could read things backwards. We make sure to abstract this away so no one needs to figure out the originating chipset before processing a file.
Ultimately, the goal is to have the software run exactly the same on all 27 platforms. Some of the solution to that problem is just writing everything as generically as possible without special code for each platform. The other solution is testing. With the conversion to C++, we wrote a lot of new tests in order to exercise as much code on all platforms. Today, we’ve expanded those tests and made error detection much more strict. Lots of files and formats pass through during tests and they need to come through clean.
Search and extract text at scale
The first step to locating or extracting text from a file is finding out what format the file is in. If you are lucky to get a plaintext file, then that’s an easy one. Unfortunately, things are rarely easy. There aren’t a lot of standards available for how files are structured; what exists may be incomplete or outdated. Things have changed a lot over the years; Microsoft is actually at the forefront for publishing standards. They publish standards for most of their file types these days, particularly the newer ones.
Many file types can be identified by an initial set of four bytes. Once you have that, you can quickly parse the file. Older MS Office files all had the same four bytes, which presented complications, especially since so many files were in one of the four Office formats. You had to do a little extra detective work. Newer Office files all identify as ZIP files—they are all compressed XML—so once you extract the XML, you start applying known heuristics and following markers. Most XML is self-describing, so those markers can be easy to follow. Others don’t have much of a path at all.
Binary file types are harder. Some of the work here is reverse engineering and making sure you basically have enough files that are a representative sample set. But once you know the pattern, then detecting the file is absolutely predictable. We don’t use any machine learning or AI techniques to identify files because of this. The challenge is working out what the pattern is and what pattern a given file fits.
Identifying files is the very first thing that we do, so it has to be fast. One slow detection can impact everything and take us from sub-milliseconds per document to 15 milliseconds per document. When you’re trying to crank through forty-thousand documents in a minute, that’s a lot.
We gain a lot of speed from specializing in text search and extraction as a pure back-end system. Alternate methods have used something like LibreOffice to process documents as a headless word processor. End-user applications have graphic elements and other features that you don’t care about. In a high-traffic environment, that could mean 50 copies of LibreOffice running as separate processes across multiple machines, each eating up hundreds of MB. If that crashes it could bring down vital business processes with it. It’s not uncommon to see server farms running LibreOffice for conversions that could be replaced with a single back-end process such as Document Filters. That’s before considering the other workarounds to process all the other file types you might need such as spreadsheets, images, and PDFs.
By focusing on processing text at a high volume, we can help clients that need to process emails, incoming and outgoing, looking for data loss and accidental PII leakages. These products need to scan everything going in or out. We call it deep inspection. We cracked apart every level of an email that could have text. Zipping something and renaming the extension is not enough to try and trick it. Attaching a PDF inside a Word document inside an Excel document is also not enough. These are all files that contain text, and security needs to scan all of it without delaying the send. We won’t try to crack an encrypted file, but we can flag it for human review. All this is done so quickly that you won’t notice a delay in the delivery of critical email.
We can process text so quickly because we built in C++ and run natively on the hardware; targeting native binaries also gives us the greatest flexibility where we can be embedded in applications written in a wide variety of languages. On top of that, all that work identifying file formats pays off. When scanning a file, we load as little as possible into memory, just enough to identify the format. Then we move to processing, where we ignore any information we don’t need to spot text—we don’t need to load Acrobat forms and crack that stuff apart. Plus we let you throw as much hardware at the problem as you have. Say you are running a POWER8 machine with 200 cores, you can run 400 threads and it won’t break a sweat. You want a lot of memory if you’re doing that amount of documents in parallel.
Make it look good
Our clients weren’t content with just searching and extracting text; they also wanted to display it in web browsers. Around 2009, people wanted to convert documents to HTML. When extracting text, the software doesn’t care about whether something is bolded or paginated—we just want the text.
Fortunately, all that work we did in understanding file types paid off here. We knew how to spot text, the markers that indicated each type, but now we had to understand the full file structure. Suddenly, bold, italics, tables, page breaks, and tabs vs. spaces become a lot more important. Our first iteration of HTML rendering, now called Classic HTML, created an unpaginated free flowing version of the file with as much formatting as we could pull. If you’ve ever looked at the HTML generated by MS Word, you know that creating HTML that accurately reflects a document is complicated.
There’s seven billion people on the planet and all of them create a Word document differently. Even within Word or open source .docx editors like OpenOffice, you move an element and suddenly the formatting disappears. We had to test out all of the possible behaviors in the specifications, and still we figured out some bugs by trial and error.
We had one bug where Windows and Mac versions were producing different shades of blue. It was consistent across Office documents—PowerPoint and Excel documents all showed the same two shades of blue. Sometimes it comes down to different system defaults and fonts on different platforms. Sometimes the answer is completely subjective as to what the definition of blue is or whether a line wraps before or after a word. In cases like that, you have to pick one of the cases to propagate; one of them is right, but it’s hard to suss out exactly which one. There’s no absolutes.
File format specifications, typically published by the vendor, don’t always help here either. We’ve seen a property change, while the spec doesn’t clarify how that affects the formatting of the document. Then, when testing a thousand page document, we find a bug on page 342, and our collective hearts sink a little bit. In cases like these, we know it’s going to take a while to sort out what’s causing it, then prove it over millions of iterations.
For all the trouble that Word documents give us, at least there’s structure; you know a table is a table. PDFs have none of that. They are probably the hardest to deal with because they focus on how a document is drawn on a screen. Technically, characters can be placed individually anywhere on a page, so determining column breaks, tables, and other formatting features requires looking at their rendered position on a screen.
Pre-internet, everyone had to create everything themselves. They made their own formats in the dark. Everyone wrote binaries differently. And PDFs, while they are getting better, can always reveal a new bug, no matter how large a corpus of test data we have.
Open source software and an increased focus on accessibility concerns have changed formats a lot. PDFs have started including more formatting information to accommodate screen readers. Open source software needs to understand file formats, so more information is published and file producers have started making their files easier to understand.
The next step after understanding document format was to be able to take these files and produce paginated output that looks near-pixel perfect to the source application. All that information we learned about file formats, let us create what we call Paginated HD Renditions. Paginated output means the output looks similar to if you were to print the document. That’s reading and extracting text from 550 formats, and creating fully formatted and paginated HD Renditions for over 100 formats. Combined with a full markup and annotation API that can create native annotations and export to one of over 20 formats.
We’ve talked a lot about Word and PDF documents, because that’s what most people use. But we also can read in exotic file formats, like MRI and CT scan files. This has a significant application in medical situations where you may want to concatenate them with other medical forms, then output a PDF complete with the doctor’s annotations. Want to throw us multiple documents from different file formats? Go ahead, we’re not limited to 1:1 input to output, we will ingest the data, understand it, and return it as a single file type of your choice.
Don’t forget security
As we moved our product from a desktop search application, we’ve had to increase our focus on security. If a consumer-grade product crashes it impacts a single user. But if an embedded piece of software crashes, it could take the rest of the program—possibly the entire server—down with it. These crashes and exploits could open them up for further mischief. Over the years, we did get hit with a few surprises and got burnt.
What may be common today certainly wasn’t in the early 2000s. Static analysis, unit tests with high code coverage, compiler sanitizers, CVE scans, and fuzz-testing are all must haves.
We process files of unknown origins and quality. These files might come from a third-party that doesn’t strictly follow specifications, so they might be corrupt, or they might be maliciously crafted to trigger vulnerabilities.
Strict adherence to coding and security best practices only gets you so far. Testing, both active and passive, is a constantly running background task that helps us in our efforts to detect and gracefully handle the unexpected.
Each release is verified with 100K+ files to ensure no regressions or performance degradations. Each nightly build runs over 40K unit tests. Fuzz-tests number in the 10s of millions. And of course, third-party libraries are scanned for vulnerabilities nightly.
We’ve lived and breathed file types for decades, and seen the complexities that go into simply finding and extracting text. Some of the largest software companies in the world leverage Document Filters for their document processing needs, processing terabytes of information hourly. Our team of engineers is always monitoring new and changing file-types so consumers of Document Filters are well prepared for the future.
If you’re starting a new project, feel there’s room for improvement with your current tools, or not wanting to worry about the complexities of document processing, you can always learn more by checking out our code samples or requesting an evaluation at DocumentFilters.com.Tags: files, partner content, partnercontent, text extraction
As someone who’s hacking together a very pale imitation of this with the intent to provide a free and open-source Reader View-esque HTML converter for a variety of formats people post web-published amateur fiction in, you have my undying respect.
Whilst I think your text extraction I’m sure is very cool, I also bet there’s a myriad of ways to bypass it too. Having worked at a large email processing organisation, we also parsed attachments, especially to prevent confidential document leakage and even employed fuzzy hashing, most of which I coded. But even I realised, there are still many ways to bypass even the most complex document inspection techniques. We even went down the road of converting images to prevent stenography leaks, but even then, if you know enough, you can beat that too. So whilst I can applaud the concept and the work you’ve done in this area, I’m pretty sure it’s wouldn’t take much work for someone worth their salt to devise a way to beat your text search whilst still not triggering human inspection. Give me your library, and we’ll see!
I built something very similar, but it’s a managed API to let you search any file (image, video, audio, etc.) using Lucene:
@Corey Kidd and Ben Truscott, or whomever can solve our extraction problem:
Our automated platform requires clients to input images and text (primarily restaurant menus). We would like our clients to do so by simply entering their website address, and then having their images and text extracted. REQUIREMENTS: Extract from various websites (menus/product: images, product name, product description, product price). RSVP
Finally, the mankind needs to understand the era we are living in, and say goodbye to lowercase as a different type face. Lowercase letters are the same, just smaller. Goodbye to brea-king words at the end of lines, using CR LF, when CR is enough, abusing accented and local characters, when the same words can be written using the basic set. Goodbye to characters typed as single lines, to differentiate between 1, i and L. Goodbye to medieval measuring units, each foot and is different, to XZY and other variants of axis layout, when XYZ already existed, to winter and summer time shift, to left side driving, to using AHCDEFG musical notation, to JPEG color profiles, to various monetary units and money at all. The most of these differences came into being as results of mistakes, such as misspelling, mishearing, forgetting.
By the way, Microsoft and Google do not search for free. Our private data are the most valuable goods.