Schedule a Meeting

Return to Enterprise Automation Blog

Advancing Intelligent Document Rotation Correction at Hyperscience

September 5 2024

3 min read

By Chuyao Shen

Setting the Stage

At the core of the Hyperscience platform lies Intelligent Document Processing, a crucial component that drives accurate information extraction from documents. One of the key preprocessing steps in this workflow is image correction, which ensures that document images are properly aligned and machine readable. Effective image correction not only enhances extraction accuracy but also minimizes the need for manual intervention in downstream tasks.

In our image correction process, we employ a two-step approach. First, a deskew model aligns the document image to the nearest 90-degree angle. Next, a rotation correction model adjusts the image to its correct upright position. From a machine learning standpoint, this rotation detection task is a classic example of an image classification problem.

The Journey So Far

In the past, we relied on a heuristic approach involving multiple steps to achieve image correction. The design of this multi-step architecture was driven by the need to balance accuracy and inference speed. First a light-weight letter segmentor was used to extract letter patches from the document. This step effectively removed large regions without text, significantly reducing the resolution of the input fed into the convolutional neural network (CNN) classifier. As a result, even a tiny CNN architecture could achieve high classification accuracy.

The following diagram illustrates the detailed steps of how we performed image correction on synthetic SSN cards:

We benchmarked our pipeline against the popular open-source document rotation detection model, PP-LCNet, using four of our internal Latin evaluation datasets:

Current Innovations

The previously discussed pipeline worked well for the majority of documents we process. However, we’ve observed several cases where the model’s performance begins to decline:

  1. Performance Issues with Text-Sparse Documents: The model struggles with documents that contain minimal text, such as SSN cards, driver’s licenses, and birth certificates.
  2. Difficulty Handling Noisy Backgrounds: Due to the limitations of the segmentor module, documents with textures or complex backgrounds are unable to be accurately predicted.
  3. Challenges with Mixed Language Documents: For instance, some Arabic letters resemble upside-down Latin letters, which confuses the letter rotation detection model.
  4. Thresholding Challenges: The model is difficult to threshold effectively, leading to inconsistent results.

As a key benefit of the release of MobileNetV4 earlier this year, we were able to build a new page-level rotation detection model that addresses all of the above limitations while maintaining comparable inference times. When compared to other leading efficient models, MobileNetV4 demonstrates strong performance:

Moreover, the new page-level rotation detection model allows us to eliminate the complicated hierarchy in our previous pipeline. Now, a single model is sufficient to make predictions for all types of documents and languages. In our final implementation, we found that MobileNetV4_conv_medium more than meets the requirements for the document rotation detection task. When comparing the new page-level rotation detection model to the old letter rotation detection pipeline, the results speak for themselves:

Looking Ahead

Hyperscience is actively exploring Vision-Language Model (VLM) solutions to further push the boundaries of document processing. By employing a single multi-modal model that can handle every component of the pipeline, we envision the possibility of eliminating the image correction step entirely. With a well-fine-tuned VLM capable of interpreting rotated images, issues like upside-down documents would no longer be a bottleneck, allowing for a more seamless and efficient document processing experience.