Join Some of the Brightest Minds in AI
Turn cutting-edge ML techniques into enterprise-ready AI solutions for the world's largest organizations
One of the roles of Machine Learning Engineers at Hyperscience is to help Product Managers assess the feasibility of new products. To do so, we run a discovery phase where the goal is to build a quick prototype of the solution in order 1) to get some signal on the complexity of the problem and its solution, 2) to identify ahead of time the main challenges and open questions, and 3) to evaluate the time and effort required to get to a fully fledged solution deployable within the Hyperscience Platform.
Depending on the complexity and novelty of the product, this prototyping phase can take anywhere from several weeks to several months, but we strive to always follow the same approach:
In this blog post, we will be presenting a prototype that we worked on earlier this year as a solution for the problem of information extraction from dense and text-heavy documents. In each document, we want to extract multiple fields and we assume that each field has only one single value in the document. For example, a document could be a Warranty Deed, and a field to extract could be the Grantee’s address. In which case, we assume that there is only one Grantee in the Warranty Deed and this Grantee has only one address. Note that this is different from classic Named Entity Recognition tasks because such documents could contain multiple addresses, but we only want to extract the one that corresponds to our definition.
There are two interesting aspects to this problem that makes it both unique and challenging:
In the rest of this blog post, we will dive into each of the prototyping steps outlined above and showcase how our team approached them in the context of information extraction from dense documents.
The first step of prototyping is to collect datasets representative of the problem at hand. To do so, we list some characteristic of the documents we want in those datasets:
We mostly rely on public, open-sourced datasets that were published to facilitate ML research in the field of Document Understand and 2D Information Retrieval. Those were published in previous papers and often come with ground truth annotations. We also utilize internal, private datasets.
Public datasets are a great platform for us in this case for a number of reasons. To start, they already have ground truth annotations, requiring minimal effort to use them. Secondly, other research groups have already worked and published papers on these datasets, so we can also gather some model ideas and performance estimates.
We found two datasets that were released recently by ApplicaAI™, another ML company, that tackled a very similar problem to ours: Kleister NDA™ made of US non-disclosure agreements, and Kleister Charity1 made of annual financial reports of charitable foundations in the UK.
On average, Kleister NDA™ documents are six pages long, and Kleister Charity™ documents are 22 pages long. For Kleister NDA and Charity we have respectively 254/83 and 1729/440 documents in the training/validation sets.
Each dataset has multiple fields to extract with various data types such as dates, numbers, or freeform text. Multiple factors influence the extraction difficulty, one of which is whether the values are isolated on the page, within a paragraph, or within a table. For example, effective_date or report_date are relatively easy date fields, while addresses are generally harder.
The topic of information extraction from documents is an active area of research where the most successful approaches are built on top of BERT2or RoBERTa3, adding some location information to the model to take into account the 2D nature of the input data. In this instance, 2D refers to the fact that the location of the text on the page provides some useful information. Treating the problem as purely 1D with a traditional BERT approach can significantly hurt performance, as determined in the Lambert paper4. An example of this can be seen in extracting a field from a table where adding location information will help the model understand the concepts of columns and rows.
One of the foundational papers of 2D BERT models is LayoutLM5, which was developed by Microsoft™. The main ideas are to add 2D embeddings for each segment of text on the page, as well as add image features extracted from the feature maps of a CNN. The image features can be used to understand if the text is in bold or if the font is bigger.
Some variants of this approach were later proposed by ApplicaAI™ with Lambert4, or by Microsoft™ improving on their original implementation with LayoutLMv26.
To build our prototype, we decided to use LayoutLM for the following reasons:
We treat the problem of information extraction as an entity tagging problem. We have one class per field, plus one more class for the “no field” option. Our decoding layer is a simple softmax over the number of fields to extract. We use the already available implementation in Hugging Face™:
from dataclasses import dataclass
from torch import nn
from transformers import LayoutLMForTokenClassification
from ml_framework.zoo.unstructured_nlp.config import LAYOUT_LM_PRETRAINED_STR
# (batch_size, MAX_WINDOW_SIZE, num_labels)
logits: torch.Tensor = None
# (batch_size, MAX_WINDOW_SIZE)
predictions: torch.Tensor = None
loss: torch.Tensor = None
def __init__(self, num_labels: int) -> None:
self.model = LayoutLMForTokenClassification.from_pretrained(
def post_process_logits(logits: torch.Tensor) -> torch.Tensor:
return torch.argmax(logits, dim=2)
) -> ModelPrediction:
# passing the labels will automatically compute Sparse Cross Entropy loss
out_model = self.model(
In order to make it work for our problems, we had to tweak it slightly, namely with two additions:
The ground truth (GT) annotations provided with the Kleister™ datasets are strings of text with no localization information in the document. Since we use a tagging approach, we need to find all the tokens corresponding to each field’s ground truth in the text. In order to match the tokens with the ground truth, we “fuzzy match” the text; we look for all the occurrences of the GT text in the document allowing for some amount of mistakes. It is necessary to allow some mistakes since they could be transcription mistakes from the OCR engine, or they could be normalization mistakes (e.g. the ground truth is “$19” and the segment is “$19,”). This tolerance to mistakes needs to be adjusted depending on the field type. For example, for pure freeform text (e.g. a company name), we can allow for a few mistakes, but for numbers we need to be much more strict in the matching (e.g. if the ground truth is “2019” we don’t want to match “2018”).
from typing import List
import numpy as np
gt_words: List[str], text_words: List[str], match_similarity_threshold: float
) -> np.ndarray:
tagged_segments = np.zeros(len(text_words), dtype=np.int)
num_gt_words = len(gt_words)
num_text_words = len(text_words)
num_allowed_char_errors = math.floor(
(1.0 - match_similarity_threshold) * sum(len(gt_word) for gt_word in gt_words)
total_gt_window_errors = np.zeros(num_text_words - num_gt_words + 1, dtype=np.int)
for gt_word_idx, gt_word in enumerate(gt_words):
gt_word_errors = np.array(
for text_word in text_words[
gt_word_idx: (gt_word_idx + num_text_words - num_gt_words + 1)
total_gt_window_errors += gt_word_errors
prev_start_word_idx = None
for candidate_start_word_idx in (
total_gt_window_errors <= num_allowed_char_errors
best_start_word_idx_in_window = total_gt_window_errors[
candidate_start_word_idx: candidate_start_word_idx + num_gt_words
start_word_idx = candidate_start_word_idx + best_start_word_idx_in_window
if prev_start_word_idx is None or start_word_idx >= prev_start_word_idx + num_gt_words:
tagged_segments[start_word_idx: start_word_idx + num_gt_words] = 1
prev_start_word_idx = start_word_idx
Like BERT, LayoutLM takes a maximum of 512 tokens as input. To be able to reuse its out-of-the-box implementation, we need to chunk the input document (a.k.a. windowing). We treat that as a preprocessing step where we split the document into multiple sub-documents of 512 tokens, with some overlap between each chunk (e.g. 20%). Before we can chunk the document, we need to order the segments on the page, a process called text serialization. Although text serialization is a much bigger problem that can often be very complex, we took a simple approach of ordering the segment from left to right and top to bottom. This approach was sufficient for the Kleister™ documents, given they have a fairly simple layout. Note that in other cases, this approach would not work; for multi-column documents the correct serialization algorithm would be from top to bottom in the first column, then repeat for each column.
from dataclasses import dataclass
from typing import List, Union, Optional
start_x: Union[float, int]
start_y: Union[float, int]
end_x: Union[float, int]
end_y: Union[float, int]
token_labels: Optional[List[int]] = None
document: Document, chunk_size: int, overlap_ratio: float
) -> List[Document]:
num_tokens = len(document.token_ids)
overlapped_num_tokens = int(chunk_size * overlap_ratio)
start, stop = 0, min(chunk_size, num_tokens)
doc_chunks = 
chunk = Document(
if stop >= num_tokens:
start = stop - overlapped_num_tokens
stop = start + chunk_size
The model is trained on each chunk independently, and at inference time, we pool together the predictions from each chunk of the same document to compute our metrics. We expand on our metrics discussion in the next section.In order to output a single prediction per field, we take the predicted entity with the highest confidence score across all the chunks. The confidence score of an entity here is defined simply by averaging the probabilities of each token in a given entity.
First, we need to define the concept of an “entity”; we group consecutive tokens belonging to the same class into an “entity”. For example, an “entity” is a sequence of “token_ids” in a chunk. For a given class, we aggregate all these entities across all chunks of a document. Therefore, we have a list of predicted entities and a list of target entities (using the fuzzy-matched labels). We then keep the top-1 entity with the highest confidence score to be the only predicted entity for a class. We take the target entity that overlaps the most with the predicted entity to be the only target entity for a class.
The second detail to notice is that the model’s performance can be evaluated in multiple “spaces”:
We derive three levels of metrics from that and, for each level, we compute recall, precision, and F1 scores:
Each of these three levels provides useful information as to where the bottleneck is in the system. For example, if the model is very good at the “token level” but is worse at the “entity-level in the token-space”, it could mean that the model sometimes misses one or two tokens in entities (e.g. the ground truth entity is “The Automation Company” and the model predicts “Automation Company”).
We report F1-scores at different levels for each field:
We notice that the performance varies greatly across fields and multiple factors can explain the discrepancy.
OCR / Normalization Errors: This is the case for “charity_number”, for example, where the “tokens space” metrics are very high. This means that the model tags the correct tokens, but the “words space” metric is much lower, signifying that there is an issue when decoding / normalizing tokens. This was confirmed when looking at some “false” predictions:
Missing Tokens in the Prediction: This is the case for “charity name”, where we see a bump between the “entity-level token space” and the “token level” metrics due to a few tokens missing in some entities:
Failing Fuzzy Matching: This is the main cause of bad performance for most of these fields. The very simple fuzzy matching that we first implemented suffers from two shortcomings:
Using an off-the-shelf model and implementing a few modules on top of it, the Hyperscience team was able to quickly build a first prototype that achieves decent performance on multiple public datasets. Most importantly, we identified the main challenges we need to solve if we want to turn this simple research prototype into an actual product that can be used at scale by our clients.
Interestingly, most of these challenges could be easily fixed manually for the sake of getting better results on an academic dataset. An example of this could be publishing a paper. In the context of industrial research, where the goal is to build a product that can be deployed and work on any datasets at scale, fixing these issues becomes a more complex and longer problem. Flagging all these challenges is very beneficial, however, and was one of the main goals of this prototyping phase.
Author’s Note: Thanks to Siyuan Xiang and Louis Duperier for their article review. All of the aforementioned code can be found on the Hyperscience GitHub.
Romain Sauvestre is a ML Engineering Manager at Hyperscience based in France. Connect with Romain on LinkedIn.