Inside Look: Prototyping a New ML Product at Hyperscience

Romain Sauvestre | Posted on September 22, 2021

Overview of Prototyping at Hyperscience 

One of the roles of Machine Learning Engineers at Hyperscience is to help Product Managers assess the feasibility of new products. To do so, we run a discovery phase where the goal is to build a quick prototype of the solution in order 1) to get some signal on the complexity of the problem and its solution, 2) to identify ahead of time the main challenges and open questions, and 3) to evaluate the time and effort required to get to a fully fledged solution deployable within the Hyperscience Platform

Depending on the complexity and novelty of the product, this prototyping phase can take anywhere from several weeks to several months, but we strive to always follow the same approach:

  1. Collect some datasets: they should be representative of the main use cases of the new product.
  2. Build a first baseline: look for models available in the literature and tweak them to our needs, or create a new one from scratch if it’s faster.
  3. Define metrics with the Product team: understand what metrics make sense for our clients. Also decide on “success criteria” that will be evaluated at the end of the prototyping phase.
  4. Train and evaluate a first prototype, iterate.

In this blog post, we will be presenting a prototype that we worked on earlier this year as a solution for the problem of information extraction from dense and text-heavy documents. In each document, we want to extract multiple fields and we assume that each field has only one single value in the document. For example, a document could be a Warranty Deed, and a field to extract could be the Grantee’s address. In which case, we assume that there is only one Grantee in the Warranty Deed and this Grantee has only one address. Note that this is different from classic Named Entity Recognition tasks because such documents could contain multiple addresses, but we only want to extract the one that corresponds to our definition.

There are two interesting aspects to this problem that makes it both unique and challenging:

  • Text heavy documents require deep language understanding, i.e. BERT-like models.
  • Unlike input data in the vast majority of tasks in NLP such as Named Entity Recognition or Machine Translation, documents are not 1D text since they have a layout (paragraphs, headers, etc) that constitutes helpful features one should not discard in the modeling.

In the rest of this blog post, we will dive into each of the prototyping steps outlined above and showcase how our team approached them in the context of information extraction from dense documents.

Data Collection

The first step of prototyping is to collect datasets representative of the problem at hand. To do so, we list some characteristic of the documents we want in those datasets:

  • Documents should consist of multiple pages (10+).
  • Documents should contain multiple fields to extract (5+).
  • Extracting these fields requires a good understanding of the text itself and should not rely only on simple regexes or location features (e.g. field should not always be located in the top left hand corner of the page).

We mostly rely on public, open-sourced datasets that were published to facilitate ML research in the field of Document Understand and 2D Information Retrieval. Those were published in previous papers and often come with ground truth annotations. We also utilize internal, private datasets.

Public datasets are a great platform for us in this case for a number of reasons. To start, they already have ground truth annotations, requiring minimal effort to use them. Secondly, other research groups have already worked and published papers on these datasets, so we can also gather some model ideas and performance estimates.

We found two datasets that were released recently by ApplicaAI™, another ML company, that tackled a very similar problem to ours: Kleister NDA made of US non-disclosure agreements, and Kleister Charity1 made of annual financial reports of charitable foundations in the UK.

On average, Kleister NDA documents are six pages long, and Kleister Charity documents are 22 pages long. For Kleister NDA and Charity we have respectively 254/83 and 1729/440 documents in the training/validation sets.

Each dataset has multiple fields to extract with various data types such as dates, numbers, or freeform text. Multiple factors influence the extraction difficulty, one of which is whether the values are isolated on the page, within a paragraph, or within a table. For example, effective_date or report_date are relatively easy date fields, while addresses are generally harder.

Source: Table 2, Gralinski, et. al. (https://arxiv.org/pdf/2003.02356.pdf)

Modeling Approach 

The topic of information extraction from documents is an active area of research where the most successful approaches are built on top of BERT2or RoBERTa3, adding some location information to the model to take into account the 2D nature of the input data. In this instance, 2D refers to the fact that the location of the text on the page provides some useful information. Treating the problem as purely 1D with a traditional BERT approach can significantly hurt performance, as determined in the Lambert paper4. An example of this can be seen in extracting a field from a table where adding location information will help the model understand the concepts of columns and rows.

One of the foundational papers of 2D BERT models is LayoutLM5, which was developed by Microsoft™. The main ideas are to add 2D embeddings for each segment of text on the page, as well as add image features extracted from the feature maps of a CNN. The image features can be used to understand if the text is in bold or if the font is bigger.

Source: Figure 2, Xu, et. al. (https://arxiv.org/pdf/1912.13318.pdf)

Some variants of this approach were later proposed by ApplicaAI™ with Lambert4, or by Microsoft™ improving on their original implementation with LayoutLMv26.

To build our prototype, we decided to use LayoutLM for the following reasons:

  • Simpler approach that gives close to state-of-the-art results.
  • Results with this model are published in the literature on the Kleister™ dataset, so they can serve to benchmark our prototype.
  • Implementation and pre-trained weights are open-sourced by Microsoft™ on the Hugging Face™ hub.
  • The tokenizer used by Microsoft™ is also available on Hugging Face™ for preprocessing of the input.

We treat the problem of information extraction as an entity tagging problem. We have one class per field, plus one more class for the “no field” option. Our decoding layer is a simple softmax over the number of fields to extract. We use the already available implementation in Hugging Face™:

from dataclasses import dataclass

import torch
from torch import nn
from transformers import LayoutLMForTokenClassification

from ml_framework.zoo.unstructured_nlp.config import LAYOUT_LM_PRETRAINED_STR


@dataclass
class ModelPrediction:
# (batch_size, MAX_WINDOW_SIZE, num_labels)
logits: torch.Tensor = None
# (batch_size, MAX_WINDOW_SIZE)
predictions: torch.Tensor = None
loss: torch.Tensor = None


class UnstructuredFieldIDModel(nn.Module):
def __init__(self, num_labels: int) -> None:
super().__init__()
self.model = LayoutLMForTokenClassification.from_pretrained(
LAYOUT_LM_PRETRAINED_STR, num_labels=num_labels
)

@staticmethod
def post_process_logits(logits: torch.Tensor) -> torch.Tensor:
return torch.argmax(logits, dim=2)

def forward(
self,
token_ids: torch.Tensor,
positions_normalized: torch.Tensor,
attention_mask: torch.Tensor,
token_type_ids: torch.Tensor,
labels: torch.Tensor,
) -> ModelPrediction:
# passing the labels will automatically compute Sparse Cross Entropy loss
out_model = self.model(
input_ids=token_ids,
bbox=positions_normalized,
attention_mask=attention_mask,
token_type_ids=token_type_ids,
labels=labels,
)
return ModelPrediction(
logits=out_model.logits,
predictions=self.post_process_logits(out_model.logits),
loss=out_model.loss,
)

In order to make it work for our problems, we had to tweak it slightly, namely with two additions:

  • Ground Truth Fuzzy Matching
  • Document Chunking

Ground Truth Fuzzy Matching

The ground truth (GT) annotations provided with the Kleister™ datasets are strings of text with no localization information in the document. Since we use a tagging approach, we need to find all the tokens corresponding to each field’s ground truth in the text. In order to match the tokens with the ground truth, we “fuzzy match” the text; we look for all the occurrences of the GT text in the document allowing for some amount of mistakes. It is necessary to allow some mistakes since they could be transcription mistakes from the OCR engine, or they could be normalization mistakes (e.g. the ground truth is “$19” and the segment is “$19,”). This tolerance to mistakes needs to be adjusted depending on the field type. For example, for pure freeform text (e.g. a company name), we can allow for a few mistakes, but for numbers we need to be much more strict in the matching (e.g. if the ground truth is “2019” we don’t want to match “2018”).

import math
from typing import List

import editdistance
import numpy as np


class FuzzyMatcher:

@staticmethod
def fuzzy_match_freeform(
gt_words: List[str], text_words: List[str], match_similarity_threshold: float
) -> np.ndarray:
tagged_segments = np.zeros(len(text_words), dtype=np.int)
num_gt_words = len(gt_words)
num_text_words = len(text_words)
num_allowed_char_errors = math.floor(
(1.0 - match_similarity_threshold) * sum(len(gt_word) for gt_word in gt_words)
)
total_gt_window_errors = np.zeros(num_text_words - num_gt_words + 1, dtype=np.int)
for gt_word_idx, gt_word in enumerate(gt_words):
gt_word_errors = np.array(
[
editdistance.eval(gt_word, text_word)
for text_word in text_words[
gt_word_idx: (gt_word_idx + num_text_words - num_gt_words + 1)
]
]
)
total_gt_window_errors += gt_word_errors

prev_start_word_idx = None
for candidate_start_word_idx in (
total_gt_window_errors <= num_allowed_char_errors
).nonzero()[0]:
best_start_word_idx_in_window = total_gt_window_errors[
candidate_start_word_idx: candidate_start_word_idx + num_gt_words
].argmin()
start_word_idx = candidate_start_word_idx + best_start_word_idx_in_window
if prev_start_word_idx is None or start_word_idx >= prev_start_word_idx + num_gt_words:
tagged_segments[start_word_idx: start_word_idx + num_gt_words] = 1
prev_start_word_idx = start_word_idx

return tagged_segments

Document Chunking 

Like BERT, LayoutLM takes a maximum of 512 tokens as input. To be able to reuse its out-of-the-box implementation, we need to chunk the input document (a.k.a. windowing). We treat that as a preprocessing step where we split the document into multiple sub-documents of 512 tokens, with some overlap between each chunk (e.g. 20%). Before we can chunk the document, we need to order the segments on the page, a process called text serialization. Although text serialization is a much bigger problem that can often be very complex, we took a simple approach of ordering the segment from left to right and top to bottom. This approach was sufficient for the Kleister™ documents, given they have a fairly simple layout. Note that in other cases, this approach would not work; for multi-column documents the correct serialization algorithm would be from top to bottom in the first column, then repeat for each column.

from dataclasses import dataclass
from typing import List, Union, Optional


@dataclass
class BoundingBox:
start_x: Union[float, int]
start_y: Union[float, int]
end_x: Union[float, int]
end_y: Union[float, int]


@dataclass
class Document:
name: str
token_ids: List[int]
token_type_ids: List[int]
attention_mask: List[int]
positions: List[BoundingBox]
token_labels: Optional[List[int]] = None


def chunk_document(
document: Document, chunk_size: int, overlap_ratio: float
) -> List[Document]:
num_tokens = len(document.token_ids)
overlapped_num_tokens = int(chunk_size * overlap_ratio)
start, stop = 0, min(chunk_size, num_tokens)

doc_chunks = []
while True:

chunk = Document(
name=document.name,
token_ids=document.token_ids[start:stop],
token_type_ids=document.token_type_ids[start:stop],
attention_mask=document.attention_mask[start:stop],
positions=document.positions[start:stop],
token_labels=document.token_labels[start:stop],
)

if stop >= num_tokens:
break
start = stop - overlapped_num_tokens
stop = start + chunk_size

return doc_chunks

The model is trained on each chunk independently, and at inference time, we pool together the predictions from each chunk of the same document to compute our metrics. We expand on our metrics discussion in the next section.In order to output a single prediction per field, we take the predicted entity with the highest confidence score across all the chunks. The confidence score of an entity here is defined simply by averaging the probabilities of each token in a given entity.

Metrics

First, we need to define the concept of an “entity”; we group consecutive tokens belonging to the same class into an “entity”. For example, an “entity” is a sequence of “token_ids” in a chunk. For a given class, we aggregate all these entities across all chunks of a document. Therefore, we have a list of predicted entities and a list of target entities (using the fuzzy-matched labels). We then keep the top-1 entity with the highest confidence score to be the only predicted entity for a class. We take the target entity that overlaps the most with the predicted entity to be the only target entity for a class.

The second detail to notice is that the model’s performance can be evaluated in multiple “spaces”:

  • Tokens’ space: The model acts as a tagger and assigns a label to each token. We can then compare these predicted labels to the ground truth labels. The metrics in the token space can also be called “localization” metrics, because they rely on the localization of the predictions to determine whether or not they are correct. 
  • Words’ space: Using the tokenizer, we can decode the tokens converting them back to words. This is what we ultimately care about because in the end, our customers want to extract words from a document. Note that in that case we compare the predictions and the true labels in the “word space” and, as a result,these metrics will be impacted by OCR errors or normalization errors.

We derive  three levels of metrics from that and, for each level, we compute recall, precision, and F1 scores:

  • Token-level in the token-space: We compare each token_id of the predicted entity to the token_ids of the target entity.
  • Entity-level in the token-space: We compare the predicted entity, i.e. a sequence of token_ids, to the target entity, i.e. another sequence of token_ids.
  • Entity-level in the word space: We decode the predicted entity, converting it back to words, and then compare it with the original ground truth. 

Each of these three levels provides useful information as to where the bottleneck is in the system. For example, if the model is very good at the “token level” but is worse at the “entity-level in the token-space”, it could mean that the model sometimes misses one or two tokens in entities (e.g. the ground truth entity is “The Automation Company” and the model predicts “Automation Company”).

Results

We report F1-scores at different levels for each field:

We notice that the performance varies greatly across fields and multiple factors can explain the discrepancy.

OCR / Normalization Errors: This is the case for “charity_number”, for example, where the “tokens space” metrics are very high. This means that the model tags the correct tokens, but the “words space” metric is much lower, signifying that there is an issue when decoding / normalizing tokens. This was confirmed when looking at some “false” predictions:

Missing Tokens in the Prediction: This is the case for “charity name”, where we see a bump between the “entity-level token space” and the “token level” metrics due to a few tokens missing in some entities: 

Failing Fuzzy Matching: This is the main cause of bad performance for most of these fields. The very simple fuzzy matching that we first implemented suffers from two shortcomings:

  1. It does not utilize the context around the words it matches. For example, the word “New York” could have different semantic meanings depending on its context.It could be the state where a company is headquartered, the jurisdiction that governs an agreement between two companies, or part of the residence address of a company representative. From an NER perspective, these three occurrences should be tagged the sameIn the case of information extraction, however, these three occurrences represent different fields, and tagging all of them will confuse the model and prevent it from correctly learning the concept of “jurisdiction”, for example. This is something that we experiment firsthand in Kleister NDA.,  A prime example of this can be seen here:
  • Correct Fuzzy Matching:
  • Incorrect Fuzzy Matching:
  1. The vanilla implementation of fuzzy-matching assumes that the provided ground truth can be matched almost exactly in the text. However, this is rarely the case in practice for various reasons. Once such reason is abbreviations. For example, in Kleister NDA™, we saw a lot of cases where the ground truth is “XXX Ltd” and in the text we only find “XXX Limited”. Another example is the representation of numerical values. The ground truth can be “1000” and in the text it is written as “1k” or “1 thousand”. This is the main reason for the low performance on numerical fields such as “spending_annually” and “income_annually” where our fuzzy matching fails to find a lot of ground truth text during training. This introduces a lot of noise in our training data.

Conclusion

Using an off-the-shelf model and implementing a few modules on top of it, the Hyperscience team was able to quickly build a first prototype that achieves decent performance on multiple public datasets. Most importantly, we identified the main challenges we need to solve if we want to turn this simple research prototype into an actual product that can be used at scale by our clients. 

Interestingly, most of these challenges could be easily fixed manually for the sake of getting better results on an academic dataset. An example of this could be publishing a paper.  In the context of industrial research, where the goal is to build a product that can be deployed and work on any datasets at scale, fixing these issues becomes a more complex and longer problem. Flagging all these challenges is very beneficial, however, and was one of the main goals of this prototyping phase.

Author’s Note: Thanks to Siyuan Xiang and Louis Duperier for their article review. All of the aforementioned code can be found on the Hyperscience GitHub.

###

Romain Sauvestre is a ML Engineering Manager at Hyperscience based in France. Connect with Romain on LinkedIn.

  1. https://arxiv.org/abs/2003.02356
  2. https://arxiv.org/abs/1810.04805
  3. https://arxiv.org/abs/1907.11692
  4. https://arxiv.org/pdf/2002.08087.pdf
  5. https://arxiv.org/abs/1912.13318
  6. https://arxiv.org/abs/2012.14740

Join Some of the Brightest Minds in AI

Turn cutting-edge ML techniques into enterprise-ready AI solutions for the world's largest organizations