By David Liang, Lead Product Manager, Machine Learning
According to a recent IDC report, 90% of new enterprise content is unstructured and a big challenge for companies is how to extract relevant information from this content to help drive business-critical processes. In addition, we have found through conversations with our customers that a lot of this unstructured content is text-dense, long form content that is challenging to automate with traditional approaches. In order to help our enterprise customers make sense of this long-form content, Hyperscience is proud to launch a new proprietary model called Long-Form Extraction.
What is long form content and why is it hard to process?
From our conversations with customers, we define long form content by a couple of dimensions:
Length of Documents – While traditional document processing focuses on shorter structured and semi-structured documents, many of the unstructured documents that customers are looking to automate are often longer with sizes ranging from 50 to over 200 pages. Every industry has unique long form document types that are critical to their operations, presenting diverse challenges that require tailored processing approaches. Some examples include:
- Insurance: Policy binders and claims investigation reports
- Healthcare: Patient medical histories and clinical trial documentation
- Financial Services: Loan agreements and regulatory compliance reports
- Legal: Contracts, case files, and court filings
- Manufacturing: Technical manuals and product specification documents
- Government: Legislative bills and public policy reports
- Education: Research dissertations and accreditation reports
Text-Dense Content – The unstructured content our customers want to process is typically very text-dense where there are fewer visual cues or formatting that can be used for finding relevant information. Oftentimes, the specific information of interest is embedded within a sentence or paragraphs in the text. As an example, legal documents – such as warranty deeds or lease agreements (see image below) – have very few formatting cues to find the relevant information embedded within its text.
IDC, Worldwide Intelligent Document Processing Software Forecast, 2024-2028, July 2024, Matt Arcaro & Amy Machado, IDC #US52445224
Longer Extraction Requirements – The ability to extract entire paragraphs or pages from long-form documents enables organizations to automate tasks that historically required human intervention, such as analyzing complex risk disclosures or contract terms. By leveraging automation for these time-consuming processes, enterprises can accelerate decision-making, reduce operational costs, and minimize compliance risks. This not only frees up valuable human resources but also creates a competitive advantage by allowing firms to respond faster to market demands, offer more personalized services, and stand out when speed and accuracy are key differentiators.
The characteristics of long form content described above make it difficult to process with traditional automation approaches such as traditional Intelligent Document Processing (IDP) and Robotic Process Automation (RPA) solutions. Even newer technologies like Large Language Models (LLMs) can find this content challenging as the length and complexity of entire documents make it difficult for generative models to answer accurately, leading to hallucinations. As a result, many enterprises still rely on human keyers to process long form, unstructured content as a part of their business critical workflows. These human keyers could be in-house resources or, more commonly, outsourced labor from business process outsourcers, adding significant cost, inefficiency, and potential delays to workflows, as well as introducing greater room for human error in business-critical processes.
Hyperscience’s approach to long form content
As a part of our R40 release, we have launched our Long Form Extraction Model – a purpose-built deep learning model to extract text from long form content. Our customers have already seen success using this model to help automate processing of documents such as legal contracts and medical records. Some key features that set this model apart from other solutions are:
- This model uses deep learning and advanced natural language processing (NLP) capabilities to learn the language of your business. The simple interface of Hyperscience makes it easy for business users to upload and annotate samples in order to train a powerful model capable of sifting through dense and unstructured content to support downstream processes. This innovation is made possible through advanced NLP capabilities that enable the model to interpret complex language patterns, adapt to domain-specific terminology, and accurately extract critical data from lengthy documents, regardless of format or structure.
- Users can easily define the data format or length to extract for each field. Using our layout configuration, users can easily define how each piece of relevant information should be extracted – whether it be dates, address(es), or even longer text clauses.
- Supports advanced use cases where organizations need to extract multiple occurrences of the same information from one document. This capability is especially useful for documents like warranty deeds, which may contain multiple grantees or grantors. Unlike other solutions, including LLMs, that require post-processing to handle these repeated fields, Hyperscience can accurately identify and extract each occurrence without needing to split the output into discrete chunks, streamlining the process for comprehending complex, long form content.
While this model can be used standalone to unlock the data contained within your long form content, our customers are also able to use Hyperscience to create powerful end-to-end workflows that pair the output of this model with generative AI models in the Hyperscience Flow Studio. In the following section, we highlight how one customer combined our Long Form Extraction Model with a Large Language Model to help improve their understanding of 10-K documents.
Long Form Extraction Model in action
A financial institution approached us with a problem: How could Hyperscience help them speed up their processing of documents – which can be well over 100 pages – to enable faster strategic decisions from their analysts? One of their primary challenges was extracting risk factors from these documents due to the variable nature of the number of risk factors and the length of each risk factor in the document. Below is an example of a risk factor from the Cisco 2024 10-K document, which is only one of over 30 different risk factors outlined.
Example of risk factors from Cisco’s 2024 10-K
In order to address their use case, we trained and deployed a Long Form Extraction Model that specializes in extracting each risk factor detailed in a document. However, when reviewing the output with the customer, we learned their end goal was actually to summarize each risk factor into a smaller, more easily understandable piece of text. As a result, we worked with the customer to deploy a Hyperscience LLM block into their flow and fed the output of our Long Form Extraction Model into the LLM for summarization purposes.
As a next step, we are now exploring how this customer can leverage the Hyperscience Hypercell for Generative AI product to create powerful generative experiences for their analysts by giving them the ability to query against historical risk factors across all of the companies in their portfolio. It’s easy to envision how that ability will further enable our customer to better identify patterns, predict future risks, and make more informed strategic decisions across their entire portfolio, ultimately setting a new standard for risk analysis in financial markets.
Hyperscience’s Long Form Extraction Model is enabling our customers to build differentiating experiences by speeding up the traditionally time consuming processing of long form content. To learn more about how Hyperscience can help you unlock your long form content, you can watch our R40 webinar or schedule a meeting with us today.