How to Split a Multi-Document PDF Using JavaScript and Google Cloud Document AI

11 October 2024

Introduction

In this tutorial, I will guide you through a process of splitting a PDF that contains multiple documents using JavaScript, Google Cloud’s Document AI, and the pdf-lib library. This feature is useful when you have a PDF with several documents, each identified by page numbers (e.g., “Page 1 of 2” for the first document, “Page 1 of 3” for the second document, etc.). Document AI will help extract page number data, and then we’ll split the PDF accordingly.

Step 1: Understanding the Problem

Consider a PDF with multiple documents, each identified by page numbers:

  • The first document has 2 pages, labeled “Page 1 of 2”, “Page 2 of 2”.

The second document has 3 pages, labeled “Page 1 of 3”, “Page 2 of 3”, “Page 3 of 3”. We’ll use OCR (Optical Character Recognition) to extract these page numbers and split the PDF into separate files for each document.


Step 2: Setting Up Google Cloud Document AI

To OCR the page numbers, we will use Google Cloud Document AI’s Custom Extractor.

1. Create a Google Cloud Account if you don’t have one.

2. Set up Document AI by searching for it in the GCP Console

3. Create a Custom Processor by selecting the Custom Extractor model.

4. Select Custom extractor as our processor.

5. Upload Training Documents: Upload sample PDFs to train our processor .

6. Create Labels: Annotate the page numbers and total page count fields, creating two labels: page_no and page_total. For optimal accuracy, label at least 100 pages across 20 documents.

7. Train and Deploy the model.

Step 3: Extracting Page Numbers from the PDF Using Document AI

Once the processor is trained and deployed, you can extract labeled data like page numbers and total pages from the PDF. Here’s how we do it in JavaScript:

				
					const name = `projects/${projectId}/locations/${location}/processors/${processorId}`;
const buffer = await getTheArrayBufferFromPdfUrl(s3Url); 
const encodedImage = Buffer.from(buffer).toString('base64');

const request = {
  name,
  rawDocument: {
    content: encodedImage,
    mimeType: 'application/pdf',
  },
};

const [result] = await client.processDocument(request);
const {document} = result;
const {entities} = document;
const pages = formatData(entities);
const pagesToSplit = getPdfPagesToSplit(pages);

				
			

This function organizes the extracted data into a structured array containing each page’s number and total page count.

Step 4: Identifying Document Boundaries

 We then determine the starting and ending pages for each document inside the PDF:

				
					getPdfPagesToSplit = (pages) => {
  const pdfPages = [];
  let count = 0;
  let skipCount = 0;
  	
  for (const page of pages) {
    count++;
    if (skipCount) {
      skipCount--;
      continue;
    }
    
    if (page.page_total == 1) {
      pdfPages.push({ number: +page.number + 1, start: count, end: count });
    } else if (page.page_total > 1) {
      skipCount = page.page_total - 1;
      pdfPages.push({ number: +page.number + 1, start: count, end: count + +page.page_total - 1 });
    }
  }
  
  return pdfPages;
};

  },
};

const [result] = await client.processDocument(request);
const {document} = result;
const {entities} = document;
const pages = formatData(entities);
const pagesToSplit = getPdfPagesToSplit(pages);

				
			

Step 5: Splitting the PDF Using pdf-lib

Once we have the start and end pages, we can split the PDF using pdf-lib:

				
					extractPdfPage = async (arrayBuff, pageToSplit) => {
  const pdfSrcDoc = await PDFDocument.load(arrayBuff);
  const pdfNewDoc = await PDFDocument.create();
  const pages = await pdfNewDoc.copyPages(pdfSrcDoc, range(pageToSplit.start, pageToSplit.end));
  pages.forEach(page => pdfNewDoc.addPage(page));
  
  const newPdf = await pdfNewDoc.save();
  return newPdf;
};

				
			

Here, pdf-lib copies and saves the pages of each document as a new PDF.

Step 6: Upload or Download the Split PDFs

Now, we can take the split PDFs from SplittedPdfs and either upload them to a cloud service or download them to the user’s machine:

				
					const SplittedPdfs = [];
for (const pageToSplit of pagesToSplit) {
  const splittedPdf = await extractPdfPage(imageFile, pageToSplit);
  SplittedPdfs.push(splittedPdf);
}
// Now you can use SplittedPdfs as per your needs.

				
			

Conclusion

This tutorial demonstrates how to split a multi-document PDF using JavaScript, Document AI, and pdf-lib. We covered setting up Document AI, extracting page numbers, and splitting the PDF based on those page numbers. With these steps, you can easily implement this feature in your own applications.

Shaheryar Ahmed

Software Engineer at Qavi Technologies