Saturday, November 12, 2022

Run grobid as AWS lambda

grobid is the most popular software to extract data from scholarly PDF documents. It calls pdfalto to parse PDF into XML, then uses a machine learning model to extract information like author, abstract etc.
On the downside, grobid is implemented as a traditional java web application, and size hasn't been a consideration for this kind of application. The current stable version 0.7.2 is about 366MB zipped, which is way bigger than AWS lambda's 250MB unzipped size limit if you want to run it there.
Fortunately there are lot of stuff in grobid's distribution that we may not need, so there is a way to strip it down to a much smaller size.
First of all, the runtime seeting should be:
Runtime: Java 8 on Amazon Linux 2 (pdfalto crash on Linux 1)
HandlerInfo: Handler::handleRequest
ArchitectureInfo: x86_64

The lambda itself is just some simple java code:

import java.util.*;
import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import org.grobid.core.*;
import org.grobid.core.data.*;
import org.grobid.core.factory.*;
import org.grobid.core.utilities.*;
import org.grobid.core.engines.Engine;
import org.grobid.core.main.GrobidHomeFinder;

public class Handler implements RequestHandler, String> {

    @Override
    public String handleRequest(Map event, Context context) {
        String pGrobidHome = "/opt/grobid-home";
        GrobidHomeFinder grobidHomeFinder = new GrobidHomeFinder(Arrays.asList(pGrobidHome));
        GrobidProperties.getInstance(grobidHomeFinder);

        String url = event.get("url");
        Engine engine = GrobidFactory.getInstance().createEngine();
        String pdfPath = engine.downloadPDF(url, "/tmp/", "test.pdf");

        BiblioItem resHeader = new BiblioItem();
        String tei = engine.processHeader(pdfPath, 0, resHeader);
        return tei;
    }

    public static void main(String[] args) {
        Map event = new HashMap();
        event.put("url", "https://www.medrxiv.org/content/10.1101/2021.10.02.21264468v1.full.pdf");
        String tei = (new Handler()).handleRequest(event, null);
        System.out.println(tei);
    }
}

And here is the build.gradle file:

version '1.0.0'

apply plugin: 'java'

sourceCompatibility = 1.8

repositories {
    mavenCentral()
    maven { url "https://grobid.s3.eu-west-1.amazonaws.com/repo/" }
}

dependencies {
    implementation 'com.amazonaws:aws-lambda-java-core:1.2.1'
    implementation 'com.amazonaws:aws-lambda-java-events:3.11.0'
    runtimeOnly 'com.amazonaws:aws-lambda-java-log4j2:1.5.1'
    implementation 'org.grobid:grobid-core:0.7.2'
}

task buildZip(type: Zip) {
    from compileJava
    into('lib') {
        from configurations.runtimeClasspath
    }
}

build.dependsOn buildZip

After running './gradlew clean build', we will need to remove some large but rarely used jars from the build:

zip -d build/distributions/title_extract-1.0.0.zip lib/jruby-complete-9.2.13.0.jar
zip -d build/distributions/title_extract-1.0.0.zip lib/scala-library-2.10.3.jar

grobid needs to be deployed as lambda layer. And it will be mounted at /opt/grobid-home. Do the following to reduce it size:

Edit grobid.yml change temp to "/tmp"
rm -rf  grobid-home/lib/
rm -rf grobid-home/pdf2xml
rm -rf grobid-home/scripts
rm -rf grobid-home/sentence-segmentation/
...

compress grobid-home and upload it to S3.