Run grobid as AWS lambda
grobid is the most popular software to extract data from scholarly PDF documents. It calls pdfalto to parse PDF into XML, then uses a machine learning model to extract information like author, abstract etc.
On the downside, grobid is implemented as a traditional java web application, and size hasn't been a consideration for this kind of application. The current stable version 0.7.2 is about 366MB zipped, which is way bigger than AWS lambda's 250MB unzipped size limit if you want to run it there.
Fortunately there are lot of stuff in grobid's distribution that we may not need, so there is a way to strip it down to a much smaller size.
First of all, the runtime seeting should be:
Runtime: Java 8 on Amazon Linux 2 (pdfalto crash on Linux 1)
HandlerInfo: Handler::handleRequest
ArchitectureInfo: x86_64
The lambda itself is just some simple java code:
On the downside, grobid is implemented as a traditional java web application, and size hasn't been a consideration for this kind of application. The current stable version 0.7.2 is about 366MB zipped, which is way bigger than AWS lambda's 250MB unzipped size limit if you want to run it there.
Fortunately there are lot of stuff in grobid's distribution that we may not need, so there is a way to strip it down to a much smaller size.
First of all, the runtime seeting should be:
Runtime: Java 8 on Amazon Linux 2 (pdfalto crash on Linux 1)
HandlerInfo: Handler::handleRequest
ArchitectureInfo: x86_64
The lambda itself is just some simple java code:
import java.util.*;
import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import org.grobid.core.*;
import org.grobid.core.data.*;
import org.grobid.core.factory.*;
import org.grobid.core.utilities.*;
import org.grobid.core.engines.Engine;
import org.grobid.core.main.GrobidHomeFinder;
public class Handler implements RequestHandler
And here is the build.gradle file:
version '1.0.0'
apply plugin: 'java'
sourceCompatibility = 1.8
repositories {
mavenCentral()
maven { url "https://grobid.s3.eu-west-1.amazonaws.com/repo/" }
}
dependencies {
implementation 'com.amazonaws:aws-lambda-java-core:1.2.1'
implementation 'com.amazonaws:aws-lambda-java-events:3.11.0'
runtimeOnly 'com.amazonaws:aws-lambda-java-log4j2:1.5.1'
implementation 'org.grobid:grobid-core:0.7.2'
}
task buildZip(type: Zip) {
from compileJava
into('lib') {
from configurations.runtimeClasspath
}
}
build.dependsOn buildZip
After running './gradlew clean build', we will need to remove some large but rarely used jars from the build:
zip -d build/distributions/title_extract-1.0.0.zip lib/jruby-complete-9.2.13.0.jar
zip -d build/distributions/title_extract-1.0.0.zip lib/scala-library-2.10.3.jar
grobid needs to be deployed as lambda layer. And it will be mounted at /opt/grobid-home. Do the following to reduce it size:
Edit grobid.yml change temp to "/tmp"
rm -rf grobid-home/lib/
rm -rf grobid-home/pdf2xml
rm -rf grobid-home/scripts
rm -rf grobid-home/sentence-segmentation/
...
compress grobid-home and upload it to S3.