Monday, April 01, 2024

A book in progress

Recently, I have been working on a book that aims to teach kids how computers work. I chose to use a storytelling approach to make it more appealing to the younger generation. The topics covered will include many advanced concepts that are traditionally not seen until college and beyond.

The book’s title is “Lost Language of the Machines.” I am pleased to announce that the first chapter is almost ready, and you can read it here:

Lost Language of the Machines

Additionally, the book is open source, and the code can be found on GitHub:

GitHub Repository

Thank you for your interest, and I hope you enjoy the journey into the fascinating world of computers!

Saturday, November 12, 2022

Run grobid as AWS lambda

grobid is the most popular software to extract data from scholarly PDF documents. It calls pdfalto to parse PDF into XML, then uses a machine learning model to extract information like author, abstract etc.
On the downside, grobid is implemented as a traditional java web application, and size hasn't been a consideration for this kind of application. The current stable version 0.7.2 is about 366MB zipped, which is way bigger than AWS lambda's 250MB unzipped size limit if you want to run it there.
Fortunately there are lot of stuff in grobid's distribution that we may not need, so there is a way to strip it down to a much smaller size.
First of all, the runtime seeting should be:
Runtime: Java 8 on Amazon Linux 2 (pdfalto crash on Linux 1)
HandlerInfo: Handler::handleRequest
ArchitectureInfo: x86_64

The lambda itself is just some simple java code:

import java.util.*;
import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import org.grobid.core.*;
import org.grobid.core.data.*;
import org.grobid.core.factory.*;
import org.grobid.core.utilities.*;
import org.grobid.core.engines.Engine;
import org.grobid.core.main.GrobidHomeFinder;

public class Handler implements RequestHandler, String> {

    @Override
    public String handleRequest(Map event, Context context) {
        String pGrobidHome = "/opt/grobid-home";
        GrobidHomeFinder grobidHomeFinder = new GrobidHomeFinder(Arrays.asList(pGrobidHome));
        GrobidProperties.getInstance(grobidHomeFinder);

        String url = event.get("url");
        Engine engine = GrobidFactory.getInstance().createEngine();
        String pdfPath = engine.downloadPDF(url, "/tmp/", "test.pdf");

        BiblioItem resHeader = new BiblioItem();
        String tei = engine.processHeader(pdfPath, 0, resHeader);
        return tei;
    }

    public static void main(String[] args) {
        Map event = new HashMap();
        event.put("url", "https://www.medrxiv.org/content/10.1101/2021.10.02.21264468v1.full.pdf");
        String tei = (new Handler()).handleRequest(event, null);
        System.out.println(tei);
    }
}

And here is the build.gradle file:

version '1.0.0'

apply plugin: 'java'

sourceCompatibility = 1.8

repositories {
    mavenCentral()
    maven { url "https://grobid.s3.eu-west-1.amazonaws.com/repo/" }
}

dependencies {
    implementation 'com.amazonaws:aws-lambda-java-core:1.2.1'
    implementation 'com.amazonaws:aws-lambda-java-events:3.11.0'
    runtimeOnly 'com.amazonaws:aws-lambda-java-log4j2:1.5.1'
    implementation 'org.grobid:grobid-core:0.7.2'
}

task buildZip(type: Zip) {
    from compileJava
    into('lib') {
        from configurations.runtimeClasspath
    }
}

build.dependsOn buildZip

After running './gradlew clean build', we will need to remove some large but rarely used jars from the build:

zip -d build/distributions/title_extract-1.0.0.zip lib/jruby-complete-9.2.13.0.jar
zip -d build/distributions/title_extract-1.0.0.zip lib/scala-library-2.10.3.jar

grobid needs to be deployed as lambda layer. And it will be mounted at /opt/grobid-home. Do the following to reduce it size:

Edit grobid.yml change temp to "/tmp"
rm -rf  grobid-home/lib/
rm -rf grobid-home/pdf2xml
rm -rf grobid-home/scripts
rm -rf grobid-home/sentence-segmentation/
...

compress grobid-home and upload it to S3.

Thursday, June 28, 2012

ActiveRecord::Fixtures.create_fixtures and RI_ConstraintTrigger error

When you run ActiveRecord::Fixtures.create_fixtures with postgresql adapter, it requires the db user (in config/database.yml) to be superuser. So the user is created by 'createuser -S' command, ActiveRecord::Fixtures.create_fixtures will fail with permission denied error related to RI_ConstraintTrigger. There are some useful discussion over here: https://github.com/matthuhiggins/foreigner/issues/61

Thursday, April 12, 2012

Delete Multiple Objects from s3

Deleting s3 objects used to be slow if you have lots of files in a s3 buckets: you have to issue a delete request for each file. On Dec 7, 2011 , Amazon announced Multi-Object Delete, a new API that allows user to delete multiple objects with a single web request.

As this api is relatively new, those popular ruby s3 libraries including rightaws has not supported it. So I wrote a ruby script to do it, the source code is here:
https://gist.github.com/2361625

And it can be used as a command line tool:

$ruby s3_multi_object_delete.rb bucket s3id s3key key1 key2

Or you want to use it in your Rails project, just drop it into 'config/initializers' folder.

It is based on happening, although the code can be ported to rightaws easily. I picked happening as it may archive better performance due to higher concurrency.

Labels:

Monday, April 09, 2012

HTML5 for Desktop Application Development

Thursday, August 11, 2011

Inspect Ruby Process From gdb

If you have been developing in Ruby and Ruby on Rails for any length of time, it is very likely that you have encountered a 'Segmentation fault' from Ruby without any information about where the error is coming from.

Normally, a gem written in C extension is to blame. If pure Ruby code meets an error, a nice stacktrace with a line number will be there to point you in the right direction, but not for code in C extension. You won't find more information about the error unless you fire up gdb.

Assuming the code that crashes is bug.rb, instead of doing 'ruby bug.rb', just do:
$gdb ruby
(gdb) run bug.rb

Once the the crash is reproduced, type:
(gdb) bt

'bt' means 'backtrace'. A C callstack will show up. And you will find some valuable information there. If 'bt' does not give you function name and line# you want, it normally means it can not locate the source code, and you can use the 'dir' command to help gdb locate the source code (Ruby's source code is freely available, just download the one that matches the binary):
(gdb) dir /root/downloaded/ruby-1.8.6-p111

If the program that crashed is a rake task, just do some thing similar. Rake itself is a Ruby script:
$gdb ruby
(gdb) run /usr/bin/rake -T

If you have a daemon written in Ruby, sometimes you might see that it hangs (does nothing) or uses too many resources (100% cpu usage). To inspect a live process with process id 12345, do this:
$gdb ruby
(gdb) attach 12345
(gdb) bt

For Rails applications, a bug may only happen in the production environment. Before you enter 'run', use this to setup the environment:
(gdb) set environment RAILS_ENV = production

Here are some tips to get more from a C stacktrace.
You can call a C function with a 'call' command under gdb shell -- yep, that makes C feel like an interpreted language. For example, you have a stack trace entry like this:
rb_call0 (klass=47110398570320, recv=47110462241640, id=22977,
oid=, argc=0, argv=0x7fffee0453c8, body=0x2ad8be3946c8,
flags=) at eval.c:5998
You can find out what 'klass' is by using its memory address:
(gdb) call rb_class2name(47110398570320)
Or you can look up a Ruby function's name:
(gdb) call rb_id2name(22977)

This GDB QUICK REFERENCE and Extending Ruby chapter of 'Programming Ruby' book are great references if you want to know more!

Labels:

Monday, January 12, 2009

How to debug or test cron script

If your simple cron script (/etc/cron.daily, etc/cron.hourly etc) does not run as expected, do the following:

1. be sure to have '#!/bin/sh' as the first line
2. make sure the script's name does not contains '.'. for example, you should rename 'yourscript.sh' to 'yourscript'.
3. use absolute path if you need to read/write files in the script
4. try run the script in the shell to make sure there is no obvious problem
5. 'run-parts --test /etc/cron.hourly/' will tell you what scripts will run, you should see your script in the list
6. next try to run it like cron does, for example:
cd / && run-parts --report /etc/cron.hourly
7. cron uses syslog for logging. Check your syslog config file ( /etc/syslog.conf) to see where the log is, and check the log for errors.

Ref:
[1] Why are cron.hourly files not running? SOLVED
https://lists.ubuntu.com/archives/ubuntu-users/2007-July/118973.html

Labels: