Serverless Article Extract Service using Node.js & AWS Lambda

Serverless Article Extract Service using Node.js & AWS Lambda

Firefox offers a Reader Mode and the library that powers it, is called Readability. In this article, we will learn how to put this library behind a serverless function and use it as an API.

You can read more about readability here: https://github.com/mozilla/readability. It is defined as “A standalone version of the readability library used for Firefox Reader View”. The library is licensed under Apache License 2.0.

What it does?

Extracts the entire HTML of a URL and then passed it to the readability library for parsing. If there is no error then the cleaned HTML and title are returned.

The below file is pretty much the entire code for the extractor.

'use strict';

// HTTP client
const axios = require('axios').default

// Readability, dom and dom purify
const {JSDOM} = require('jsdom')
var { Readability } = require('@mozilla/readability');
const createDOMPurify = require('dompurify')
const DOMPurify = createDOMPurify((new JSDOM('')).window)

// Not too happy to allow iframe, but it's the only way to get youtube videos
const domPurifyOptions = {
    ADD_TAGS: ['iframe', 'video']
}

module.exports.extract = (event, context, callback) => {
    axios
        .get(event.url)
        .then((response) => {
            const dom = new JSDOM(response.data, {
                url: event.url
            })
            var parsed = new Readability(dom.window.document).parse();
            console.log('Fetched and parsed ' + event.url + ' successfully')
            return callback(null, {
                statusCode: 200,
                headers: {'Content-Type': 'text/html'},
                body: {
                    url: event.url,
                    content: DOMPurify.sanitize(parsed.content, domPurifyOptions),
                    excerpt: parsed.excerpt || ''
                },
            });
        })
        .catch((error) => {
            console.log(error)
            const response = {
                statusCode: 200,
                body: JSON.stringify({
                    error: 'Error while fetching the content',
                    details: error
                }),
            };
            return callback(null, response);
        });
};

You can check the entire source code here: https://github.com/imshashank/article-extract-engine

git clone https://github.com/imshashank/article-extract-engine.git

Then cd into the repository

cd article-extract-engine

Make sure you have AWS credentials set up and have provided the appropriate permissions. To deploy the stack simply run:

serverless deploy

You should see an output like this:

Serverless: Packaging service...
Serverless: Excluding development dependencies...
Serverless: Creating Stack...
Serverless: Checking Stack create progress...
........
Serverless: Stack create finished...
Serverless: Uploading CloudFormation file to S3...
Serverless: Uploading artifacts...
Serverless: Uploading service article-extract-engine.zip file to S3 (5.58 MB)...
Serverless: Validating template...
Serverless: Updating Stack...
Serverless: Checking Stack update progress...
...............
Serverless: Stack update finished...
Service Information
service: article-extract-engine
stage: dev
region: us-east-1
stack: article-extract-engine-dev
resources: 6
api keys:
  None
endpoints:
functions:
  extract: article-extract-engine-dev-extract
layers:
  None
Serverless: Deprecation warning: Resolution of lambda version hashes was improved with better algorithm, which will be used in next major release.
            Switch to it now by setting "provider.lambdaHashingVersion" to "20201221"
            More Info: https://www.serverless.com/framework/docs/deprecations/#LAMBDA_HASHING_VERSION_V2

Toggle on monitoring with the Serverless Dashboard: run "serverless"

This indicated the stack was deployed and you should now be able to see a new Cloudformation stack as well as a new AWS Lambda function with the name “article-extract-engine-dev-extract”.

To invoke the function from the CLI, just run:

serverless invoke --function=extract --log --data '{ "url": "https://system.camp/startups/understanding-kpis-for-mobile-apps-and-how-to-measure-kpis/" }'

Here the URL is passed as the URL parameter and you will get a response with full content and text.

{
    "statusCode": 200,
    "headers": {
        "Content-Type": "text/html"
    },
    "body": {
        "url": "https://system.camp/startups/understanding-kpis-for-mobile-apps-and-how-to-measure-kpis/",
        "content": "<div class=\"page\" id=\"readability-page-1\"><div>\n\t\n\n<p>KPIs are the ultimate indicator for how well you Mobile app is doing. KPI stands for Key Performance Indicator. The first rule of KPIs is that they need to be the “key indicators” of your business model and .......",
        "excerpt": "How to create a financial model for a mobile app? How to measure KPIs? What are KPIs? Learn all this and more..."
    }
}
--------------------------------------------------------------------
START RequestId: 1737de56-085d-4e65-9562-e70d54ef4dd5 Version: $LATEST
2021-09-22 14:38:57.912 (+02:00)	1737de56-085d-4e65-9562-e70d54ef4dd5	INFO	Fetched and parsed https://system.camp/startups/understanding-kpis-for-mobile-apps-and-how-to-measure-kpis/ successfully
END RequestId: 1737de56-085d-4e65-9562-e70d54ef4dd5
REPORT RequestId: 1737de56-085d-4e65-9562-e70d54ef4dd5	Duration: 1353.61 ms	Billed Duration: 1354 ms	Memory Size: 1024 MB	Max Memory Used: 162 MB

You can also invoke the AWS Lambda using the AWS SDK in any language by passing a payload with a variable URL like this:

{
  "url": "https://system.camp/startups/understanding-kpis-for-mobile-apps-and-how-to-measure-kpis/"
}

You can also involve the function through the AWS Lambda UI and passing the above as the test case.

To delete the entire stack, run:

serverless remove

This code was originally part of the Pipfeed app and is now available free with Apache License.

No Comments

Post A Comment