{"id":2482,"date":"2021-09-22T13:33:30","date_gmt":"2021-09-22T13:33:30","guid":{"rendered":"https:\/\/system.camp\/?p=2482"},"modified":"2021-09-22T13:33:31","modified_gmt":"2021-09-22T13:33:31","slug":"serverless-article-extract-service-using-node-js-aws-lambda","status":"publish","type":"post","link":"https:\/\/system.camp\/aws\/serverless-article-extract-service-using-node-js-aws-lambda\/","title":{"rendered":"Serverless Article Extract Service using Node.js & AWS Lambda"},"content":{"rendered":"\n

Firefox offers a Reader Mode and the library that powers it, is called Readability. In this article, we will learn how to put this library behind a serverless function and use it as an API.<\/p>\n\n\n\n

You can read more about readability here: https:\/\/github.com\/mozilla\/readability<\/a>. It is defined as “A standalone version of the readability library used for Firefox Reader View”. The library is licensed under Apache License 2.0.<\/p>\n\n\n\n

What it does?<\/h3>\n\n\n\n

Extracts the entire HTML of a URL and then passed it to the readability library for parsing. If there is no error then the cleaned HTML and title are returned.<\/p>\n\n\n\n

The below file is pretty much the entire code for the extractor.<\/a><\/p>\n\n\n\n

'use strict';\n\n\/\/ HTTP client\nconst axios = require('axios').default\n\n\/\/ Readability, dom and dom purify\nconst {JSDOM} = require('jsdom')\nvar { Readability } = require('@mozilla\/readability');\nconst createDOMPurify = require('dompurify')\nconst DOMPurify = createDOMPurify((new JSDOM('')).window)\n\n\/\/ Not too happy to allow iframe, but it's the only way to get youtube videos\nconst domPurifyOptions = {\n    ADD_TAGS: ['iframe', 'video']\n}\n\nmodule.exports.extract = (event, context, callback) => {\n    axios\n        .get(event.url)\n        .then((response) => {\n            const dom = new JSDOM(response.data, {\n                url: event.url\n            })\n            var parsed = new Readability(dom.window.document).parse();\n            console.log('Fetched and parsed ' + event.url + ' successfully')\n            return callback(null, {\n                statusCode: 200,\n                headers: {'Content-Type': 'text\/html'},\n                body: {\n                    url: event.url,\n                    content: DOMPurify.sanitize(parsed.content, domPurifyOptions),\n                    excerpt: parsed.excerpt || ''\n                },\n            });\n        })\n        .catch((error) => {\n            console.log(error)\n            const response = {\n                statusCode: 200,\n                body: JSON.stringify({\n                    error: 'Error while fetching the content',\n                    details: error\n                }),\n            };\n            return callback(null, response);\n        });\n};\n<\/code><\/pre>\n\n\n\n

You can check the entire source code here: https:\/\/github.com\/imshashank\/article-extract-engine<\/a><\/p>\n\n\n\n

git clone https:\/\/github.com\/imshashank\/article-extract-engine.git<\/code><\/pre>\n\n\n\n

Then cd into the repository<\/p>\n\n\n\n

cd article-extract-engine<\/code><\/pre>\n\n\n\n

Make sure you have AWS credentials set up and have provided the appropriate permissions. To deploy the stack simply run:<\/p>\n\n\n\n

serverless deploy<\/pre>\n\n\n\n

You should see an output like this:<\/p>\n\n\n\n

Serverless: Packaging service...\nServerless: Excluding development dependencies...\nServerless: Creating Stack...\nServerless: Checking Stack create progress...\n........\nServerless: Stack create finished...\nServerless: Uploading CloudFormation file to S3...\nServerless: Uploading artifacts...\nServerless: Uploading service article-extract-engine.zip file to S3 (5.58 MB)...\nServerless: Validating template...\nServerless: Updating Stack...\nServerless: Checking Stack update progress...\n...............\nServerless: Stack update finished...\nService Information\nservice: article-extract-engine\nstage: dev\nregion: us-east-1\nstack: article-extract-engine-dev\nresources: 6\napi keys:\n  None\nendpoints:\nfunctions:\n  extract: article-extract-engine-dev-extract\nlayers:\n  None\nServerless: Deprecation warning: Resolution of lambda version hashes was improved with better algorithm, which will be used in next major release.\n            Switch to it now by setting \"provider.lambdaHashingVersion\" to \"20201221\"\n            More Info: https:\/\/www.serverless.com\/framework\/docs\/deprecations\/#LAMBDA_HASHING_VERSION_V2\n\nToggle on monitoring with the Serverless Dashboard: run \"serverless\"<\/code><\/pre>\n\n\n\n

This indicated the stack was deployed and you should now be able to see a new Cloudformation stack as well as a new AWS Lambda function with the name “article-extract-engine-dev-extract”.<\/p>\n\n\n\n

To invoke the function from the CLI, just run:<\/p>\n\n\n\n

serverless invoke --function=extract --log --data '{ \"url\": \"https:\/\/system.camp\/startups\/understanding-kpis-for-mobile-apps-and-how-to-measure-kpis\/\" }'<\/code><\/pre>\n\n\n\n

Here the URL is passed as the URL parameter and you will get a response with full content and text.<\/p>\n\n\n\n

{\n    \"statusCode\": 200,\n    \"headers\": {\n        \"Content-Type\": \"text\/html\"\n    },\n    \"body\": {\n        \"url\": \"https:\/\/system.camp\/startups\/understanding-kpis-for-mobile-apps-and-how-to-measure-kpis\/\",\n        \"content\": \"<div class=\\\"page\\\" id=\\\"readability-page-1\\\"><div>\\n\\t\\n\\n<p>KPIs are the ultimate indicator for how well you Mobile app is doing. KPI stands for Key Performance Indicator. The first rule of KPIs is that they need to be the \u201ckey indicators\u201d of your business model and .......\",\n        \"excerpt\": \"How to create a financial model for a mobile app? How to measure KPIs? What are KPIs? Learn all this and more...\"\n    }\n}\n--------------------------------------------------------------------\nSTART RequestId: 1737de56-085d-4e65-9562-e70d54ef4dd5 Version: $LATEST\n2021-09-22 14:38:57.912 (+02:00)\t1737de56-085d-4e65-9562-e70d54ef4dd5\tINFO\tFetched and parsed https:\/\/system.camp\/startups\/understanding-kpis-for-mobile-apps-and-how-to-measure-kpis\/ successfully\nEND RequestId: 1737de56-085d-4e65-9562-e70d54ef4dd5\nREPORT RequestId: 1737de56-085d-4e65-9562-e70d54ef4dd5\tDuration: 1353.61 ms\tBilled Duration: 1354 ms\tMemory Size: 1024 MB\tMax Memory Used: 162 MB<\/code><\/pre>\n\n\n\n

You can also invoke the AWS Lambda using the AWS SDK in any language by passing a payload with a variable URL like this:<\/p>\n\n\n\n

{\n  \"url\": \"https:\/\/system.camp\/startups\/understanding-kpis-for-mobile-apps-and-how-to-measure-kpis\/\"\n}<\/code><\/pre>\n\n\n\n

You can also involve the function through the AWS Lambda UI and passing the above as the test case.<\/p>\n\n\n\n

To delete the entire stack, run:<\/p>\n\n\n\n

serverless remove<\/code><\/pre>\n\n\n\n

This code was originally part of the Pipfeed app and is now available free with Apache License.<\/p>\n","protected":false},"excerpt":{"rendered":"

Firefox offers a Reader Mode and the library that powers it, is called Readability. In this article, we will learn how to put this library behind a serverless function and use it as an API. You can read more about readability here: https:\/\/github.com\/mozilla\/readability. It is…<\/p>\n","protected":false},"author":1,"featured_media":2483,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_mi_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0},"categories":[41,64,73,34,35],"tags":[10,26,18,74],"aioseo_notices":[],"_links":{"self":[{"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/posts\/2482"}],"collection":[{"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/comments?post=2482"}],"version-history":[{"count":1,"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/posts\/2482\/revisions"}],"predecessor-version":[{"id":2484,"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/posts\/2482\/revisions\/2484"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/media\/2483"}],"wp:attachment":[{"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/media?parent=2482"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/categories?post=2482"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/system.camp\/wp-json\/wp\/v2\/tags?post=2482"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}