Andre's Blog
Perfection is when there is nothing left to take away
From ASP.Net to Node.js

I originally wrote this blogs application in 2008 in ASP/JScript, thinking that JavaScript-like language would age better than VBScript, but soon realized that while that might be true for the language, the classic ASP itself didn't have a lot of life left in it. This prompted me to rewrite the blogs in ASP.Net/JScript, in 2009. This time I thought my choice of the framework was quite clever and would surely outlast my needs for a blog.

ASP.Net indeed has done remarkably well since 2009, but JScript didn't do nearly as well and Microsoft quietly dropped it from the platform at some point, so my choice of JavaScript as the server side language for my blogs needed another revision. Needless to say, Node.js was really the only choice to consider, so it was an easy decision.

SQL vs. NoSQL

The original application was working against a MySQL database running on a local VM, so all operations against this database were incredibly fast, even though I crammed it in a micro instance. For the new application I decided to switch it up a bit and run it against a hosted Mongo DB instance, in hopes that what I would lose on a slower network link would be recovered by faster database operations.

Database schema conversion was fairly straightforward, but queries got pretty convoluted quite fast because Mongo DB does not have a straightforward 1-to-1 join syntax and such joins against multiple collections require some aggregation pipeline trickery to mold result arrays into usable application data.

For example, I keep categories in a separate collection and join blog posts and categories to construct a post with a text category name. In SQL this query would be as simple as the following.

select post, category, category_id
  from posts join categories on
    posts.category_id = categories.category_id
  where posts.post_id = ? and posts.blog_id = ?

The exact same result would be generated by this aggregation pipeline in Mongo DB:

[
    {$match: {
        post_id: post_id,
        blog_id: blog_id
    }},
    {$lookup: {
        from: "categories",
        let: {category_id: "$category_id"},
        pipeline: [
            {$match: {
                $expr: {$eq: ["$category_id", "$$category_id"]}
            }},
            {$project: {
                _id: 0,
                name: "$category"
            }}
        ],
        as: "category"
    }},
    {$project: {
        category_id: 1,
        post: 1,
        category: {$arrayElemAt: ["$category", 0]}
    }}
]

That's a lot of aggregation magic for a simple 1-to-1 join, just to populate post.category.name.

A typical NoSQL advice for this type of query is to have category included as text in the post document, but doing so would not only make it harder to tell how many categories there are in the system, but also when any of the categories need to be renamed, all documents would need to be scanned and all matching ones would need to be updated. Imagine this being done not in a simple blogs application, but against an online store database with hundreds of thousands of products that need to be updated from Clothing to Apparel.

I created a suggestion for Mongo DB to make 1-to-1 joins simpler, in hopes that maybe one day some Product Owner gives it a thought.

https://feedback.mongodb.com/forums/924280-database/suggestions/44143983-provide-straightforward-syntax-for-1-to-1-joins-in

Another problematic area for me was that Mongo DB transactions are not really transparent, like they are in any SQL database, where one connection acquires various locks while running database operations, and other connections block until locked rows, pages, etc., become available.

Mongo DB sort of fakes lock waits by retrying a few times the same operation with a cadence of delays and then fails and rolls back everything that was done to this point. They even automate it in some drivers by registering a callback that will rerun all statements in a transaction.

For this simple application I decided to avoid transactions altogether and fix whatever data integrity violations I will run into manually, but for a real application this would present a bit of a challenge to solve.

Auto-Incremented Fields

SQL databases offer a couple of ways to maintain keys - numeric auto-increment columns and UUID columns. The latter is similar to Mongo DB's _id field typed as ObjectId, but there is nothing available that would resemble auto-incremented numeric values.

I did switch some auto-increment identifiers to ObjectId, but wanted to keep post and category identifiers as numeric values for usability and it took some thinking to figure out how to increment each safely for concurrent requests. I ended up keeping the largest value for each field in a blog document and used findOneAndUpdate with the option to return an incremented field value after a modification, which got me a unique value for each concurrent query.

Single-Page vs. Multi-Page Application

Single-Page Applications (SPA) have been all the rage in recent years and I gave it a good thought whether I want to keep the blogs as a collection of pages or to embrace the new SPA approach and decided to stay in the Multi-Page Application (MPA) camp.

My main reason was that I don't like those pages that keep jumping in front of you as their REST API requests flood the server trying to get all resources at once, as opposed to being able to construct most of the page on the server via local interfaces, and then make asynchronous requests in background to update real-time page content, as needed.

I also never was quite thrilled about URLs being manipulated in the browser to match the active SPA view. I see it as a better approach for multiple pages to be identified by their specific URLs and having dynamic content to reflect live changes for the currently rendered content.

I chose Pug and EJS for my view template engines. Pug was new to me and I wanted to try it out and it turned out to be a nice clean way to maintain HTML. EJS worked out better for XML content, so XML structure is more visible, and it also defaults to XML encoding it its output tags, which makes generating XML output simpler.

I particularly liked that Pug templates not only support the usual includes, but also provide a more elaborate way to extent base templates and maintain replaceable blocks, with a fallback base block, similar to how the virtual function mechanism works out.

Express vs. NestJS

When I started thinking about Node.js as the next platform for the blogs application, I had Express and NestJS in mind to choose from.

NestJS implements a web application framework on top of a base framework, such as Express, and adds a lot of functionality in a TypeScript friendly way, so it is easier to pass application-specific classes through the framework, compared to middleware in Express just jamming application-level constructs into Express objects.

For example, various authentication implementations in Express add req.user to keep track of the authenticated user, but there is no way to reference it cleanly in TypeScript, so a cast is required to obtain an authentication-specific reference to a user object. The same can be said about any values added to the request object by middleware components.

NestJS allows TypeScript decorators in route handler method parameters, which keeps these casts hidden behind decorators, and, in general, is structured to keep application implementation within the framework of modules, controllers and services, all wired together via instantiate/import/export declarations and dependency injection.

After careful consideration, however, I still ended up using Express for a couple of reasons.

NestJS didn't quite fit the service structure I wanted because NestJS relies too heavily on their dependency injection, which makes application components structured not as descriptively as I would like because any controller or service constructor can get any available classes just by listing them as parameters and the dependency injector just goes ahead and wires them together, whether it fits the application design or not. I wanted to structure my routers and services in a more traditional way, similar to how an OS service application would designed, which guarantees all dependencies, start-up and shutdown order and improves maintainability via predictable service callbacks, such as on-start, on-idle, on-stop and on-clean-up.

Another problem with NestJS for me was their attitude towards user-reported issues. For example, when NestJS handles authentication failures, it drops all application-provided information and responds with a malformed 401 response lacking the mandatory WWW-Authenticate header. I logged an issue against NestJS, which got immediately closed with commenting disabled.

https://github.com/nestjs/nest/issues/7011

It's absolutely fine for a maintainer to choose not to fix something based on their vision of the project, but cutting a conversation like this was a red flag for me. NestJS is a very capable framework, no question about it, but it just wasn't for me.

All in all, I see Express as a framework that enables creativity by not forcing users into a particular application structure. I especially like chainable routers, which are well thought-through and keep area-specific application routers reasonably isolated, but still allow subordinate routers to access parent route parameters. For example, a blog post router handling path /post/25 chained into a blog router handling path /blog/1/ within URI path /blog/1/post/25 is able to obtain the blog number from the parent part of the path. Otherwise, each of these routers is completely unaware of one another.

Performance

The original application was designed and implemented based on the assumption that the database access will be very quick, which allowed me to avoid having to cache various application entities. I followed the same practice with this application and also implemented concurrent database queries to run in parallel via Promise.all and, based on my experience with Mongo DB deployed on a local server, I had high hopes for response times being just as fast as in the ASP.Net application running against a local MySQL database.

Much to my surprise, once I got first pages working against the hosted database, I observed the response time at a jaw-dropping one second or more for some pages. I did expect some drop in performance with Mongo DB being hosted on a different network and requiring with a few additional queries to cover for lack of SQL-like joins, but this was worse than I could ever imagine.

Connection Pool

I spent some time capturing and looking at CPU profiling data and realized that the Mongo DB driver v3.7.3 for Node does not have a connection pool built in. I worked quite bit with a C driver for Mongo DB and it comes with a dedicated connection pool, so not seeing it in the interface I assumed there was one behind the scenes and it was just a bad assumption. The new version of Node driver has a connection pool class, but for the version I was using I needed to roll my own, which is what I did.

What a difference avoiding extra connections makes - my one-second response times dropped to about 300-500 milliseconds. This was great, but I still wanted better performance, so I started looking into conditional HTTP requests based on either ETag or last modification time.

ETag

I quickly ruled out ETag because there was no way to track blog posts along with their comments with a consistent value for an ETag and I didn't want to implement the commonly misused approach to return a response hash because one would need to generate a response to get the hash, which defies the purpose of avoiding generating the whole page for caching, so I opted for implementing the last modification time validation approach.

Much to my surprise, even though I didn't set ETag, I still could see the ETag headers in the network traffic. Looking closer into this, revealed that Express is doing the very thing I was trying to avoid - it hashes the response and drops it if the hash matches the one in the request. Talk about doing steps just for the sake of doing steps - if it takes 500 ms to generate a response and some time for a round trip HTTP request/response to communicate freshness validation information, dropping a compressed 1K response makes cached responses just as slow as full responses or even slower, as hashing is not cheap.

Fortunately, there is a setting in Express to disable ETag generation by setting "etag" with app.set. The bad news, however, is that Express does the same thing with last modification time freshness validation, except that there is no way to turn it off.

Last Modification Time

Response freshness based on last modification time for complex resources may be not as simple to compute as when one serves a file, because such resources may have multiple components.

For example, a blog post has a creation and modification times, just like a file, but a blog post page also has comments, which can be modified or deleted, so the modification time of a blog post page need to be computed with all of these time stamps being considered. Things get even more complex when showing a list of posts for a blog, which becomes quite a challenge if one wants to figure out whether blog post modifications are visible in excerpts or not, as it would affect freshness calculations.

Some of these calculations are too complex to pursue and after testing various time stamps, it would be a reasonable compromise to send back a full response, so there is no need to go though response details to obtain a more precise modification time, even though it could have the same modification time. Doing so doesn't violate the HTTP RFC, even though it does go against the recommended implementation:

The origin server SHOULD NOT perform the requested
method if the selected representation's last modification date is
earlier than or equal to the date provided in the field-value;
instead, the origin server SHOULD generate a 304 (Not Modified)
response, including only those metadata that are useful for
identifying or updating a previously cached response.

https://datatracker.ietf.org/doc/html/rfc7232#section-3.3

Problem is, Express gets in the way and drops the response based on just the last modification time and there was no way to disable this behavior, even though it's the exact same logic as what the ETag setting controls. I created an issue for Express, in hopes that a similar setting would be added for checking response modification time:

https://github.com/expressjs/express/issues/4753

, but despite the good initial discussion, I didn't manage to communicate that Express is not a black-box web server, but rather is a framework upon which web applications are built and maintainers of those applications interact with HTTP just as much as Express does and may need to implement more sophisticated caching than what is available out of the box and being able to turn off some of the built-in features is a very important part of a usable framework.

This behavior remains broken and 304 responses may take as long as full responses because Express may ignore response data, but when modification time caching works as intended, responses are now returned now within the 100 millisecond range.

A the last touch, I needed to add max-age=0, must-revalidate into Cache-Control to ensure browsers always submit HTTP requests to validate the last modification time instead of using some caching heuristics.

Package Management

I used to deploy the ASP.Net blogs application as a ZIP package, which was expanded and copied into place via a deployment script. Ironically, despite a very robust packaging system, Node applications intended to run as top-level local applications cannot be deployed via npm and still require archiving tools to unpack them in-place and deployment scripts to set them up.

A top-level local application is not the same as a global application, which is commonly confused in discussions for reasons that are not quite clear to me. A global application is installed as a system-wide utility, such as TypeScript compiler or PM2 process manager, so they can run from command line as tsc or pm2. A local application is deployed into an arbitrary directory and multiple instances of a local application can be installed at the same time and run in parallel with different configurations.

For example, I could install and run an application called abc on different ports like this.

mkdir app1 && tar xzf abc-1.2.3.tgz -C app1 --strip-components=1
mkdir app2 && tar xzf abc-1.2.3.tgz -C app2 --strip-components=1
app1/devops/make-config port=3010
app2/devops/make-config port=3020
pm2 start app1/server.js --name app1
pm2 start app2/server.js --name app2

I even tried to make a feature suggestion for npm, but ran into the same global vs. local discussion.

https://github.com/npm/feedback/discussions/267

For some reason people can't seem to understand the difference between system-wide package managers, such as APT and DNF, and project-level package managers, such as npm and Nuget. Granted, npm can install scripts globally as well as locally, but most application packages running as web servers would be installed as top-level local applications either via Git clone or from an archive.

Ironically, npm will even create a bogus top-level local application for you if you run something like npm install lodash in an empty directory. It just escapes me why wouldn't there be a simple syntax along the lines of npm install --local-app myapp to grab an application package from a repository, install its dependencies and run all the usual package life cycle scripts.

Another packaging problem I needed to solve was to be able to consume sub-project type npm packages in a way that was quick to update and test within the main project. I looked at npm workspaces and while they looked promising at first, I concluded that juggling symbolic links, bundled dependencies and running npm commands selectively against one or more workspaces wasn't what I was looking for in a sub-project management.

What I wanted was to install a sub-project the same way locally and in production, which would not require any source changes even if I chose to start using a build-aware package repository, like Artifactory, at some later time.

In the end I set up a build script in a sub-project that would create a package named to include a sequential build number in the parent directory of my main project, which I would install as a file via npm. This worked quite nicely not only in development, but also in production deployments because npm handled file references in package.json without complaining about build numbers.

For example, one of sub-projects used in this app, let's call it webapp-framework for the purposes of this example, would be developed in its own arbitrary directory. When any changes were made in the sub-project, I would run a build script from webapp-framework, which ran a TypeScript compiler, incremented a sequential build number, made an npm package, renamed it to include the build number and copied the package to the parent directory of the main application. All needed to be done then was to remove the previous version of that sub-project and install the new build, like this:

npm remove webapp-framework
npm install ../webapp-framework-0.3.0+25.tgz
tsc
node app.js

This made it easy to keep both projects in sync and keep track of all builds, along with build history capturing commit identifiers and subjects. The sub-project package installed this way was recorded in package.json as a file dependency, so in order to deploy this in production I just needed to copy all packages into the same directory above the intended application directory and run the installation script.

With all these bits combined, my deployment script can effectively be distilled to this sequence of commands.

pm2 stop blogs

rm -rf $APP_DIR
mkdir -p $APP_DIR

tar -xzf blogs-1.2.3+$BUILD_NUMBER.tgz -C $APP_DIR --strip-components=1

cd $APP_DIR
npm install
cd ..

pm2 start $APP_DIR/app.js --name blogs

Once new functionality is verified to be working as intended, the commit associated with the deployed build number is tagged as a release in the Git repository and all version strings are changed to the next version in the source.

Backward Compatibility

One more thing I wanted to do was to make sure that some of the most important ASP.Net URLs would redirect to the new URLs and do it without having to litter application source with ASP.Net URI paths. Naturally, I started looking into Nginx rewrite syntax and the fact that query arguments could be arranged in any order took me some time to figure out how to structure rewrite rules.

Here's the final syntax I ended up with for one of the scripts, to keep it short.

location = /showblog.aspx {
    # an exact-match path may be changed immediately
    rewrite . /;

    # capture all arguments before query is reset
    set $cid $arg_cid;
    set $mid $arg_mid;
    set $uid $arg_uid;

    # start fresh with an empty query
    set $args '';

    # append all non empty arguments
    if ($uid != '') {
      rewrite . /$uid/;
    }
    if ($cid != '') {
      rewrite . $uri?cid=$cid;
    }
    if ($mid != '') {
      rewrite . $uri?mid=$mid;
    }

    return 301 https://$host$uri$is_args$args;
}

Case Sensitivity

One thing that completely escaped me when I deployed this application on Linux was that Linux file systems are case-sensitive and it took me a couple of weeks before I started noticing 404 errors piling up. Turns out that I had a mix of .JPG vs. .jpg and other similar character case mismatches between HTML and referenced images. Something to keep in mind when migrating from an NTFS to a Linux file system or to AWS S3.

Conclusion

It took me a couple of weeks to do most of the rewrite for this blogs application, about a week to tweak various bugs after deploying it and a few evenings throughout this year of preparatory work to figure out how to structure components and write various bits and pieces. The initial blog in classic ASP took me just a few days to implement, but then again, it was about a thousand lines of code and markup (I wish I kept better code history back then). The current application has 5189 lines of code and markup on the moment of this writing. For comparison, the last line count for the ASP.Net application is about 3449 lines of code and markup.

Node.js is pretty awesome, with functionality gems ranging from tiny components to giant frameworks maintained by talented individual developers and development teams. Granted, Node deployments do look like a patch quilt of a myriad packages glued together, compared to more monolithic applications, like those built on ASP.Net, but that patch quilt tends to be less expensive to implement and run, at least initially, before one finds themselves in need of replacing some of the components that are no longer maintained.

Moving from Windows running IIS/ASP.Net/MySQL to Linux running Node.js with a hosted Mongo database gave me about 30% savings in AWS. This alone is quite nice, but what is even more remarkable is that with the selection of readily available Node.js packages I was able to implement in days some of the features that I have been putting off for years because I would need to write wrappers for low-level libraries to use them within ASP.

Going forward, I do anticipate that maintenance effort will be higher, compared to working with software developed by larger vendors, but considering that all my previous attempts to write an application in a way that would make it trivial to migrate to the next platform didn't quite work out the way I hoped, maybe this easier-to-start approach will serve me better. Time will tell.

Comments: