#+TITLE: (New) Static Site Generation #+DESCRIPTION: Migration from Hugo to org-mode, some scripts, and pandoc #+TAGS: Emacs #+TAGS: Org-mode #+TAGS: GNU/Linux #+TAGS: Bash #+TAGS: Make #+TAGS: Pandoc #+DATE: 2019-03-03 #+SLUG: static-site-generation #+LINK: blog-git https://git.devnulllabs.io/blog.kennyballou.com.git/ #+LINK: golang https://golang.org #+LINK: hugo https://gohugo.io/ #+LINK: wiki-markdown https://en.wikipedia.org/wiki/Markdown #+LINK: org-mode https://org-mode.org #+LINK: org-manual https://orgmode.org/manual/ #+LINK: org-mode-publish https://orgmode.org/manual/Publishing.html#Publishing #+LINK: wiki-rst https://en.wikipedia.org/wiki/ReStructuredText #+LINK: justin-abrah-org-publish https://justin.abrah.ms/emacs/orgmode_static_site_generator.html #+LINK: panchekha-org-publish https://pavpanchekha.com/blog/org-mode-publish.html #+LINK: evenchick-org-publish https://www.evenchick.com/blog/blogging-with-org-mode.html #+LINK: ogbe-org-publish https://ogbe.net/blog/blogging_with_org.html #+LINK: pandoc https://pandoc.org #+BEGIN_PREVIEW For a few years, I've been using [[hugo][Hugo]] for blog generation. Recently, I've decided I wanted to take static site generation into a different direction. Specifically, I wanted to use a different source markup and I wanted to write my own tool set for generating the actual HTML. #+END_PREVIEW We'll walk through the motivation of changing the content into a different format and changing the generation process into a completely custom set of scripts. ** Motivation When I first set down to build a blog 5 years ago, I had a pretty basic set of requirements. - Lightweight - Native Markdown Support - Minimal Dependencies [[hugo][Hugo]] met all of these requirements quite well. The templating engine is fairly simplistic; It supports Markdown; It's a written in [[golang][Go]], therefore, only the built artifact is necessary for site generation. If it fits so well, why change? It worked well for what I was asking, however, as I wrote more and time went on, the features of [[hugo][Hugo]] became more and more complex and created a mismatch of how I wanted to express the text in the markup. I've felt this was going in a direction I did not need nor wanted. More specifically, the issues start surfacing more in the [[wiki-markdown][Markdown]] side. [[wiki-markdown][Markdown]] simply lacks some features in its markup that is corrected by blocks of HTML inline and other hacks that are standard in only some specific "flavor" or translator implementation. [[hugo][Hugo]] attempts to patch this over with its implementation of ~shortcode templates~, however, these still felt unnatural. The final nail was discovering [[org-mode][Org-mode]]. I liked the weight of [[wiki-markdown][Markdown]], but I didn't like its lack of features when needed. I liked [[wiki-rst][rStructured Text]], however, I felt it was always too heavy for the documents I was working on. In finally giving Emacs a full try (another blog post), I discovered [[org-mode][Org-mode]]. It was the exact middle weight I was looking for between [[wiki-markdown][Markdown]] and [[wiki-rst][reST]]. Thus, I was in search of a new tool for generating static HTML from a set of source files written in [[org-mode][Org-mode]]. [[org-mode][Org-mode]] (within Emacs) has a native publish mode, and I had discovered [[justin-abrah-org-publish][several]] [[panchekha-org-publish][posts]] on how [[ogbe-org-publish][people]] are doing [[evenchick-org-publish][exactly]] this. However, [[org-mode-publish][Org-publish]] isn't exactly what I was looking for. Therefore, let's revise the current list of requirements: - Makefile driven - Does not require Emacs to generate That is, I wanted a ~Makefile~ that could generate the site contents. Furthermore, and no less importantly, I wanted the ~Makefile~ to not include lines like ~Emacs --quick --batch ...~. This obviously creates a bit of a challenge since [[org-mode][Org-mode]] is an Emacs mode. I decided I could probably generate the content myself with a few scripts and invocations to [[pandoc][Pandoc]]. ** High-Level Implementation The core of the implementation of the new site generation is blog posts written in [[org-mode][Org-mode]], processed by several shell scripts, using [[pandoc][Pandoc]] to perform the translation from raw [[org-mode][Org-mode]] markup to HTML, all of which is orchestrated by a ~Makefile~. I'm not going extol [[org-mode][Org-mode]]'s capabilities in this post. There's plenty of resources on it already, no greater authority than the [[org-manual][Org-mode Manual]] itself. There is, in fact, some limitations of [[org-mode][Org-mode]] due to the choices of not allowing the generation to include ~Emacs~ itself. Along the tour of the implementation, it's important to note a guiding principle in the conversion was not breaking existing links. That is, I was and am satisfied with the folders and slug usage for posts and I didn't want the new version to break existing links. ** Detailed Implementation The easy part is generating each post. This is simply an ~index.html~ in the correct folder. The majority of the complexities stem from the summaries and main ~index~ page. *** Post Content Generation To generate a blog post's ~index.html~ page, we consider the following ~make~ target: #+BEGIN_SRC makefile blog_dir = $(shell $(SCRIPTS_DIR)/org-get-slug.sh $(1)) TEMPLATE_FILES:=$(wildcard templates/*.html) define BLOG_BUILD_DEF $(BUILD_DIR)$(call blog_dir,$T): mkdir -p $$@ $(BUILD_DIR)$(call blog_dir,$T)/index.html: $T \ $(TEMPLATE_FILES) \ Makefile \ | $(BUILD_DIR)$(call blog_dir,$T) $(SCRIPTS_DIR)/generate_post_html.sh $$< > $$@ endef $(foreach T,$(POSTS_ORG_INPUT),$(eval $(BLOG_BUILD_DEF))) #+END_SRC This definition is fairly opaque now. However, the definition will expand for each post when the ~foreach~ macro expands. For example, when run, the following targets will be defined for this post: #+BEGIN_SRC makefile $(BUILD_DIR)/blog/2019/03/static-site-generation: mkdir -p $@ $(BUILD_DIR)/blog/2019/03/static-site-generation/index.html: posts/static-site-generation.org \ $(TEMPLATE_FILES) \ Makefile \ | $(BUILD_DIR)/blog/2019/03/static-site-generation $(SCRIPTS_DIR)/generate_post.html $< > $@ #+END_SRC This will create the correct directory for each post, e.g., ~/blog/2019/03/static-site-generation~, and place the translated HTML into this directory as ~index.html~. #+BEGIN_QUOTE Note: it doesn't actually translate to ~$(TEMPLATE_FILES)~. During the expansion of the definition, the variable ~$(TEMPLATE_FILES)~ is similarly expanded. This is acceptable, however, since it's a static list of files and has no bearing on which post's target is being expanded. #+END_QUOTE The ~generate_post.sh~ script is fairly basic: #+BEGIN_SRC bash #!/usr/bin/env bash # Generate HTML for blog post ORGIN=${1} PROJ_ROOT=$(git rev-parse --show-toplevel) source ${PROJ_ROOT}/scripts/site-templates.sh source ${PROJ_ROOT}/scripts/org-metadata.sh DISPLAY_DATE=$(date -d ${DATE} +'%a %b %d, %Y') SORT_DATE=$(date -d ${DATE} +'%Y %m %d ') cat ${HTML_HEADER_FILE} cat ${HTML_SUB_HEADER_FILE} echo -n "

${TITLE}

" echo -n "
" echo -n '' echo -n "

${DISPLAY_DATE}

" pandoc --from org \ --to html \ ${ORGIN} cat ${HTML_FOOTER_FILE} #+END_SRC The ~org-metadata.sh~ script, reads the [[org-mode][Org-mode]] preamble, lines starting with ~#+~, and puts them into different variables available for other scripts. For example, the ~TITLE~, ~DATE~, ~TAGS~ are pulled out and used to generate the title section of each post. Furthermore, some templates are pulled in to generate the headers and footers of each page. The templates are written directly in HTML and really serve only to simplify each page with otherwise largely duplicated content. *** Summary Page Generation The summary page is a bit more involved to generate. A few questions had to be answered before it was possible: how to generate the summary text? And how to sort and order posts? To answer the first question, I dug into how [[hugo][Hugo]] was generating these summaries. It turns out, it really only takes the first couple hundred characters and calls it the "summary". This depends largely on the content of each post to actually describe the post in the first couple hundred characters. Obviously, this led to some awkward results, especially with links and section headings mixed in. To achieve similar results, it /would/ be fairly easy to write a script to simply take the first few hundred characters after the preamble and output this into something to be collected for the summary page. However, a better solution is available since we are taking full control over the generation process. Namely, we can put the preview content into a specific [[org-mode][Org-mode]] block to be parsed out and used explicitly for this purpose. If the summary for a post is only a sentence or two, the summary generation process won't then start reading extra text, if the summary requires a little more detail, it won't be cut short by the arbitrary read limit. To generate the preview content, the ~generate_post_preview.sh~ script is used: #+BEGIN_SRC bash #!/usr/bin/env bash # Generate HTML post summary tags ORGIN=${1} PROJ_ROOT=$(git rev-parse --show-toplevel) source ${PROJ_ROOT}/scripts/org-metadata.sh echo "${LINKS}" echo "${PREVIEW}" #+END_SRC The ~LINKS~ variable is included in this file because we are generating an intermediate file for [[pandoc][Pandoc]] to generate the summary content. Without the ~LINKS~, any links included in the preview section would be broken. The second question actually turns out to be pretty easy in practice: we parse the ~#+ DATE:~ line from the preamble and prepend it to the summary content. From the ~org-metadata.sh~ script: #+BEGIN_SRC bash file:org-metadata.sh ORIGIN=${1} DATE=$(awk -F': ' '/^#\+DATE:/ { printf "%s", $2}' ${ORGIN}) #+END_SRC Then, from the ~generate_post_summary_html.sh~ script: #+BEGIN_SRC bash file: generate_post_summary_html.sh #!/usr/bin/env bash # Generate HTML post summary tags ORGIN=${1} GENERATED_PREVIEW_FILE=${2} PROJ_ROOT=$(git rev-parse --show-toplevel) source ${PROJ_ROOT}/scripts/org-metadata.sh DISPLAY_DATE=$(date -d ${DATE} +'%a %b %d, %Y') SORT_DATE=$(date -d ${DATE} +'%Y %m %d ') PREVIEW_CONTENT=$(cat ${GENERATED_PREVIEW_FILE} | pandoc -f org -t html) echo -n "${SORT_DATE}" echo -n '
' echo -n "

${TITLE}

" echo -n "
${DISPLAY_DATE}
" echo -n "
$(echo ${PREVIEW_CONTENT})
" echo -n '' echo -n '" echo "" #+END_SRC Finally, this is all put together with the ~generate_index_html.sh~ script: #+BEGIN_SRC bash #!/usr/bin/env bash # Generate index.html page INPUT_FILES=${@} PROJ_ROOT=$(git rev-parse --show-toplevel) source ${PROJ_ROOT}/scripts/site-templates.sh cat "${HTML_HEADER_FILE}" echo "" cat "${HTML_SUB_HEADER_FILE}" cat ${INPUT_FILES} | sort -r -n -k1 -k2 -k3 | awk -F' ' '{print $4}' echo "" cat "${HTML_FOOTER_FILE}" #+END_SRC Specifically, the following line is of interest with respect to properly sorting: #+BEGIN_SRC bash cat ${INPUT_FILES} | sort -r -n -k1 -k2 -k3 | awk -F' ' '{print $4}' #+END_SRC Use the tab-separated date fields from before, and use them to sort each of the post summaries onto the ~index.html~ page. *** RSS/XML Generation I also wanted to keep the RSS/XML feeds going. However, as it turns out, generating the RSS feed was achieved by performing essentially the same steps used for generating the summary ~index.html~ page. ** Future Work There is a fairly obvious limitation of the summary page generation, but only really obvious if I write more content. There was and is no current archive page. Moreover, _all_ posts are put into the ~index.html~ summary page. If/when more posts are written and published, a solution for the first page will be necessary. However, this was necessary regardless of whether the blog is generated using [[hugo][Hugo]] or generated via the new process. ** Parting Thoughts Like many projects, this was started because I personally was dissatisfied with the current state of options. However, that said, I did not write these scripts to be used directly for someone else. I'm not sure I would necessarily recommend this approach to someone else, unless, of course, they wanted to do it to learn or to otherwise take control of their content. That said, I hope this captures the essence of the scripts, their major functions, and the motivations behind them. The scripts are available, WITHOUT WARRANTY, under the [[gnu-gpl][GNU General Public License (version 3)]]. If you have questions or comments, feel free to reach out to me.