The Nightmare: A Horde of Ancient HTML Files
So there I was, at my job, staring into the abyss of a massive pile of HTML files. And not just any HTMLโsome of these bad boys were from the HTML 1.0 era.
You know, back when <font>
tags were cool, tables ruled the web, and <blink>
was somehow acceptable.
The goal? Take all of this ancient, inconsistent HTML and normalize it into consistently formatted PDFs.
Simple, right? Nope.
Some documents were pristine; others were crime scenes of mismatched <div>
s, inline styles, and <table>
layouts that made modern CSS cry.
Parsing it all into something usable was going to take a serious plan.
Step 1: Fixing and Cleaning the Old HTML
The first step was extracting usable content while cleaning up horrible, outdated practices. This meant:
- Removing junk tags (
<blink>
,<marquee>
,<script>
,<style>
). - Fixing unmatched tags (because some files had random
<div>
openings with no closures). - Handling encoding nightmares (ISO-8859-1 mixed with UTF-8).
- Detecting HTML bleeding into output (sometimes, raw HTML was mistakenly treated as content).
- Finding content even if
<div>
s were broken or misused. - Logging warnings for human review (because you never know what horrors lurk in old HTML).
๐ Python Code for Cleaning Old HTML
We used BeautifulSoup
for fixing the HTML, along with some custom logic to detect malformed structures.
|
|
๐ฅ What This Does:
- Removes
<blink>
and<marquee>
(because no one needs those anymore). - Fixes broken tags (using
BeautifulSoup.prettify()
ensures they are closed properly). - Detects if raw HTML is leaking into content.
- Logs warnings so we can manually review problematic files.
Step 2: Extracting Content Even with Broken <div>
s
Some HTML files had zero structure. Like this mess:
|
|
To extract content even if divs are broken, we used:
|
|
Step 3: Generating PostScript from the Fixed HTML
Now that we have cleaned HTML content, itโs time to generate PostScript.
๐ Simple Example: Basic Text Output
|
|
Step 4: Complex PostScript with Layout
For the real project, the PostScript output had columns, images, and formatted text. Hereโs a more advanced version:
|
|
๐ฅ What This Does:
- Uses Helvetica-Bold for the title
- Uses Times-Roman for content
- Draws a horizontal line for formatting
- Ensures proper text placement
Step 5: Converting to PDF
Once we had PostScript files, we converted them to PDFs using:
|
|
This gave us clean, consistently formatted PDFs, with zero surprisesโexactly what we needed.
๐ Key Ideas
- Old HTML is a messโparsing it requires patience (and sometimes therapy).
- HTML rendering is inconsistent, so PostScript was chosen for absolute control over formatting.
- Python can fix broken HTML, detect issues, and extract content even from disasters.
- PostScript allows precise placement of text, images, and layout.
- Converting PostScript to PDF is straightforward, making it a great pipeline for document conversion.
๐ References
- PostScript Language Reference Manual โ Adobe
- ps2pdf Documentation โ Ghostscript
- BeautifulSoup Documentation โ Python
- Parsing HTML in Python โ
lxml
&html.parser