OSS Rabbitholes

This weekend and the surrounding weekdays is PyCon! I've been neglecting a lot of open source work & maintenance that I used to be oh so familiar with, so I decided to take this time to cross off some easy tickets off my list.

urxvt(1) man formatting oddity

When I type in man urxvt, mentions of urxvtperl(3) within it are only half-bolded in my default pager. This caught my attention as a low-hanging contribution to maybe tidy up their docs a slight amount.

Precursor: get the source code

urxvt is kind of an obscure terminal emulator, they do things their own way & I absolutely love them for that.

Their repository is hosted under cvs.

Specifically, as outlined on http://software.schmorp.de/pkg/rxvt-unicode.html, you can "clone" the repository using:

cvs -z3 -d :pserver:anonymous@cvs.schmorp.de/schmorpforge co rxvt-unicode

I don't know how to use cvs so I opted to not do this :D

tinkered around with git a bit to figure out the git cvsimport command:

git cvsimport -C urxvt -r cvs -k -v -d :pserver:anonymous@cvs.schmorp.de/schmorpforge rxvt-unicode

Discover this took longer than I had hoped, thanks to this stackoverflow post for spelling out the invocation for me: https://stackoverflow.com/a/11490134.

Also, running that command took roughly two hours so that was fun.

Attempt #1, modify the man page!

The offending manual file is located in doc/rxvt.1.man.in. The .in suffix hints that there's some preprocessing going on before I see the final output, but that's extraneous for our current goals.

Man pages are written in the troff programming language, which looks pretty esoteric. But finding what change to enact to get my goals was pretty easy:

- @@RXVT_NAME@@\fBperl\fR\|(3)
+ \fB@@RXVT_NAME@@Bperl\fR\|(3)

Just have to move the special \fB control character to surround the entire name, including prefix.

man can be run on the file as-is, with wierd pre-preprocessed artifacts:

man doc/rxvt.1.man.in

But to build the files, first configure the entire project with a ./configure in the base directory, then run make all from within the doc/ directory

Attempt #2, notice that the man page is autogenerated

At some point I realized that there is a sibling doc/rxvt.1.pod. A bit more digging (from within the Makefile) led me to find that the doc/*.man.in files are generated from the *.pod files:

%.tbl: %.pod
	$(srcdir)/podtbl <$< >$@

%.1.man.in: %.1.tbl
	$(POD2MAN) -s1 <$< >$@

%.3.man.in: %.3.tbl
	$(POD2MAN) -s3 <$< >$@

%.7.man.in: %.7.tbl
	$(POD2MAN) -s7 <$< >$@

I don't know too much about pod/tbl, but upon initial search these look to be perl-isms, pod standing for "Plain Old Documentation".

The file is also much cleaner, with the generated @@RXVT_NAME@@\fBperl\fR\|(3) stemming from this:

@@RXVT_NAME@@-extensions(1)

But the resulting question is: where does the formatting come from?

Aside: standards & patterns within this file

As part of this contribution, I wanted to make sure that I was following prior work with highlighting/distinguishing man page references from within this man page

Using the search (\d) from within vim, here's a subset of what I found:

In conclusion, I found no rhyme or reason and got more excited to maybe contribute some semblance of order to this obscure file.

Aside: pod2{man,html,xhtml,tbl(?)}

rxvt-unicode has a very well formatted manpage located here: http://pod.tst.eu/http://cvs.schmorp.de/rxvt-unicode/doc/rxvt.1.pod

Additionally, the doc/Makefile has a %.html: %.tbl rule, so we can build html files!

This is quite a boon because in investigating the man page inter-reference formatting issue, seeing what different output formats output can lead to hints!

Unfortunately, urxvt's pod to html converter uses pod2xhtml, which doesn't exist on my machine, nor is packaged on default gentoo.

A quick replacement of s/pod2xhtml/pod2html gave me a quick working setup though, and I continued down my path!

Unfortunately, pod2html doesn't seem to auto-format the man links at all! This signals to me that there's an under-specification of what these tokens "are", and that the pod machinery could use some more hinting

L<>

I don't know how I found it, but I stumbled onto this addendum on a stackoverflow answer:

Btw, UNIX man pages work right out of docs:

L<crontab(5)>

This brings up http://man.he.net/man5/crontab

This is!! Exactly what I want! A properly documented way to link to man pages without implicit rules trying to auto-detect things!

I hastily surrounded some of the links I was working with, resulting in:

- @@RXVT_NAME@@perl(3)
+ \L<@@RXVT_NAME@@perl(3)>

and after running my makefile amalgimation make clean alldocclean alldoc rxvt.1.html all (with modified s/pod2xhtml/pod2html), I got exactly what I was looking for, a properly formatted manpage reference with linking included!

Kind of.


man_url_prefix

The resulting autogenerated man page reference URL directs to: http://man.he.net/man3/urxvtperl.

Which 404s.

I'm not exactly sure what the bar to get a man page up on http://man.he.net is, but apparently urxvt doesn't make it. There are plenty of other online man page providers that do include it though, a list:

There's no shortage of options. An alternative approach could be to figure out how to get man.he.net to index urxvt's man pages too, but that requires dealing with people & bureaucracy and I go down these rabbitholes to deal with software

Digging into pod2html, it is somewhat configurable and lets you change the website that man page references link to. This is done through man_url_prefix "variable". There doesn't seem to be any way to modify these variables from the command line instantiations, so this distraction has kind of led to a dead end.

I've been making heavy use of grep.app recently, and plugging in man_url_prefix into there will result with 9 (nine) total uses of it throughout all of github. I'm not sure this variable setting has actually been used in any real capacity.

pod2xhtml

I quickly dug myself out of pod2html hackery, since none of that actually forwards me towards my goal since urxvt's doc builder doesn't actually use pod2html, it uses pod2xhtml!

pod2xhtml is much worse off. It doesn't exist in gentoo's repo tree because it hasn't been touched in over a decade (last update: 2010). It uses a legacy link parser that unfortunately dashes my hopes of improving the html generated manual along with the man page -- it doesn't autolink to online man page references.

By default the L<> wrapped man links just turn into this html:

<cite>urxvtperl</cite>(3)

Which, honestly. Isn't the worst. This mimics some manually crafted I<xterm>(3)s found within the page, so I consider it an acceptable modification.

A consideration to me made for the future though: pod2html works and generates a pretty identical looking html file. Perhaps it's worth porting over at some point? http://pod.tst.eu seems to be running a cgi script providing realtime pod2xhtml, not sure who owns this but that's how urxvt's documentation is being rendered currently.

With a bit of work, could definitely transition over to a pod2html / static served html man page setup.

The actual fix: podlators

The current iteration of pod2man lives within podlators.

The crux of the issue that started this all comes from this block of regex:

# Change references to manual pages to put the page name in bold but
# the number in the regular font, with a thin space between the name and
# the number.  Only recognize func(n) where func starts with an alphabetic
# character or underscore and contains only word characters, periods (for
# configuration file man pages), or colons, and n is a single digit,
# optionally followed by some number of lowercase letters.  Note that this
# does not recognize man page references like perl(l) or socket(3SOCKET).
if ($$self{GUESSWORK}{manref}) {
    s{
        \b
        (?<! \\ )                                   # rule out \e0(1)
        ( [A-Za-z_] (?:[.:\w] | \\-)+ )
        ( \( \d [a-z]* \) )
    } {
        '\f(BS' . $1 . '\f(BE\|' . $2
    }egx;
}

specifically, the recognizition heuristic only matches a portion of @@RXVT_NAME@@perl(3), with or without the L<> construct.

The resulting patch to podlators circumvents this guesswork when within the L<> construct and just generally bolds the contents of the link (when not a URL), special-casing the man reference type to not bold the suffixed section number.

& the resulting patch to rxvt-unicode is trivial. Through a mailing list, too :D