Wednesday, 13 June 2012

TRF comments and posts: migration and statistics

Quick note: Please feel more than free to register with Disqus and post testing comments at the Stuxnet playground blog. If you already have a Disqus comment, sign with "D" for "Disqus". If you don't, click a check box and register. The e-mail you use during registration will only be visible to me (and I will not abuse it, and not even notice it, for that matter); if you really need to hide it, let me say that the e-mail won't be used so it probably doesn't have to be a fully real one.
On October 1st, the "fast" Echo comments will disappear from this blog and all other websites in the world. The employees of Echo simply found out that they're too greedy and the services including comment systems of small consumers aren't bringing them enough profit. And they're just not too good and/or motivated to maintain such comment systems. My current plan is to switch to DISQUS, an alternative comment system that seems to enjoy many advantages of Echo.

See, for example, The Smallest Minority, a fellow blogspot blog that has switched from Echo to DISQUS. That owner hasn't been able to preserve all the Echo comments but this is not DISQUS' fault as I am going to discuss momentarily.

It seems to me that I have in principle understood – and collected all the required data – for the successful import of the 70,000+ Echo comments to the new system. This has eaten many hours of my time and is likely to eat additional ones. Because this is really about the preservation of your work, not just mine, you may financially contribute via PayPal (using either the red heart or the green piglet) if you think that this is work that shouldn't be sponsored purely from my resources.

The detailed text below will tell you something about the amount of data and the formatting.




The Reference Frame has about 4,750 blog entries at this point. In general, each blog entry allows "slow" comments internally served by Blogger.com as well as "fast" comments served by Echo. Most of the comment activity occurs in the fast comments, of course. (Only slow comments are accessible from the mobile template.)

The previous comparison may be quantified. The number of slow comments on this whole blog is just 10,000 or so – about two comments per blog entry in average; the number of fast comments exceeds 70,000 which is almost 20 comments per blog entry in average. More than 18,000 fast comments are "replies" so they must remember the identifier of the parent comment, too. I hope that DISQUS will respect this structure.

The Blogger.com "slow" comments have been around since the beginning of TRF in October 2004. I added the Haloscan comments a year or so later. Another year or two later, Haloscan was bought by JS-Kit, another company that gradually got renamed to Echo and changed the procedures how the comments are associated with the web pages etc.

Because of this chaos and because the transitions have never quite "assimilated" the previous systems and conventions, there exists a substantial amount of incoherence in the formatting of the 100-megabyte XML into which I may export all the Echo comments. Only the comments from recent years contain the information about the Blogger's numerical "postID" and only an even smaller subset of the Echo comments also preserves the information about the URL on motls.blogspot.com.

The Smallest Minority blog mentioned above wasn't able to find out these data in the XML file – so the import of the comments into DISQUS, however successful it claimed to be, couldn't possibly lead to the correct association of the comments with the blog entries. The required information just wasn't there!

But I think that I have downloaded some extra secret files that allowed me to write down the full dictionaries. If you open fast comments, you get an URL like this one for a noncommutative geometry. If you study the URL, you will see that the only variable part is a number after %2F (this is a HTML way to write the slash "/") and before the ampersand sign ("&"). The number in between is referred to as the "path" in the Echo context and "blog.postID" in the Blogger.com context. When you click at "Post a comment" (the slow one) at the bottom of this blog entry, you will get the following URL
http://www.blogger.com/comment.g?blogID=8666091&postID=1833053477481953977
which implies that the "blog.postID" or "path" of this blog entry is
1833053477481953977
If you care, 8666091 is the blogID of this whole blog.

The Echo comments pages can't be saved, at least not conveniently, because they're "dynamically generated". The page is almost empty to start with and the elements – the comments – are later added by Javascript routines. However, I was able to discover that these Echo comments use a rather compact URL with the comments data such as this one. You may see that the only variable part of the URL is the "path" again; the resulting page looks like a Javascript code that adjusts various variables.

I was able to download all these 4,750 or so "comments data" pages via Mathematica – which I use as my nearly universal programming language these days. It created a neat 130-megabyte folder. A funny thing about these files is that while they lack some information about the comment, including dates and many other things, they contain the "jsid" codes of the comments.

It seems a verified fact to me that each of the 70,000 Echo "fast" comments on this blog is equipped by a unique "jsid" identifier of the form
jsid-1267854875-305.
This is the only identifier of a comment you may reliably find in the XML. So what I need for a successful readdressing of the XML file is to create several dictionaries. For each "jsid", I have to find out the "path" i.e. the "blog.postID" and the motls.blogspot.com-based URL. Optionally, it's also good to have a neat and readable "title" for a "jsid" code.

I believe that all these dictionaries have been created (they will be updated when the migration approaches the "real life process"). To do so, I had to pick all the relevant substrings from 4,750 of the TRF backup XML files – substrings that carry the information about the "blog.postID", the URL, and also the filename in the TRF backup directory. PowerShell, an extended DOS, was helpful, especially because of its grep-like command, "select-string".

Everything else has been done by Wolfram Mathematica. It has evaluated all the relevant substrings in the files of the relevant lines picked by the PowerShell – lines with the blog.postIDs, URLs, titles, and many other things. Make no doubt about it: there were surprises and exceptions almost at every step. The incoherent and incomplete "permalink" information in the exported Echo XML file was the first bad surprise. But I had to deal with unwanted linebreaks in the output generated by the PowerShell (this has been fixed as I increased the maximum length of the line).

And when I was isolating the substrings to find the "blog.postID" for various filenames, I had another surprise. It seemed sensible to think that the ID starts after the characters "post-" in the TRF backup XML file. However, it turned out that I also have blog entries about post-Nazi and post-normal scientific things in the title, or whatever it exactly was, so adjustments had to be made to the algorithm. (This was just an example of a subtlety; there have been many.)

It's not my goal to overwhelm you with technicalities – and there have been many, indeed. The Mathematica program does lots of operations with strings and it's a lot of boring details. While I dislike this kind of work composed of subtleties, you may object that a trained string theorist should have no problems with finding substrings of strings in various files describing 70,000 comments. All of it boils down to string theory, of course, much like everything else in this Universe or multiverse except for Radiohead.

With the dictionary, I should be able to fix the gaps in the 100-megabyte XML file exported by Echo so that DISQUS will be able to associate the comments with the threads. This will be a piece of work that hasn't been tried – by me and chances are that by no one else in the world – and may create new challenges. If you import the Echo XML file as a variable x into Mathematica, a permalink associated with a particular comment may be obtained e.g. as x[[2, 3, 1, 3, 12, 3, 4, 2, 2, 2]]. No kidding, no exaggeration. That's quite some nesting: it's an array of arrays of arrays of arrays of arrays of arrays of arrays of arrays of arrays of arrays, a ten-dimensional array, just like in string theory, of course. ;-) Most of these dimensions are redundant; the actual physical field is 2-dimensional, labeled by the thread (blog entry) and the number of comment within the thread. The translation between the 2-dimensional "world sheet" and the 10-dimensional conventional "spacetime" is just a pile of conventions. I hope that it's enough to find and fix the title and permalink fields.

It's been roughly tested that the import of the XML file into DISQUS mostly works except that no comments ever appear anywhere – the permalinks are wrong. If they're fixed, things should work. All of it is theory at this point but when theory is done carefully, it sometimes works well in practice, too.

Right now, I actually think about switching into a one-comment-system regime. And the single surviving comment system will be none of those that already exist at this point! ;-) We're talking about DISQUS, the frontrunner as of today. That's quite a revolution but carefully masterminded revolutions sometimes work, too. I plan to replace the "slow" Blogger.com comments by DISQUS 2012 once the testing is completed which may be as early as tomorrow. Extra links to "fast comments" will be somewhere but I will encourage the users to switch to the DISQUS threads and incorporate the Echo comments later.

This DISQUS page on Echo imports also reminds me of filling some author's name for comments that don't have any author field – and be sure that Echo simply exports the comments so that the anonymous ones are not affiliated with any author, not even an anonymous one. Also, the avatars and other things may be expected not to work. DISQUS sometimes offers the commenters to declare themselves owners of previous comments posted from the same IP address or something like that so maybe I should leave it to the individual commenters.

Even if the import procedure ultimately fails, I think that I have all the data to build a brand new XML file with the 70,000 comments from scratch, in any format you want.

It seems likely that objects such as attached images will be lost, much like the "likes" (and not sure how smilings will look like). This would be a huge extra amount of work to find out how the images and likes are parameterized and how they could be included into the DISQUS comments if it is possible at all. DISQUS also has advantages over Echo. For example, it shouldn't suffer from the bug that prevents the visitors from national domains such as motls.blogspot.cz from seeing the global dot-com discussions.

As you can see, it's a lot of work and the result is bound to be imperfect. But I still feel it's better to do an imperfect job than to allow a complete extinction of the comments. Many people have written things that may want to be remembered over those 7-8 years... If you have something important to say, don't hesitate.

No comments:

Post a Comment