Divide by zero
 Tuesday, June 03, 2008
Web log file parsing with c#

In the past week I have been doing a lot of analysis on the web analytics reports that are being generated by AWStats (A good Open Source product, don't be put off by the arrogant download Firefox message) and another commercial product that we have an Enterprise license for. I won't name this product, as I am very disillusioned with it, its accuracy is very questionable.

In order to determine how accurate the statistics were, i chose one days log file for a reasonably busy website. This site is mainly a Monday to Friday website with peaks around 10 AM to 4 PM, and another small peak 6 PM to 8PM. Weekends see a lower usage, so I used a sunday as this would be the smallest log I could reasonably analyse. Thee log contained just over 100,000 entries. In order to analyse it, I used Visual Studion 2008, filtering using regular expressions. After a while I realised I would like to review these stats on an ongoing basis, so i wrote a small log parser. There is an excellent free parser available from Microsoft called Log Parser 2.2 however I wanted direct control, because I also want to create some graphs. My first version creates a beautiful graph using the excellent Open Source Silverlight data visualisation component called Visifire. I haven't included that code in this download as i wanted to clean it up and make it more re-usable and extensible.

This tool is not particularly robust as it is just a throwaway utility, but I tried to make it a bit extensible for future requirments. The log files I am parsing are Apache log files. So I designed a simple ILogFormatReader interface, and created an ApacheLogParser implementation. This doesn't populate all the fields, but it's easy to see how it works to finish off the implementation if more information is necessary.

The main issue with parsing log files is how they are separated. in this case the log file is space separated. Any fields that have spaces in them are surrounded by double quotes or square brackets. Another consideration is what to do if a log entry comes in with an invalid format. I didn't worry about this too much, as the web logs should work well if the first line is correct. if this code was to go into production then of course that would have to be refactored.

Not all log file are created equal

Log parser example outputEach web server can be customised to record different informatin in the log files. Additionally proxy servers can modify what is sent thru to the web server, so it is important to check the format of any logfile before parsing. The log files I am parsing are using an Apache log file format. More information can be found on the Apache log files page and the Microsoft Log file formats in IIS (6.0) page.

For those interested, I found the statistics in AWStats very close to what I believe they should be. There were inaccuracies that I couldn't explain. To be fair, we are running an old version of AWStats so I assume that this might have been addressed in a newer version. The very expensive commercial application we use reports approximately twice the bandwidth that it should, and does not understand how to handle the JSESSIONID tacked on to the end of some JSP applications. It confuses the real resource with the session ID, and we get very iaccurate statistics as a result.

There is some minimalistic reporting to give an idea of how to use the parser. The output looks like thie image here.

There is a download included below. I would be interested to hear if anyone finds this useful. I hope to update this application at some stage to include some graphing output. I might use Visifire as mentioned earlier, or possibly Microsoft Excel. I think that is a better option as it would allow for richer maniupulation of the reporting after the log files have been parsed. Excel has such powerful features it would require good justification not to use it as a reporting mechanism. I think that AWStats doesn't do their reports justice by presenting them as they do, Everyones reporting requirements are different.

Download

The applicaition can be downloaded here:

LogParsing.zip (7 KB)


Tuesday, June 03, 2008 11:33:35 PM (AUS Eastern Standard Time, UTC+10:00)   #    Comments [0]  Downloads | Webmaster
link to del.icio.us link to reddit link to StumbleUpon link to Facebook Bookmark to Google
 Tuesday, May 27, 2008
How to add a provider to the Internet Explorer search bar

Internet Explorer 7 and above has a search bar with a default provider of live search. It also contains a link to other search providers such as Microsoft and Wikipedia. It is very easy to add a new provider to this search bar. It can be done manually, but this article describes how to change a web page to enable auto-detection of a search provider, plus how to create a link to that provider. Firefox 2  and above also has a search sidebar that supports this feature.

The search provider requires an XML file which adheres to the OpenSearch Description schema. It is easy enough to create this, but if you want to use Internet Explorer to create it for you then you can use the Find more providers option in the search bar dropdown. On the page it displays you place the search page for your site with the word TEST instead of a valid search keyword. This will only work for sites that use GET rather than POST for search queries. Once you have entered the URL, there is a View XML option which will display the necessary XML. It will look like this:

<? xml version = "1.0" encoding = "UTF-8" ?>
<
OpenSearchDescription xmlns = "http://a9.com/-/spec/opensearch/1.1/" >
<
ShortName>blog.focas.net.au</ShortName>
<
Description>blog.focas.net.au provider</Description>
<
InputEncoding>UTF-8</InputEncoding>
<Url type="text/html"
      template
="http://blog.focas.net.au/SearchView.aspx?q={searchTerms}"/>
</
OpenSearchDescription>

The ShortName will show up in the search bar. This file should be saved somewhere on your website. it doesn't matter where as it is the metadata in the pages that will point to its location.

On any page where you want Internet Explorer to automatically detect the search provider, you need to add some metadata into the head section of the page. This will look like this:

<link title="blog.focas.net.au search"
      
rel="search"
      
type="application/opensearchdescription+xml"
      
href=http://blog.focas.net.au/blog.focas.net.au.searchprovider.xml/>

This tells Internet Explorer that there is a search provider from the rel attribute containing the value search. The type is a mime type referring to the Open Search Description xml format. the href is the location of the XML file you saved earlier.

Once this is in the page, when you browse to this page, Internet explorer will change the drop-down icon colour on the search toolbar to orange. Like this: 


Figure 1: Search provider glowing when it has discovered a search provider

By clicking on the drop-down, the provider will be displayed. It has not been installed at this stage, but is available whenever the page is viewed:


Figure 2: Search provider displayed in the search provider drop-down

To allow the user of a page to install the search provider in Firefox2+ or Internet Explorer 7+, you need to create a link on a page that calls some javascript. The window.external method must be called in order for the link to work. The following script is a modified version of a script on the Mozilla developer center 

function installSearchEngine(url) {
  if (window.external && ("AddSearchProvider" in window.external)) {
    // Firefox 2 and IE 7, OpenSearch
    window.external.AddSearchProvider(url);
  } else {
    // No search engine support (IE 6, Opera, etc).
    alert("Sorry, your browser doesn't provide search engine support");
  }
}

This page has a search provider in its metadata, so you can see the effect in the search box if using IE7+. Alternatively you can try installing the search engine by clicking here.


Tuesday, May 27, 2008 10:26:21 PM (AUS Eastern Standard Time, UTC+10:00)   #    Comments [0]  Metadata | Webmaster
link to del.icio.us link to reddit link to StumbleUpon link to Facebook Bookmark to Google
 Saturday, May 17, 2008
Image Gallery using metadata
Image gallery using metadata

This is a small application that takes all the images in a directory and creates a lightbox style AJAX image gallery that is web ready.  It reads the metadata in the picture to extract the title, description, keywords and rating. Primarily I wrote this to experiment with some C# 3 features, such as LINQ.

There are no options in the program, its fairly simple, point it at an input directory, an output directory and click  Make the gallery! And it does its stuff. Its not overly sophisticated, it doesn’t generate thumbnails, just uses width and height attributes on the img tags. All the necessary support files will be copied across to the output directory. A page called index.html is generated and automatically displayed when complete.

Metadata collection

Metadata is collected using the System.Windows.Media.Imaging namespace. This is part of the Windows Presentation Foundation. When I tested this it worked well on Windows Vista, but when I tested it on a machine running Windows XP SP2, I got a codec not available error when accessing the metadata. I got around this by installing Microsoft Photo Info which is a fantastic utility for XP that incorporates read/write access to image metadata,  It has explorer integration, and I highly recommend it if you like adding metadata to your images. It can also be very helpful for those who have upgraded from Windows Vista to Windows XP.

The code to access the metadata is straightforward:

using(Stream stream = fii.OpenRead()) {

                    BitmapDecoder decoder = BitmapDecoder.Create(stream, BitmapCreateOptions.None, BitmapCacheOption.Default);

                    BitmapFrame frame = decoder.Frames[0];

                    BitmapMetadata metadata = (BitmapMetadata)frame.Metadata;

                    caption = metadata.Title;

                    if(metadata.Subject != null) {

                        caption += " - " + metadata.Subject;

                    }

I tried to close the stream, hoping that the metadata would be cached, but I suspect it uses Lazy loading because it would throw an error as soon as I accessed the metadata.

LINQ seemed a good idea for filtering the files. There may be a better way but this worked just fine.

            string[] files = Directory.GetFiles(directoryToProcess);

            var query = from f in files

                        where (new string[] { ".jpg", ".png", ".gif" , ".jpeg"}).Contains(Path.GetExtension(f).ToLower())

                        select f;

              

                return query;

Having the list of file extensions as the first part of the where clause didn't please me, but that is just aesthetics.

Javascript/CSS

The lightbox effect is achieved by using the MooTools JavaScript framework, and another library and example from phatfusion which creates the lightbox. This program just encapsulates the HTML generation,  metadata extraction and file copying.

Scope for improvement

There is a lot of scope for improvement. The class layout is fairly simple. Interfaces could be added, and a plug-in approach to allow for different LightBox or similar effects. As this was more of an experiment than a robust utility I didn’t get too precious about such design considerations. I hope it helps a few people to make a gallery for themselves.

Focas.NET.ImageGallery.zip (84.1 KB)
Saturday, May 17, 2008 10:11:52 PM (AUS Eastern Standard Time, UTC+10:00)   #    Comments [0]  Downloads | Metadata | Webmaster
link to del.icio.us link to reddit link to StumbleUpon link to Facebook Bookmark to Google
 Sunday, May 11, 2008
Wikipediaise - a c# VSTO Word addin

Wikipediaise - What is it?

Wikipediaise is a Visual Tools for Office addin (VSTO) developed in Microsoft Visual Studio 2008 as an addin for Microsoft Word. It is written in C#.  It  was designed to hyperlink acronyms and jargon  to Wikipedia.

I do a lot of technical documentation for my work, and the IT industry being what it is, the documents end up with a ridiculous number of acronyms. To make life easier, we usually put an abbreviation section at the top of the document, but this is a time consuming process to go thru every time, so I automated it. Additionally I added another method which will seek out the first occurrence of an abbreviation or acronym, and hyperlink it. First I will describe how this works, then how to use and customize the functionality.

Initially Wikipedia was used as the reference point, as it is an excellent reference point for technical information. After a while it became clear that many acronyms were better documented elsewhere, or in internal company documents, so I added the ability to use alternative reference sources.

Note that although I refer to acronyms, the addin is good for jargon and technical terms as well.

The following images show a before and after shot of a simple document, additionally it shows the document with an acronym table inserted at the top.


Figure 1 - Before shot of a Word document with acronyms and jargon to be hyperlinked


Figure 2 - The same document after it has been hyperlinked


Figure 3 - The same document, hyperlinked, and with an acronym table inserted at the beginning

How it works

The application comes with an embedded XML file with a set of pre-defined acronyms. This serves as an example only. The application will look in the %mydocuments% folder for a file called wikipediaise.dic. If this file exists it will override the embedded file, so the application can be customized for most requirements.

Format of the XML file

There are two elements available in the wikipediaise.dic file shown below.

Table 1 - Elements available in wikipediaise.dic

Element

Description

Comment

excludeStyle

Lists a Word style to be excluded from the process

This could be a built in style or a user defined style. If a word is in this style it will not be hyperlinked.

entry

Contains a mandatory key term that will be searched for. Optional attributes will be described later.

This text will be searched for in a case sensitive manner. If the term is found in the middle of a word, it will still be matched. For this reason, position longer superset acronyms earlier e.g. place https before http

The excludeStyle element has no attributes, so just looks like this


Figure 4 - excludeStyle element example

The entry element has attributes, these are described below.

Attribute name

Mandatory

Description

Comment

key

Yes

The term that will be searched for and hyperlinked.

Case sensitive. position longer superset acronyms earlier e.g. place https before http

wikipediaEntry

No

This attribute is only used for entries in Wikipedia where the page name is not the same as the attribute.  E.g. the entry in Wikipedia for Apache has a page name of Apache_HTTP_Server

 

description

No

This will be used as a tooltip when a hyperlink is created in Word. It will also be used in the acronym table if that feature is used,

 

url

No

This is an alternative URL if Wikipedia is not to be the source of reference.

 

Table 2 - entry element attributes

Focas.NET.wikipediaise.zip (21.4 KB)
Sunday, May 11, 2008 5:29:29 PM (AUS Eastern Standard Time, UTC+10:00)   #    Comments [3]  Downloads | VSTO | Word 2007
link to del.icio.us link to reddit link to StumbleUpon link to Facebook Bookmark to Google
 Wednesday, April 23, 2008
Googlebot frequency

While researching how google crawls websites, I found this great piece of information on the Google Webmasters site.

I am certain google doesn't crawl every website every few seconds. Does this actually mean that when the Googlebot is crawling, it won't access the website for the duration of the crawl more than every few seconds? is this to avoid looking like a potential DoS attack? I think the wording could be clearer here!

For reference, the above image was taken from this URL: http://www.google.com/support/webmasters/bin/answer.py?answer=34439&ctx=sibling


Wednesday, April 23, 2008 2:54:15 PM (AUS Eastern Standard Time, UTC+10:00)   #    Comments [0]  Webmaster
link to del.icio.us link to reddit link to StumbleUpon link to Facebook Bookmark to Google
 Monday, March 10, 2008
Windows Vista grievance

I just tried to install Microsoft PowerShell on a Windows Vista Home Basic installation. OK, so it is a power application, and on a home machine, but there are some scripts I really wanted to run here. But guess what, It wont install on Vista home basic. I checked the system requirements at the PowerShell home page and sure enough, no mention of Vista home basic.

But what I think is lousy about this is that when you go to the Choose an edition page of the Vista site, it doesn't say anywhere that you cannot run PowerShell on Vista Basic. I find this a little sneaky. I can understand their logic behind it, perhaps if I want to use a power application, I shouldn't run it on a home basic edition installation, however, lets be honest up front about it.

Well, lets play the game and upgrade to Ultimate or even Home Premium. That is a fairly simple option as the control panel explains, in fact it states You can learn more about editions of Windows Vista, or you can upgrade immediately. How cool is that! All I have to do is click the button, purchase my upgrade, and I will have a shiny new bells and whistles Vista edition, and now I can run PowerShell.

Unfortunately it's not that simple. The upgrade options fire up a Windows anytimeUpgrade web page to begin the process. Here I can select my billing location, and proceed. Unfortunately the only options are to bill to the United States, or Canada. Here in Australia we actually have the internet and are capable of online shopping. Why is it so difficult to provide an upgrade option online? So I can't really upgrade instantly as advertised.

It looks like I will have to put this off unless I feel like going to a shop to upgrade. I will be upgrading, but due to the experiences with various versions of Vista, I will be upgrading to Windows XP Service Pack 2.


Monday, March 10, 2008 9:28:08 PM (AUS Eastern Standard Time, UTC+10:00)   #    Comments [0]  PowerShell | Vista
link to del.icio.us link to reddit link to StumbleUpon link to Facebook Bookmark to Google
 Monday, December 17, 2007
Handy PowerShell commands

These are some PowerShell commands I have created that I find really handy. Some of the first ones are just helper methods for aiding more useful scripts. These are stored in my PowerShell profile located at:
%userprofile%\my documents\WindowsPowerShell\Microsoft.PowerShell_profile.ps1.  There is a special variable to point to this file, called $profile. It is easy to edit my PowerShell profile by typing: notepad $profile which will open up the profile in notepad. I actually use NotePad2 which I have renamed to n2 to make it easy to use from the run prompt, command line, or PowerShell. To use NotePad2, I type n2 $profile.

To make it easier later, I have set an alias to run Internet Explorer. I don’t use this alias in PowerShell usually, but it does get used in some cmdlet’s later. I call the alias ie

set-alias ie "${env:programfiles}\Internet Explorer\iexplore.exe".

Now the cmdlet I use a lot. This one just retrieves the latest item from a podcast feed, and plays it in the default Internet Explorer media player. Tjis cmdlet looks like this:

function play-Podcast($url) {

                ie ([xml](new-object net.webclient).DownloadString($url)).rss.channel.item[0].enclosure.url

}

This takes an URL and retrieves the contents as a string, turns it into an XML object, then performs the XPath rss/channel/item[0]/enclosure/url. It now passes that to the Internet Explorer alias set earlier, and the effect is to play the latest podcast entry.

My colleague David laughs at me for this, he thinks I should just use a podcast client, or just use a bookmark in a browser, but I find this handy. Now I can set other cmdlets so I can hear my favourite podcasts easily. I like a security podcast by Patrick Gray called Risky Business.  The cmdlet I have created is:

function risky-Business() {

                play-Podcast http://www.itradio.com.au/security/?feed=rss2

}

One other cmdlet that is handy is to save the podcast for offline listening. To achieve this I have a cmdlet called persist-Podcast , which makes use of another helper cmdlet called persist-Url. This takes a URL and a file name, and saves the resource to that file. Persist-Podacst just calls this cmdlet with the URL of the feed item. First is persist-Url:

function persist-Url($url, $file) {

                (new-object net.webclient).DownloadFile($url, $file)

}

This is straightforward, using the methods used earlier, in addition to the DownloadFile method of the webclient object. Now, to pass the URL of the podcast enclosure to this method, we get:

function persist-Podcast($url, $file) {

([xml](new-object net.webclient).DownloadString($url)).rss.channel.item[0].enclosure.url

                persist-Url ([xml](new-object net.webclient).DownloadString($url)).rss.channel.item[0].enclosure.url  $file

}

And to call this, the following can be entered into PowerShell:

persist-Podcast http://www.itradio.com.au/security/ m:\riskybusiness.mp3

 


Monday, December 17, 2007 8:11:24 AM (AUS Eastern Standard Time, UTC+10:00)   #    Comments [1]  PowerShell
link to del.icio.us link to reddit link to StumbleUpon link to Facebook Bookmark to Google
 Wednesday, December 05, 2007
Adding buttons to the Word 2007 ribbon at runtime

The ribbon in Word 2007 is a great feature, and it can be customized fairly easily using Visual Studio or other tools. As far as I am aware though, it is impossible to add buttons at run time. This would be a great feature, one that is missed from the earlier versions of Word.

There is a way around it, although it doesn’t provide the same functionality of adding buttons at will. When a Word 2007 Add-In loads, if it has a custom ribbon, then the ribbons GetCustomUI method will be called. This by default returns a string, which is the XML defined in the Visual Studio designer. By modifying this method, extra buttons can be added, but be aware that this method is only ever called once, when the Add-In loads.  Of course you could just define the buttons in the XML in the first place, but the method I outline here is good for a plug-in scenario, where you don’t know in advance what buttons might want to be added.

To cater for a plug-in scenario, you can define extra buttons in an external  configuration file that is read at start up. This can then be parsed and added to the ribbons XML text in the GetCustomUI method. This can be used to add in any of the button styles available to the ribbon. You still need to implement handlers, which I will discuss in a later post. In this example, I will add a simple button that just uses the Microsoft happy face image.

First, I have defined a helper method which just returns a Stream from the resource file which contains the XML representation of the ribbon. The GetCustomUi method returns a string, and by default just calls the standard GetResourceText method which is created when you add a ribbon.

Stream ribbonStream = GetResourceStream("WordAddIn1.Ribbon1.xml");

I haven’t included the code for GetReourceStream method, it is straightforward and easy to create. Now, the Stream is loaded into an XmlDocument. This is because we are going to add a node later, so we need a read/write object.

XmlDocument ribbonDocument = new XmlDocument();

XmlNamespaceManager nsmgr = new XmlNamespaceManager(ribbonDocument.NameTable);

nsmgr.AddNamespace("r", "http://schemas.microsoft.com/office/2006/01/customui");

ribbonDocument.Load(ribbonStream);

Note the use of an XmlNamespaceManager. This is not strictly necessary in this example, but if you are going to work with Office open XML then you should get used to using this object. In the ribbon designer you need to define somewhere to place your custom buttons. For this example I have defined a group that I will add buttons to. This is defined in the XML like this:

<group id="grpCustomButtons"

               label="Custom buttons" />

An XPath query will now be run nto get a reference to this XmlNode.

string xpath = "r:customUI/r:ribbon/r:tabs/r:tab[@idMso='TabAddIns']/r:group[@id='grpCustomButtons']";

XmlNode nodeCustomButtons = ribbonDocument.SelectSingleNode(xpath, nsmgr);

Now the custom buttons can be added to the nodeCustomButtons object. In this example I have just hard coded this, but ideally you would have helper methods that would look for a configuration file, load it and dynamically create the buttons from the information found there.

XmlNode nodeCustomButton = ribbonDocument.CreateElement("button",ribbonDocument.DocumentElement.NamespaceURI);

XmlAttribute att = ribbonDocument.CreateAttribute("id");

att.Value = "cb1";

nodeCustomButton.Attributes.Append(att);

               

att = ribbonDocument.CreateAttribute("label");

att.Value = "Custom button";

nodeCustomButton.Attributes.Append(att);

 

att = ribbonDocument.CreateAttribute("imageMso");

att.Value = "HappyFace";

nodeCustomButton.Attributes.Append(att);

 

att = ribbonDocument.CreateAttribute("size");

att.Value = "large";

nodeCustomButton.Attributes.Append(att);

This button will now be added to the nodeCustomButtons that was retrieved earlier.

nodeCustomButtons.AppendChild(nodeCustomButton);

And all that is left is to return the XML as a string as the method requires.

return ribbonDocument.OuterXml;

Figure 1 - The custom button displayed in thw Add-Ins tab

What comes next? This code only creates a custom button. It doesn’t create a handler for the button, so it is a useless button. What needs to be done next is to add a handler for this button. This code is based on a plug-