On Lexers, Syntax Highlighting and a good suggestion.

by Mike

In my last post I mentioned I am working on a code snippet repository for storing and sharing code snippets. In the comments Jason Gedge(a very smart man I might add) mentioned that I should probably store code snippets as XML and then use XSL/XSLT to do the markup for the syntax highlighting. This is one of the many ideas I have thrown around and after some more thought I think it is the one I am going to go with.

That being said, to make things more difficult on myself I think I am going to write a lexer.
From the Wikipedia Article:

In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. Programs performing lexical analysis are called lexical analyzers or lexers. A lexer is often organized as separate scanner and tokenizer functions, though the boundaries may not be clearly defined.

So the intent is for my lexer to take grammars I write for individual programming languages(it would be nicer if I could actually find a set grammars for common languages and then just use them) and then generate the xml based on the particular language and its parse before I store it in the database.

I then I will just use XSL/XSLT(I think this is what I will do) and use it to transform my XML to XHTML . It won’t accomplish what Jason suggested in allowing for intake of generic code and outputting it into multiple languages, but I think it should solve this sane code storage/syntax highlighting issue.

Needless to say, this should be interesting, I have never written a lexer before so I can only imagine how it will turn out! Ideally, as long as it gets the job done and its relatively fast(light) I will consider it a success.