On Lexers, Syntax Highlighting and a good suggestion.
by Mike
In my last post I mentioned I am working on a code snippet repository for storing and sharing code snippets. In the comments Jason Gedge(a very smart man I might add) mentioned that I should probably store code snippets as XML and then use XSL/XSLT to do the markup for the syntax highlighting. This is one of the many ideas I have thrown around and after some more thought I think it is the one I am going to go with.
That being said, to make things more difficult on myself I think I am going to write a lexer.
From the Wikipedia Article:
In computer science, lexical analysis is the process of converting a sequence of characters into a sequence of tokens. Programs performing lexical analysis are called lexical analyzers or lexers. A lexer is often organized as separate scanner and tokenizer functions, though the boundaries may not be clearly defined.
So the intent is for my lexer to take grammars I write for individual programming languages(it would be nicer if I could actually find a set grammars for common languages and then just use them) and then generate the xml based on the particular language and its parse before I store it in the database.
I then I will just use XSL/XSLT(I think this is what I will do) and use it to transform my XML to XHTML . It won’t accomplish what Jason suggested in allowing for intake of generic code and outputting it into multiple languages, but I think it should solve this sane code storage/syntax highlighting issue.
Needless to say, this should be interesting, I have never written a lexer before so I can only imagine how it will turn out! Ideally, as long as it gets the job done and its relatively fast(light) I will consider it a success.
Comments
You can write a trivial lexer in no time at all, but it’ll most likely be relatively inefficient. By trivial I’m thinking one that tries all possibilities from the current character. Rather brute force.
One suggestion is to write up your own lexer-like “generator” based off of some format. That way you can reuse the code over all the languages you support. For example, http://en.wikipedia.org/wiki/Flex_lexical_analyser is something I’ve used in the past, and you could use similar ideas.
Also, if you’re open to using existing software, I have another suggestion. Since you’re using Python/Django, http://pygments.org/ would be an option for lexers. If not, even checking out its source might give you some ideas.
Finally, if you’re just lookin’ for some learnin’, then implementing one yourself is good stuff. If you’d really like to get in all kinds of learning, consider building a (finite) state machine from something similar to the input that Flex uses.
Have fun!
P.S. Thanks for the compliments. You’re too kind
I have actually checked out pygments and it did look good, Ideally I’m in no rush to get the project done(and don’t know what I am going to use it for) so at this point, I am pretty sure I am going to code everything myself.
As for using FSM(or FSA I guess), I have considering that as well, being able to only read each character in the text once would mean the it will be fast, and fast and small are what I am going for. So that is definitely an option I am considering.