Tuesday, December 08, 2009

Basic High level design of a text based matrix parser…

In this post I will introduce the basic architecture for a matrix based text file/stream parser that can deal with file formats in both fixed width notation and delimited formats. Not only that the matrix parser can deal with 1-2-1, 1-2-M and M-2-1 relations ships in the text stream. It can also deal with mixed format files where you have column based entries, CSV formats and also multiple ragged formats of fixed width data.

The concept is in reality quite simple, and if you think about it after looking at the architecture diagram below. Is quite similar to the structure used when creating complex reports in tools liek crystal reports and so on.

The structure is quite simple, here is a text breakdown of the key parts of a matrix parser.

  • Scanner
  • Token Extractor
  • Matrix structure made up of
    • Blocks of rows that can be CSV, fixed, column meta-type and made up of
      • rows which can contain tokens or made up of
        • columns which can contain
          • another Block or the data to parse

Below is a high level class diagram for a simplistic Matrix based text parsing library/API. ( the scanning/parsing/tokenizing part is taken out.

 

image

I will put up a full architecture later, when I walk through how the scanner/parser and token extractor works.

No comments: