Well for the past year I have been working on a data management product that can deal with a ridiculous amount of data and data transactions.
The product is a finance industry related product and is designed to work with Financial Market Data and also financial reference data feeds. Some of which can potentially have multi-million entries per file in some extreme cases.
And of course the more feeds you process and match-merge the amount of data transactions (insert, update, removal, indexing, relational mapping and querying) can mean you process exponentially more data.
Now the issue is simply there is only so much processing that can be done at certain parts of a transactional process before you hit some pretty major latency issues and processor performance issues, shortly after these are resolved you tend to hit physical infrastructure issues.
A casing point is using SQL server bulkcopy on an optimized table you can import data at an incredibly speed, however if you need to process that data, mark it up in some way and merge it with other data, performance will nose dive. Add in basic auditing/logging support it will drop again.
Most people resort to two main architectures in this light
- BRBR – Bloody Row by Row Processing
- This is the most flexible way of processing large data and provides many ways to improve speed, but is still along way behind batch processing in all but a couple of scenarios.
- BRBR can be optimized the following ways
- Multi-threading
- Parallel Processing
- Multi-database server
- Cloud and Grid computing and processing
- The above is pretty new, and not that mature yet
- Custom Data storage and retrieval based up data and how it is obtained
- There are a few of these out there including a few very powerful and fast data management and ETL like tools
- BRBR can suffer greatly based upon
- Skill of engineers
- Database technology
- Database latency
- Network latency
- Development tool used
- Batch Processing
- If you are able to just use a single database server you can look at the various ways of doing batch processing
- Batch processing can be Optimized by
- Having a great god like DBA
- Database tool selection
- Good Cloud/grid computing support
- batch Processing can suffer greatly
- Anytime you need to audit data processing that is done in batch data. most of the time you may have to run another batch process or a BRBR process to create audit data that can be generated easier at point of change in a BRBR based system
- Database dependency
- System upgrades
- Visibility into data processing/compliance
That's it for now kind of a messy post.
I will clean this up later, and add a post or two on advances in cloud computing in relation to Very large database processing issues.
No comments:
Post a Comment