Is Hand-Coded ETL the Way to Go? Absolutely Yes, or Absolutely No, Depending…

This article originally ran in May 31, 2003 edition of Intelligent Enterprise magazine, written by Gary Nissen, edited by Ralph Kimball.

Recently, there was a question posted on a data warehousing e-mail list that is frequently asked, yet always surprises me. The question asked whether it is acceptable or recommended to hand-code the Extract-Transform-Load (ETL) logic for a data warehouse or if tool-based ETL is always preferred.

My standard answer is that it is perfectly appropriate to custom develop the ETL code. In the right situation, custom developed ETL can be cheaper and more maintainable. But in other situations, a tool-based approach is better. Just like any other technology decision, it depends. It all comes down to understanding the scope and requirements of the project.

I guess what surprises me is that this decision is no different than any other technology decision in any other development project. You simply consider all significant requirements, balance your options and make a business decision. In other words, just follow your standard project plan and it will lead you to the right decision.

A simple project plan for an ETL system

In my experience, every well-run project follows steps similar to these in a reasonable and efficient manner.

  1.  Requirements are written and agreed to with enough detail so that all parties generally understand each other, what is in scope and what is not in scope. It is also a good idea to identify some of the tougher deliverables and agree to a general methodology to address each. The principal parties to an ETL system besides the ETL system developers are the original data providers (such as the production DBAs), the data cleaning and loading specialists, the presentation system DBAs who mainly build star schemas, and the application developers who in turn serve the business community.
  2. Design decisions are agreed to by the ETL development team, then documented. The goal is to design the simplest possible solution that delivers the requirements. The design document should include technology decisions, data flows and development standards.
  3. ETL transformation modules and data flows are developed according to the designs. Throughout the process, the developers continually conduct tests to ensure that requirements are being met.
  4. Documentation is written to support DBAs, IT managers, and future developers.
  5. Data warehouse DBAs conduct acceptance testing by working with the system to ensure that it delivers what was asked for and that what was asked for meets the actual needs.
  6. A training and support system is put in place to provide user guidance and feedback for improvements.

Most people see this list and assume that each step is done to the extreme and, therefore, will take too much time and effort. That will doom a project just as quickly as skipping these steps. Each step must be done just enough to successfully move on to the next step. Anything else is wasted effort.

Just how simple is it?

I use these guidelines on my projects to keep the process simple enough so that no one feels the need to cheat. These guidelines don’t fit every time, but probably work for 80% of our projects.

  • Requirements should be 3-5 pages, no more than 10. Gather requirements in 4-8 hours of meetings or interviews. Ask things like “What are your top 3 administrative requirements for this ETL system” and “What are the most difficult things you currently do and cannot do, but wish you could?”. You should complete your requirements document in about 5 calendar days. Anything more might indicate that your project is too ambitious, not fully understood or you’re trying to do too much too early.
  • Design documents should be 5-20 pages and take about a week for the development team to complete. The data flows for each table in your data warehouse take up most of this document. Each target table can be described in one or two pages. Use a diagramming tool if possible.
  • Developer testing should include a clean install, a complete configuration, loading a representative set of data sources, and delivery of required final data structures within specified  processing windows. Known “little surprises” like corrupted data should be introduced into the testing on the second or third iteration. This process should be completed several times through the development phase. If possible, it should be automated so it can be run more frequently.
  • Developers should either write the documentation or be very involved in the writing of the documentation. The documentation should be brief and simple, focusing on the primary users of the system – database administrators, IT managers, and future developers. Take time to describe what each of them needs to know. But, again, don’t go overboard.
  • During acceptance testing, the target ETL administrative users are expected to spend significant time using the system the way they expect to after delivery. I always make a point of saying, “We take our responsibilities seriously and we expect the same from you”.
  • If everyone did their job well, there should only be a few things found during acceptance testing. Address the issues, allow the target users to re-test, and deploy.

What about the hand coding question?

Look back at the original question. Do you now see why it surprises me? You can’t know whether it is appropriate to hand-code the ETL system until you understand the scope of the project. The development team makes this decision after the project scope and requirements are well understood. At this point, you may still have questions regarding the right ETL development method. But the questions will be more detailed and specific.

For example:

  • We need to make an ETL system that only takes data from one packaged application and will be distributed to many customers. As part of this system, the ETL needs to have a lot of configuration options. I don’t think that I can address this with a tool-based ETL system. Can I?
  • We are building a small data warehouse with a little database and a small budget. The ETL programming seems straightforward and we have programmers on staff that can handle it. The tools available all cost about twice our development estimates. We think we should hand-code the ETL to save money. Everyone tells us we NEED a tool. Are we missing something big here?
  • We have designed a relatively large data warehouse. Even though the ETL processes are pretty complex, we are inclined to write these processes by hand because the tools cost so much. Do the tools really save me enough time to pay for themselves? Where does the big payback come from?

So the point is that you need to understand your project and requirements before this decision can be made. Once you understand your project, you should consider the advantages of hand-coded and tool-based ETL development and choose the best fit. Be advised, however, that hand-coded ETL is not a highly regarded approach. Don’t let the industry bias persuade you. Try to evaluate your options independently and choose the best fit for your project.

Advantages of tool-based ETL

Tool-based ETL systems enjoy several well-known advantages:

  1. Simpler, faster, cheaper development. The ETL tool cost will make up for itself in projects that are large enough or sophisticated enough.
  2. Many ETL tools have integrated metadata repositories that can synchronize metadata from source systems, target databases and other BI tools.
  3. Most ETL tools enforce a consistent metadata driven methodology that all developers must follow.
  4. Data flow and data lineage documentation can often be automatically generated from a metadata repository.
  5. ETL tools have connectors pre-built for many source and target systems. So, if you are working with many sources or targets, you will find many advantages in a tool-based approach.
  6. Most ETL tools deliver good performance even for very large data sets. If your ETL data volume is really large, or it will be large in a year or two, you should seriously consider a tool-based option.
  7. Some ETL tools help you conduct change impact analysis further down the road.

Advantages of hand-coded ETL

However, hand-coded ETL development provides its own unique advantages:

  1. Automated unit testing tools are available in a hand-coded system, but not with a tool-based approach. For example, the JUnit library ( is a highly regarded and well-supported tool for unit testing Java programs. There are also similar packages for other languages. Another common approach to unit testing is to use a scripting language, like TCL or Python, to setup test data, run an ETL process and verify the results. Automating the testing process through one of these methods will significantly improve the productivity of your QA staff and the quality of your deliverables.
  2. Object-oriented programming techniques help you to make all of your transformations consistent for error reporting, validation, and metadata updates. You may not have this degree of control with the parameter setting and snippet programming style of ETL tools.
  3. Metadata can be more directly managed by hand-coded systems. Actually, metadata driven hand-coded systems are much more flexible than any alternative.
  4. A brief requirements analysis of an ETL system quickly points you towards file-based processing, not database stored procedures. File-based processes are more direct. They are simply coded, easily tested and well understood.
  5. A tool-based approach will limit you to the tool vendor’s abilities and their unique scripting language. But, a hand-coded system can be developed in a common and well-known language. In fairness, all the ETL tools allow “escapes” to standard programming languages in isolated modules.
  6. Hand-coded ETL provides unlimited flexibility, if that is indeed what you need. You can literally do anything you want. There are many instances where a unique approach or a different language can provide a big advantage.

Why are so many people against hand-coded ETL?

Many people in the data warehouse industry believe that ETL processes should not be hand-coded. They say that hand-coded systems are more difficult to maintain, developers won’t produce documentation and everything slowly migrates to a convoluted mess. But, the problem is not that hand-coded systems deliver poor results. Poor project management and the lack of enforced discipline deliver poor results. It just so happens that when a project is not adequately analyzed or managed, the hand-coded route is frequently taken but the project is not managed as a serious software development effort should be managed. The real costs and issues aren’t understood, so the ETL tool cost always looks out of proportion to the perceived need.

But, just because poorly run projects are hand-coded, doesn’t mean that tool-based ETL is always the right choice either. There are no silver bullets that will make your project successful. Well, except maybe hard work and a disciplined approach.

Speak Your Mind