I have been reading and thinking a lot about DevOps recently, specifically in the area of development/test/deployment automation and how it would be best applied to building analytic solutions.
Like metadata, I believe DevOps is coming into its prime, with the advancements in open source and resurgence in programming combining to provide all of the enabling components to build an automated delivery pipeline.
In a simple delivery pipeline you will have a number of stages that the code moves through before final deployment into a production environment.
Code is created by a developer or generated by a tool and committed into a source code repository. The change is detected and triggers the packaging process.
QA checks are performed and a deployment package is built. This contains the set of changes (delta) which must be applied to the production environment to deploy the new code.
The package is then deployed into a series of environments where testing is executed and the package is automatically promoted throughout the test environments based on the “green light” results of the previous round of testing.
Test results are monitored throughout the entire process and failures are detected and reported back to the developer immediately for correction. The faster you can find and report on bugs, the easier it is for the developer to fix the issue, as it will still be “top of mind” and requires least effort to remedy.
Finally packages are deployed into production, either automatically (continuous delivery) or as part of a scheduled release.
An automated delivery pipeline like this looks simple on the surface, but as soon as you peel back the covers and look into the detail you quickly realise that there is a lot of complexity contained within.
Some of the challenges are technical in nature, “how do I generate delta changes to a database object?” and others are firmly in the design/architecture realm, “how do I perform continuous integration on a monolithic data maintenance process?”.
Whilst I am not going to be able to explore all of these issues within this article, I would like to discuss the core principle which I believe is the key to solving this complex puzzle – modular design.
Modular design certainly is not a new concept, indeed it has been used in traditional software development for many years. Micro services are a great example of modular design in action, and I believe they will play a far greater role in analytic solution development in the future.
Many analytic solutions (eg: Data Warehouse, Data Lake etc…) will have a modular design to some degree, but most do not extend the concept down to a low enough level to enable the benefits of continuous integration & delivery that DevOps automation can provide.
The monolithic process design encapsulates multiple functions within a single object, commonly a script or piece of SQL code.
To test this object, we must supply object definitions for all inputs, test data for all inputs and expected results for the output.
Testing this single component does not provide any particular challenges when done in isolation, however when integration test requirements are considered the limitations of monolithic design become apparent.
Consider the case where the output of this process is consumed by 2 downstream processes.
For integration testing, we must also test the downstream processes to ensure that they still produce expected results. The testing scope has now increased significantly and this may reduce the frequency of our integration testing, based on the elapsed time of executing the full suite of tests required.
Organisations who have implemented automated deployment pipelines often report that integration testing is only done overnight, as the end to end elapsed time is “hours”.
When implementing automated deployment into an existing environment this is going to be the starting position, as you must incorporate the existing monolithic processes and implement improvements over time.
A modular process design can be thought of as a monolithic, complex process decomposed into a series of simple, atomic functions. In the agile world this is analogous to taking a user story and decomposing it into a number of tasks.
We now have four separate objects implementing the previous monolithic process.
In general, there will be a reduced number of inputs to the object, as only the inputs necessary for that specific function are required.
Output objects will tend to be reused in subsequent functions, and there will be an increased number of working/temporary objects to store intermediate results.
The number of objects and associated artifacts (code, scripts, object definitions, test data, etc…) has increased, while the complexity of each artifact has decreased.
Managing the artifacts is a perfect candidate for automation. Many of the artifacts can be generated using templates and customised with parameters.
Unit testing now has a much reduced scope, if we make a change to the business rule we just need to test the inputs and output objects.
Each individual test will be simpler, and it will be easier to expand the test coverage. The elapsed time of testing will be shorter, allowing more frequent testing.
How does this impact on the scope of integration testing? Impact to external processes in our example should be constrained to the case where we change the “apply changes” function, as that is the only point where output is produced ready for use by other processes. In this case we must integration test that function and any consumers.
Let us assume that the output table of our changed function is used in both the “resolve keys” and “apply business rule” functions of the dependent processes.
We now have what I call a “minimum testable component” for integration testing which ensures that all dependent processes are tested, but also keeps the scope of the testing to a minimum.
There will tend to be less permanent source objects involved in the testing, the functions tested are simple, the individual tests are simple but comprehensive and elapsed time will be the minimum possible.
This is the road to the holy grail of continuous integration testing, where the end to end elapsed time of the testing fits within a small enough window that the testing can be performed on demand, for every change committed into the source code repository.
Continuous integration enables continuous delivery, where every change has the potential of automatically flowing through the pipeline to be deployed into production.
Organisations I have seen that have implemented this do not let all changes automatically deploy to production, or indeed let all changes proceed with the minimum testable component for integration testing, that is an aspirational goal rather than normal practice.
Changes are rated according to risk – high, medium & low. Low risk changes can be automatically deployed to production, given all testing succeeds. Medium risk changes have some manual governance checks applied before deployment, and may trigger off integration testing with expanded scope. High risk changes are typically implemented in a scheduled release, with supporting resources (developers and operations) close at hand.
The goal is to minimise the risk of each change. Over time development teams will realise that they can deploy much faster by implementing many simple, low risk changes rather than a smaller number of monolithic high risk changes.
Following modular design techniques will help you to maximise the number of low risk changes, enabling agility in delivery of new functionality within an analytics environment which traditionally has been viewed (and built) as a large monolithic application.
Modular design will also help you to start to unravel the complexities of automating your deployment pipeline, with every small step forward providing benefits though increased code quality, reduction in production defects and faster time to market for new functionality.
The post DevOps Decoded: Modular Design appeared first on International Blog.