Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


May 29, 2016

Simplified Analytics

Digital Transformation in Banking - my POV

The digital banking landscape has never been more dynamic than it is today.  The number of people going into branches to do their banking is falling dramatically. Customers are changing the way...


May 28, 2016

Omer Brandis

Scale out your existing MySQL landscape with Scalebase

In a nut shell, scalebase is to mysql what greenplum db is to postgresql, it makes it possible to create an MPP database based on mysql. You can use it to scale out your existing "mysql applications" without changing your code and/or create new MPP databases for handling big data.
Data Digest

The Issue of Data Ownership Explained: Could Chief Data Officers Have Got it All Wrong?

Ahead of the coming Chief Data Officer Forum, Financial Services happening on June 22-23, 2016, we spoke to Marc Alvarez, Chief Data Officer at Mizuho Securities USA and one of the distinguished speakers of the event, to share his insights on the key issues gripping CDOs in the financial services industry. We spoke about the challenges faced in this highly regulated environment, the tangible risks involved when one fails to utilize data and the sticky issue of data ‘ownership’.

CDO Forum: In financial services, data is pervasive in everything a business does. How can CDOs establish their remit when every part of the business deals with data?

Marc Alvarez: It’s not really an optional exercise in today’s financial service industry. There is a very strong and to some extent increasing regulatory mandate pushing this requirement – regulators are seeking to better understand and have transparency into the data operations within a firm in order to better measure possible risk.  By definition this is a highly quantitative function, in line with other activities such as quarterly reporting of earnings and other performance reports.

Basic statistics tells you that accurate analysis is dependent on the quality and completeness of the data used in the analysis. So the need for improving data quality is a natural extension of the remit from regulators for improved analysis. By the way, this is an evolution that has been ongoing for decades – as technology becomes more sophisticated and economically affordable, more sophisticated statistical and quantitative methods can be brought to bear to better benefit the firm and its clients.

I think this is the key touch point for any organization, whether in the capital markets or not. Removing friction in the ability to access data content to feed empirical analysis is the natural outcome. Looking at the data footprint within a firm and identifying key pain points is the place to start – then moving to establish controls and improve quality of service and the user experience is likely going to generate the visibility to launch a broader program. Metrics around data quality, usage, and, yes, cost, provide the right basis for better managing content, leading to more and better analysis, leading to better satisfying regulatory requirements while at the same better informing both customers and investors as well as staff within the firm.

They key, though, is to recognize and communicate that this is a journey and not a destination. Industry experience has shown that the need for improved quantitative and statistical analysis is insatiable and needs to have an equal effort put into providing reliable and accurate input. It’s not and never will be a one off project.

CDO Forum: Because of data ownership issues, the CDO’s relationship with other senior executives can sometimes be difficult. What do you think is the key to partnering productively with other executives?

Marc Alvarez: I really don’t find the concept of “ownership” particularly meaningful or useful in the data management context. In today’s capital markets, a wide variety of data is used and re-used across the organization – it’s a fundamental resource like air or water in my opinion. We simply can’t run our business without it just like a blast furnace can’t run without oxygen or a sailboat without wind (or water for that matter).

That means we have a situation where we have multiple uses for the same resource. The real value to the firm is to deploy infrastructure that makes accessing and consuming the resource as efficient as possible. In today’s day and age that means establishing some form of level of service to meet these requirements at enterprise scale. In the past, this could be handled by the IT team on a project by project basis, but the demands are rapidly exceeding the capability and it’s likely not sustainable.

I really don’t find the concept of “ownership” particularly meaningful or useful in the data management context. In today’s capital markets, a wide variety of data is used and re-used across the organization – it’s a fundamental resource like air or water in my opinion. 
I think the key is to get dialogue going at a senior level within the firm and undertake an honest assessment of the firm’s capabilities to meet the growing demand. More IT spend is likely part of the answer, but it’s also becoming increasingly clear that firms need to build new competencies in the areas of data acquisition, management, administration, accounting and so on if they are going to not only compete, but thrive in an increasingly globalized and digitized economy.

Senior executives are well versed in working with data since they have to report results on a regular basis. So it’s not a foreign topic for them (at least it shouldn’t be). They get the potential upside to be won from thinking of new ways to deal with data supply at scale within the firm….once you cross that particular threshold, it really becomes a question of setting priorities and execution, both areas where financial firms in particular excel.

CDO Forum: Is it fair to say that the heavy regulatory burden that finance companies face means they have been slower to find the value in their data than those in other industries? How do you think can this be overcome?

Marc Alvarez: I would disagree that financial firms have been to slow to come to value their data. If anything, financial firms have been at the forefront of consuming and managing data content for decades. Data content – whether its prices, corporate actions, news, symbology, reference data, or whatever is a fundamental input to a financial firm’s business operations….plus the universe of content the firm generates itself. If anything, I think financial firms are at the front of the curve in deploying data content to everyday business operations. Just look at the amount of investment Silicon Valley is making in funding firms that sell to financial services!

And while regulatory reform is proving to be a major overhead to the business (we can debate just how effective it’s proving), I think it’s clear that financial firms are very well equipped to meet these challenges. The difference is that the impetus is coming from regulation and the scale of reporting and analysis is unlike anything we’ve seen before.

However, I’m fairly optimistic on this point – uniquely the financial industry has the history of working with data content and technology to have the skills to meet these demands. And if the new regulatory framework does what it claims it’s supposed to do, we’ll be living in a world of reduced risk and volatility for investors to deal with. That has to be good thing.

CDO Forum: What do you think are the risks for a financial services institution that fails to utilize its data in a meaningful way?

Marc Alvarez: Well, the obvious one is the risk of being fined and the associated impact on the firm’s reputation.

More importantly are the opportunity costs. Optimizing the manner in which the firm acquires, integrates and applies data content is fundamental to providing customers with an agile service line. The faster we can develop and leverage analytics, the better we can serve our customers across all our lines of business. Technology is core to today’s business and in particular it is driving more robust and sophisticated statistical analysis. I think we’ve arrived at what the journalists like to call an “inflection point” – the combination of technology and available data content are equipping firms and investors alike to perform far more detailed statistical analysis, developing entirely new investment strategies while at the same time modelling and managing the corresponding risk.

I think it’s becoming clear that this is the new normal - business is demanding new and more sophisticated analytics – it’s not an option. Failing to improve in data management will be a major constraint on a firm’s ability to compete and grow.

It will be interesting to see how the new entrants in the FinTech space address data management – if there are any new ways to skin this particular cat, that’s one place to look for them. If they break trail and come up with new, more efficient methods that avoid the challenges facing longer established firms (including the regulation hurdle), then we should see still more competition in an already hyper-competitive industry.

I think it’s becoming clear that this is the new normal - business is demanding new and more sophisticated analytics – it’s not an option.

CDO Forum: With the current scarcity of talent and high demand for data professionals, can CDOs still build an effective team?

Marc Alvarez: So this is where the “C” comes into “CDO” – building teams, recruiting staff, improving competences…these are all tasks that come with any leadership position. I don’t think there’s anything special when it comes to data management, however finding and growing leaders in any discipline is hard to do; that is true.

The common lament you hear is that data professionals can be somewhat hard to find these days. However, I think we may need to start by broadening our search parameters. The data vendor community has addressed many of these challenges over the years (full disclosure, I recently left a data vendor firm) and we are seeing an increase in education in the area. So I suspect we will have to address the current skills shortage via a variety of recruiting and training methods – in fact, I expect defining these programs and attracting the right talent will form a big part of the CDO role. 

The real challenge will be retaining the staff you get up to speed – people in data management roles tend to be highly numerate, analytical and excellent communicators…..there are plenty of roles in today’s economy that place a premium on those skills. So turnover is, unfortunately, to be expected.  On the plus side, if you’re just starting your career, data management is great place to consider – it opens lots of doors and the work really is quite interesting.
CDO Forum: What do you think is the key to turning a team to be ‘data evangelists’ and how do you engage employees across the business to manage and utilize data?  

Marc Alvarez: As with real estate, only 3 things – communicate, communicate, communicate!!

The challenge I find is that everybody in the industry presumes they know what we’re talking about when we use the phrases “data management” or “data governance” or “data” whatever. However, the reality is that if you ask 5 different people for their definitions you are almost certainly are going to get 20 different answers!! Putting my old Product Manager’s hat on, this indicates to me a need to set the scope of the discussion and agree on the terms of engagement. That means one aspect of the CDO role is to lead some internal marketing efforts to get the firm up to a common and agreed level of understanding.

There are a lot of ways to do this – other aspects of our business like IT Security expend a lot of effort in communicating complex concepts and requirements. So I think that’s the model to work with – creating an expectation that any one “evangelist” is going to be effective just doesn’t sound right to me. I think setting goals and producing visible results goes a lot further than anything else.  Do that, then the marketing comes easily.

CDO Forum: What do you think is the single most important quality that CDOs must possess to succeed in Financial Services companies?

Marc Alvarez: Be fearless! You need to be comfortable going across a lot of functions including IT, regulation, risk management, finance and so on. There is no one playbook to follow for any given firm and I think you have to be comfortable with a fair amount of uncertainty. So you have to have a strong will and communicate a very complex set of tasks, goals, and most important of all, benefits to the business. I think that adds up to having the courage to take on some pretty big issues and enlist the resources of the firm – so it’s not a job for you if you like things nice and quiet!

Hear more from the leading voices in Financial Services at the coming Chief Data Officer Forum, Financial Services on 22-23 June 2016 in New York City. Join over 100 CDOs, CAOs, and other data leaders from leading global and national financial institutions. For more information, visit

May 27, 2016

Revolution Analytics

Because it's Friday: The history of Japan, in 9 minutes, with jingles

This short video (via Vox) presents the entire history of Japan (yes, since the dawn of man on the islands!) without pausing for breath, and it's hilarious: I don't have much in the way of knowledge...

Big Data Tech Law Blog

Beyond Breach: Challenges in Cybersecurity & Coverage

By Jon Neiditz and David L. Cox Some of the biggest threats to cybersecurity involve controlling, damaging and interrupting systems, … Continue reading Beyond Breach: Challenges in Cybersecurity & Coverage

The post Beyond Breach: Challenges in Cybersecurity & Coverage appeared first on Big Data Tech Law.


Will privacy fears undermine the future of machine learning?

Artificial intelligence (AI) and machine learning technology present an interesting meeting point between our fear of the unknown and our fear of being known (that is, fear of our private information being exposed and known by others.) The erosion of privacy and rise of intelligent machines is actually a common theme in science fiction. But while reality still has a lot of catching up to do before we can call Skynet’s customer support or play cards with Agent Smith, many people have expressed genuine concern over the implications of modern technology – especially regarding their privacy.

Revolution Analytics

An object has no name

No, it's not a Jaqen H'ghar quote. Recently, Hadley Wickham tweeted the following image: While this image isn't included in Hadley's Advanced R book, he does discuss many of the implications there....


May 26, 2016

Revolution Analytics

Some Impressions from R Finance 2016

by Joseph Rickert R / Finance 2016 lived up to expectations and provided the quality networking and learning experience that longtime participants have come to value. Eight years is a long time for a...


May 25, 2016

Revolution Analytics

Predictive Maintenance for Aircraft Engines

Recently, I wrote about how it's possible to use predictive models to predict when an airline engine will require maintenance, and use that prediction to avoid unpleasant (and expensive!) delays for...

The Data Lab

Scotland’s Innovation Centres at TEDx Glasgow: Driving Innovation for a Disruptive World

Innovation is vital to business and helps to drive growth. It defines how we transform the things we do, how we make more from the services we offer and how we improve the things we make. Innovation is the successful exploitation of ideas to make our businesses ever more competitive across international markets and at home. In most, if not all of these markets, it is the key differentiator, which shapes Scotland’s competitive advantage and helps us win market share.

How we innovate has changed. It is no longer happening as an add-on for businesses but as a fundamental part of their plans for success. It is more likely to come from creative collaborations and if Scotland wants to innovate successfully then it is essential that we find the right partners – ones that not only share our vision but also our passion for success.

With the backing of £120m from the Scottish Funding Council, the Innovation Centres are designed to encourage innovation through industry-led collaboration with academia. They blend academic creativity and invention with industrial insights, providing and environment in which creativity of design, engineering and technology are combined to achieve results that bring tangible benefits to the Scottish economy. Each centre has specialist knowledge, so whatever the challenge, the ICs play a significant role in helping industry to deliver effective and transformational solutions. From sensing to stratified medicine, digital health to data, aquaculture to oil and gas, and biosciences to construction, there are eight to choose from – each one designed to drive further growth in an area of key economic importance to Scotland.

Revolutionary or evolutionary, Scotland’s Innovation Centres exist to help take your idea to the next level. Perhaps you’re looking for help with design, engineering and technology. Or maybe you want help managing your costs and risks, reducing time to market or turning an idea into a viable commercial product or service.  Our connections can help. We know the leading companies breaking new ground in your niche area. We know the academics carrying out specialist research in your field. And we know the funding streams available. We also have a proven track record of bringing it all together to help get new ideas and innovations off the ground.

If your business values innovation as a growth opportunity and sees merit in a collaborative approach, then one of Scotland’s eight Innovation Centres could be the perfect partner. Each is a place where creativity of design, engineering and technology are combined to achieve results that bring tangible benefits to Scottish industry. They can provide help at any point along the journey from raw idea through to market entry.

This year, the Innovation Centres are proud to be partners of TEDx Glasgow. The theme at TEDxGlasgow 2016, is entitled ‘A Disruptive World’, considering how individuals and organisations can encourage disruption as a force for good, to make step-changes in thinking, actions and the wider communities. We feel a partnership with TEDx, a platform that showcases and supports the very best of ideas and current innovative work generated in Scotland, is a great way of joining forces to support our country’s most brilliant innovators.

As part of our involvement, the Innovation Centres are sponsoring a new space within TEDx this year, that is all about innovation. We are delighted to introduce the “Innovation Avenue” an interactive space that will showcase some of Scotland’s innovative technologies, where attendees will be able to explore, discover, and play with some impressive Scottish tech innovations, and hopefully be inspired to innovate themselves.

We will also host 2 “Innovation Lab” sessions, where we want to hear from you. We will invite some TEDx guests to come along and participate in these interactive sessions and share your ideas on how we can help drive innovation in Scotland even further.

To find out more about how we can help your next big idea soar, visit


Scotland Data Science and Technology Meetup Launch in Glasgow

The Data Lab and MBN Solutions are launching a new network to support Glasgow’s growing community of Data Science enthusiasts, by expanding the Scotland Data Science and Technology Meetup group to Glasgow ahead of TedX Glasgow 2016 event.

The Data Science Meetup will be held on June 2nd in Glasgow’s Tontine Building, serving as the launch day for future meetups. The guest speakers for the inaugural event are Gillian Doherty from The Data Lab, Callum Murray from Amiqus, Grant Smith from Barrachd, Paul Forrest from MBN Solutions and Andrew Audrey, a consultant.

Attendance at the launch event is free and you can register your attendance here.


Google Plus

When outsourcing Analytics offshore makes sense

Thanks to its broad applicability, data analytics has rapidly become a critical business function for modern organisations. But with expertise in the field in short supply and high demand, companies with an identified need for data analytics are looking beyond their traditional borders to monetise their information assets.

Forrester Research predicts that a third of businesses will “pursue data science through outsourcing and technology”  as organisations become less process-driven and look to their data to find new opportunities for innovation. And with globalisation and technological advancements making outsourcing a realistic and practical option for businesses, this trend is set to gain momentum. With this in mind, let's take a look at why an organisation would even consider outsourcing their analytics capabilities in the first place.

Data Digest

JUST RELEASED: Chief Data Officer Forum Europe - Speaker Presentations

On 9th-11th May 2016, over 120 of the leading CDOs and senior executives of data and analytics met at the Millennium Hotel, Mayfair to network and engage in an exchange of ideas which would bolster each of their respective strategies.

The general consensus is that the CDO position is indeed gaining global momentum, with Gartner predicting  that 50% of all companies in regulated industries will have a CDO by 2017.

Hear more from the leading Chief Data Officers at the coming Chief Data Officer Forum organized by Corinium Global Intelligence. Join the Chief Data Officer Forum on LinkedIn here.

Teradata ANZ

Jack Bauer, An Analytical Real-Time Marketer, And A Snarling Tiger. Who’s Your Money On?

So, there you are relaxing in your Asian grassland holiday villa; daydreaming you’re Jack Bauer battling against real-time odds in an old episode of 24, when this snarling tiger leaps through an open window.

What do you do? Throw a chair at it? Wave a kitchen knife in its general direction? What?

The simple answer is… pray. You can’t rely on knee-jerk reactions to deal effectively with this kind of intense, in-the-moment situation. A little planning goes a long way.

It’s the same with real-time marketing. Big Data analytics can help you predict and plan for threats and opportunities, and make appropriate just-in-time decisions.

Going, going, gone

Real-time marketing implies a sequence of steps – business-event data, analytical insights and decisions based on context. You’re looking to reduce:

  • data latency (time lag before the availability of fresh data)
  • analytical latency (time lag before access to recent intelligence and insights)
  • decision latency (time lag for recommending contextual actions), so that you can make a personalised and relevant offer to the customer.

Why? Because if a customer arrives at the target location, or calls your contact centre and you don’t capture the event and deliver the intelligence before the customer leaves, that golden opportunity is gone for good. Any subsequent offer is likely to be a waste of time, so real time (just-in-time) becomes your promotional deadline.

Reality check on real-time, local-area marketing

The weak link in your real-time marketing and location intelligence efforts could be your ability to get targeted prospects to react to offers, in real time. Consequently, you might consider near-real-time or just-in-time marketing. In other words, if instant conversion on a local-area marketing campaign doesn’t happen inside the target location, does it really matter if the offer is delivered in real time?

If the offer has an expiry date of a few hours or a few days, is it worth worrying about real-time delivery?

For instance, knowing that a sports event is scheduled for the coming weekend and that your customer often passes the host stadium, you could send a Subway discount QR coupon to his cell phone a couple of days in advance. This would increase the likelihood of conversion, while optimising ROI on the location-intelligence solution.

Tips for better real-time action

First, profile your customers to work out when they’re most likely to react to offers.Geofencing based on recency, and frequency of locations of interest to the customer, as well as lifetime value etc., will help segment and augment targeted leads. Geospatial analytics with in-database capabilities will help meet real-time-marketing-performance requirements.

Based on strategic intelligence gained from these kinds of in-database capabilities, you can prepare compelling offers, targeting prospects, driving them to the target location and allowing for ‘just-in-time’ conversion. Real-time streaming and Complex Event Processing will help minimise data latency and provide continuous intelligence when a customer is at geo-fence locations, to enable just-in-time conversion.

In markets where privacy is a concern, case studies have shown a wide range of benefits in business planning and optimisation can be obtained through local-area marketing, including:

  • Collective customer preferences, behaviours, and spend patterns, can be used to expand retail stores in catchment areas where under-utilised infrastructure exists (e.g. cell towers).
  • You can develop local-market price plans by combining census demographic data (by statistical area) with customer-segment data.
  • You can identify commuter-line cell towers suffering network congestion, prioritising network investment to ease cash flow and improve ROI.

The ever-increasing deluge of digital data is driving organisations to deploy broader real-time monitoring, alerting, and interactive decision-making solutions, to improve business operations in local areas. Big Data analytics play a significant part in enabling this initiative, and that’s really important.

After all, you never know when a tiger is going to jump through your window. Do you?

The post Jack Bauer, An Analytical Real-Time Marketer, And A Snarling Tiger. Who’s Your Money On? appeared first on International Blog.


May 24, 2016

Silicon Valley Data Science

Building Data Systems: What Do You Need?

In previous posts, we’ve looked at why organizations transforming themselves with data should care about good development practices, and the characteristics unique to data infrastructure development. In this post, I’m going to share what we’ve learned at SVDS over our years of helping clients build data-powered products—specifically, the capabilities you need to have in place in order to successfully build and maintain data systems and data infrastructure. If you haven’t looked at the previous posts, I would encourage you to do so before reading this post, as they’ll provide a lot of context as to why we care about the capabilities discussed below. Please view this post as a guide, laid out in easily-visible bullet points for quick scanning.

A few key points before we start:

  • To reiterate what was discussed in earlier blog posts, the points discussed in the sections below are shaped by automation, validation, and instrumentation: the concepts that drive successful development architecture.
  • Much of what is covered in this post comes from continuous delivery concepts. For a more detailed overview of continuous delivery, please take a look at existing sources.
  • Whether you hold these capabilities explicitly in house or not depends on the constraints and aspirations of your organization. They can (and often do) manifest themselves as managed services, PaaS, or otherwise externally provided.
  • The implementation specifics of these capabilities here are going to be determined by the constraints and goals unique to your organizations. My hope is that the points below highlight what you should focus on.


Data engineers must be as conscious of the specifics of the physical infrastructure as that of the applications themselves. Though modern frameworks and platforms make the process of writing code faster and more accessible, the scale in terms of data volume, velocity, and variety of modern data processing means that conceptually abstracting away the scheduling and distribution of computation is difficult.

Put another way, engineers need to understand the mechanics of how the data will processed, even when using frameworks and platforms. SSD vs. disk, attached storage or not, how much memory, how many cores, etc. are decisions that data engineers have to make in order to design the best solution for the targeted data and workloads. All of this means reducing friction between developer and infrastructure deployment is imperative. Below are some important things to remember when thinking about how to enable this:

  • Infrastructure monitoring and log aggregation are imperative as the number of nodes used increases throughout your architecture.
  • The focus should be on repeatable, automated deployments.
  • Infrastructure-as-code allows for configuration management through a similar toolchain as application code and provides consistency across environments.
  • For a number of reasons, many organizations will not allow developers to directly provision infrastructure and will typically have a more specialized operations/network team to handle those responsibilities. If this is the case then providing clear, direct infrastructure deployment request processes for the developer with the ability to validate is necessary.


As the level of target scope increases in the testing sequence, testing for data infrastructure applications begins to deviate from traditional applications. While unit testing and basic sanity checks will look the same, the distributed nature of many data applications make traditional testing methodologies difficult to fully replicate. Below are specific issues:

  • As with any application development, code reviews, code coverage checks, code quality checks, and unit tests are imperative.
  • Using sample data and sample schemas becomes more important since pipelines mean the data becomes the integration point.
  • Having as much information about the data and associated metadata is critical for developers to rationalize fully about how to build and test the application.
  • Developers should be able to access a subset, a sample, or at worst a schema sample (in order to generate representative fake data).
  • Metadata validation is as important as data validation.
  • Duplicating environments in order to establish an appropriate code promotion process is equally important but harder: locally duplicating distributed systems takes multiple processes and actually replicating the distributed nature of the cluster setup is tough without a lot of work. Typically, this issue can be mitigated by investing in infrastructure automation, to be able to deploy the underlying platforms for testing in multiple environments rapidly.
  • The scale of the data often prohibits complete integration tests outside of performing smoke tests in the production cluster as, for example, testing a 10 hour batch job is not practical.
  • Performance testing will be iterative. The cost of duplicating the entire production environment is often prohibitive, so performance tuning will need to take place in something close to a prod environment. An alternative to this is having push-button/automated system to spin up instances just for performance tuning.
  • Further complicating performance testing is the fact that resource schedulers are often involved.
  • Running distributed applications often means multiple processes are creating logs. It’s important to enable your developers to diagnose issues by implementing log aggregators and search tools for logs.
  • Since certain issues only manifest themselves when concurrency is introduced, testing in concurrent mode should be done as early as possible. This means making sure developers are able to test concurrency on their local environments (some frameworks allow for this, e.g using Spark’s local mode with multiple threads).


Dependency management for distributed applications is HARD. It’s necessary not only to maintain consistency across promotion environments (dev, test, QA, prod), but within the clustered machines within each of those environments. The distributed nature of many of the base technologies coupled with the prevalence of frameworks in the Big Data ecosystem means that when it comes to dependency management organizations have to make a decision to either 1) cede management of shared libraries to the platform (usually the operations team) and make sure that developers can maintain version parity; or 2) cede control to developers to manage their own dependencies. Some more specifics below:

  • While the polyglot nature of data infrastructure development will tempt developers toward manual packaging and manual deployment (e.g. on an edge node), packaging standards should be enforced regardless of language or runtime. Choose a packaging strategy for the set of technologies at hand and establish an automated build process.
  • Understand the impact that maintaining multiple languages and runtimes has on your build process.
  • Pipelines themselves need to be either managed using something like Oozie (in hadoop ecosystem) or reliably managed through automated scripting (e.g. using cron).
  • For a traditional application you can version all configurable elements (source, scripts, libs, os configs, patch levels, etc), but with the current state of the technology in enterprise BD, multiple applications are running on a specific software stack (e.g. CDH distro). This means the change set for configurations for an app can not lie entirely within that apps repo. At best, the configurations are across separate repos, but with managed stacks like CDH, configuration versioning is typically handled internally to the platform software itself.


As with testing and build, automated deployments and release management processes are crucial. Below are some things to consider:

  • The use of resource managers in modern data systems means that whatever deployment process is in place must account for resource requests or capacity scheduling, as well as related feedback to the development team.
  • The cost of maintaining multiple large clusters is often prohibitive to fully duplicating the prod environment. Therefore, you will have to deal with mismatch-sized clusters or logically separate multiple “environments” on the same cluster.
  • Performance and capacity testing will be iterative. It’s often difficult to deterministically judge ahead of time the exact resource configuration needed to optimize performance. Therefore, even in prod the roll out is will require multiple steps.
  • It’s appropriate for packaged artifacts that maintain jobs or workflows to live on edge nodes. Source code should not.


Commercial data infrastructure technologies (Hadoop distros and the like) are often complex integrations of multiple distributed systems. Operations processes must account for a multitude of configurations and monitoring of distributed processes. Some additional points:

  • Upgrade strategy (i.e. whether or not to stay on the latest supported versions of technologies) is important since in a production environment we have to verify regression tests are passing and validate that developers are in sync regarding the new versioning preferably through explicit dependency management automation. Complicating things is the fact that many platforms are multi-system integrations makes it more important to test and stay on supported version sets. In general, the bias should be toward the latest supported version set, since the feature set expands/evolves quickly, but the most important thing is to establish a verifiable, versionable process, that can be rolled back if necessary.
  • With Hadoop distros or other frameworks, operations infrastructure often incorporates execution time reporting (e.g. Spark dashboard), in order to validate and diagnose applications.
  • Many platforms, especially commercial offerings, provide UIs for configuration management. While they are useful to get information about the system, configurations should be managed through an automated, versionable process.


Having an explicit strategy for both resource management and dependency management are critical for reconciling developer productivity, performance at scale, and operational ease. Below you’ll find some of the key points to consider:

  • The feature set of specific distributions and component versions affect functionality. Modern data infrastructure development and Big Data are an emerging practice so the feature set of specific frameworks expands/evolves quickly but commercially supported distributions move at a slower pace. It’s important to explicitly define and communicate the versions of software you will be using, so that developers understand the capabilities that they can leverage from certain frameworks.
  • Never begin development using a newer version of a framework or API unless you have a plan for rolling out the update in production. Since you should be deploying to production as early as possible then this would mean this would happen almost concurrently.
  • Data platforms often aim for an org-wide, cross-functional audience set. This means that you will need to have a way to manage authorization and performance constraints/bounds for different groups. Resource managers and schedulers typically provide the mechanisms for doing this, but it will be up to your organization to make the decision as to who gets what (Have fun!).
  • Using resource managers will be your best bet from an operations standpoint to align your business needs to the performance profile of specific jobs/apps. The simplest model for this is essentially a queue, but can take, and most times likely should, take the form of group-based capacity scheduling. The trade-off here is that organization coordination/prioritization is a must.


Having available developers with deep knowledge of your technology stack is critical to success with data efforts. In addition, it’s necessary to provide those developers with the tools to do their job:

  • Desktop virtualization/containerization tools like VirtualBox, Vagrant, or Docker will allow developers to more easily “virtualize” the production environment locally.
  • Developers should be able to duplicate the execution environments locally as closely as possible. Meaning that if the execution machines are linux based and your workstations are windows, by necessity, tools like VirtualBox, Vagrant, Docker will need to be used.
  • You must align training plans with roadmap of platform/product. For example, if you plan on using modern data infrastructure, it is essential for developers to understand distributed systems principles.
  • You must allocate the appropriate development and operation personnel (or equivalent managed services) for the lifespan of the product/platform.

Product Management

The platform aspect of data infrastructure development calls for strong coordination of consumer needs, developer enablement, and operational clarity. As such, it’s important to engage all relevant parties as early as possible in the process. Validate results against the consumer early and begin establishing the deployment process as soon as the first iteration of work begins. Other tips:

  • Address operations and maintenance concerns as early as possible by beginning to deploy ASAP during development in order to iterate and refine any issues.
  • Much of data work is inherently iterative, reinforcing the need for feedback support systems such as monitoring tools at operational level and bug/issue trackers at the organizational level.
  • Like most modern software systems, getting feedback from end users is essential.
  • Establishing a roadmap for the data platform is imperative. Data platforms often aim for an organization-wide, cross-functional audience set. It is imperative to have a strategy on prioritizing consumer needs and onboarding new users/LoB/groups. The feature requests of the consumers must be treated as a prioritized set.
  • Availability, performance, etc. requirements have a large impact on design due to the complexities of distributed systems in general and current Big Data technologies specially, so these things should be thought about as early as possible.
  • Any technology platforms or distribution used in production must have licensed support.

There are a lot of capabilities to think about, and it can seem daunting. Remember that at the highest level, you are aiming to give your teams as much visibility as possible into what’s happening through instrumentation, implementing processes that provide validation at every step, and automating the tasks that make sense. Of course, we are here to help.

Hopefully, the points highlighted above will be useful as you are establishing your development and operations practices. Anything you think we missed? Let us know your thoughts below in the comments

The post Building Data Systems: What Do You Need? appeared first on Silicon Valley Data Science.

Roaring Elephant Podcast

Episode 16 - Interview part two with Sumeet Singh - Senior Director, Cloud and Big Data Platforms @ Yahoo!

Interview part two with Sumeet Singh - Senior Director, Cloud and Big Data Platforms @ Yahoo!Hopefully you enjoyed the first part of our interview with Sumeet, here is part two where we go into more detail about Yahoo's use of Hadoop, with lots of interesting topics coming up including the splintering of the ecosystem, governance and much much more.Read more »
Big Data University

This Week in Data Science (May 24, 2016)

Here’s this week’s news in Data Science and Big Data. Smart City

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

The post This Week in Data Science (May 24, 2016) appeared first on Big Data University.

Data Digest

The Data Analytics Disconnect and How to Avoid It!

Despite the professed benefits, there remains a real uncertainty around how to leverage the technology and ecosystem to achieve business value from big data analytics. Many stakeholders understand the urgency around analytics for the strategic success and competitive edge of their organisation, but fail to understand how to extract significant value from data.

We regularly hear how analytics investment is revealing customer needs and informing future product development and innovation. How operating costs are being significantly reduced, processing time and services optimised, with insights providing more effective decision making and planning. Clearly, organisations with targeted actions can maximise returns and identify potential business risks.

Indeed many organisations are excelling at leveraging data. They are raising the profile for data and analytics, and promoting the ability to interact and provide customer services in a new way using new technology platforms based upon data and information driving the way they operate. The next 5-10 years is going to see global transformation, with advancing technology, the impact of the IoT, and maturation of cloud and development of human talent that will inevitably grow to operate in this new environment. But if we listen carefully, we hear about the disconnect. A lack of confidence from those companies who are failing to make inroads, with questions around investment in structures, existing talent, and the available commercial solutions.

Building Confidence

Gartner predicts that by 2019 only half of those in the Chief Data and Analytics Officer role would be successful. Certainly the role of the CDAO is to add value but those who will fail in the CDAO function will have failed to manage stakeholder expectations. The strategy has to be a carefully managed approach that is effectively communicated and underwritten by well developed organisational support. The strategic purpose of the CDAO is turning data into knowledge to make better decisions but managing stakeholder expectations has to be their number one priority.

As you build the strategic ecosystem of the future, it has to deliver value not only for the future but for today. It has to show the value and demonstrate that it is being driven throughout the business areas to provide impact and confidence. Building the strategy and the assets has to enable a ‘fast-fail’, learn quickly and move forward environment. The business certainly expects the data analytics strategy to ‘succeed fast’.

Building the strategy and the assets has to enable a ‘fast-fail’, learn quickly and move forward environment. The business certainly expects the data analytics strategy to ‘succeed fast’.

Re-Writing the Legacy

The adoption of agile principles through the development of models, data, business value activities are also complimentary to any big data analytics programmes. For many organisations though, legacy is prevalent through their technology, data and culture. Smaller, newer companies have developed with data as a core activity. Larger organisations are operating in the reverse as legacy companies with legacy cultures now operate in a big data analytics world. The business is critically still running off the old systems and culture, however if you can tap the potential for change and exploit it, the business will respond and the analytics value will quickly gain stock.  These legacy aspects have to be impacted to ensure the success and sustainability of big data analytics.

Internal and Expanding Capability

The lead protagonists are the analytics team who should represent a blend of skills. Many question whether to train up or recruit externally but the fact is that the BAU will inevitably evolve and expand with the rise of self-service analytics, text search, AaaS and AI.  These advances will only bring value if the CDAO can embed a new mind-set so the organisation starts to value data as an asset and makes data driven decisions. The broader organisational culture is a key aspect.

So to Re-cap How to Avoid this Disconnect:

  • Gain meaningful support and understanding of stakeholder expectations
  • Transform from a legacy environment to a modern technology and data environment
  • Build internal talent capability
  • Implement data governance across the organisation
  • Build predictive and prescriptive analytics capability
  • Adopt a strategic partner as a specialist in the data and analytics space to help progress and move things forward
  • Impact the culture in a holistic and meaningful way

With the wealth of data in the next few years there is going to be a tidal wave of innovation. For many organisations, it will be the strategic plan they set out now and the level of big data analytics investment that determines how they will respond to this new business environment.  The success of data analytics in each organisation will be determined by how the leader in the CDAO role has managed that strategic plan.

For more on the disconnect and how to overcome it join us at the Chief Data and Analytics Officer Forum Singapore, 27-28 July where Dr Partha Dutta, Principle Advisor Business Intelligence & Analytics from Rio Tinto will be sharing his insight on how to get the most out of your data.

By Kate Tappin:

Kate Tappin  is the Content Director for the Chief Data and Analytics Officer Forum Singapore. Consulting with the industry about their key challenges and finding exciting and innovative ways to bring people together to address those issues - the CDAO Forum is launching in Singapore, 27-28 July 2016. For enquiries email:
Teradata ANZ

Just what’s in my data warehouse – and what should I do with it?

The analytics ecosystem today is evolving into a data fabric, where data is located in different places and processed in place with tools and technologies ideally suited to the specific type of analytics being performed – with all of this being transparent to the end users.

But that is a recent development, for many years we have had only one tool in the tool box, the data warehouse. The data warehouse was often the only place in an organisation where data could come together, and where the production applications did not entirely consume the platform, enabling many users to access the warehouse and perform ad-hoc analytics, prototyping and analytic experiments.

Over time the data warehouse becomes filled up with an enormous variety of data and processing, which can cause challenges that will likely sound familiar to you… The data warehouse is running out of capacity, my queries are slow, it takes too long to add new data, it costs too much. These are common challenges today so you are not alone!

If you have a good look at any data warehouse, you will find that it will be holding a wide variety of data, will be performing many types of analytics and will have many different user groups. Given the modern analytics ecosystem being a data fabric, comprised of different platforms, tools and technologies, do they all need to stay on the data warehouse? No they don’t.

Nathan Green_Data Warehouse

The analytics ecosystem is a sustainable environment for answering questions. Facts (data) are stored, context is applied (questions and relationships) and insights are delivered (answers). All of these things happen in different localities, different places. Data is stored in a variety of platforms – Relational Databases, Hadoop, NoSQL and others. Context is applied through data models and code. Insights are delivered through web pages, reports and dashboards . All of these platforms, tools and technologies are very good at some things, and not so good for others. Recognising the capabilities of the different components of your analytic ecosystem is key to starting to answer the “does this data and processing need to stay on the data warehouse?” question.

Understanding capabilities enables you to perform analytics to see what fits, and what doesn’t. For example, if you have a process that does many complex joins, then it makes sense to place that on a platform that has a strong “join” capability. If you try to migrate that process to another platform that has a low “join” capability, it will be very difficult and/or time consuming to perform that migration, indeed it may require extensive redevelopment of the original process to make it “fit” into the capabilities of the target platform. So while it is still possible to migrate a process that doesn’t fit into the target platform, the cost and benefit must be carefully weighed up to make a properly informed decision.

Let’s say that we have another process that performs complex text analytics, and that process is currently on the data warehouse – a relational database, where this is a resource intensive process with complex SQL. We also have another Hadoop platform which has a high capability for text analytics. In this case the process is a good candidate for migration. We can move that processing onto a more capable platform, potentially simplifying the code and improving performance, whilst also freeing up the resources it used to consume on the data warehouse, allowing other processing to benefit. Migrating a process that fits well into the target platform reduces risk, minimises effort (some level of redevelopment will always be required when migrating between different technologies) and optimises cost.

Processing (Context) is applied to data (Facts), so both must be considered. If you move data to another platform, then you must also move the processing that uses that data. Data is rarely used in isolation, rather groups of data (ie: multiple tables) are joined together. So if you wish to move the processing then you also need to move the logical “set” of data required by the processing. Not all data needs to be moved, some may be replicated to the other platform (and kept in sync), so you need to consider the tools and mechanisms to enable this, how to operationalise it. If the frequency of processing is low, then using data virtualisation technologies (eg: Teradata Querygrid or Presto) is an option, avoiding the physical replication of the data.
While this initially doesn’t appear to help answer the “Just what’s in my data warehouse?” question, the thought process and analytics you perform to understand what you can move off the data warehouse, and what needs to stay, will give you a lot of the answers. At the same time, you will have a much better understanding of why some data and processing has to stay on the data warehouse – as it fits into the capabilities of that platform.

You will also build up a view of opportunities to migrate data and processing that doesn’t fit on the data warehouse onto a more capable platform, allowing you to make informed, data driven decisions about how you evolve your analytic ecosystem and address the challenges faced within your data warehouse environment.

The post Just what’s in my data warehouse – and what should I do with it? appeared first on International Blog.


May 23, 2016

Revolution Analytics

Feather: fast, interoperable data import/export for R

Unlike most other statistical software packages, R doesn't have a native data file format. You can certainly import and export data in any number of formats, but there's no native "R data file...


Revolution Analytics

Principal Components Regression in R: Part 2

by John Mount Ph. D. Data Scientist at Win-Vector LLC In part 2 of her series on Principal Components Regression Dr. Nina Zumel illustrates so-called y-aware techniques. These often neglected methods...


What to do When You Can’t Patch a Vulnerability

The Verizon DBIR has a lot to say about vulnerabilities. One of the more interesting topics is the large number of 2015 vulnerability exploits that were more than a year old. In a footnote the DBIR authors comment that “Those newly exploited CVEs, however, are mostly – and consistently – older than one year.” The data show that more than 90% of exploited vulnerabilities in 2015 were more than one-year-old and nearly 20% were published more than 10 years ago.



This data is consistent from year-to-year. In 2014, more than 95% of exploited CVEs were more than a year old. As you would expect, most of the remediated CVEs in 2015 were recent. It appears that over 70% of all closed CVEs in 2015 were no more than two years old and over 95% of all closed CVEs were within five years of their original publication.



You would expect that most of the CVEs closed in any year would be recent as older vulnerabilities would have received patches published in previous years. And, when you get past about five years, you are much more likely to encounter software versions that are no longer supported. New patches are not arriving for that end-of-life software you are still running. “This gets at a core and often ignored vulnerability management constraint – sometimes you can’t fix a vulnerability – be it because of a business process, a lack of a patch, or incompatibilities,” conclude the DBIR authors.

SAP Vulnerabilities Expose the Scope of the Problem

The king of the enterprise resource planning (ERP) market, SAP, provides good example of this situation. On May 11th, US-CERT published an alert confirming that 36 organizations were affected by a vulnerability first identified by researchers at Onapsis earlier in the year. This may not seem like widespread impact, but consider that these likely represent 36 of the world’s largest companies facing severe consequences. The alert states, “Exploitation of the Invoker Servlet vulnerability gives unauthenticated remote attackers full access to affected SAP platforms, providing complete control of the business information and processes on these systems, as well as potential access to other systems.”

A Dark Reading interview with Onapsis director of research Ezequiel Gutesman concludes, “the average enterprise takes 18 months to patch these systems.” Why so long? The hosts of the Defensive Security podcast, Jerry Bell and Andrew Kalat, offer a simple explanation. Every instance of SAP is customized during implementation. Patches from SAP, no matter how expertly crafted, risk breaking those customizations that SAP didn’t include in their core product. Now  consider that companies often rely on SAP to run back-office financials, fulfillment, and procurement. Those customizations could impact a company’s ability to run payroll, ship product, or order supplies for production. An extended downtime of the system could bring business operations to a halt. When you consider this situation, you can see why some companies are slow or reluctant to implement every patch immediately. It’s complicated.

This is a recurring theme on the Defensive Security podcast. Compromises of unpatched vulnerabilities are often known risks that information security teams cannot or are not allowed to patch for a variety of business reasons. Risk must often be addressed through means other than patching in these instances. The SAP example highlights that this isn’t just a concern for legacy software, PLCs and SCADA systems that have always-on operational requirements. Even your newest software can pose substantial challenges around vulnerability management.

Detection Capabilities Are Critical

This is yet another reason why robust breach detection capabilities are so critical. There is a lot of discussion about the importance of detection and response due to increased attacker skills in social engineering and penetrating well protected and fully patched endpoints. We understand that it is hard to prepare for unknown risks from skilled hackers. However, detection and response capabilities are equally important because we also have many known risks like un-remediated vulnerabilities that we must vigilantly monitor for compromise.

The added complication is that we know SIEM and endpoint protection solutions consistently miss attacks. This allows attackers to dwell for several months inside the network before discovery. We know that leads to lateral movement, data exfiltration and heightened risk leading to financial, operational and reputational impacts to the business.

Identifying Patterns of Compromise

That is precisely why enterprises are turning to information security analytics solutions. Tools like IKANOW help narrow the time between compromise and detection by curating threat intelligence to match IOCs to assets with higher predictability and also identify unusual patterns of behavior. Existing security tools too often miss these compromises because they cannot process enough network and application data and filter through the noise to notice the tiny footprints left by attackers. Nor are they designed to integrate with the wide variety of threat intelligence content and use it for rapid analysis.

IKANOW has the required scalability, speed and an open architecture that enables rapid detection where other systems fail. It also enables enterprises to identify clusters of assets and identify growing risk across the group. Even if a single asset may not appear to be under attack, small changes across a number of assets as a group can indicate malicious activity. IKANOW’s risk scorecards expose these incidents when other systems view them as benign. In practice, this might mean clustering un-patchable assets and the systems they interact with as a group instead of looking at each individually.

The un-patchable vulnerability is a fact of everyday life for many information security teams. Segmenting networks and end-of-life plans are solid strategies to employ over time to limit the risk of damage. Tools like information security analytics can also make these situations far easier to manage with an acceptable risk profile.

The post What to do When You Can’t Patch a Vulnerability appeared first on IKANOW.


May 21, 2016

Solaimurugan V.

Big Data / Data Analytics Jobs

Big Data Hadoop Jobs in India and around the world
Big Data Analyst   5+ years   
May, 2016    Chennai, India
Big Data - Principal Software Engineer   5+ years   
May 05, 16    Humana, Irving, Texas / USA
Hadoop and Spark Developer   5-7 years   
May 05, 16    CSC india, Dubai/ UAE
Data Specialist: Advanced Analytics   4-10 years   
May 01, 16    IBM, Bangalore, India
Hadoop Developer (DWH)   5-9 years    April 12, 2016
Csi Software Pvt Ltd , Chennai, TamilNadu, India
Sr.Data Scientist      April 10, 16
Allstate., Northbrook, IL US.
Sr.HADOOP DEVELOPER   4-8 years    April 03, 16
Swathi Business Solutions, Chennai, India,(KL, Malaysia)
Java Hadoop Lead (Big Data)   7-10 years    April 03, 16
Innominds, Hyderabad, India
Big Data developer / Intern for Big Data     April 03, 2016
Frgma Data, Bengaluru, India
Analyst 1 - Apps Prog      April 01, 2016
Chennai, Tamil Nadu, India
Hadoop Data Engineer      Mar 31, 2016
Chennai, Tamil Nadu, India
Bigdata Developer      March 31, 2016
Chennai, India
In Association with

Simplified Analytics

Digital Assistants – Siri, Cortana, Alexa, Google Now?

In October 2011, Apple gave voice to iPhone 4s through Siri, the digital assistant to whom you could talk, when in need of some information. Since then the Intelligent Personal Assistants have...


May 20, 2016

Forrester Blogs

Linux vs Unix Hot Patching – Have We Reached The Tipping Point?

The Background - Linux as a Fast Follower and the Need for Hot Patching No doubt about it, Linux has made impressive strides in the last 15 years, gaining many features previously associated with...


Revolution Analytics

Microsoft R Open 3.2.5 now available

Microsoft R Open 3.2.5 is now available for download. There are no changes of note in the R langauge engine with this release (R 3.2.5 was just a largely a version number increment). There's lots new...


Revolution Analytics

Because it's Friday: The Illegal Magnet Machine

This marble-based Rube Goldberg machine video is strangely compelling: it's simply made, but full of lots of elegant little flourishes, like the uphill "cannon" made of several magetic balls. The...


May 19, 2016

Silicon Valley Data Science

One Year Later, Observations on the Big Data Market

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

Back in 2014, we discussed how the market looked like on our first birthday. As we hit three years, it seems like an appropriate time to look back on those observations. First, though, some thoughts on where things stand now.

We find ourselves transitioning from a market of early adopters, heading to one of the early majority in select industries, such as financial services and retail. As we teach about big data and data science at industry conferences, we meet an increasing number of enterprise leaders who are at the point of creating their strategy around data, and are innately aware of the importance of data as an asset.

There has never been more widespread awareness of the value of data science to business. The concept of artificial intelligence has again risen to the public consciousness: an identifiable way of explaining the role of data in creating value, rather than just analytics and reporting. Business leaders now have a way to understand that using data better involves changing processes, and the way they interact with customers, partners and suppliers. IDC has forecast the market for big data technology and services to grow at a ~23% CAGR, reaching nearly $50B in 2019.

While we are moving towards broader market sophistication around data, there’s time yet before writing “data science apps” becomes an everyday reality for customers. Implementation is often still focused on infrastructure assembly as a first step; followed by development of customer applications that leverage data science capabilities. The software vendor landscape continues to show significant turbulence, with many similar players and messages, amplified by large investments and marketing budgets. Inevitably, we expect there to be rounds of consolidation, and for a few core enterprise data platform alternatives to emerge.

The most important first step remains developing a solid strategy in place. Data is only valuable if you do something with it. Our Data Strategy Position Paper has more details, and we’d love to hear from you in the comments on what you’re seeing in the market.

2014 Thoughts

We recently celebrated our first birthday here at Silicon Valley Data Science. Alongside building our own venture, we also believe that our clients should be creating Experimental Enterprises that adapt quickly to succeed.

We’ve had the good fortune to work with some amazing clients this year while providing our services. We have built data-driven capabilities that are core to the business for many of our customers.

We worked with an online retailer to develop their Data Strategy to provide them with a roadmap for future business and technology decisions. Through a collaborative effort between Silicon Valley Data Science and key stakeholders at the customer, we guided them on how to use data technologies to accelerate their growth plans.

Another customer — an innovative media & entertainment company — asked us to provide Architecture guidance on which analytical and data architectures to use to enable their Personalization agenda. We rapidly assessed their current state as well as their near-term aspirations to provide both an architectural plan.

We’ve also helped a variety of customers create new capabilities and develop products by deploying our data science and engineering teams in Agile Build teams. We’ve been able to rapidly develop software to help our customers working under tight timeframes and with highly productive teams.

Through all of the work that we’ve done and conversations we’ve had with clients, there are three big data observations that I can share on the big data market:

  • Techniques and approaches from one domain are quite often relevant in other domains. We think cross pollination is one of the best ways to innovate and get to analytic results faster.
  • Pursuing problems by placing blended teams of architects, data scientists, and engineers has led to better productivity and amazing results.
  • Our customers are asking the right questions about how data can be strategic for their business, not just trying to prove that they can deploy a new technology.

We are eager to help our customers build experimental enterprises that take the best of today’s technology to iterate and adapt their businesses for maximum impact. It’s not enough to just build a giant data repository, the question is how to build an effective platform that allows you to rapidly innovate with business-relevant applications.

We started Silicon Valley Data Science a year ago to teach people about our view on how to use data and ultimately help our customers with their data-driven aspirations. It’s been an incredible year of growth and opportunity, I’m looking forward to what we make happen in year two. Please feel free to contact us if you’d like to discuss your business and build an experimental enterprise.

The post One Year Later, Observations on the Big Data Market appeared first on Silicon Valley Data Science.


Chris Morgan to Speak at Cyber Maryland Conference October 20-21, 2016

Date and time: Oct. 20-21, exact date and time TBA
Location: Hilton Baltimore, Baltimore, MD
Event Website

IKANOW’s Chris Morgan was invited to speak on the Cyber Defense Toolbox panel. Each panelist will have 5-8 minutes to present on a product, followed by Q&A. We understand that at last year’s conference, this was one of the more popular sessions.

Session abstract: The recent onslaught of cyber-attacks has left many organizations re-evaluating what’s in their toolbox to help combat cyber-crimes.  Having the right armor for pre- and post-cyber-attack strategies is the key to survival. Learn how these five innovative products can help you defend your network in real-time. Hear a panel of product development experts and technologists provide insight on next generation tools designed to protect business and personal assets.

Moderator: Dr. Avi Rubin, Computer Science Professor, Johns Hopkins University. (learn more here, and here).

Conference background: As the Nation’s epicenter for cybersecurity, Maryland convenes local, national and international entrepreneurs, investors, academics, students, enterprises and government officials each year at the CyberMaryland Conference. The two-day event is a cyber conference, job fair, NSA sponsored hack-a-thon and Hall of Fame dinner.  The conference is in its sixth year and draws approx. 2,300 business, government and academic leaders. You can read more at

The post Chris Morgan to Speak at Cyber Maryland Conference October 20-21, 2016 appeared first on IKANOW.

Data Digest

Will the Chief Data Officer still exist 10 years from now?


“How did God create the world in 6 days? Because he didn’t have legacy systems.” - CDO Forum, Europe 2016 speaker

This one statement resonated with me because people are inherently resistant to change, and that inertia increases when we are dealing with corporate structures and hierarchies which date back over centuries.

However, the digital revolution has deeply disrupted organisational workings from both a tactical and strategic standpoint. The pace of innovation is becoming more rapid and buoyant than ever before, compelling businesses to adapt or fail. Chief Data Officers (CDOs) and Data Scientists (CDSs) have begun to appear in organisations hoping to innovate and capitalise on the data that is entrenched in every aspect of their business functions - and with good reason. A recent survey by Forrester Research concluded that the ‘top performers’ with 10% annual revenue growth were 65% more likely to appoint a CDO than ‘low performers’ that have less than 4% revenue growth.

Although the statistics appear to suggest a plethora of benefits, in some cases, the CDO has been met with an air of prudence and sometimes, rejected altogether. In spite of this, there have been some great successes; with CDOs holding responsibilities for compliance, best practice data governance and evangelising the value of data, shifting the organisation to data driven-decision making. But can a CDO’s mission ever be accomplished? If so, how do we define this success story and perhaps more importantly, what is the next step?

The early days of CDO

Early CDO appointments had strong roots within compliance and were created out of necessity for organisations to achieve effective Data Governance and management of their data assets. This is further cemented by the sheer prevalence of CDOs within heavily regulated industries, such as the Financial Services, and Gartner’s prediction that 50% of all companies in regulated industries will have a CDO by 2017. Ironically, these are also some of the oldest institutions inhibited by legacy infrastructures as well as cultural mind-sets.

You [the CDO] need to know your super-customers really well to address their needs, pain points and hopes. The best way to ensure this is through empowering your customers (co-workers, internal customers).

Contrast this with some of today’s leading tech giants, e-commerce and app-driven businesses which are able to experience a level of agility not afforded to their more traditional counterparts. These organisations have prospered in the digital age and have had data and analytics ingrained so deep within their cultural fabric that there is no need for a CDO.

Legislation has typically always lagged behind technology; this has been exacerbated by the exponential growth of technology observed in recent years. However, the EU Data Protection Directive is due for finalisation in 2017 which aims to take into account the “vast technology changes of the last 20 years”. What impact shall these legislative revisions have on organisations born in the digital age? Who will be responsible for this compliance and will this likely come in the form of a CDO?

How to keep seat at the executive table

CDOs must continue to evolve in order to maintain their seat at the executive table. Typically this evolution passes through specific benchmarks, such as data quality or compliance, improving processes, driving a data culture and then leveraging analytics and insights to draw real business value.  A CDO from a large hotel group stated their plan for the next 3 years includes the, “delivery of hot analytics, mainly data visualisation tools, as well as influence growth hacking and deliver bottom line models, thanks to advanced analytics and data science”. It could be argued that these are typically responsibilities held by a CDS or Chief Analytics Officer (CAO). Perhaps in the coming years we will see an increased amalgamation of sorts between these roles.

We must not forget the impact of an organisation’s internal stakeholders. Much of a CDO’s staying power is determined by their ability to meet the needs of their internal customers. I spoke with a CDO from the healthcare sector who stated, “you [the CDO] need to know your super-customers really well to address their needs, pain points and hopes. The best way to ensure this is through empowering your customers (co-workers, internal customers).” This was echoed by another CDO from a large publishing firm who defined his position by stating: "My role is about measuring how people use data and how we can support it centrally."

The only thing certain

Will the CDO role as it is defined now, still exist 10 years from now? Who knows what the future has in store for the CDO. Although one thing is certain: data shall continue to become a persistent and ever present aspect of modern day business. Developments in the Internet of Things will rapidly increase the amount of data available and those who begin implementing the architectures to prepare for this wave of information will succeed.

Power lies in the enabler who empowers an organisation to fully grapple the value of their data to create a competitive advantage, but perhaps more importantly, solve the strategic and operational problems faced by the business. This may not always lie in the hands of one individual or even one specific job title.

Will the CDO role as it is defined now, still exist 10 years from now? Who knows what the future has in store for the CDO. Although one thing is certain: data shall continue to become a persistent and ever present aspect of modern day business.

By Andrew Odong:

Andrew Odong is the Content Director US/Europe for the CDO Forum. Andrew produced the CDO Forum, Europe 2016, researching with the industry about the opportunities and key challenges for enterprise data leadership and innovating an interactive discussion-led platform to bring people together to address those issues – the CDO Forum has become a global series having been launched in five continents. For enquiries email:

Forrester Blogs

Mobilize The Internet Of Things

Businesses can obtain major benefits -- including better customer experiences and operational excellence -- from the internet of things (IoT) by extracting insights from connected objects and...


Curt Monash

Surveillance data in ordinary law enforcement

One of the most important issues in privacy and surveillance is also one of the least-discussed — the use of new surveillance technologies in ordinary law enforcement. Reasons for this neglect...


Curt Monash

Governments vs. tech companies — it’s complicated

Numerous tussles fit the template: A government wants access to data contained in one or more devices (mobile/personal or server as the case may be). The computer’s manufacturer or operator...


Curt Monash

Privacy and surveillance require our attention

This year, privacy and surveillance issues have been all over the news. The most important, in my opinion, deal with the tension among: Personal privacy. Anti-terrorism. General law enforcement. More...


May 18, 2016


Video: How Does Someone Determine the Data Sources to Harvest

We specialize in harvesting and curating custom datasets for our customers. This means that our customers have access to a completely unique dataset created for them unlike any other dataset in the world. But sometimes our customers don’t exactly know what sources and websites to harvest. Our latest educational video explains how we work with […] The post Video: How Does Someone Determine the Data Sources to Harvest appeared first on BrightPlanet.

Read more »

Curt Monash

I’m having issues with comment spam

My blogs are having a bad time with comment spam. While Akismet and other safeguards are intercepting almost all of the ~5000 attempted spam comments per day, the small fraction that get through are...

Jean Francois Puget

What Is Machine Learning?

Can you explain me what machine learning is?  I often get this question from colleagues and customers, and answering it is tricky.  What is tricky is to give the intuition behind what machine learning is really useful for. 

I'll review common answers and give you my preferred one. 

Cognitive Computing

The first category of answer to the question is what IBM calls cognitive computing.  It is about building machines (computers, software, robots, web sites, mobile apps, devices, etc) that do not need to be programmed explicitly.   This view of machine learning can be traced back to Arthur Samuel's definition from 1959:

Machine Learning: Field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel is one of the pioneers of machine learning.  While at IBM he developed a program that learned how to play checkers better than him.

Samuel's definition is a great definition, but maybe a little too vague.  Tom Mitchell, another well regarded machine learning researcher, proposed a more precise definition in 1998:

Well posed Learning Problem: A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Let's take an example for the sake of clarity.  Let's assume we are developing a credit card fraud detection system.  The task T of that system is to flag credit card transactions as fraudulent or not.  The performance measure P could be the percentage of fraudulent transactions that are detected.  The system learns if the percentage of fraudulent transactions that are detected increases over time.  Here the experience E is the set of already processed transaction records.  Once a transaction is processed, then we know if it is a fraud or not, and we can feed that information to the system for it learn.

Note that the choice of the performance measure is critical.  The one we chose is too simplistic.  Indeed, if the system flags all transactions as fraudulent, then it achieves a 100% performance, however, this system would be useless!    We need something more sensible, like detecting as much fraud as possible, while flagging as little as possible honest transactions as fraud.  There are ways to capture this double goal fortunately, but we won't discuss them here  Point is that once we have a performance metric, then we can tell if the system learns or not from experience.

Machine Learning Algorithms

The above definitions are great as they set a clear goal for machine learning.  However, they do not tell us how to achieve that goal.   We should make our definition more specific. This brings us to the second category of definitions, who describe machine learning algorithms.  Here are some of the most popular ones.  In each case the algorithm is given a set of examples to learn from. 

  • Supervised Learning.   The algorithm is given training data which contains the "correct answer" for each example.  For instance, a supervised learning algorithm for credit card fraud detection would take as input a set of recorded transactions.  For each transaction, the training data would contain a flag that says if it is fraudulent or not. 
  • Unsupervised Learning.  The algorithm looks for structure in the training data, like finding which examples are similar to each other, and group them in clusters. 

We have more concrete definitions, but still no clue about what to do next.

Machine Learning Problems

If defining categories of machine learning algorithms isn't good enough, then can we be more specific?  One possible way is to refine the task of machine learning by looking at classes of problems it can solve.  Here are some common ones:

  • Regression. A supervised learning problem where the answer to be learned is a continuous value.  For instance, the algorithm could be fed with a record of house sales with their price, and it learns how to set prices for houses.
  • Classification. A supervised learning problem where the answer to be learned is one of finitely many possible values.  For instance, in the credit card example the algorithm must learn how to find the right answer between 'fraud' and 'honest'.  When there are only two possible value we say it is a binary classification problem.
  • Segmentation. An unsupervised learning problem where the structure to be learned is a set of clusters of similar examples.  For instance, market segmentation aims at grouping customers in clusters of people with similar buying behavior.
  • Network analysis. An unsupervised learning problem where the structure to be learned is information about the importance and the role of nodes in the network.  For instance, the page rank algorithm analyzes the network made of web pages and their hyperlinks, and finds what are the most important pages.  This is used in web search engines like Google.  Other network analysis problem include social network analysis.

The list of problem types where machine learning can help is much longer, but I'll stop here because this isn't helping us that much.  We still don't have a definition that tells us what to do, even if we're getting closer.

Machine Learning Workflow

Issue with the above definitions is that developing a machine learning algorithm isn't good enough to get a system that learns.  Indeed, there is a gap between a machine learning algorithms and a learning system.  I discussed this gap in Machine Learning Algorithm != Learning Machine where I derived this machine learning workflow:


A machine learning algorithm is used in the 'Train' step of the workflow.  Its output (a trained model) is then used in the 'Predict' part of the workflow.  What differentiate between a good and a bad machine algorithm is the quality of predictions we will get in the 'Predict' step.  This leads us to yet another definition of machine learning:

The purpose of machine learning is to learn from training data in order to make as good as possible predictions on new, unseen, data.

This is my favorite definition, as is links the 'Train' step to the 'Predict' step of the machine learning workflow. 

One thing I like with the above definition is that it explains why machine learning is hard.  We need to build a model that defines the answer as a function of the example features.  So far so good.  Issue is that we must build a model that leads to good prediction on unforeseen data.  If you think about it, this seems like an impossible task.  How can we evaluate the quality of a model without looking at the data on which we will make predictions?  Answering that question is what keeps busy researchers in Machine Learning.  The general idea is that we assume that unforeseen data is similar to the data we can see.  If a model is good on the data we can see, then it should be good for unforeseen data.  Of course, Devil is in detail, and relying blindly on the data we can see can lead to major issues known as overfitting.  I'll come back to this later, and I recommend reading Florian Dahms' What is "overfitting"? in the meantime.

A Simple Example

Let me explain the definition a bit.  Data comes in as a table (a 2D matrix) with one example per row.  Examples are described by features, with one feature per column.  There is a special column which contains the 'correct answer' (the ground truth) for each example.  The following is an example of such data set, coming from past house sales:

Name Surface Rooms Pool Price
House1 2,000 4 0 270,000
House5 3,500 6 1 510,000
House12 1,500 4 0 240,000


There are 3 examples, each described by 4 features, a name, the surface, the number of rooms, and the presence of a pool. The target is the price, represented in the last column.  The goal is to find a function that relates the price to the features, for instance:

price = 100 * surface + 20,000 * pool + 15,000 * num_room

Once we have that function, then we can use it with new data.  For instance, when we get a new house, say house22 with 2,000 sq. feet, 3 rooms, and no pool, we can compute a price:

price(house22) = 100 * 2,000 + 20,000 * 0 + 15,000 * 3 = 245,000

Let's assume that house22 is sold at 255,000.  Our predicted price is off by 10,000.  This is the prediction error that we want to minimize.  Another formula for price definition may lead to more accurate price predictions.  The goal of machine learning is to find a price formula that leads to the most accurate predictions for future house sales.

In practice, we will look for formulas that provide good predictions on the data we can see, i.e. the above table.  I say formulas, but machine learning is not limited to formulas.  Machine learning models can be much more complex.  Point is that a machine learning model can be used to compute a target (here the price) from example features.  The goal of machine learning is to find a model that leads to good predictions in the future.


Some of the definitions listed above are taken from  Andew Ng's Stanford machine learning course.  I recommend this course (or the updated version available for free on Coursera) for those willing to deep dive on machine learning. 

I found a more formal statement of my favorite definition in this presentation by Peter Prettenhofer and Gille Louppe (if a reader knows when this definition was first used, then please let me know):


Revolution Analytics

User Groups and R Awareness

by Joseph Rickert For quite a few years now we have attempted to maintain the Revolution Analytics' Local R User Group Directory as the complete and authoritative list of R user groups. Meetup groups...


David Corrigan

When it Comes to Customer Data, You Are Richer Than You Think

Most organizations think they don’t use customer data effectively.  To an extent, they are right.  88% of customer data is not used in most organizations.  That’s a staggering statistic.  It’s also...


Mission Project Rescue – Saving a Software Development Project

“Failure is the opportunity to begin again more intelligently.” Henry Ford

We take great inspiration from this statement by the Auto and Manufacturing visionary. So much so that we have made this the central tenet of one of the pillars of our software development practice – Project Rescue (more info. at for the more curious). First the facts.

  • Less than 1 in 3 software technology projects over the last 12 months were completed on time and within budget according to research by the Standish Group.
  • Geneca reported that 75% of all business and IT executives believed their software projects would fail, even before these projects got off the ground.
  • HBR reported that 16% of projects overshot their budgets by 200% and their schedule by 70%.
  • Failed projects cost money – Gallup reported that the US economy lost between $ 50 and $ 150 Billion annually due to IT projects that failed.

Presumably, there is no room for argument that technology projects within enterprises are falling short of expectations in large numbers. The definitions of just what constitutes failure are many but essentially this could either mean that the effort cost significantly more than was initially planned, or that it took much longer than anticipated to reach completion or, in many ways the most damaging, the entire initiative failed to deliver the expected business value. In the third scenario, the problem is compounded since money is lost, the opportunity may have moved on with time and essentially the entire effort is wasted. What does this mean for the enterprise facing this failure though? Well, our contention is that there is still hope and even seemingly lost causes can be salvaged. We believe that almost all IT projects can be saved and revived to fulfill the need they were visualized to serve. That being said, we should also make the point that rescuing a project goes beyond mere survival. The rescue of the project is complete only when it does achieve the business value it was designed for.

Many causes have been put forward to explain why projects fail. These include reasons like undisciplined project management practices, poor governance, inadequate support, insufficient technical experience, functional issues or even a lack of focus on quality assurance. While these are all valid reasons, more often than not, in our experience most projects fail simply because very few people have seen a project done right. The good news is, we believe that, in most cases, rescuing a project is not like recovering from a nuclear holocaust where nothing can be salvaged. Instead, we believe, it’s a series of few steps done right.

Understanding the technology part of this equation is, no doubt, important. We believe though this is virtually a hygiene factor in that most technologies platforms, tools, and approaches are great in their own way and, when put to the task, have the ability to deliver to specifications. Putting that differently – it is rare for the project to run aground solely because of the technology choice. That being the case our focus in Mission Project Rescue is usually elsewhere.

John Johnson said, “First, solve the problem. Then, write the code.” The first thing we look to do is reinforcing the integration of the business requirements into the architecture and the design. In the end, the project has to deliver business value and the process of ensuring that starts right from gathering the requirements and building these centrally into the vision of the software. A significant challenge we seek to address here is simplifying the business process and how it maps to the software. A Gartner study in 2015 of 50 large projects that were publicly accepted as being “complete failures” cited an inability “to address complexity in the business process” as the most important reason for their failure.

Bjarne Stroustrup said, “The most important single aspect of software development is to be clear about what you are trying to build.” Once the business requirements have found appropriate representation in the architecture we look to bring strong software development practices to bear on the conversion of those specifications into code. It is not always about which is the latest technology fad or software development approach dominating the blogosphere – our approach is to stay tuned to the technology world at all times but to choose conservatively and execute rigorously to revive the project.

A quote we saw somewhere went something like “Quality control is implemented to detect and correct problems when they occur, quality assurance is implemented to prevent problems from occurring.” That describes very well one of the key pillars of Mission Project Rescue. We seek to use stringent quality assurance processes across the entire development lifecycle to ensure that the code that we turn out not only meets the specifications we worked to but also the eventual expectations of the internal and external customers of the enterprise when they get to touch and feel what has been created.

No one is making the claim that saving an IT project that is spiralling out of control is easy. That being said, though, we specialize in coming to the rescue when projects are in jeopardy and in bringing them back to life. From that viewpoint, some of the insights we have gained may not play to traditional wisdom, but take it from us, they do work!


How Machine Learning is Boosting Sales for one Food Retailer

Machine learning is helping brands narrow the divide between their products and consumers in ways that would appear almost magical only ten years ago. From Amazon's personal product recommendations based on past purchases and browsing habits, to Netflix's uncanny ability to suggest just the right movie title according to your taste in film, data-driven insights are helping companies speak to individual customer preferences, who are demanding more personalisation in their products and engagements. This has moved data analytics from novelty status to an integral part of the marketing strategy, as brands discover new opportunities to communicate their unique selling points.

Revolution Analytics

Spark 2.0: more performance, more statistical models

Apache Spark, the open-source cluster computing framework, will soon see a major update with the upcoming release of Spark 2.0. This update promises to be faster than Spark 1.6, thanks to a run-time...


May 17, 2016

Rob D Thomas

The Fortress Cloud

In 1066, William of Normandy assembled an army of over 7,000 men and a fleet of over 700 ships to defeat England's King Harold Godwinson and secure the English throne. King William, recognizing his...


Revolution Analytics

Principal Components Regression in R, an operational tutorial

John Mount Ph. D. Data Scientist at Win-Vector LLC Win-Vector LLC's Dr. Nina Zumel has just started a two part series on Principal Components Regression that we think is well worth your time. You can...

Silicon Valley Data Science

Noteworthy Links: Hadoop Edition

As you may have heard, Hadoop is 10 this year. In celebration, here are some posts we think you’ll find interesting.

Doug Cutting on Hadoop turning 10—The co-creator of Hadoop talks a bit about the tech’s history, and what he sees in the future. A key theme is the importance, and inevitability, of open source technology.

Know your business needs for Hadoop—Diving into the data-driven world can be exciting, but SVDS CTO John Akred stresses the need for solid business plans. The process can be overwhelming to consider, and there are pitfalls to avoid.

Five things to know about Hadoop vs Spark—A quick rundown of what each tech does, and how they compare. Let us know if you disagree on their best uses.

Here’s 5 resources to help you become a certified Hadoop developer—Their list includes Cloudera itself, and the popular site Udemy. Have you used any of these to further your Hadoop skills?

A Spark of Genius? How the Hadoop Ecosystem Evolves—This interview with DMRadio looks at how Spark is affecting the experts, and what they think about it. A featured guest is our own Principal Engineer, Richard Williamson.

Want more? Sign up for our newsletter


The post Noteworthy Links: Hadoop Edition appeared first on Silicon Valley Data Science.

Data Digest

Taking the Pulse of Chief Data Officers in Africa

In the lead up to the inaugural Chief Data Officer Forum Africa I have been surveying the speaker faculty to get insights into where their focus lies. As well as their thoughts on the CDO role as it stands in South Africa right now.

The results of this survey will give you some idea of what they will discuss at the event early next month.

To give you some context 13 speakers have completed the survey to date. Of those:

  • 3 are official (titled) CDOs
  • 2 perform the CDO role but their organisation doesn't use the title
  • 6 are from financial services
  • 2 are from telecoms & media
  • The remainder cover digital transport, retail, scientific research and consumer goods

The first key question asked was:

"Why did you decide to get involved as a speaker in the Chief Data Officer Forum Africa?"

The 3 most insightful responses to this question were:

1."This is a good opportunity to connect with like minded individuals and exchange insights. It is also an opportunity to shape the local industry that is still growing in data maturity" - Magan Naidoo, Group Data Manager, Dimension Data

2."To share my experiences and learnings with the wider community of data change agents as well as gain new learnings from others" - Yasaman Hadjibashi, Chief Data Officer, Barclays Africa

3."The CDO Forum is a great platform to engage with individuals from different spheres who have a vested interest in the effective management of data as an asset. Also, I will be exposed to various challenges experienced within other sectors and use this opportunity to measure our data maturity against other organisations" - Kaizer Manasoe, Data Governance Specialist, Standard Bank

The next question is probably the most important as it demonstrates the attitude towards the CDO role.

"How would you describe your company's view of the CDO role?"

  • 48% of those surveyed said "My company is aware of the role but is not actively working towards creating one."
  • 22% said "I am one so the companies sees value in the role."
  • 15% said "Effectively I am one but my company doesn't use the title."
  • 15% said "My company is aware of the role and is actively working towards creating one." 
I was keen to know why South African companies weren't actively working towards having a CDO especially since they are aware of the role.

"If you're company is not working toward having a CDO, why?"

  • 2 of the 6 (who gave this answer) said "Data & analytics is not mature enough yet."
  • 2 of the 6 said "Can't figure where to place the role in the organisation"
  • 1 said "Don't use or believe in using the C-title outside of traditional roles"
  • 1 said "No understanding of exactly what a CDO would do"
I'm willing to bet that of the 100+ people that have registered for the event, the majority would have very similar answers to the speakers. The objective of the event is help companies overcome the barriers to having a CDO through information sharing.

The next question delved into where the speakers were applying most of their focus.

"What is your biggest focus when it comes to data and/or analytics?"

Speakers could choose multiple answers to this question. Results were:
  • "Building a culture of data centricity across the whole organisation" came up 11 times
  • "Monetising data through analytics and insights" came up 7 times
  • "Improving the quality of data across the organisation" came up 6 times
  • "Embedding robust data management and governance frameworks" came up 6 times
Other than building a data centric culture, each focus area is dependent on technology investments.

"What are the biggest technology investments you expect to make in the next 12 to 18 months?"

Like the previous question, respondents could select multiple answers.
  • "Predictive analytics, advanced analytics and big data" came up 12 times
  • "Data governance, data quality and data management" came up 10 times
  • "ETL & data warehousing" came up 4 times
  • "Data visualisation" came up 3 times
  • "Data security, data breach" and "Consulting services" came up once each
Clearly, these companies are getting their data in order to drive business performance through analytics and insights.

The good news is Corinium is launching the Chief Analytics Officer Forum Africa in September 2016.

The second to last question was used to get insight into what needs to happen to drive the growth of CDOs.

"What do you think will drive the proliferation of CDOs in South Africa?"

The 3 most insightful responses were:

1. "Increased awareness and appreciation of having accountability at the C- level by an appropriately experienced person. Pressure to manage data related costs and leverage data as an asset" - Magan Naidoo, Group Data Manager, Dimension Data
2. "Pressure and results from big corporates that have shown the value in what a CDO can bring to the table. Companies will soon learn that the traditional CIO role does not give you a competitve advantage in the market" - Morne van der Westhuizen, Head of Data Analytics, Zoona

3. "The increasing data management and governance challenges presented by the advent of Big data and Analytics; The ongoing challenge that most CIOs face for not having the capacity to focus on data" - Junior Muka, Data Architect - Business Intelligence, Woolworths
Finally, I wanted to know what each speaker hopes to get out of their conference experience.

"What do you hope to get out of the event? What take-aways are you looking for?"

I have chosen these 3 responses although all were insightful:

1. "Exciting new connections from across Africa and learning from new as well as aspiring data leaders, who want to make a material difference for their companies, customers, and the African continent" -  Yasaman Hadjibashi, Chief Data Officer, Barclays Africa

2. "Because data and information as a valuable asset is a growing realisation in Africa, the opportunity to network with others dealing with the same challenges and difficulties will give us comfort that we are headed in the right direction" - PJ Bezuidenhout, Chief Data Officer, Wesbank

3. "Improved definition and shared understanding of the role and responsibilities of a CDO; Practical ways for growing toward such a role and influencing your organization to actively work toward creating such a role" - Junior Muka, Data Architect - Business Intelligence, Woolworths

Once all speakers and delegates have been surveyed, we will publish a detailed report on the South Africa CDO environment. Clearly, there is a lot of activity in this space and some deep thinking. All that is needed now is a concerted effort to push the value of data, analytics and leadership roles up the corporate ladder.

Think about how much insight there is to be gained by spending up to 4 days with over 100 data and analytics leaders if 13 surveyed speakers generated this post.

By Craig Steward: 

Craig Steward is the Content Director for Corinium’s CDO and CAO Forums in Africa. His research is uncovering the challenges and opportunities that exist for CDOs and CAOs and the Forums will bring the market together to map the way forward for these important roles. For more information contact Craig on
Big Data University

This Week in Data Science (May 17, 2016)

Here’s this week’s news in Data Science and Big Data. Wild Data

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming & Recent Data Science Events

Cool Data Science Videos

The post This Week in Data Science (May 17, 2016) appeared first on Big Data University.


May 16, 2016

Teradata ANZ

Connect-the-dots to ensure “Great” Customer Experience – Part 2

In Part 1 of this blog series, we looked at how hyper-personalisation is fundamental to delivering “Great” Customer Experience at each and every interaction. Businesses need to “see the world as customers do” in order to appreciate the full customer context – i.e. each journey has to be managed end-to-end rather than as a series of disconnected interactions.

Yet nowadays consumers pretty much dictate where and how they liaise with service providers. Increasingly, most of this is done in the digital realm and anonymously. Most organisations currently have little to no visibility of these exchanges, especially in Social Media. Hence the question beckons: ‘How can a business gain this end-to-end perspective while it controls only parts of the customer’s buying journey?’

abstract light in city at night, defocused.

Contrary to common practice of simply “buying a new tool”, it is the underlying data and analytics ecosystem that enables an organisation to connect-the-dots. This blog highlights three main components and processes that should be considered in order to achieve systematic execution on a massive scale. Our guiding principle here is to build appropriate agility in the company’s data and analytics functions capable of meeting any challenge that the rapidly changing competitive landscape throws at it.

1. Identity Registry holding the “Master Identifier”

Cookie IDs are pervasive and readily accessible and many companies have already used them to link with internal data to recognise specific entities. However, the experience of a mega-major North American financial institution has shown that using cookies alone is insufficient. They only provide the required precision and contextual granularity for less than half of its customer base.

To expand the coverage to as many individuals as possible, an enterprise-level process is needed that can merge real-world identity data with digital information collected incrementally. Accuracy is further enhanced when online activities of every anonymous visit are tracked. Over time, actual behavior may reveal personal information that eventually provides the strongest possible “Master Identifier” connection.

Establishing a comprehensive Identity Registry is necessary to infer a unique “digital fingerprint” for each customer and addressable prospect. This Identity Registry forms the foundation for the next key component – i.e. categorising interaction history as a way to join previously unjoin-able data.

2. Unified Interaction History

To gain competitive a advantage, a business has to anticipate the customer’s needs, expectations, and desires during each part of a buying or service journey. In order to do so, the company must know what happened in its dealing with the customers beforehand. Chaining together seemingly disparate interactions into related sessions would help to unearth the critical moments for proactive intervention – e.g. when a customer truly enters the sales funnel or encounters difficulties.

A Unified Interaction History is most reliable when it is based on the Identity Registry mentioned earlier. It would standardise the detection of these critical moments-of-truth across the enterprise regardless of the touch points used by each customer. This integrated repository would allow the organisation to methodically analyse how customers navigate across different touch points as they go about searching for information, compare features and prices, seek recommendations and advice, etc.

3. Advanced Analytics Collaboration

The Unified Interaction History in itself is only an aggregate of information. It must be available in a timely manner as input for advanced analytics to reveal insights into different customer journeys. Traditional analytics and BI solutions struggle to provide the necessary alacrity and efficiency to address the demands imposed by the competitive digital market place.

New-generation ‘Big Data’ analytics is required to expose crucial junctions that proactively highlight unforeseen road blocks and/or opportunities to deliver bespoke incentives to lift sales conversions. At the same time, to achieve such agility at satisfactory scale that will outpace the competition, complex data science ‘discovery’ works have to be accessible to non-technical business people to use. In other words, sophisticated analytical functions should be prepackaged as ‘Apps’. The business users should be able to access their library of these Apps and apply them as and when required.

For example, instead of waiting for a data scientist to conduct the analysis, a Customer Experience Manager can easily perform a path analysis herself with one of these pre-built Apps. The App may highlight certain “hotspots” on a website that ‘force’ customers to telephone for help from a customer service rep. This intelligence becomes a timely source of relevant context in data-driven marketing communication and personalised service messages. It also provides better coordination across the enterprise and helps to deliver “Great” Customer Experience…first time, every time!

Finally it is worth noting that an agile framework for rapid iteration and deployment will be needed to continuously refine the company’s data and analytics ecosystem. To maintain the competitive advantage, innovators have started to experiment with Social Media and other off-domain interactions to further enrich their ‘unified customer experience view’. In fact, more and more anecdotes are suggesting that Social Media is shaping purchasing decisions almost as much as TV. Glimpses of consumer’s sentiment and likelihood to buy can now be gleaned from Social Media postings. By incorporating unstructured data, in multiple formats, from these increasingly important channels into the company’s data and analytics ecosystem, the company will be in a significantly better position to completing the customer buying journey’s jigsaw puzzle.

Winning in the fast pace digital market place will require good decisions to be made at every level. The foundation for good decisions is an agile data and analytics ecosystem as described in this blog.

The post Connect-the-dots to ensure “Great” Customer Experience – Part 2 appeared first on International Blog.

Revolution Analytics

Documentation for Microsoft R Server now online

If you've been thinking about trying the big-data capabilities of Microsoft R Server but wanted to check out the documentation first, you're in luck: the complete Microsoft R Server documentation is...


May 14, 2016

Jean Francois Puget

Tidy Data In Python

Is your data tidy or messy?  If you are not sure about how to answer this question, don't worry, you'll understand it in a minute.  This question has to do with an issue that keeps busy data scientists (or statisticians, or machine learners, pick your favorite).  The issue is that of data preparation.  It is well known that 80% of the time spent in a data science project is spent in data preparation, and as little as 20% is spent in actually learning from it (or modeling it). 

What is this data preparation about?  Well known steps include dealing with missing values.  Should they be replaced by 0, or by some average value?  Shouldn't we rather get rid of the observations with missing values?  Another popular topic is what to do with outliers, i.e. values that are way apart form most values in the data.  These could be error measurements, or transcription errors, in which cases they should be ignored.  But they could also be meaningful, in which case we should keep them by all means. 

Tidy and messy data sets

I could go on, but there is a fundamentally different data preparation step that transforms data to make it easy to process by statistical or machine learning packages.  That transformation was a bit fuzzy for me until recently.  I knew when I was seeing data that it needed to be transformed, and I could transform it, but I never spent time defining precisely what this transformation is about.  Then the light came with this beautiful article:

 Tidy Data, by Hadley Wickham.

I recommend reading it, but I'll try to convey the idea here.  Wickham defines two types of data sets.  The tidy datasets are made of tables with observations in rows, variables in columns.  Actually there is a third condition about data being normalized that we will ignore for now.  We will revisit it in the last section of this post.  In machine learning terms, tidy datasets are matrices with features as columns, and examples as rows.  Anything else is called a messy dataset. 

Let's look at an example for the sake of clarity.  This example, and all examples in this blog entry are taken from Hadley's article. 

The data contains the result of two treatments applied to a set of people.  A rather natural way to report these results is depicted below.


First Last Treatment A Treatment B
John Smith NaN 2
Jane Doe 16 11
Mary Johnson 3 1


Sometimes, the transpose view could be preferred


First John Jane Mary
Last Smith Doe Johnson
Treatment A NaN 16 3
Treatment B 2 11 1


While these are compact ways to present data, neither of them are suited for easy analysis.  For instance, there is a missing value shown as NaN (Not a Number).  Let's say we want to ignore the corresponding observation.  How do we do that?  The only way is to loop over a 2D matrix and test each time if the value is a number or not.  This is not convenient. 

A much better way is to reorganize data in a tidy form, with one observation per row.  Here, an observation is one value for a treatment for a person:


First Last treatment result
John Smith Treatment A NaN
Jane Doe Treatment A 16
Mary Johnson Treatment A 3
John Smith Treatment B 2
Jane Doe Treatment B 11
Mary Johnson Treatment B 1


Ignoring the observation with a missing value is easy: we remove the row for it:


First Last treatment result
Jane Doe Treatment A 16
Mary Johnson Treatment A 3
John Smith Treatment B 2
Jane Doe Treatment B 11
Mary Johnson Treatment B 1


We can then proceed with further analysis without having to worry about missing values.  This is a tidy data set.  The previous ones were messy data sets.

The rest of this post will go through most of the examples used by Wickham in his article to show how to turn messy data sets into tidy ones.  Hadley uses the R programming language for that.  He actually is the main author of many popular R packages for data preparation and data visualization. His packages make it easy to get tidy datasets from messy ones.

We will use Python to do the same if possible.  All the code we use is available in a notebook on github and nbviewer.

The code for the above example is the following.  First, we create the messy data set.

import pandas as pd
import numpy as np

messy = pd.DataFrame({'First' : ['John', 'Jane', 'Mary'], 
                      'Last' : ['Smith', 'Doe', 'Johnson'], 
                      'Treatment A' : [np.nan, 16, 3], 
                      'Treatment B' : [2, 11, 1]})

The transpose view is easy to get:


The tidy version is easily obtained via the melt() function.

tidy = pd.melt(messy, 

This function is quite powerful.  Let's have a closer look at it.

A simple melt example

Understanding how the melt() function works is key for turning data into tidy data.  Wickham provides this simple example to explain the melt process.  We first create a pandas dataframe for it.

messy = pd.DataFrame({'row' : ['A', 'B', 'C'], 
                      'a' : [1, 2, 3],
                      'b' : [4, 5, 6],
                      'c' : [7, 8, 9]})

This yields


  a b c row
0 1 4 7 A
1 2 5 8 B
2 3 6 9 C


This dataset has three variables (or features). One is stored in the row column.  The second one appears as column names (a,b, and c).  The last one is stored as entries in the table.  As Wickham puts it (I modified the R names into Python names):

"To tidy it, we need to melt, or stack it. In other words, we need to turn columns into rows. While this is often described as making wide datasets long or tall, I will avoid those terms because they are imprecise. Melting is parameterised by a list of columns that are already variables, or id_vars for short. The other columns are converted into two variables: a new variable called variable that contains repeated column headings and a new variable called value that contains the concatenated data values from the previously separate columns."

We therefore use melt() with the row column as the id_vars.

pd.melt(messy, id_vars='row')

It yields


  row variable value
0 A a 1
1 B a 2
2 C a 3
3 A b 4
4 B b 5
5 C b 6
6 A c 7
7 B c 8
8 C c 9


The row column stays as a column.  All the other column names are now values for the new variable column.  And the original entries in the table are now the values of the new value column. 

It is possible to rename the new columns with additional arguments.

tidy = pd.melt(messy, id_vars='row', var_name='dimension', value_name='length')



  row dimension length
0 A a 1
1 B a 2
2 C a 3
3 A b 4
4 B b 5
5 C b 6
6 A c 7
7 B c 8
8 C c 9


The pivot function is almost the inverse of the melt function.

messy1 = tidy.pivot(index='row',columns='dimension',values='length')



dimension a b c
A 1 4 7
B 2 5 8
C 3 6 9


This is close, but not identical to the original dataframe, because row is used as index.  We can move it back to a column with reset_index().  We should also remove the name dimension used for columns.

messy1.reset_index(inplace=True) = ''

It yields a dataframe that is identical to the original one, up to some column reordering.


  row a b c
0 A 1 4 7
1 B 2 5 8
2 C 3 6 9

Column headers are values, not variable names

The melt function is key, but it is not always sufficient, as we will see with additional examples used by Wickham.  The data for the first example of problems is depicted below.  We store it as a dataframe in the messy Python variable. 


  religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k
0 Agnostic 27 34 60 81 76 137
1 Atheist 12 27 37 52 35 70
2 Buddhist 27 21 30 34 33 58
3 Catholic 418 617 732 670 638 1116
4 Don't know/refused 15 14 15 11 10 35
5 Evangelical Prot 575 869 1064 982 881 1486
6 Hindu 1 9 7 9 11 34
7 Historically Black Prot 228 244 236 238 197 223
8 Jehovah's Witness 20 27 24 24 21 30
9 Jewish 19 19 25 25 30



As Wickham describes it: "This dataset explores the relationship between income and religion in the US. It comes from a report produced by the Pew Research Center, an American think-tank that collects data on attitudes to topics ranging from religion to the internet, and produces many reports that contain datasets in this format.

This dataset has three variables, religion, income and frequency. To tidy it, we need to melt, or stack it.

Again, the melt() function is our friend.  We sort the result by religion to make it easier to read.

tidy = pd.melt(messy, id_vars = ['religion'], var_name='income', value_name='freq')
tidy.sort_values(by=['religion'], inplace=True)

It yields (we only show the first 5 rows via the head() function).


  religion income freq
0 Agnostic <$10k 27
30 Agnostic $30-40k 81
40 Agnostic $40-50k 76
50 Agnostic $50-75k 137
10 Agnostic $10-20k 34

Multiple variables stored in one column

This example is a little trickier.  We first read the input data as a data frame. This data is available at

I've cloned it therefore it is in my local data directory.

Reading it is easy with the pandas built-in function read_csv().  We remove the new_sp_ prefix appearing in most columns, and we rename a couple of columns as well.  We restrict it to year 2000 and drop a couple of columns to stay in sync with Wickham's paper.

messy = pd.read_csv('data/tb.csv')
messy.columns = messy.columns.str.replace('new_sp_','')
messy.rename(columns = {'iso2' : 'country'}, inplace=True)
messy= messy[messy['year'] == 2000]
messy.drop(['new_sp','m04','m514','f04','f514'], axis=1, inplace=True)

Printing the top 10 rows of the first 11 columns yields.

  country year m014 m1524 m2534 m3544 m4554 m5564 m65 mu f014
10 AD 2000 0 0 1 0 0 0 0 NaN NaN
36 AE 2000 2 4 4 6 5 12 10 NaN 3
60 AF 2000 52 228 183 149 129 94 80 NaN 93
87 AG 2000 0 0 0 0 0 0 1 NaN 1
136 AL 2000 2 19 21 14 24 19 16 NaN 3

The melt() function is useful, but is not enough.  Let's use it still.

molten = pd.melt(messy, id_vars=['country', 'year'], value_name='cases')
molten.sort_values(by=['year', 'country'], inplace=True)

It yields


  country year variable cases
0 AD 2000 m014 0
201 AD 2000 m1524 0
402 AD 2000 m2534 1
603 AD 2000 m3544 0
804 AD 2000 m4554 0
1005 AD 2000 m5564 0
1206 AD 2000 m65 0
1407 AD 2000 mu NaN
1608 AD 2000 f014 NaN
1809 AD 2000 f1524 NaN


This molten dataframe makes it easy to remove the values where the age is mu.  However, it is not really tidy as the variable column encodes two variables: sex and age range.  Let's process the dataframe to create two additional columns, one for the sex, and one for the age range.  We then remove the variable column.

tidy = molten[molten['variable'] != 'mu'].copy()
def parse_age(s):
    s = s[1:]
    if s == '65':
        return '65+'
        return s[:-2]+'-'+s[-2:]

tidy['sex'] = tidy['variable'].apply(lambda s: s[:1])
tidy['age'] = tidy['variable'].apply(parse_age)
tidy = tidy[['country', 'year', 'sex', 'age', 'cases']]

It yields


  country year sex age cases
0 AD 2000 m 0-14 0
201 AD 2000 m 15-24 0
402 AD 2000 m 25-34 1
603 AD 2000 m 35-44 0
804 AD 2000 m 45-54 0
1005 AD 2000 m 55-64 0
1206 AD 2000 m 65+ 0
1608 AD 2000 f 0-14 NaN
1809 AD 2000 f 15-24 NaN
2010 AD 2000 f 25-34 NaN

Let's look at a trickier example.

Variables are stored in both rows and columns

This example is really tricky.  We store the following data as a dataframe in the messy Python variable.


  id year month element d1 d2 d3 d4 d5 d6 d7 d8
0 MX17004 2010 1 tmax NaN NaN NaN NaN NaN NaN NaN NaN
1 MX17004 2010 1 tmin NaN NaN NaN NaN NaN NaN NaN NaN
2 MX17004 2010 2 tmax NaN 27.3 24.1 NaN NaN NaN NaN NaN
3 MX17004 2010 2 tmin NaN 14.4 14.4 NaN NaN NaN NaN NaN
4 MX17004 2010 3 tmax NaN NaN NaN NaN 32.1 NaN NaN NaN
5 MX17004 2010 3 tmin NaN NaN NaN NaN 14.2 NaN NaN NaN
6 MX17004 2010 4 tmax NaN NaN NaN NaN NaN NaN NaN NaN
7 MX17004 2010 4 tmin NaN NaN NaN NaN NaN NaN NaN NaN
8 MX17004 2010 5 tmax NaN NaN NaN NaN NaN NaN NaN NaN
9 MX17004 2010 5 tmin NaN NaN NaN NaN NaN NaN NaN NaN


As Wickham describes it: "It has variables in individual columns (id, year, month), spread across columns (day, d1-d31) and across rows (tmin,tmax) (minimum and maximum temperature). Months with less than 31 days have structural missing values for the last day(s) of the month. The element column is not a variable; it stores the names of variables."

Most of the values are missing.  However, filtering the NaN values isn't possible in this messy form. We need to melt the dataframe first.  We reindex the dataframe, but this step isn't mandatory.  I just prefer to have my rows numbered consecutively in the dataframe, but keeping the original indices may be valuable in other circumstances.

molten = pd.melt(messy, 
                 id_vars=['id', 'year','month','element',],
molten = molten.reset_index(drop=True)

It yields.


  id year month element day value
0 MX17004 2010 2 tmax d2 27.3
1 MX17004 2010 2 tmin d2 14.4
2 MX17004 2010 2 tmax d3 24.1
3 MX17004 2010 2 tmin d3 14.4
4 MX17004 2010 3 tmax d5 32.1
5 MX17004 2010 3 tmin d5 14.2


This dataframe is not in tidy form yet. First, the column element contains variable names. Second, the columns year, month, day represent one variable: the date. Let's fix the latter problem first.

def f(row):    
    return "%d-%02d-%02d" % (row['year'], row['month'], int(row['day'][1:]))
molten['date'] = molten.apply(f,axis=1)
molten = molten[['id', 'element','value','date']]

It yields


  id element value date
0 MX17004 tmax 27.3 2010-02-02
1 MX17004 tmin 14.4 2010-02-02
2 MX17004 tmax 24.1 2010-02-03
3 MX17004 tmin 14.4 2010-02-03
4 MX17004 tmax 32.1 2010-03-05
5 MX17004 tmin 14.2 2010-03-05


Now we need to move the values in the element column to be the name of two new columns. This is the opposite of a melt operation.  As we have seen above, this is the pivot operation.

tidy = molten.pivot(index='date',columns='element',values='value')

It yields


element tmax tmin
2010-02-02 27.3 14.4
2010-02-03 24.1 14.4
2010-03-05 32.1 14.2


Wait a minute.

Where is the id?

One way to keep the id column, is to move it to an index with the groupby() function, and apply pivot() inside each group.

tidy = molten.groupby('id').apply(pd.DataFrame.pivot,

It yields


  element tmax tmin
id date    
MX17004 2010-02-02 27.3 14.4
2010-02-03 24.1 14.4
2010-03-05 32.1 14.2


We are almost there. We simply have to move id back as a column with the reset_index().


It yields.


element id date tmax tmin
0 MX17004 2010-02-02 27.3 14.4
1 MX17004 2010-02-03 24.1 14.4
2 MX17004 2010-03-05 32.1 14.2


Et Voila!

Multiple types in one table

Wickham uses yet another dataset to illustrate  further issues with messy data.  It is an excerpt from the Billboard top hits for 2000.  We store the following data in the messy Python variable.


  year artist track time date entered wk1 wk2 wk3
0 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 87 82 72
1 2000 2Ge+her The Hardest Part Of ... 3:15 2000-09-02 91 87 92
2 2000 3 Doors Down Kryptonite 3:53 2000-04-08 81 70 68
3 2000 98^0 Give Me Just One Nig... 3:24 2000-08-19 51 39 34
4 2000 A*Teens Dancing Queen 3:44 2000-07-08 97 97 96
5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 84 62 51
6 2000 Aaliyah Try Again 4:03 2000-03-18 59 53 38
7 2000 Adams,Yolanda Open My Heart 5:30 2000-08-26 76 76 74


The first columns give the artist performing the song, the title of the song, its duration, and the date it entered to the top hits.   Columns wk1, wk2, etc. represent the rank of a given song in the weeks after it entered the top hits. 

This dataframe is messy because there are several observations per row, in the columns wk1, wk2, wk3.  We can get one observation per row by melting the dataframe.

molten = pd.melt(messy, 
                 id_vars=['year','artist','track','time','date entered'],
                 var_name = 'week',
                 value_name = 'rank',
molten.sort_values(by=['date entered','week'], inplace=True)

It yields


  year artist track time date entered week rank
5 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk1 84
13 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk2 62
21 2000 Aaliyah I Don't Wanna 4:15 2000-01-29 wk3 51
0 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 wk1 87
8 2000 2,Pac Baby Don't Cry 4:22 2000-02-26 wk2 82


We can clean the dataset further, first by turning week into number.  Second, we need the starting date of the week for each observation, instead of the date the track entered.

from datetime import datetime, timedelta

def increment_date(row):
    date = datetime.strptime(row['date entered'], "%Y-%m-%d")
    return date + timedelta(7) * (row['week'] - 1)
molten['week'] = molten['week'].apply(lambda s: int(s[2:]))
molten['date'] = molten.apply(increment_date, axis=1)
molten.drop('date entered', axis=1, inplace=True)

It yields


  year artist track time week rank date
5 2000 Aaliyah I Don't Wanna 4:15 1 84 2000-01-29
13 2000 Aaliyah I Don't Wanna 4:15 2 62 2000-02-05
21 2000 Aaliyah I Don't Wanna 4:15 3 51 2000-02-12
0 2000 2,Pac Baby Don't Cry 4:22 1 87 2000-02-26
8 2000 2,Pac Baby Don't Cry 4:22 2 82 2000-03-04


This is a tidy set by the definition we gave above, with one observation per row, and one feature per column.  This form is suitable for most statistical and machine learning packages.  Yet, it is not in the 'purest' tidy form.  Indeed, some data is repeated, like artist name, song name, and song duration.  We can remove this redundancy by using more than one table for the data.  This is called data normalization in relational database community.  For Wickham, normalization is required to have tidy datasets.

Let us first get the information pertaining to each song.  We restrict ourselves to the year, artist, rack. and time columns.  And we only need to keep the first row for each combination of year, artist, and track. 

tidy_track = molten[['year','artist','track','time']]\
tidy_track.rename(columns = {'index':'id'}, inplace=True)

The first call to reset_index moves the columns that were used in group_by() back to columns.  The second one adds a new column that will serve as id. 

It yields our first new table.


  id year artist track time
0 0 2000 2,Pac Baby Don't Cry 4:22
1 1 2000 2Ge+her The Hardest Part Of ... 3:15
2 2 2000 3 Doors Down Kryptonite 3:53
3 3 2000 98^0 Give Me Just One Nig... 3:24
4 4 2000 A*Teens Dancing Queen 3:44
5 5 2000 Aaliyah I Don't Wanna 4:15
6 6 2000 Aaliyah Try Again 4:03
7 7 2000 Adams,Yolanda Open My Heart 5:30


The second table is obtained by adding the id column to the original table, via a merge operation, then restrict to the columns about weekly ranks.

tidy_rank = pd.merge(molten, tidy_track, on='track')
tidy_rank = tidy_rank[['id', 'date', 'rank']]

It yields.


  id date rank
0 5 2000-01-29 84
1 5 2000-02-05 62
2 5 2000-02-12 51
3 0 2000-02-26 87
4 0 2000-03-04 82


This concludes our little exercise.  I hope it shows that pandas provides data tidying features that are flexible enough to match what Wickham showcased in his article.  All the code for this blog entry is available in a notebook on github and nbviewer.



May 13, 2016

Revolution Analytics

Because it's Friday: The time-travelling jukebox

If you're looking for some musical nostalgia this weekend, look no further than How Music Taste Evolved, from design firm Polygraph (and with a hat tip to Pogue). Choose any month from the past six...


Revolution Analytics

What's in Pasta Carbonara?

Apparently, people have strong feelings about how pasta carbonara should be made. A 45-second French video showing a one-pot preparation of the dish with farfalle instead of spaghetti and...


Simplified Analytics

Sentiment analysis in the age of Digital Transformation

Sentiment Analysis is the process of determining whether an information or service provided leads to positive, negative or neutral human feelings  or opinions. It is essentially, the process of...



How Web Data Harvesting Can Be Used to Combat Counterfeiting

The World Trademark Review recently published a startling commentary on a study in an article titled,  “We are failing”: study reveals $461 billion international trade in counterfeit and pirated goods. The article details the failings of companies when it comes to combatting counterfeiting online. In this post, we hope to cover how harvesting web data […] The post How Web Data Harvesting Can Be Used to Combat Counterfeiting appeared first on BrightPlanet.

Read more »
Jean Francois Puget

Data Science Automation

Will data scientists disappear soon?  I am asking the question as I see more and more papers about why data scientists may be a parenthesis in history.  Latest I read is Will The 'Best Job Of 2016' Soon Become Redundant? by Bernard Marr. To his point, there is indeed a number of software and cloud services aiming at automating data science.  Marr cites IBM Watson Analytics as a great example of this.  I tend to agree, and not only because I am an IBM employee.  Watson analytics does automate data science: you upload data to it, and data gets analyzed automatically for you.  This makes data science usable by non specialists.

Does this mean that data scientists are useless now?

I don't think so.

First of all, there are many flavors of data science, and Watson Analytics automates one of them.  More precisely, it automates what I would label as data mining.  It implements a simple workflow:

imageNote that simplicity is a plus here, it is not a defect.   Making data science simpler is what makes it usable by non specialists.  Note also that the above is an over simplification of Watson Analytics.  It may not represent it faithfully.  I encourage readers to give it a try.  I bet you'll be positively surprised by the natural language interface and the great visualization capabilities.  Anyway, for the sake of this discussion, I think we can abstract it to the above workflow.

Another flavor of data science is machine learning.  I discussed at length how machine learning goes beyond data analysis in Machine Learning Algorithm != Learning Machine.  Here is the workflow I derived:

imageWatson Analytics could be seen as a way to automate some of it, namely the "train" step.  There are other efforts aiming at automating this step in the context of machine mearning, including, but not limited to: IBM Cognitive Assistant To Data Scientists. TPOT, LICO, The Automatic Statistician, plus a number of commercial offerings I can't really list here.  A google search on "machine learning automation", or "data science automation" yields new links every week.  These efforts focus on the data preparation and train steps.  They aim at automating three tedious tasks of the machine learning workflow:

  • Feature engineering, i.e. data transformation steps that make machine learning easier.  This includes treatment of missing values, dimensionality reduction (e.g. principal component analysis), new feature creation, etc
  • Model selection. Which type of machine learning algorithm should be used?  Linear regression? Decision trees? Support Vector Machines?  Deep learning? etc.
  • Hyper parameter optimization.  Most machine learning algorithms come with a set of parameters (e.g. learning rate, regularization coefficients, etc).  Which values should be set for these parameters?

These three tasks can be automated in principle because they are essentially a trial and error process.  For each of these tasks, there is a number of things a data scientist can use.  Selecting among these often amounts to just try them on some training data, and see what quality of prediction they give on some validation data. 

The difficulty comes from the combinatorial nature of the task.  If you have 10 different ways to do feature engineering, 10 different algorithms to consider, and 4 parameters with 10 possible values each for each algorithm, then you end up with 1 million possibilities to try.  Each possibility is one feature engineering algorithm followed by one machine learning algorithm with one value for each of its 4 parameters.  This combination defines one machine learning pipeline. 

Goal is to find the best possible pipeline among the million possible pipelines.  A brute force approach isn't doable by hand, and hardly doable at all in an automated way.  But one could see this as an optimization problem, and try to be clever about how to explore this large number of possible machine learning pipelines.  This is basically what the efforts listed above try to do.

So, aren't we on the verge of machine learning automation?  I'd say yes and no.  Yes, lots of the tedious part of machine learning is going to be automated.  Selecting the right learning rate will no longer be a key human skill.  But I do think there is room for smart data scientists.  Feature engineering can be quite tricky and I doubt all of it can be automated.

Interpreting machine learning algorithm output quality is also a challenge: it can't be summarized by a single number.  We said above that selecting the best machine learning pipeline is an optimization problem.  It is more precisely a multi objective optimization.  Indeed, there are many ways to quantify the quality of the predictions made by a machine learning pipeline.  Common evaluation metrics include F1 score, ROC area under curve, recall, precision, accuracy, etc. The fact that there are many evaluation metrics simply indicates that human judgement is key, and will remain key.

From the above, my advice to data scientists is to prepare for a shift in their practice:  if their value comes from their skills at selecting machine learning algorithm parameters, then they should worry.  If their skills are about how to map a business problem into a machine learning problem, and drive the machine learning workflow till a deployed application, then they are on the safe side.



May 12, 2016

Silicon Valley Data Science

Talking About the Caltrain

Date: May 6, 2016
Location: SVDS HQ
Speaker(s): Harrison Mebane and Christian Perez


On May 6th, SVDS was honored to host an Open Data Science Conference (ODSC) Meetup in our Mountain View headquarters. The goal of ODSC is to bring together the global data science community to help foster the exchange of innovative ideas and encourage the growth of open source software. They have several upcoming conferences, and you might see us there!

Data Engineer Harrison Mebane and Data Scientist Christian Perez presented on a long-running internal project here at SVDS: how we observe and predict Caltrain delays, process data using a data platform built from open source components, and deliver those insights to riders via our mobile app. Our audience was a diverse crowd of students, data scientists, and engineers.

The Caltrain project is designed to show how disparate data sources can be pulled together to create a robust data-driven application. We also use the project to try out new tools and refine our methodologies. It helps that it’s an interesting (albeit hard) problem to solve (especially if you ride the train as much as our team does). Harrison and Christian discussed our network of audio and video sensors directly observing the track, analysis of Caltrain-related tweets to detect catastrophic events, and using data from Caltrain’s near-real-time feed to find patterns and make predictions about future delays.


Takeaways (Making data applications is hard!)

Bringing together data from many different sources is a non-trivial task. How can we best combine audio and video feeds to get the most accurate real-time detection? How do we use Twitter to discern the state of the system as a whole? How can we use past arrival data to predict future performance? Each one of these questions is difficult to answer, and combining all of these data sources into a comprehensive view for an end user is harder still.

While we don’t claim to have completely solved the challenges above, a few themes have emerged:

  • Use the right tools. We live in an era where it seems like several new data projects are released every week, many of them open-source. This is both a blessing and a curse — you have lots of choices, but … you have lots of choices. We use Kafka for production-ready communication between different data sources, Python and Spark for streaming analytics, and HBase and Impala for data persistence and ad-hoc analysis. There are many other good choices out there. If you are curious as to why we went with these, just ask us!
  • More data sources is (usually) better. There are myriad advantages to collecting data from several sources. It makes your data applications more robust to failure, it gives you a multi-faceted view of the system you are analyzing, and it allows for better predictions than any one individual source could give you. Of course, if you don’t have an appropriate data platform in place, or you haven’t gotten around to analyzing the data you already have, adding additional sources may just end up being a distraction. Make sure you know how to use the data you are collecting.
  • The ability to visualize is crucial. Whether it’s making plots for exploratory analysis or putting up a dashboard to track data as it comes in, it’s hard to overstate the importance of visualizations. A good summary plot can open up new avenues for analysis, and a well-designed dashboard can alert developers to ingestion problems as they are happening. They also make cool presentation slides!


Next steps

As difficult as it can be, designing a data-driven app for Caltrain riders has been a labor of love. Our Caltrain Rider app is currently available on iOS and Android. Follow our blog (you’ll find other Caltrain posts here and here) and the app for updates as we continue to refine our predictions and put them into production!

The post Talking About the Caltrain appeared first on Silicon Valley Data Science.


Customer Value Management tips from Steve Jobs

In his final keynote speech at the 2011 Apple Worldwide Developer’s Conference, Steve Jobs remarked that, “If the hardware is the brain and the sinew of our products, the software is its soul.” Jobs’ intimate understanding of and vision for his products stands out as one of the key reasons behind Apple’s success. His notoriously protective stance on his company vision and the extent of his involvement in the conception, design and development of his products right up until their anticipated release is legendary. But the man behind Forbes’ most valuable brand of 2015  also knew a little something about value creation and customer value management.


Why Big Data Initiatives Fail?

Late in 2015 PwC and Iron Mountain published a report on “How organizations can unlock value and insight from the information they hold”. The report was based on an exhaustive survey of 1800 top executives at medium and large companies. The results were quite deflating for the supporters of big data and analytics. 43% of the organizations surveyed said they got “little tangible benefit from their information” and a further 23% said they “derived no benefit whatsoever”. That’s 2 out of every 3 organizations reporting disappointing impact from their data-related initiatives. Richard Petley, director of PwC Risk and Assurance gave voice to the concern saying, “Data is the lifeblood of the digital economy, it can give insight, inform decisions and deepen relationships. Yet when we conducted our research very few organizations can attribute a value and, more concerning, many do not yet have the capabilities we would expect to manage, protect and extract that value.”


With so much talk of organizations adopting big data and analytics initiatives, this would suggest that such initiatives are not delivering – failing in other words. So why do big data initiatives fail? Depending on who you talk to, you will get different answers to this question. From our position as a software development focused company that helps client organizations put big data and cloud-focused initiatives into place here are four factors we would like to highlight.


  1. Disconnect between the data and the systems that use it: Clearly there is no problem with the volume of data. This is flooding into the organization from all sides – every customer interaction or even intent is monitored, logged and made available for analysis. The same is true of data relating to operational efficiency and even the broader ecosystem in which the organization operates. But is the relevant data being collected and is it being channeled for deriving the insights that matter? That apart, we have observed substantial gaps between the maths that the data scientists apply to the data and the software and systems that then have to apply to these mathematical algorithms to derive insights from them the organization can use. A bridge is needed between the science in data and the computing technologies of big data without which the data cannot deliver.
  2. Not designing for business impact: Ironically, many big data and analytics initiatives are doomed to fail even before they start. In these cases, the problem is not in how the initiative is implemented but in rather how it is envisioned. What are the key business questions you want your data to answer for you? What is the business impact you are seeking? Framing and then answering these questions accurately will help you define what data to collect, how to collect it and what treatment to put it through to get those insights. Starting without this end in mind will make your big data initiative a costly experiment of the IT Department in everything but name.
  3. The Technology tangle: In most instances organizations know their business but do not have visibility into the technology landscape. This comes in the way of putting together a coherent and complete solution that turns input into insight. This often prompts inappropriate “fad of the day” technology choices. More than once we have found initiatives running aground because of the technology choice and significant time is expended in trying to prove the technology works, rather than in making the technology work for you. Technology choices are many and we always recommend taking a considered decision keeping your specific business needs, including factors like future scalability and security in mind, and after reasonable trials.
  4. The money shot: In case you had not noticed, big data initiative cost money – sometimes lots of it. Sometimes discussions about big data seem to ignore the fact that apart from the costs associated with the software there are likely to be significant costs in hardware and infrastructure – all that data has to be stored, secured, transmitted and treated somewhere. While using the cloud could reduce this burden (and shift costs from the Capex to the Opex budget) this cost is still not trivial. This apart there are costs associated with the changes the organization will have to make to systems and processes, training costs and even hiring for new skill-sets. This is why big data and analytics initiatives need the sponsorship of the very top management in the organization. How else to get the money and organizational commitment required to pull this off?

Admittedly this list is not complete – our view is obviously coloured by our experiences and there are a whole bunch of other factors that can make or break big data initiatives. We do think though that it is important to stay positive and believe. A Capgemini report from last year showed that only 8% of the managers surveyed said their own big data initiatives were “very successful” and 27% thought their initiatives were “successful”. On our part, we would like to focus on the 60% of managers in that same report who believed that over the next 3 years, big data would change the world, including the industries they themselves were in. For organizations that have similar hopes from their big data initiatives, we would suggest a long hard look at the factors listed here!