Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


October 25, 2016

Data Digest

How NOT to do Customer Experience initiatives

How NOT to do customer experience, corinium, chief customer officer, cco, big data

In this 3 minute video Peter Strohkorb talks about how businesses can easily get their drive towards customer centricity, customer experience and customer satisfaction wrong. Hint: it does not start with net promoter scoring.

If you don't want to watch the video there is a transcript for you below.


Hello, I’m Peter Strohkorb,

I have a lot of my clients and prospects talk to me about sales and marketing team alignment, but really, if you think about, it it’s really all about customer centricity.

So if we can help sales and marketing teams to align more effectively to focus on the customer then they will automatically want to work together. So that’s no secret there.

But a lot of people then come and say, “Well, if it’s about customer centricity shouldn’t we look at something like Net Promoter Score or at Customer Satisfaction Reviews and that sort of thing?”

And I always tell them this: Only conduct Net Promoter Score exercises or customer satisfaction reviews if your organization is actually ready to respond to the feedback that you’re getting from your clients.

Only conduct net promoter score exercises or customer satisfaction reviews if your organization is actually ready to respond to the feedback that you’re getting from your clients. 

Nothing is more annoying to a client than to be asked their opinion, “Hey, what do you think of our business?” And then we go: “Well, we think this and this and this, and it would be better if you did that and that, and we have competitors doing this, but you’re not.”, and so on. And “You know, I’m hoping that it is good feedback for you.”

So then, if we take that feedback in, but we are not in a position to change anything about the feedback that we’re getting. Then what is our customer going to think about how valuable we find their feedback? As a customer, I would think: “Well, you don’t really care about my opinion if I see no change (in response) to the feedback that I’m giving you.”

So therefore, there’s a logical sequence to getting this right: If you want to embrace customer centricity and you want to embark on a journey towards better customer experience, then you better start inside the organization and make sure that the organization is actually ready to give that response to the feedback, and to give that customer experience that you’re looking for.

You better start inside the organization and make sure that the organization is actually ready.

So how do you know whether you are ready? Well, there’s another tool called the MRI or the Market Responsiveness Index, that we endorse that helps you to actually measure the readiness of your organization for being customer centric.

So there are seven or eight (depending on the metric you use) parameters that we can measure within your organization, both inward looking and outward looking to a customer, to give you a score against. Like I said, seven or eight parameters in terms of how ready are your teams within your organization to even being customer centric?

And this is a fantastic exercise to benchmark your organization, and to see where are our opportunities, where are the challenges, what will take a bit longer, where do we need to put our efforts, and then to keep measuring the progress that you’re making as an organization towards customer centricity.

Now, at the stage when you are there and the organization is agile enough to respond to the customer’s feedback, then you can really embrace customer centricity and you can embrace things like Net Promoter Score or Customer Satisfaction reviews. But the important thing is to get the organization ready first, and to know when you’re ready, and how to improve continuously thereafter.

So, if you’re interested in any of these things please here is a link to more information. Also, feel free to contact us and have an obligation-free conversation about what we can do for you, and how you can benefit.

Thanks very much! I’m Peter Strohkorb. I’ll talk to you soon, bye-bye.

This article was posted with permission from the author. See the original post here.

By Peter Strohkorb:

Peter Strohkorb is the CEO of Peter Strohkorb Consulting International, a published Author, an international Speaker and Executive Mentor, as well as an Executive MBA guest Lecturer. Peter developed our 5-Step OneTEAM Method®, which is the only structured program to lift business performance through superior and sustainable collaboration between your two most customer-facing and revenue-generating functions, namely Marketing and Sales.


October 24, 2016

Revolution Analytics

Webinar: Changing Lives with Data Science and R at Microsoft

If you didn't have a chance to catch my presentation at the Machine Learning and Data Science Summit, I'll be reprising an updated version of the talk in a live webinar on Tuesday, November 1. I'll...

Cloud Avenue Hadoop Tips

Hadoop/MR vs Spark/RDD WordCount program

Apache Spark provides an efficient way for solving iterative algorithms by keeping the intermediate data in the memory. This avoids the overhead of R/W of the intermediate data from the disk as in the case of MR.

Also, when running the same operation again and again, data can be cached/fetched from the memory without performing the same operation again. MR is stateless, lets say a program/application in MR has been executed 10 times, then the whole data set has to be scanned 10 times.
Also, namesake MR supports only Map and Reduce operations and everything (join, groupby etc) has to be fit into the Map and Reduce model, which might not be the efficient way. Spark supports a couple of other transformations and actions besides just Map and Reduce as mentioned here and here.

Spark code is also compact when compared to the MR code. Below is the program for performing the WordCount using Python in Spark.
from pyspark import SparkContext

logFile = "hdfs://localhost:9000/user/bigdatavm/input"

sc = SparkContext("spark://bigdata-vm:7077", "WordCount")

textFile = sc.textFile(logFile)

wordCounts = textFile.flatMap(lambda line: line.split()).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)

and the same in MR model using Hadoop is a bit verbose as shown below.
package org.apache.hadoop.examples;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.FileSplit;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static class TokenizerMapper extends
Mapper<Object, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
context.write(word, one);

public static class IntSumReducer extends
Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<Intwritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
context.write(key, result);

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs = new GenericOptionsParser(conf, args)

Job job = new Job(conf, "word count");




FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

System.exit(job.waitForCompletion(true) ? 0 : 1);
Here we looked into WordCount which is similar to HelloWorld program in terms of simplicity, for one to get started with a new concept/technology/language. In the future blogs, we will look into a little more advanced features in Spark.

David Corrigan

The Post-it Notes of the Big Data World

This past weekend I decided to tackle the mess in my basement.  In my last , I compared the modern data lake to my “thing lake” in the basement – a random collection of stuff that may, or may not, be...


October 22, 2016

Simplified Analytics

Watch these Movies for Big Data Analytics & Machine Learning

Business analytics & Big Data has not only got the business and technology industry excited, but have influenced many movie-makers across the last few decades. It would be a big miss for data...


October 21, 2016

Revolution Analytics

Because it's Friday: Hitchcock vs Kubrick

We've seen (or rather heard) plenty of music mashups, but here's a movie mashup: Hitchcock's Rear Window augmented with various characters and scenes from Stanley Kubrick movies: The Red Drum Getaway...


Revolution Analytics

Election 2016: Tracking Emotions with R and Python

Temperament has been a key issue in the 2016 presidential election between Hillary Clinton and Donald Trump, and an issue highlighted in the series of three debates that concluded this week....


Curt Monash

Rapid analytics

“Real-time” technology excites people, and has for decades. Yet the actual, useful technology to meet “real-time” requirements remains immature, especially in cases which call...


October 20, 2016

Revolution Analytics

R Tools for Visual Studio 0.5 now available

R Tools for Visual Studio, the open-source Visual Studio add-in for R programmers, has a new update available for download. RTVS 0.5 makes it easier to run R within SQL Server 2016 as a stored...

Cloud Avenue Hadoop Tips

Intersection of Big Data and IOT

Lately I had been blogging about IOT and Big Data. They go hand-in-hand. With the IOT devices feeding the data to the Big Data platforms on the Cloud to store/analyze the data and feed it back to the IOT devices. One such example is the Nest device bought by Google. Nest thermostat gathers the room temperature, sends it over to the Cloud over Wi-Fi, analytics done in the Cloud and fed back to the Nest again.

This is not something new, but had been there for quite some time. But, the conditions are moving towards more adoption of Big Data and IOT like cheap sensors, Cloud, cheaper and faster internet connections. Here is an article  from ZDNet on ten practical examples on the intersection of Big Data and IOT. For those who are interested in learning more, here (1, 2) are few more references from ZDNet.

I am yet to try it out, but here is a detailed article from IBM Developer Works on building a temperature sensor using Arduino Uno, putting the data in the Cloud and finally visualizing the data in a real time. Here is another article on how to integrate IOT with Azure.

Silicon Valley Data Science

We Need a New Data Architecture: What Next?

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

You’ve realized that your business needs a new data architecture, but what next? Data systems serve more stakeholders than ever before; new technologies constantly become available, and competitors are moving faster than ever. How do you make the right decisions and manage the risks of moving to a new data platform? The key is in understanding the attributes of a good modern data architecture, and in adopting a data-centered view of your business.

The challenge of moving forward

Enterprise IT faces a pressing need to understand the new data architectures required to support business. The demand is universal, due to massive volumes of data from internet and mobile applications; the need to generate competitive advantage from new types of information, such as streams from Internet of Things sensors, images, or social media; and the expectation that we create the kinds of friendly online analytical products that users are now familiar with thanks to Google, Facebook, and other web services.

It’s clear from the explosion of interest in newer platforms and technologies that the old tools and licensing costs don’t work to meet new business needs. Open source, cloud, and scale-out distributed systems create a new cost model. IT needs to know how and when to use platforms such as Hadoop, Spark, AWS, Azure, or Google Cloud.

The path forward isn’t so clear, however. It’s reasonable to be worried about getting your fingers burnt. For example, some early adopters rode the NoSQL wave with enthusiasm, but discovered that no single type of database met their needs: many such databases come with hefty requirements for extra programming. They have since reverted to relational databases or, more commonly, adopted a hybrid architecture.

The world has moved into a business technology space-race era. It’s not enough to support stable business processes; IT must also support innovation and iteration. Early adopters in that race reap the rewards of trying audacious new things—but they also bear the non-trivial costs when some of those things explode.

As a modern business, your challenge is to move data infrastructure towards creating a platform that sustains today’s business needs, the innovation process, and future use cases—all while managing the risks of the unknown, and delivering valuable results to the business.

Managing the journey towards new data technologies

At SVDS, we hone our approach through R&D, then export the learnings of early adopters in a way that makes sense at enterprise scale. Adopting and engineering new data platforms is an inescapable requirement for most businesses, and we have devised a method for creating modern data architectures that work, even in the face of rapid change.

A good modern data architecture:

  • supports current and future technical capabilities,
  • considers fit with existing architecture,
  • selects the appropriate technology platforms, and
  • delivers a plan for implementation.

Being adaptable and future-proof means that you need to spend a lot of time considering what we call the “data value chain”—the stages of data as it enables your business: discovery, ingest, processing, persistence, integration, analysis, and exposure. Your current and future requirements at these stages often have the largest influence on technology selection.

For example, the need for real-time analytics not only mandates certain performance requirements for data processing, but also requires service guarantees from data ingest, and has an effect on how the results of those analytics are exposed back to the organization.

By thinking data-first, rather than application-first, you can avoid the costly data silos that prevent so many businesses from leveraging their data. And when new technologies emerge, as is inevitable in today’s fast-evolving market, you have a strong vantage point from which to judge them.

Business comes first

By far the biggest factor in any architecture analysis is the needs of the business itself. An effective modern CIO creates an enabling platform for the business to innovate and build upon. The IT mindset must transition away from policies governing users, towards creating tools that enable them.

That’s why you just can’t “get a Hadoop” and make everybody happy: simply installing a tool and kicking the tires doesn’t translate into a serious-minded exploration of your present and future data needs. In other words: modern data architectures are not a shrink-wrap problem. If it were that easy, then the business advantages wouldn’t be so radical.

Though challenging to get right, the ultimate benefit of a modern data platform is to foster innovation and make business more agile. If you don’t transform into a data-driven company, you become deeply vulnerable to data-driven competitors. To dive deeper into the kind of business advantage a modern data architecture makes possible, read my essay on building an Experimental Enterprise.

The post We Need a New Data Architecture: What Next? appeared first on Silicon Valley Data Science.

Data Digest

Will Artificial Intelligence help Big Data deliver on its promise?

One of the major trends I have been researching recently, has been the shift in interest towards Artificial Intelligence (AI) in its multiple forms and guises, and the potential it has to analyse vast quantities of data and quickly derive actionable insights. As we all know, AI, Machine Learning and Deep Learning are not new. However, there has been huge investment in the space in recent years and the ability to automatically apply complex mathematical calculations to Big Data – over and over, faster and faster – is a recent development. With steady advances in digitisation and cheap computing power, no wonder people are excited about the possibilities.

One of the areas of AI that gets the most attention is Deep Learning. Researchers have been attempting to train algorithms since the 1970s but limitations, be that computational or data related, slowed that progress. Whilst the algorithms we use today were, for the most part, created decades ago, we were unable to use them effectively. It wasn’t until new technology became available, that they could be applied to massive amounts of data, cheaply and quickly enough that they had the chance to live up to their full potential and help Big Data live up to its promise.

One area which will be interesting to observe is the relationship between Data Scientists and AI. As AI and Machine Learning progresses and evolves, some of the more basic and straightforward tasks that Data Scientists perform  routinely will become automated and will yield great results in productivity. AI is certainly not going to replace Data Scientists any time soon, and can in fact be a massively helpful tool to utilise, however how will they view it: Friend or Foe? Could this also be one of the many ways that the industry can combat the talent deficit, automating the more basic tasks and reserving the more complicated Data Science processes for the Data Scientists?

Today, AI enables computers to communicate with humans, autonomously drive cars, write and publish sport match reports, beat them at board games and find terrorist suspects. The possibilities are endless and will no doubt change the ways in which we will live our lives in the future. Not only does AI present new possibilities in our day to day lives but also within wider reaching areas such as national cyber security, and special projects to combat human trafficking and arms dealings such as the collaboration between NASA and DARPA . When speaking with an expert recently, we were discussing one algorithm which predicts the likelihood of a criminal reoffending and the ways in which this was being used in a courtroom setting to help judges determine a sentence and future parole opportunities. There are also huge opportunities within healthcare, and with recent technological advancements, in some instances, we are able to predict whether an individual will develop a certain disease before they even show any symptoms, just by analysing different aspects of their lives. The possibilities for this technology are endless, and whilst for some this is truly exciting, for others, it is a step too close towards a Minority Report, iRobot, 2001: A Space Odyssey type future.
No matter what your philosophical view of our future, increasingly, the focus on AI/Machine Learning in analytics corresponds to the next logical step, which is gaining advanced insights from Big Data.

No matter what your philosophical view of our future, increasingly, the focus on AI/Machine Learning in analytics corresponds to the next logical step, which is gaining advanced insights from Big Data, the ability to accurately predict outcomes, improve productivity, and gain competitive advantage. Whilst it’s taken a few years to build the right infrastructure to store and process massive amounts of data, this was just the first step. Now, AI/Machine Learning is driving us forward and the combination of Big Data and AI will present incredible opportunities and drive innovation across almost all industries.

AI, Machine Learning and more will be discussed at the Chief Analytics Officer Forum, Fall on October 5-7 in New York. Join us on October 5 for our dedicated Pre-Conference Focus day Machine Learning, Deep Learning and AI for Strategic Innovation and hear about the ways in which leading companies are using AI in innovative ways within their companies. For more information, visit

By Vicky Matthews:

Vicky Mathews is the Content Director US/Europe for the CAO Forum. Vicky is the organiser of the CAO Forum, Fall consulting with the industry about their key challenges and trying to find exciting and innovative ways to bring people together to address those issues. For enquiries email:
Data Digest

Executives Still Relying on Gut, Not Gigabytes in Planning for Future

At Corinium’s recent CDAO Melbourne event, we were thrilled to have on board John Studley, Lead Partner for Data and Analytics and Danielle Malone, Director for Insight Analytics at PwC. We caught up with them at the conference to find out more about the intriguing figures announced in their recent report, ‘Big Decisions.’ The global report found that Australian organisations lagged when it came to leveraging data to make business decisions that drive the enterprise strategy. The report states that “61% of Australian organisations admitted that their decision-making process is only ‘somewhat’ guided by data.” We asked John to shed some light on why that might be.

Corinium: In PwC’s latest report on unlocking data possibilities with advanced analytics and machine learning, some really surprising statistics are revealed, demonstrating how far behind Australia is when it comes to leveraging data analytics to drive business decisions. Why do you think that is?

John Studley, PwC: Two main reasons. It’s clear to me that many Australian companies still have a cultural blind spot with data analytics. There is a pervasive psychology in this country of self- sufficiency being a sign of strength. It reminds me of the classic King Gee “I know boats” commercial. So as a result there’s a lot of talk and too little serious action. Where it is occurring, it’s predominantly siloed and under resourced. The second reason is that most senior executives are gun shy from past poor experiences with large IT projects. Relatively low data and technology literacy of Boards and senior executives meant they were often caught out relying on others for very costly and poorly governed waterfall IT programs. Perversely this has made them overcautious in adopting next generation analytics applications despite the cost and utility advantages of cloud computing, processing power and open source tools.

Corinium: Who is topping the international leader board when it comes to data and predictive analytics?

John Studley, PwC: Probably not who you think. The leaders tend to be in the heavy industries including oil and gas, mining, rail, aeronautics and energy. They learnt the skills from the need to optimise highly calibrated equipment, prevent maintenance downtime and improve safety. They look forward and use data quite aggressively. I would say the most successful offshore retail banks are leading with Retail and Consumer organisations close behind them. But within each industry it is pretty mixed. In Australia, it’s mistakenly more fashionable to look back than to look forward. Count the number of staff producing valueless monthly reports and chasing their tails resolving ad hoc queries of what happened. In my view there’s a whole industry of unnecessary work out there collating aggregated non-actionable guff.  I would stop that and redirect it to analytical insight.

It’s still difficult to find sufficient internal funding in Australia for serious data analytics innovation – which is why we are now at least two to three years behind.

Corinium: Why is it that big international retailers are a step ahead of Australia?

John Studley, PwC: Necessity is the mother of all invention. Where organisations operate in highly competitive markets you will find innovation to get ahead. Our market is smaller and less competitive with only two or three major players in each industry. That’s why we pay more for books, shoes and downloads. Bill Gates once predicted that the winners will be those that harness the power of information and do a better job of it than their competitors. And it’s simple, if you know more, then you can do more. Which is why the smarter organisations have invested significantly more in building their data analytics capability and trying some new things. The survey confirmed that it’s still difficult to find sufficient internal funding in Australia for serious data analytics innovation – which is why we are now at least two to three years behind.

Corinium: Within Australia what types of companies are really driving the data analytics agenda for improved performance that you have seen?

John Studley, PwC: The best innovation is coming from smaller organisations that are developing proprietary technology applications. They can move faster to develop and then partner with larger players to distribute or deploy. Let’s not forget the Amazons and Microsofts of this world have made the rampant acquisition of small innovators part of their strategy for decades. I have seen a few of the local banks investing here in building seriously good capability and like what the miners have done on remote operations and optimisation. Certain Government agencies are also leading which is very encouraging as they will become the role models. Everyone can do it, it just takes courage and leadership commitment to the opportunity data analytics provides for their customers and staff.

Corinium: People are often telling us that they are interested in learning more about the latest artificial intelligence technologies but are not sure how to apply them to their own business problems. What would be your advice for companies wanting to get started with advanced analytics and machine learning?

John Studley, PwC: My advice would be don’t wait, dive in straight away. Identify the one or two serious things that will make a difference and kick off a pilot. Iterate the solution and scale what works. Kill it after six weeks if you need to and start another. My brother told me story this week where five years ago he couldn’t decide on which particular type of investment property to buy. He realises now that it probably didn’t matter which one he picked as it would have been a great investment either way. I would say the same about advanced data analytics and machine learning. Find some partner organisations you trust to work with, pin your ears back and just get started.

Corinium: Thanks John. To find out more, visit:   

As more and more people within businesses talk about the value of data-driven insights, there are more C-suite level roles being created… different skillsets are required. And as this role gains popularity within the business, so too does the amount of pressure to deliver results – and quickly.

Corinium: Danielle, your presentation in Melbourne was entitled, “Championing Trust – Ensuring the CDAO Remains a Strategic Force.” You talked about the ‘perception of unabated trust’ around the CDAO – why do you think this exists?

Danielle Malone, PwC: They key point I addressed is unabated trust for people in these roles no longer exists, or is in decline. Historically, the Head of Analytics (for example) was valued for his/her technical prowess. As more and more people within businesses talk about the value of data-driven insights, there are more C-suite level roles being created. And as these newly created ‘c-suite’ roles emerge, different skillsets are required. And as this role gains popularity within the business, so too does the amount of pressure to deliver results – and quickly. This can be challenging for CDAOs within the business – how does one manage c-suite counterparts who expect fast results, when key elements of one’s role (e.g. data governance) takes time to get right. This challenges the concept of unabated trust.

Corinium: What strategies do you propose for CDAOs to better manage their trust networks across the organisation?

Danielle Malone, PwC: Quite simply, do not discount the importance of this. I regularly see people deprioritise the importance of building trust networks/gaining support for the long term (i.e. sponsorship), whether that be when they start out in a role, or are 1-2 years into it. It is easy to rely too heavily on technical capability to pull you through, or deprioritise networks before other things you may deem important.

To increase the odds for making it through the first two years of a c-suite role, relationship trust will be key. People will judge you on your behaviour, not your intentions. And this starts with great leadership - are you an enabler of progress, or true leader? I recently listened to a powerful podcast via Six Pixels of Separation titled ‘Super bosses with Sydney Finkelstein’ (can be found on the Stitcher app), and found it to be accurate in its description of how the right leadership can help foster organic and unabated trust both within your team, but across the organisation too.

And trust is a relationship built between two or more people. I have seen various tools of influence. I have witnessed the application of game theory in the context of working relationships – recognising and planning for the choice between co-operation and competition. In order to achieve a mutually productive outcome, you are better off co-ordinating strategies with those in your networks, because if each of you pursues the individual payoffs, one or both are likely to fail.

Another pursuit on trust is the concept of a Working Council. It’s an engagement model I have seen work to build trust and secure outcomes from others within the c-suite. It isn’t a general project council, or a project steering committee. It’s an enduring partnership between a number of wise investors. Investors in the success of data and analytics some might say. The information pack that I distributed to all attendees focuses on some practical points on how these may best be run within your organisations.

These are all very basic strategies, but in a busy world it is easy to place these things aside in favour of immediate deadlines.

Corinium: What are the common challenges faced by the CDAO?

Danielle Malone, PwC: The key word is ‘common’, because challenges are heavily dependent on tenure of the individual within the organisation, industry, organisation, or otherwise. With this in mind, common themes I see are: 

  1. The right leader for the role 
  2. Clarity of role/value proposition you will bring to the table 
  3. Reporting structure 
  4. Support for the long term (i.e. sponsorship)
  5. Budgetary responsibilities 
  6. Ability to manage the pace of external drivers (e.g. regulatory change, digital expansion etc.

Data Digest

JUST RELEASED: Chief Analytics Officer, Fall 2016 - Speaker Presentations

The Chief Analytics Officer, Fall event brought together over 200+ Chief Analytics Officers, Data Leaders, Senior Analytics Experts and Innovators in New York on October 5-7. The three day conference was filled with networking, high level insight and discussion, addressing the hottest topics and challenges faced by CAOs and Senior Data & Analytics professionals.


The Snakes and Ladders of Customer Loyalty

Your customers go through numerous milestones in their journey through your business: the initial interest, the first purchase and opening of an account, (hopefully) paying their accounts on time, maybe signing up for your loyalty programme (and being comfortable to tell you more about themselves). 

Teradata ANZ

Advanced Analytics in Audit

I am fortunate enough to be on a voyage of discovery in the world of analytics courtesy of Teradata. When I joined Teradata nearly 18 months ago I had a good grasp of how analytics were used in IT audit and had used these regularly in engagements. I was even fortunate enough to have a data analytics team work with me in one organisation. These analytics proved to be excellent at investigating specific audit topics and applied very well to the usual HR and credit card work. The analytics we did was useful in investigating misuse of corporate networks, I often referred to as “trawling for porn”. However, while the audit tools available in audit were valuable and excellent at ensuring defensible evidence of findings, they were limited to traditional structured data sources, and to some degree so was my thinking.

Now on my journey, I have been fortunate enough to have been able to work alongside talented data scientists who have shown me what the new generation of tools, the power of integrated and/or federated data as well as the potential of open source and unstructured data can do. The Big Data technologies are opening up a new world of opportunity allowing the combination of traditional corporate structure data sets with non-traditional unstructured data.

While auditing I had believed the nirvana for data analytics was to be able to be able to conduct continuous monitoring of data feeds to detect events such as control breaches for investigation. While this is still desirable, I have learnt that this is only the beginning. Continuous monitoring is great at determining when an event occurs, however wouldn’t it be better to be able to predict and avoid an undesirable event where possible rather than taking action to resolve when it already has? This brings the concept of Predictive Analytics; a term Gartner uses in an optimisation path for Analytics. So now, for me, being able to identify the events that lead to an undesirable occurrence or behaviour on a continuous stream of data is the new nirvana.

Advanced Analytics is the evolution of the tradition analytics and is not limited to samples, it uses human/machine classification, provides alternative analysis and is premised on predictive rather than preventative outcome. Improvement in technology platforms means they are now not only faster and store more data, they also provide the capability to analyse unstructured data not previously practical to do so in audit. Techniques used by data scientist such as Path Analysis, Graph engines and even psycholinguistic tools that can profile written text provide approaches I could only dream of previously. While a data scientist would tell me many of these techniques are not new, I wonder how many in the audit community know the potential of these for their own work. As a community that lives and breathes on identifying and managing risk, it appears to me that the application of advanced analytics in an audit program will provide significant benefit.

The audit function of an organisation is commonly limited in resourcing, and as such is hindered in applying new technology tools and sourcing data. However the good news is that their organisations are moving towards more predictive analytics as a course of doing business. These new capabilities also represent an opportunity for the audit function to leverage off, with the right influence of course.

It is impossible to explain in detail the tools and techniques of advanced analytics here, and I only scratched the surface of the potential. However from my learnings so far, I encourage my audit colleagues that if you aren’t already considering what advanced analytics could mean for your audit programme, there is real potential beyond the traditional data sources to gain insight into risk for an organisation and to detect control breaches and behaviours in ways not considered before.

The post Advanced Analytics in Audit appeared first on International Blog.


October 19, 2016

The Data Lab

Open Data Training: we loved it, we want more!

 Urban Tide Open Data Training

Building on the lessons learned and experience gained delivering the open data training pilot, Urban Tide launched a new set of open data training sessions and workshops, each of which has been developed based on a full year of training feedback.


The Open Data Journey often starts with a good workshop

The Scottish Government's training pilot was recognised as a success by Urban Tide, as the workshops fulfilled their aim of aiding Scottish public sector organisations on their journey to understanding more about Open Data, becoming familiar with the benefits, challenges and technical aspects of it and eventually taking steps towards developing Open Data strategies and publication plans for their organisations. The results of post workshop evaluation showed that delegates recognised the importance, the benefits and the opportunities that open data can deliver with a lot of people saying that after learning more about open data it was not as overwhelming as it first seemed. Almost a quarter of the respondents said that as a result of the training they could immediately take steps to improve their open data publication. 


Open Data Advocacy - the birth of Open Data Champions

Open data advocacy was marked as an immediate action item by 38% of the participants, both as internal and external engagement, collaboration or developing an open data strategy or publication plan. Delegates found it important to brief their team and engage them in open data, as well as to start securing senior buy-in. 

Steven Revill, Urban Tide Chief Operating Officer and Principal Trainer: “It was fantastic to talk about open data with so many talented people from many interesting organisations. Although we delivered the training, the attendees made the workshops so effective because the breadth of knowledge among them meant there were always new insights to consider.”

Workshop after workshop participants agreed that one of the most useful aspects was to hear and learn from each other, and to see where other organisations were on the path toward open data. 

Workshop participants also expressed the need for more training opportunities for themselves and their peers.


Urban Tide’s new open data training programme

Urban Tide developed a training series with a variety of workshops ranging from crash course style introductory training to 1 and 2 day workshops with a hands-on session where delegates can use their own data to learn how to open, cleanse, validate, visualise and publish data with Urban Tide’s guidance and support.

Initial sessions are being held in October, November and December in Edinburgh and Glasgow with more sessions throughout 2017 across Scotland. If you have any questions or want to talk about the open data training, visit Urban Tide’s website or send an email or tweet @UrbanTide.


Google Plus
Data Digest

The Role Of Chief Data Officers in a World Of Data Utopia

Utopia is defined as “an imagined place or state of things in which everything is perfect.” However, perfection is  rather illusive and frivolous concept. What is important is the very concept which is ultimately what you hope to achieve with one of your organisation’s greatest assets. 

Close your eyes, and imagine a place where information from data lakes cascade seamlessly down the various trenches of your enterprise infrastructure to service the needs of your internal consumers. Where data captured is already cleansed, accurate and fit for purpose. Your customer information is readily accessible and liberated from the confines of disparate siloes into one, fully integrated data warehouse. 

Now I want you to imagine the obstacles holding you back from effective Data Governance, falling like dominoes, paving the way for innovation and the ability to capitalise on robotics, artificial intelligence, machine learning and self-service analytics that’s as intuitive as a Google search. Data science, analytics, management and operations all perfectly aligned with your business strategy and driving value for your organisation. Is this data utopia?

Data Utopia and the Chief Data Officers

The dictionary defines utopia as “an imagined place or state of things in which everything is perfect.” However, perfection is rather illusive and frivolous concept. So when I refer to a “data utopia” this is likely to conjure up an array of connotations amongst Chief Data Officers and data executives alike, influenced by their own strategic thinking, experiences and organisation’s prime objectives.  The latter of which, I believe, should be the greatest factor in shaping data initiatives as data strategy should be fundamental to optimising business outcomes.

Which brings me back to the notion of data as an asset that can provide numerous benefits and competitive advantages. However, without a strategic vision to get this asset working to supplement your goals and catalyse value creation further, data runs the risk of becoming a liability.

Why organisations must have CDOs

IBM’s CDO Playbook stated that organisations with a CDO are 1.7 times more likely to have a big data and analytics strategy and most Chief Data Officers are in agreement that analytics and data science will form the pillars of an organisation’s journey to becoming truly data-driven. This is emphasised by Fabrice Otano, Chief Data Officer at Accor who told me that “delivering hot analytics, mainly data visualisation based tools and communication” are amongst his key objectives. Gary Goldberg, Chief Data Officer at Mizuho International further cemented the importance of analytics at our CDO Roundtable by stating, “The CDO should drive the entire data strategy including Data Analytics work”. He then went on to state that “One of the key challenges for a CDO is to ensure that data is made accessible and easy to use.  Users need to be given the tools to extract value from the data”. Hence if CDO's and their subsequent organisation’s wish is to reach data utopia, is data analytics democratisation the holy grail?

Data democratisation is by no means a new concept, however, moving beyond a centralised data function and working towards the ubiquitous use of data analytics across the organisation certainly has it’s benefits,  with Gartner’s hype cycle for emerging technologies predicting a plateau of productivity in relation to what they call “citizen data science” and “advanced analytics with self-service delivery” in the next 2 to 5 years.

Is Data Utopia an ambitious theory?

Although data analytics democratisation does appear attractive, Graeme McDermott, Chief Data Officer at Addison Lee explored the pros and cons when asked his thoughts on the impact to business performance and drawbacks, “(Data analytics democratisation) It empowers end users to visualise and ask questions of the data to support business challenges. (Drawbacks) The drop off in numeracy in graduates sees more and more people needing hand holding through the most basic of analysis? I still expect CDOs to preside over multiple data sources and tools…commercially it won’t always make sense to do the right thing and centralise data with federated access from single tool.”

I had the pleasure of discussing the role of the CDO with Jon Catling, Director of Global Data Architecture at Las Vegas Sands Corp. where he articulated the CDO as “a transient job, one that sooner rather than later must go away. It serves the purpose of enabling the manifestation of a concept.”

Perhaps the idea of data utopia is a naïve and overly ambitious theory, however, what is important is the very concept which is ultimately what you hope to achieve with perhaps one of your organisation’s greatest assets. What does that look like?

As you re-imagine data utopia, does the Chief Data Officer exist at all or are they perhaps the vehicle which allowed the organisation to reach the promised land?

Join us at Chief Data Officer, Europe 2017

If you wish to learn more and hear directly from Chief Data Officers and data executives as they discuss their perspectives on maturing the enterprise-wide data transformation, join the Chief Data Officer, Europe 2017 taking place in Central London from the 21st to 23rd February 2017. If you wish to get in touch, you can email me at: or LinkedIn message.

By Andrew Odong:

Andrew Odong is the Content Director for the CDO Forum. Andrew is currently producing the CDO Forum, Europe 2017, researching with the industry about the opportunities and key challenges for enterprise data leadership and innovating an interactive discussion-led platform to bring people together to address those issues – the CDO Forum has become a global series having been launched in five continents. For enquiries email:

October 18, 2016


Just Launched: Deep Web University Monthly Blog Subscription

We do research and the writing so you can reap the rewards. Starting this month, every month we’ll compile the highlights from our Deep Web University posts. We send a collection of blog posts straight to our subscribers’ email inboxes so they can get the most value out of the available intelligence. Sign up and […] The post Just Launched: Deep Web University Monthly Blog Subscription appeared first on BrightPlanet.

Read more »
Jean Francois Puget

Spark Summit Europe


Please join me at the Spark Summit next week (Oct 25-27) in Brussels.  This is one of the yearly events where the Spark community gathers.  More details can be found at:

The Meetup

I will be talking about Machine Learning with my colleague Nick Pentreath at the meetup we organize right after the summit, on Thursday night.  Location details below:

  1. Spark and Machine Learning meetup:

    Brussels. October 27th from 6:30pm to 9:30pm (Brussels time) 


Here is the Agenda:

18:50 Introduction & update on  Data4Good Hackathon ( 

— Philippe Van Impe, Founder, European Data Innovation Hub & Brussels Data Science Community.

— Berni Schiefer, IBM Fellow


19:00 Creating an end-to-end Recommender System with Spark ML

Nick Pentreath, Jean-François Puget,

There are many resources available for building basic recommendation models using Spark. But how does a practitioner go from the basics to creating an end-to-end machine learning system, including deployment and management of models for real-time serving? In this session, we will demonstrate how to build such a system based on Spark ML and Elasticsearch. In particular, we will focus on how to go from data ingestion to model training to real-time predictive system.  

19:45 Lightening Talks

10-minute Spark and machine learning talks. 

1. Data Science as a Team Sport
Today, data science is very often an individual sport. Data scientists and data engineers, choose their own tools or flavor, work on their own. 
Learn how Data Science Experience can make data science a team sport, bringing data scientists and data engineers together to make data science and machine learning available to everyone. 

Presenter: Juergen Schaeck - IBM

2.  Telco data stream simulation, processing and visualization 

Koen will discuss the development of a prototype for processing of data coming from cell towers, executed for a telco operator in the Middle East. The added difficulty was that the customer could not provide real data.In the end he developed a data generator in Scala/Akka, a data processor with Spark Streaming, and a visualization front-end with Node.js. 

Presenter: Koen Dejonghe - Eurocontrol

3.  Hyperparameter Optimization - when scikit-learn meets PySpark
Spark is not only useful, when you have big data problems. If you have a relatively small data set you might still have a big computational problem. One problem is the search for optimal parameters for ML algorithms. 
Normally, a data scientist has a laptop with 4 cores (8 threads), that means it will take some time to perform a grid search …However, if you use Spark, then it opens the possibility to have the grid search taken out on a cluster with a higher degree of parallelism.

Presenter: Sven Hafeneger - IBM

4. A data scientist, a BI expert and a big data engineer walk into a bar: how 3 different worlds come together with Spark

Because of its general purpose nature, Spark is being used by a wide variety of data professionals, each with their own backgrounds. The data warehouse / data lake of a large organisation is a spot where those 3 worlds collide. We've experience the good, the bad and the ugly of those encounters first hand. In this lightning talk, we share what each group can learn from each other, how they can collaborate, and which are the recipes for disaster.

Presenter:  Kris Peeters - Data Minded

5.  Writing Spark applications, the easy way : how to focus on your data pipelines and forget about the rest  - Pierre Borckmans - Real Impact Analytics

Even though Spark offers intuitive and high-level APIs, writing production-ready Spark data pipelines involves non-trivial challenges for data scientists without expert background in software development and devops matters. In this short talk, I'll present how we tackled these issues at Real Impact Analytics, by developing an intuitive framework for writing dataflows, offering convenient data exploration and testing facilities, while hiding devops-related complexity. 

Presenter:    Pierre Borckmans - Real Impact Analytics

6. A very brief introduction to extending Spark ML for custom models: Talk + Demo

Spark ML pipelines, inspired by sci-kit learn, have the potential to make our machine learning tasks much easier. This talk looks at how to extend Spark ML with your own custom model types when the built in options don't meet your needs.

Presenter: Holden Karau - Spark Technology Center, IBM


20:45 Networking & Refreshments


The Summit

There is also a number of presentation and events hosted by my IBM colleagues.  You'll find a complete list below.  And of course there are a number of great talks from others than IBM.  This is a unique opportunity to catch up with the vibrant Spark community.  Please join us!


Scaling Factorization Machines on Spark Using Parameter Servers”
Wednesday, October 26, 13:45 – 14:15

by Nick Pentreath

Factorization machines are a relatively new class of model, that are extremely powerful as they are able to efficiently capture arbitrary order interactions between features. FMs are becoming increasingly popular in settings with large amounts of sparse data, including recommender systems and online advertising. Furthermore, with appropriate feature engineering, they can mimic most commonly used factorization-based models for collaborative filtering. However, one drawback of FMs is that, even though they are relatively efficient to train, they can still be difficult to scale to very large feature dimensions. This talk will explore scaling up FMs on Spark, using the Glint parameter server built on Akka. Rather than a general exploration of parameter server architectures, the focus will be on specific technical aspects of training factorization machines, with code examples and performance analysis and comparisons. It will also cover integration with Spark DataFrames and ML pipelines for feature engineering and cross-validation. Example code will be available as open source.


From Single-Tenant Hadoop to 3000 Tenants in Apache Spark: Experiences from Watson Analytics”
Wednesday, October 26, 16:05 – 16:35

by Alexander Lang

IBM Watson Analytics for Social Media is using a pipeline for deep text analytics and predictive analytics based on Apache Spark. This session describes our journey from our predecessor product, which used Hadoop in environments dedicated per tenant, to a system based on Apache Spark (both “core” and streaming), Kafka and ZooKeeper that serves more than 3000 tenants. We will describe our thought process, our current architecture, as well as the lessons we’ve learned since we put the environment into production in December 2015. Key takeaways are: – Changes to design, development and operations thinking required when going from single-tenancy to multi-tenancy – Architecture of a multi-tenant Spark solution in production – Orchestration of several Spark apps within a common data pipeline – Benefits of Apache Spark, Kafka and ZooKeeper in a multi-tenant data pipeline architecture


“From machine learning to learning machines: Creating an end-to-end cognitive process with Apache SparkTM”

Thursday, October 27, 10:00 – 10:10

By Dinesh Nirmal

Many people think of machine learning as something that begins with data and ends with a model. But machine learning in practice is actually a continuous process that begins with an application and never ends. Apache Spark has made many parts of this process dramatically easier. As an active member of the Apache Spark Community, we have recognized – through hosting meet-ups, advisory boards, and working with clients – the challenges that practitioners face in closing the loop and adapting automatically to changing business environments. Over the last 12 months we contributed over 25,600 thousand lines of code to Apache Spark including Spark ML, SparkR, and PySpark, and we’ve brought Apache SystemML to 356,000 lines of code, laying the groundwork for machine learning in business solutions and in particular for an end-to-end machine learning framework. In this keynote, I will share our recent progress and where we are headed with machine learning – towards a comprehensive vision for more effectively supporting continuous machine learning.


“SparkOscope: Enabling Apache Spark Optimization Through Cross-Stack Monitoring and Visualization”
Thursday, October 27, 14:20 – 14:50

by Yiannis Gkoufas

During the last year we have been using Apache Spark to perform analytics on large volumes of sensor data. These applications need to be executed on a daily basis, therefore, it was essential for us to understand Spark resource utilization. We found it cumbersome to manually consume and efficiently inspect the CSV files for the metrics generated at the Spark worker nodes. Although using an external monitoring system like Ganglia would automate this process, we were still plagued with the inability to derive temporal associations between system-level metrics (e.g. CPU utilization) and job-level metrics (e.g. job or stage ID) as reported by Spark. For instance, we were not able to trace back the root cause of a peak in HDFS Reads or CPU usage to the code in our Spark application causing the bottleneck. To overcome these limitations we developed SparkOscope. Taking advantage of the job-level information available through the existing Spark Web UI and to minimize source-code pollution, we use the existing Spark Web UI to monitor and visualize job-level metrics of a Spark application (e.g. completion time). More importantly, we extend the Web UI with a palette of system-level metrics of the server/VM/container that each of the Spark job’s executor ran on. Using SparkOScope, the user can navigate to any completed application and identify application-logic bottlenecks by inspecting the various plots providing in-depth timeseries for all relevant system-level metrics related to the Spark executors, while also easily associating them with stages, jobs and even source code lines incurring the bottleneck. Github: Demo:

“Spark SQL 2.0 Experiences Using TPC-DS”
Thursday, October 27, 17:15 – 17:45

by Bernie Schiefer

This talk summarizes the results of using the TPC-DS workload to characterize the SQL capability, performance and scalability of Apache Spark SQL 2.0 at the multi-Terabyte scale in both single user dedicated and multi-user concurrent execution modes. We track the evolution of Spark SQL across versions 1.5, 1.6 and 2.0 to underscore the pace of improvement in Spark SQL capability and performance. We also provide best practices and configuration tuning parameters to support the concurrent execution of the 99 TPC-DS queries at scale. The key takeaways include 1) See the substantial progress made by Spark SQL 2.0 2) Understand what TPC-DS is and why it has become the preferred workload of SQL on Hadoop systems. 3) Experimental results supporting the optimized execution of multi-user, multi-terabyte TPC-DS-based workloads 4) Tuning and configuration changes used to attain excellent performance of Spark SQL.


Big Data University

This Week in Data Science (October 18, 2016)

Here’s this week’s news in Data Science and Big Data. Human Robot Overlap

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

The post This Week in Data Science (October 18, 2016) appeared first on Big Data University.


October 17, 2016

Revolution Analytics

Estimating the value of a vehicle with R

by Srini Kumar, Director of Data Science at Microsoft We tend to think of R and other such ML tools only in the context of the workplace, to do “weighty” things aimed at saving millions. A little...


Revolution Analytics

The Team Data Science Process

As more and more organizations are setting up teams of data scientists to make sense of the massive amounts of data they collect, the need grows for a standardized process for managing the work of...


Revolution Analytics

Make tilegrams in R with tilegramsR

In this busy election season (here in the US, at least), we're seeing a lot of maps. Some states are red, some states are blue. But there's a problem: voters are not evenly distributed throughout the...

Cloud Avenue Hadoop Tips

Getting started with Apache Spark

With so much noise around Apache Spark, let's look into how to get started with Spark in local mode and execute a simple Scala program. A lot of complex combinations are possible, but we will look at the minimum steps required to get started with Spark.

Most of the Big Data softwares are developed with Linux as the platform and porting to Windows has been an after thought. It is interesting to see how Big Data on Windows will morph in the future. Spark can run on both Windows/Linux, but we will take Linux (Ubuntu 14.04 64-bit Desktop) into consideration. So, here are the steps:

1) Download and install Oracle VirtualBox as mentioned here.

2) Download and install Ubuntu as mentioned here as a guest OS.

3) Update the patches on Ubuntu from a terminal and reboot it.
sudo apt update;sudo apt-get dist-upgrade
4) Oracle Java doesn't come with Linux distributions, so has to be installed manually on top of Ubuntu as mentioned here.

5) Spark has been developed in Scala, so we need install Scala.
sudo apt-get install scala
6) Download Spark from here and extract it. Spark built with hadoop 1x or 2x will work, because HDFS is not being used in this context.

7) From the Spark installation folder start the Spark shell.
8) Execute the below commands in the shell to load the and count the number of lines in it.
val textFile = sc.textFile("")

What we have done is install Ubuntu as a guest OS and then install Spark on it. And finally run a simple Scala program in Spark local mode. There are much more advanced setups like running Spark program against data in HDFS, running Spark in stand alone, Mesos and YARN mode. We will look at them in the future blogs.
Data Digest

Data Management? Take it with a bit of philosophy

Of one thing rest assured, I am not a fan of those posts that offer a numbered list of advices. You know, those "7 ways to double your salary", "9 things you should never say in the boardroom", "5 actions to become a CEO in one month" etc...

Nevertheless, without numbering them and thanks to my scholastic background, below I am offering data scented quotes from famous philosophers of the past, because I believe that the quest for "good data" (feel free to give to the “good” whatever interpretation makes you content) is ultimately a quest for truth and in the midst of the chaos that sometimes is surrounding us, I find refuge in the collective wisdom matured in centuries of trying to answer life's fundamental questions. Here we go…

Know thyself (Socrates): my beloved mantra and foundation of everything. If you don't know what company yours is like, how agile and resilient it is or pretends to be, who your people are, where your data  is and how good it is: then managing, changing, protecting, representing data will always be fraught with unpredictability at best and danger at worst. Change is creating a continuous moving target, pulling the carpet under your feet, but I believe that if you zoom out of the chaotic stream of changes affecting your environment every day, you should be able to recognise patterns stable enough to help in your journey.

Of each particular thing ask: what is it in itself? What is its nature? (Marcus Aurelius) Or rather "to call a spade a spade", asking the powerful questions and being "objective", avoiding subjective interpretations or, even, not being afraid to point out that the emperor had had a wardrobe malfunction, is a mental skill close to meditation, one that has to be cultivated and practiced. Beware: it might not make you popular, but it is the solid basis you need to build a successful data strategy. So dress it well with the best tact and diplomacy, but be relentless.

Simplicity is the ultimate sophistication (Leonardo Da Vinci). How many times has someone had to come to you and, waving the bedazzling flag of "simplification", peeled down to nothing a decently working process/idea/project? Simplification is important and plays a pivotal role in data management, but like every change it requires accurate design, and given that it is demonstrated that it is much easier to be complex than it is to be simple (maybe entropy is the culprit here), one should not expect that simplicity requires less effort. E=mc2 is an incredibly simple and elegant formula, but it is the epitome of a lifetime of studies.

Even the finest sword plunged into salt water will eventually rust (Sun Tzu). Ok, maybe not strictly speaking a philosopher, but he is one of my favourites. In my old days as a Six Sigma convert,  I would have bored you with the "Sigma Shift" effect, meaning that old habits, like dandelion, are difficult to eradicate completely and if you don't take care of them properly embedding, controlling and supporting whatever change you implement. Thus a data governance process or a set of quality procedures, or the task list for your newly appointed Data Stewards. No matter how good and slick the design was, fatigue and old habits will always deteriorate the quality of your solution. So govern! Governance is the sharpening stone of your strategy.

Be a philosopher! But, amidst all your philosophy, be still a man! (David Hume). I will stop you right there! David Hume was born 304 years ago, so the "man" in his words is not chauvinism, but rather a collective noun meaning "human being". To me it clearly resonates with the fact that the critical piece of the puzzle you need to put in all your frameworks, policies and procedure is the one related to the true subject of all our theories: the human being! If I can spell "ontology" but I can't connect with people and help them to do their job better, well, I should probably be doing something else.

Get more insight from Roberto Maranca and 60 more senior data executives speaking at Chief Data Officer Europe 2017, taking place on 21-23 February in London. Find out more about the event here:  

By Roberto Maranca:

Roberto Maranca is Chief Data Officer at GE Capital and a speaker at the Chief Data Officer, Europe 2017. A highly-accomplished Senior Business Leader, with a wealth of experience across B2B, financial services, banking, professional, SME & global corporate sectors. Commercially-aware, with a broad range of technology management experience and a passion for Data Management.

October 16, 2016

M. Kinde

Donald Trump and the Truth Bubble, Part 1 – The Misinformed

“Wherever the people are well-informed they can be trusted with their own government.” — Thomas Jefferson to Richard Price, 1789 Thomas Jefferson’s support of a free press and education for the...


October 15, 2016

Simplified Analytics

Using Data Science for Predictive Maintenance

Remember few years ago there were two recall announcements from National Highway Traffic Safety Administration for GM & Tesla – both related to problems that could cause fires. These caused tons...


October 14, 2016

Revolution Analytics

Because it's Friday: Earthrise

I didn't know until reading this Bad Astronomy post that there were satellites orbiting the moon, but Japan had a probe from 2007 to 2009 mapping the lunar surface and gravity. The data from that...


October 13, 2016

Big Data University

How to run a successful Data Science meetup

Meetups have become very popular in the past few years, but it is not easy to get a meetup community going successfully. Every now and then different meetups just ‘die’. Like many other community initiatives, you need to nurture your community and provide value. Though having some funds to run the meetups definitely helps, it is not necessarily the main factor for success.

The Big Data University (BDU) Toronto Meetup ( recently surpassed 6200 members! and it currently sits as the #9 largest Data Science meetup worldwide, where the top spots are based out of New York and San Francisco. Most if not all of these Data Science meetups started one to three years before us though, and the audience is not exactly the same.

In the case of the BDU-Toronto meetup, what started back in 2014 as a pilot has turned out to be one of our most successful initiatives in Toronto and other cities in Canada and around the world. Just like BDU itself, the BDU meetup started in a similar fashion to a startup company: Little to no funding, many people opposing to creating it, and no established venue. Our first meetup attracted about 25 people. Today, without any marketing other than posting events in, we get at least 100 people to attend per meetup! Just like a startup, we pivoted a few times, where the meetup name changed, and the topic/interest was narrowed down.

There are several reasons why the BDU meetup has been very successful, especially in Toronto, where it started. First of all, you know a meetup is successful when you see a healthy number of new members joining weekly; but moreover, if you’ve had a chance to attend our events, you will see first hand the excitement and engagement of the community. In most of our meetups, about 90% of attendees stay all the way until the meetup is finished (typically around 9pm). Last week we held our ultimate proof that our meetup is rocking!. On a Friday evening of a long weekend, we scheduled a meetup to discuss *volunteer* opportunities in Data Science. We booked the IBM downtown Toronto venue, and we got 200 RSVPs and probably more than 100 people showing up! We started the event with a big “Wow!

In my opinion, these are the reasons why the BDU-Toronto meetup is succeeding:

  • It focuses on quality of delivery and content.
    Our main presenters, Polong Lin (IBM Data Scientist) and Saeed Aghabozorgi (IBM Senior Data Scientist) spend a lot of time preparing. They are eager and proud to present and share their knowledge. They choose great examples. They also vet any other presenters from the community to ensure they are good presenters and have good materials themselves.
  • It delivers value to attendees: Everyone always learns something
    The person who arrives to our meetup when it starts is not the same as the one who leaves at the end of the event. This person is ‘transformed’ with new skill!
  • It uses a hands-on approach to learning.
    We use the Data Scientist Workbench (DSWB) for our interactive learning sessions. Minor to no setup required!
  • It is held in a vibrant area of the city, with easy access
    The downtown area in Toronto is vibrant and has easy access to public transportation (subway), is close to universities, colleges, startup companies, large financial companies, incubators, institutes, hospitals, research centres, and entertainment!
  • It teaches hot technologies and concepts about Data Science, Big Data, and Analytics.
    These are hot areas these days that everyone, no matter the background, should learn. By the way a quick way to get started is by taking the Data Science 101 course at BDU. In less than a month this course has had more than 8000 registrations!
  • It has an active organizer team
    We are a small team of organizers, but we have almost mastered how to put together these events.
  • Events are scheduled frequently
    Frequent events mean the momentum is never lost
  • We are not afraid to try new things.
    We’ve already supported mini-hackathons, hackathons, a Data Science meetup for developers, and this past summer we sponsored a Data Science bootcamp.
  • We partner
    We have partnered with universities (University of Toronto, Ryerson U), university data science clubs, colleges (Lambton College), institutions (STEM Fellowship), other meetup groups (R Users Group).
    This helps us with promotion, awareness, and finding other presenters and topics.
  • We attend to community requests
    We get the community going by replying to their inquiries or issues. We have also recently set up a Slack group ( to get the community talking between themselves.
  • We offer free pizza and pop
    We offer free pizza and pop at the beginning of the meetup to encourage people to arrive early and so people are not hungry while learning once the meetup starts!
  • It is free!
    The event, the materials, the software, the food … all free!

People feel attending the meetup is time well spent. We know there is room to improve, especially in finding a good venue with good internet that can host us permanently. We also know our LiveStream is not very reliable, and that our registration system is not the best. We are working on all of these fronts, but our budget is a limiting factor!

We are proud to have started BDU meetups in other cities. The last part of the URLs below tell you were they are located:

If you would like to start your own BDU meetup in your city, please contact me! Having a local active organizer is very important. We can help you get established, but we need your long-term commitment.

If you are in Toronto, I hope to see you at our next meetup this October 20th! We’ve sponsored a Data Science for High School course (with credit in Ontario, Canada) and we will talk about this. As mentioned earlier, we are not afraid to try new things!

The post How to run a successful Data Science meetup appeared first on Big Data University.

Silicon Valley Data Science

Streaming Video Analysis in Python

Editor’s note: This post is part of our Trainspotting series, a deep dive into the visual and audio detection components of our Caltrain project. You can find the introduction to the series here.

At SVDS we have analyzed Caltrain delays in an effort to use real time, publicly available data to improve Caltrain arrival predictions. However, the station-arrival time data from Caltrain was not reliable enough to make accurate predictions. In order to increase the accuracy of our predictions, we needed to verify when, where, and in which direction trains were going. In this post, we discuss our Raspberry Pi streaming video analysis software, which we use to better predict Caltrain delays.

Platform architecture for our Caltrain detector

In a previous post, Chloe Mawer implemented a proof-of-concept Caltrain detector using a webcam to acquire video at our Mountain View offices. She explained the use of OpenCV’s Python bindings to walk through frame-by-frame image processing. She showed that using video alone, it is possible to positively identify a train based on motion from frame to frame. She also showed how to use regions of interest within the frame to determine the direction in which the Caltrain was traveling.

The work Chloe describes was done using pre-recorded, hand-selected video. Since our goal is to provide real time Caltrain detection, we had to implement a streaming train detection algorithm and measure its performance under real-world conditions. Thinking about a Caltrain detector IoT device as a product, we also needed to slim down from a camera and laptop to something with a smaller form factor. We already had some experience listening to trains using a Raspberry Pi, so we bought a camera module for it and integrated our video acquisition and processing/detection pipeline onto one device.


The image above shows the data platform architecture of our Pi train detector. On our Raspberry Pi 3B, our pipeline consists of hardware and software running on top of Raspbian Jesse, a derivative of Debian Linux. All of the software is written in Python 2.7 and can be controlled from a Jupyter Notebook run locally on the Pi or remotely on your laptop. Highlighted in green are our three major components for acquiring, processing, and evaluating streaming video:

  • Video Camera: Initializes PiCamera and captures frames from the video stream.
  • Video Sensor: Processes the captured frames and dynamically varies video camera settings.
  • Video Detector: Determines motion in specified Regions of Interest (ROIs), and evaluates if a train passed.

In addition to our main camera, sensor, and detector processes, several subclasses (orange) are needed to perform image background subtraction, persist data, and run models:

  • Mask: Performs background subtraction on raw images, using powerful algorithms implemented in OpenCV 3.
  • History: A pandas DataFrame that is updated in real time to persist and access data.
  • Detector Worker: Assists the video detector in evaluating image, motion and history data. This class consists of several modules (yellow) responsible for sampling frames from the video feed, plotting data and running models to determine train direction.

Binary Classification

Caltrain detection, at its simplest, boils down to a simple question of binary classification: Is there a train passing right now? Yes or no.


As with any other binary classifier, the performance is defined by evaluating the number of examples in each of four cases:

  1. Classifier says there is a train and there is a train, True Positive
  2. Classifier says there is a train when there is none, False Positive
  3. Classifier says there is no train when there is one, False Negative
  4. Classifier says there is no train when there isn’t one, True Negative

For more information on classifier evaluation, check out this work by Tom Fawcett.

After running our minimum viable Caltrain detector for a week, we began to understand how our classifier performed, and importantly, where it failed.

Causes of false positives:

  • Delivery trucks
  • Garbage trucks
  • Light rail
  • Freight trains

Darkness is the main causes of false negatives.

Our classifier involves two main parameters set empirically: motion and time. We first evaluate the amount of motion in selected ROIs. This is done at five frames per second. The second parameter we evaluate is motion over time, wherein a set amount of motion must occur over a certain amount of time to be considered a train. We set our time threshold at two seconds, since express trains take about three seconds to pass by our sensor located 50 feet from the tracks. As you can imagine, objects like humans walking past our IoT device will not create large enough motion to trigger a detection event, but large objects like freight trains or trucks will trigger a false positive detection event if they traverse the video sensor ROIs over two seconds or more. Future blog posts will discuss how we integrate audio and image classification to decrease false positive events.

While our video classifier worked decently well at detecting trains during the day, we were unable to detect trains (false negatives) in low light conditions after sunset. When we tried additional computationally expensive image processing to detect trains in low light on the Raspberry Pi, we ended up processing fewer frames per second than we captured, grinding our system to a halt. We have been able to mitigate the problem somewhat by using the NoIR model of the Pi camera, which lets more light in during low light conditions, but the adaptive frame rate functionality on the camera didn’t have sufficient dynamic range out of the box.

To truly understand image classification and dynamic camera feedback, it is helpful to understand the nuts and bolts of video processing on a Raspberry Pi. We’ll now walk through some of those nuts and bolts—note that we include the code as we go along.

PiCamera and the Video_Camera class

The PiCamera package is an open source package that offers a pure Python interface to the Pi camera module that allows you to record image or video to file or stream. After some experimentation, we decided to use PiCamera in a continuous capture mode, as shown below in the initialize_camera and initialize_video_stream functions.

class Video_Camera(Thread):
    def __init__(self,fps,width,height,vflip,hflip,mins):
    def initialize_camera(self): = pc.PiCamera(
    def initialize_video_stream(self):
        self.rawCapture = pc.array.PiRGBArray(, =,
    def run(self):
        #This method is run when the command start() is given to the thread
        for f in
            #add frame with timestamp to input queue

The camera captures a stream of still image RGB pictures (frames). The individual frames are then output as a NumPy array representation of the image. (Note: Careful readers might notice that the format saved is actually BGR, not RGB, because OpenCV uses BGR for historical reasons.) This image is then placed into the front deque, a double-ended queue, for future processing (as shown below). By placing the image into a deque, we can just as quickly access recently taken images from the front of the deque as older images from the rear. Moreover, the deque allows calculation of motion over several frames, and enforces a limit on the total images stored in memory via the maxlen argument. By constraining the length of the deque we minimize the memory footprint of this application. This is important, as the Raspberry Pi 3 only has 1 GB of memory.


Threading and task management in Python

As you may have noticed, the Video_Camera class subclasses a thread from the Python threading module. In order to perform real time train detection on a Raspberry Pi, threading is critical to ensure robust performance and minimize data loss in our asynchronous detection pipeline. This is because multiple threads within a process (our Python script) share the same data space with the main thread, facilitating:

  • Communication of information between threads.
  • Interruption of individual threads without terminating the entire application.
  • Most importantly, individual threads can be put to sleep (held in place) while other threads are running. This allows for asynchronous tasks to run without interruption on a single processor, as shown in the image below.

For example, imagine you are reading a book but are interrupted by a freight train rolling by your office. How would you be able to come back and continue reading from the exact place where you stopped? One option is to record the page, line, and word number. This way your execution context for reading a book are these three numbers. If your coworker is using the same technique, she can borrow the book and continue reading where she stopped before. Similar to reading a book with multiple people, or asynchronously processing video and audio signals, many tasks can share the same processor on the Raspberry Pi.

Real time background subtraction and the Video_Sensor class

Once we were collecting and storing data from the PiCamera in the input_deque, we created a new thread, the Video_Sensor, which asynchronously processes these images independent of the Video_Camera thread. The job of the Video_Sensor is to determine which pixels have changed values overtime, i.e. motion. To do this, we needed to identify the background of the image, the non-moving objects in the frame, and the foreground of the image: i.e. the new/moving objects in the frame. After we identified motion, we applied a 5×5 pixel kernel filter to reduce noise in our motion measurement via the cv2.morphologyEx function.

class Video_Sensor(Thread):
    def __init__(self,video_camera,mask_type):
    def apply_mask_and_decrease_noise(self,frame_raw):
        #apply the background subtraction mask
        frame_motion = self.mask.apply(frame_raw)
        #apply morphology mask to decrease noise
        frame_motion_output = cv2.morphologyEx(
        return frame_motion_output

Real time background subtraction masks

Chloe’s post demonstrated that we could detect trains with processed video feeds that isolate motion, through a process called background subtraction, by setting thresholds for the minimum intensity and duration of motion. Since background subtraction must be applied to each frame and the Pi has only modest computational speed, we needed to streamline the algorithm to reduce computational overhead.

Luckily, OpenCV 3 comes with multiple open source packages that were contributed by the OpenCV community (their use also requires installing opencv_contrib). These include background subtraction algorithms that run optimized C code with convenient Python APIs:

  • backgroundsubtractorMOG2: A Gaussian Mixture-based Background/Foreground Segmentation algorithm developed by Zivkovic and colleagues. It uses a method to model each background pixel by an optimized mixture of K Gaussian distributions. The weights of the mixture represent the time proportions that those colors stay in the scene. The probable background colors are the ones which stay longer and are more static.
  • backgrounsubtractorKNN: KNN involves searching for the closest match of the test data in the feature space of historical image data. In our case, we are trying to discern large regions of pixels with motion and without motion. An example of this is below, where we try and discern which class (blue square or red triangle) the new data (green circle) belongs to by factoring in not only the closest neighbor (red triangle), but the proximity threshold of k-nearest neighbors. For instance, if k=2 then the green circle would be assigned the red triangle (the two red triangles are closest); but if k=6 then the blue square class would be assigned (the closest 6 objects are 4 blue squares and only 2 red triangles). If tuned correctly, KNN background subtraction should excel at detecting large areas of motion (a train) and should reduce detection of small areas of motion (a distant tree fluttering in the wind).

We tested each and found that backgroundsubtractorKNN gave the best balance between rapid response to changing backgrounds, robustly recognizing vehicle motion, and not being triggered by swaying vegetation. Moreover, the KNN method can be improved through machine learning, and the classifier can be saved to file for repeated use. The cons of KNN include artifacts from full field motion, limited tutorials, incomplete documentation, and that backgroundsubtractorKNN requires OpenCV 3.0 and higher.

class Mask():
    def __init__(self,fps):
    def make_KNN_mask(self,bgsKNN_history,bgsKNN_d2T,bgsKNN_dS):
        mask = cv2.createBackgroundSubtractorKNN(\
        return mask

Dynamically update camera settings in response to varied lighting

The PiCamera does a great job at adjusting its exposure settings throughout the day to small changes, but it has limited dynamic range, which causes it to struggle with limited illumination at night. Below you can see the motion we detected from our sensor over 24 hours, where the spikes correspond to moving objects like a CalTrain.

sunrise_sunset_dataIf we were using a digital camera or phone, we could manually change the exposure time or turn on a flash to increase the motion we could capture post sunset or before the sunrise. However, with an automated IoT device, we must dynamically update the camera settings in response to varied lighting. We also picked a night-vision compatible camera without an infrared (IR) filter to gather more light in the ~700-1000 nm range, where normal cameras only capture light from ~400-700 nm. This extra far-to-infrared light is why some of our pictures seem discolored compared to traditional cameras.

We found that through manual tuning, there were exposure parameters that allowed us to detect trains after sunset (aka night mode), but we had to define a routine to do automated mode switching.

In order to know when to change the camera settings, we record the intensity mean of the image, which the camera tries to keep around 50% max levels at all times (half max = 128, i.e. half of the 8 bit 0-255 limit). We observed that after sunset, the mean intensity dropped below ~1/16 of max, and we were unable to reliably detect motion. So we added a feature to poll the mean intensity periodically, and if it fell below 1/8th of the maximum, the camera would adjust to night mode. Similarly, the camera would switch back to day mode if the intensity was greater than 3/4th of the maximum.

After we change the camera settings, we reset the background subtraction mask to ensure that we did not falsely trigger train detection. Importantly, we wait one second between setting camera settings and triggering the mask, to ensure the camera thread is not lagging and has updated before the mask is reset.

class Video_Sensor(Thread):
    def vary_camera_settings(self,frame_raw):
            intensity_mean=frame_raw.ravel().mean() #8 bit camera
            #adjust camera properties dynamically if needed, then reset mask
            if ((intensity_mean &amp;amp;lt; (255.0/8) ) &amp;amp;amp; ('day')):
                print 'Day Mode Activated - Camera'
            if ((intensity_mean &amp;amp;lt; (255.0*(3/4)) ) &amp;amp;amp; ('night')):
            return intensity_mean,self.mask

Real-time detection of trains with the Video_Detector class

At this point, the video sensor is recording motion in a frame 5 times per second (5 FPS). Next, we needed to create a Video Detector for detecting how long and what direction an object has been moving through the frame. To do this, we created three ROIs in our frame that the train passes through. By having three ROIs, we can see if a train enters from the left (southbound) or right (northbound). We found that having a third center ROI decreases the effect of noise in an individual ROI, thereby improving our ability to predict train directionality and more accurately calculate speed.

We then created a circular buffer to store when individual ROIs have exceeded the motion threshold. The length of this motion_detected_buffer is set as the minimum time corresponding to a train, multiplied by the camera FPS (we set this as two seconds; the motion_detected_buffer has a length of ten). We added logic to our Video_Detector class that prevents a train from being detected more than once in cool-down period, to prevent slow moving trains as being registered as a train more than one time. Additionally, we used a frame sampling buffer to keep a short term record of raw and processed frames for future analysis, plotting or saving.

Using all of these buffers, the Video_Detector class creates ROI_to_process, which is an array of information which stores time, the motion from the three ROIs, if motion was detected, and the direction of the train motion. The video below shows the normal video of the train passing, and the background subtracted versions (MOG2 in the middle pane, KNN in the right pane).

def run(self):
        while self.kill_all_threads!=True:
            #update the history dataframe and adjust the frame number pointer
            self.history.iloc[self.framenum % self.history.shape[0]] = self.roi_data
            #add frames and data to sampler

Persist processed data to pandas DataFrame

With relevant train sensor and detector data stored in memory, we chose to use pandas DataFrames to persist this data for future analysis. Pandas is a Python package that provides fast and flexible data structures designed to work efficiently with both relational and labeled data. Similar to using SQL for managing data held in relational database management systems (RDBMS), pandas makes importing, querying and exporting data easy.

We used the History class to create a DataFrame that loads time, sensor, and processed detector data. Since the Pi has limited memory, we implemented the History DataFrame as a limited length circular buffer to prevent memory errors.

class History():
    def __init__(self,fps):
    def setup_history(self):
        #create pandas dataframe that contains  column information
        self.history = pd.DataFrame.from_records(\

This allowed us to easily retrieve and analyze train detection events. Shown below is 17 frames (3.4 seconds of data at 5 FPS) of data of a northbound Caltrain.


Detector_Worker class

Once we were persisting data in a DataFrame, we visualized the raw sensor and processed detector data. This required additional processing time and resources, and we did not want to interrupt the video detector. We therefore used threading to create the Detector_Worker class. The Detector_Worker is responsible for plotting video, determining train direction, and returning sampled frames to the Jupyter notebook or file system. Shown below is the output of the video plotter. On the top left is one raw frame of video, and on the bottom right is one KNN-background-subtracted motion frame. The two right frames have the three ROIs overlaid onto the image.


Train direction

The last step was to deduce train direction. In order to accurately detect train direction within our streaming video analysis platform, we iterated through several methods.

  • Static ‘Boolean’ Method: Track motion level in each individual ROIs and then select north/south depending on which ROI exceeded the threshold first. We found that this static boolean method does not work well for express trains which triggered the north and south facing detectors simultaneously when using low framerates.
  • Streaming ‘Integration’ Method: This method involved summing the historical levels motion in each ROI, and determining direction by which ROI had the highest sum. We found that this method was too reliant on accurate setting of ROI position, and broke down if the camera was ever moved. The problem this created was that if an ROI never became fully saturated with motion, the maximum mean intensity it could reach for a true positive could be lower than the mean of an ROI due to non-train motion.
  • Streaming ‘Curve-Fit’ Method: We next tried to combine the boolean and integration method with a simple sigmoid model of motion across the frame. If average motion across the three ROIs exceeded our motion threshold, we empirically fit a sigmoid curve to determine when the train passed in time (i.e. at 50% max motion). If the data was noisy and curve fitting failed, we revert back to the static boolean method. Moreover, our curve-fit method allows determination of train speed if the real distance between the ROIs is known.
class Detector_Worker(Thread):
    def curve_func(self,x, a, b,c):
        #Sigmoid function
        return -a/(c+ np.exp(b * -x))
    def alternate_km_map(self,ydata,t,event_time):
        #determine emperically where the ROI sensor hits 50% of the max value
        max_value = max(ydata)
        for i in range(0,len(ydata)):
            #if the value is above half of the max value
            if ydata[i] &amp;amp;gt; max_value/2.0:
                return km
        #if the value never exceeds half of the max value
        #return the end of the time series
        return km

Below is an example of a local southbound train passing our Mountain View office. If you’d like to learn more about analyzing time series data, please see Tom Fawcett’s post on avoiding common mistakes with time series data.



You should now have an idea of how to design your own architecture for stream video processing on an IOT device. We will cover more of this project in future posts, including determination of train speed. Importantly, other false positives like light rails or large trucks that pass in front of the camera also trigger the sensor. By having a secondary data feed, i.e. audio, we can have a second input to determine if a train is passing by both visual and audio cues.

Later in the Trainspotting series we will also cover how to reduce false positive of freight trains using image recognition. Keep an eye out for future posts in our series, and let us know below if there’s something in particular you’d like to learn about.

The post Streaming Video Analysis in Python appeared first on Silicon Valley Data Science.

Data Digest

10 Key Takeaways from the Chief Analytics Officer, Africa 2016

So often people attend conferences, take loads of notes and then forget about them when they get back to the office. BAU takes over. A lot of value from the event gets lost because other priorities take over. Delegates are not alone...this often happens to me too.

In an effort to combat this I had our post-event survey modified to include the following question:

“What was the one big takeaway you got from the event?"

The motivation behind this was to get each delegate to deliberately think about the one thing they got from the event that will make a difference to their data analytics efforts. The survey is sent out a few days after the event so delegates need to reflect on the sessions and discussions.

In no particular order...

The Top 10 Takeaways from CAO Africa 2016:
  1. Data analytics is not a silver bullet!
  2. Although there are advanced uses of data and analytics, very few companies in South Africa actually have something like an exec level data or analytics position
  3. Data analytics is a fast growing sector in business and becoming more and more valuable as businesses start to understand and buy into analytics
  4. Consumers of information are not all the same, especially pertaining to self-service
  5. Fail forward in this space. Start small, grow credibility and scale thereafter
  6. Data analytics connects business to key objectives of revenue generation, cost management and client management
  7. Take the time to truly understand the insights derived from your data
  8. Building organisational trust in data analytics doesn't happen on its own. It needs a plan and careful execution
  9. Analytics can bring you closer to the business to solve problems together
  10. The talent shortage in South Africa is going to continually drive fierce competition for top-end talent amongst companies that take data analytics seriously
I think these 10 takeaways are a good reflection of the maturity of the South African analytics environment - there are people leading the charge but a lot of work needs to be done to make it a 'go-to' in business.

I'd like to thank all those who attended CAO Africa 2016 either as a delegate, speaker or sponsor. The contributions made by everyone during the presentations and discussion groups help drive the role forward. It's important that a community is developed where best practice can be shared and ideas created by peers.

By Craig Steward:

Craig Steward is the Managing Director for EMEA responsible for developing Corinium’s C-level forums and roundtables across the region. One of Craig’s major objectives is to provide the data & analytics community with informative and valuable information based on research and interactions with senior leaders across EMEA. Contact Craig on
Teradata ANZ

Hadoop, utilities and that pesky bandwagon

A colleague of mine recently asked for help in a workshop with a major global utility customer. As part of his preamble, he described the utility as being in the “trough of disillusionment” phase with their Hadoop implementation. I wasn’t surprised.

Now let’s be clear: this isn’t a Hadoop-bashing blog. Teradata has great relationships with the big names in Hadoop; our ecosystem features Hadoop strongly; and I’m entirely convinced that there’s a place for the technology in the modern, data-driven utility. No, this is a blog about believing the hype. And where that can get you.

Many utilities today have Hadoop implementations at one level or another. Very few have achieved anything worthy of note with that implementation. French networks business Enedis are an exception, it has to be said. But most….well….there’s not really much to say. Why is that? Here are three reasons:

The bandwagon effect
A problem that amazes me every time I encounter it is that utility companies all over the world seem happy to “get some Hadoop” before they have any idea what they might use it for. For example, a utility company I know well will go to tender for “some Hadoop” in the next few months. Their favourite consultancy / systems integrator assures me that the client has no idea what for. They just know they need some. Because how will they been seen as innovative if they don’t have any? Part of this keenness might be due to my second point, below.

The engineering mind-set
In the main, utilities are asset management businesses,[1] run by engineers. Engineers really like new technologies. They like stuff they can play around with. Especially if it’s free. (Which clearly, it isn’t.) And they tend to think that “hey, it can’t be so hard”[2]. Especially since engineers probably already run the SCADA and associated OT systems, often entirely outside the traditional IT domain. Which leads on nicely to my third point.

Lack of skilled resources
Actually, being a Hadoop Mahout [3] isn’t so easy. It’s about managing – and deriving value from – huge amounts of data in an open-source distributed file system. And the state of the art is changing every day. Utilities don’t really have guys with those sorts of skills hanging on hooks waiting for stuff to do. And even if they hire a team, it can be difficult to keep hold of them, given the continued attractiveness of Hadoop skills on a CV. Like it or not, the utility industry is not the most exciting place for the young Big Data specialist.

In summary then: Hadoop is having a hard time in many utilities because to a great extent, they don’t really know what to do with it. And even if they did, they don’t have right the people to do it. Simple as that. And here’s the kicker: that’s not Hadoop’s fault. It’s your fault.

Hadoop really is a useful technology. But it’s not the answer to all your hopes and dreams. And it doesn’t make you innovative as soon as you spin up a cluster. Yes, you can store huge data sets very cheaply on it. You can even offload certain workloads and storage to it from other systems. Sometimes even from…gasp… Teradata. But it’s not a straight-up replacement for anything. No matter what the hype, or what the unscrupulous vendor says.

Instead, Hadoop is a key part of a wider data & analytics ecosystem. Something that by definition has other components too. Gartner like to call that ecosystem the Logical Data Warehouse. Teradata call it the Unified Data Architecture. It doesn’t matter what you call it. If it’s done right, it’s a seamlessly integrated combination of tools and technologies that meet your data & analytics needs. Needs such as overall cost, of course. But also security; accuracy; availability; usability; the opportunity to discover entirely new insights; and to serve the CEO at the same time as the Data Scientist (if you have them) or the Analyst if you don’t. And all points in between.

If you work for a utility struggling with Hadoop, or about to go to Tender for “some Hadoop”, perhaps you ought to help them think again. Perhaps you should consider how your utility might avoid that trough of disillusionment by taking a different approach. One that looks at the problems they’re trying to solve first and the technology second. If that sounds like it might have some merit….well, we can help. And we’d love to talk to you.



[1] Yes, I know they have customers too. But let’s face it – most still call them consumers and see them as a necessary evil.
[2] I know. I’m one of them.
[3] Elephant handler. It’s all elephant-related in the Hadoop world. No, seriously.

The post Hadoop, utilities and that pesky bandwagon appeared first on International Blog.


October 12, 2016

Revolution Analytics

Tutorial: Scalable R on Spark with SparkR, sparklyr and RevoScaleR

If you'd like to manipulate and analyze very large data sets with the R language, one option is to use R and Apache Spark together. R provides the simple, data-oriented language for specifying...


Revolution Analytics

Upcoming Practical Data Science courses in London, Chicago, Zurich, Oslo and Stockholm

If you'd like to learn how to run R within Azure Machine Learning and SQL Server, you may be interested in these upcoming 4-day Practical Data Science courses, presented by Rafal Lukawiecki from...

Cloud Avenue Hadoop Tips

ASF (Apache Software Foundation) as a standards body

What is ASF all about?

For many Apache is synonymous to the Apache HTTP server, which is the backbone for serving the web pages. But, there is much more to Apache. It's a non profit organization (ASF - Apache Software Foundation) which provides an environment and a platform in which different companies and individuals work in an open and collaborative fashion towards a common goal of developing a good piece of software. Open means that all the work (architecture, design, coding, testing, documentation etc) happens in an open way and there are no secrets. Anyone can also download the code, make some changes, compile it and push it back.

It's possible for different companies and individuals like you and me to improve the code and contribute it back to the ASF. To maintain the quality of the software there is a process in place where the project committer will check the quality of the code contributed by someone and then add them to the code repository. The advantage of working in this model is that any improvements made by an individual or a company can be immediately absorbed by someone else. This is what working in a collaborative fashion means.

There are a lot of Big Data projects under the ASF like Hadoop, Hive, Pig, HBase, Cassandra and a lot of non Big Data projects like Tomcat, Log4J, Velocity, Struts. The projects usually start with Incubator status and then some of them move to the TLP status (Top Level Project). The code for the different projects can be accessed in a read only way from here.

How is Apache promoting standards?

OK, now we know how the Apache process works. The different companies and individuals work towards the common goal of creating good software. Now, lets look into why standards are important in software and how Apache is promoting the standards.

Those who travel internationally and carry at least one electronic item, face the problem with the different socket layout and the voltages in different countries. The plugs just don't fit in into the sockets and so is the need to carry multiple adapters. If there had been an international standard for the socket layout and the voltage, we wouldn't face this problem.

The same can be applied to software standards also. Software standards allow interoperability across different software stacks. As an example, a program written against one software can be easily ported to some other software if the standards are followed. One example is the JEE standards. An EJB written for JBoss can be easily ported to WebSphere with minimal or no changes.

In the case of Big Data stacks, the different Big Data companies take the software from the Apache Software foundation and improve on it. Some of the improvements can be better documentation,  better performance, bug fixes, better usability in terms of installation/monitoring/alerting.

The Apache code is the common base for the different Big Data distributions. Due to this reason the Apache code base and the different distributions like HDP, CDH etc provide more or less the same API to program against. This way Apache is acting as a standards body. For example a MapReduce program written against HDP distribution can be run against the other distributions with minimal or no changes.

Although Apache is not a standards body, it is still acting as one. Usually, a standards body is formed and they do develop standards and a reference implementation for the same. This is a painstaking process and some of the standards may not see the light of the day also. The advantage of working in the Apache fashion is that the standards are developed indirectly in a very quick fashion.

Correlation does not imply causation

A popular phrase tossed around when we talk about statistical data is “there is correlation between variables”. However, many people wrongly consider this to be the equivalent of “there is causation between variables”.
VLDB Solutions

Teradata Setup on AWS

Teradata MPP Setup on AWSTeradata AWS Setup

As previously mentioned, Teradata MPP (multi-node) setup on AWS is a tad more involved than the previous SMP (single node) setup. Well, as we’re kind-hearted folks here at VLDB, we decided to show you how it’s done. Well, how we did it anyway.

Let’s see, if armed with an Amazon account and a credit card, you can have Teradata up and running on AWS in under an hour, as claimed by Stephen Brobst at the Teradata Partners conference recently.

Tick-tock, here goes…

Teradata AWS Marketplace

First of all, login to the AWS Management Console and select AWS Marketplace (under ‘Additional Resources’ on the right) then search for ‘Teradata Database Enterprise Edition’ in the AWS marketplace to go to the Teradata Database Enterprise Edition (1-32 nodes) page. From here you will have to subscribe to Teradata if not already done so.



Select the required ‘Region’ – in this case ‘EU(Ireland)’ – then click the ‘Continue’ button on the ‘AWS Marketplace Teradata Database Enterprise Edition (1-32 nodes)’ page to go to the ‘Launch on EC2’ page.

Launch Teradata on EC2


Set the following as required:

  • Version – currently only ‘’ available.
  • Region – ‘EU (Ireland)’ is the closest region for the UK.
  • Deploymentlocal or EBS storage and VPC options.

Click ‘Launch with CloudFormation Console’ to go to the ‘Create Stack/Select Template’ page.

AWS Cloud Formation Stack Template


Ensure you are in the correct AWS region (top right corner next to ‘Support’ drop-down) then click on ‘Next’ to go to the ‘Create Stack/Specify Details’ page.

AWS Cloud Formation Stack Details

Set the following & leave other values as default:

  • Stack name – the name of the stack that will be created in CloudFormation.
  • System Name – the name of the Teradata instance.
  • DBC Password – the password for Teradata DBC user.
  • Number of Nodes – the number of EC2 nodes required (1-32).
  • Instance & Storage Type – select the EC2 instance type and storage type/size required.
  • Availability Zone – choose availability zone from list.
  • Remote Access From – specify CIDR block (IP range) from which SSH/DBMS access is allowed. Use if required (any IP allowed) to test the setup process.
  • AWS Key Pair – a pre-existing key pair must be specified.

See for help on choosing EC2 instance & storage types. Apart from the number of nodes, this is the biggest driver of the cost of your Teradata on AWS stack.

Click on ‘Next’ to go to the ‘Create Stack/Options’ page.

AWS Cloud Formation Stack Options


Nothing needs to be set/changed in this page…unless you think otherwise.

Click ‘Create’ to proceed with the stack/instance creation which is monitored via CloudFormation.

AWS Cloud Formation Stack Review

teradata-aws-reviewIt generally takes between 20-30 minutes to provision a single m4.4xlarge EC2 instance with 5TB EBS storage. The process is the same irrespective of the number of nodes.

There are lots of steps to go through as part of the provisioning. Once, complete the status will change to ‘CREATE_COMPLETE’:


Teradata on AWS Up and Running

Once the stack is up and running the next stage is to connect via SSH, Teradata Tools & Utilities (TTU) and a desktop SQL client. This is quite a big topic in itself and will be covered in a separate blog post.

So, to get back to Mr Brobst, we think it is possible to be up and running with Teradata on AWS in under an hour, but only if the node count is low, and only if you follow a guide that somebody has prepared…such as this one.


The post Teradata Setup on AWS appeared first on VLDB Blog.


October 11, 2016

Revolution Analytics

Watch the world warm with this animated globe, created with R

Due to anthropogenic climate change, the average global temperature has increased steadily over the past decade or so. While we're all familiar with the hockey-stick line chart of rising temperature,...



Media Monitoring vs. Data Harvesting: What’s the Difference?

BrightPlanet often gets lumped in with companies like Radian6, uberVU or Sysomos, but our services are nothing like those of a media monitoring company. No offense to those businesses, but data harvesting is a completely different service for a more involved customer. Here’s a look at the major differences between media monitoring and data harvesting, […] The post Media Monitoring vs. Data Harvesting: What’s the Difference? appeared first on BrightPlanet.

Read more »
Big Data University

This Week in Data Science (October 11, 2016)

Here’s this week’s news in Data Science and Big Data. Brain AI

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

Cool Data Science Videos

The post This Week in Data Science (October 11, 2016) appeared first on Big Data University.


Understanding machine learning #3: Confusion matrix – not all errors are equal

One of the most typical tasks in machine learning is classification tasks . It may seem that evaluating the effectiveness of such a model is easy. But it depends...
Data Digest

JUST RELEASED: Chief Analytics Officer Forum Africa 2016 - Speaker Presentations

The Chief Analytics Officer (CAO) Forum Africa has been designed to bring the senior analytics community together to discuss the most critical data and analytics challenges for areas including finance, human resources, sales and marketing, the supply chain, risk, investment and most importantly, the customer.

Research conducted by Corinium Global Intelligence shows that South African organisations are beginning to realise the value that lies within their data but face numerous challenges to realising this value as they are not yet mature enough in their data & analytics journey to move to a CAO structure.


October 10, 2016

Revolution Analytics

Make ggplot graphics2 interactive with ggiraph

R's ggplot2 package is a well-known tool for producing beautiful static data visualizations that you can include in a printed report. But what if you want to include a ggplot2 graphic on a webpage...

Cloud Avenue Hadoop Tips

What is Dockers all about?

There had been a lot of noise about Docker. And there had been a raft of announcements (1, 2, 3) about the support for Docker in their products from different  companies. In this blog, we will look what Docker is all about, but before that we will look into what virtualization, LXC. In fact, Docker-Big Data integration also can be done.

What is Virtualization?

Virtualization allows to run multiple operating systems on a single machine. For this to happen virtualization software like Xen, KVM, HyperV, VMWare vSphere, VirtualBox has to be installed. Oracle VirtualBox is free and easy to setup. Once VirtualBox has been installed, multiple guest OS can be run on top of VirtualBox as shown below.
On top of the guest OS, applications can be installed. The main advantage of the above configuration is that the applications are isolated from each other. And also, resources can be allocated to the guest OS and a single application can’t dominate and use the underlying hardware resources completely and let the other applications starve for resources. The main disadvantage is each of the guest OS gets it’s own kernel and the file system and hence consume a lot of resources.

What is LXC?

LXC (Linux Containers) provide OS level virtualization and don’t need a complete OS to be installed as mentioned in the case of virtualization.
The main advantage of LXC is that they are light weight and there is little over head of running the applications on top of LXC instead of directly on top of the host OS. And also, LXC provides isolation between the different applications deployed in them and resources can also be allocated them.

LXC can be thought of light weight machines which consume less resources, are easy to start and on which applications can be deployed.

How does Docker fit into the entire thing?

LXCs are all interesting, but it’s not that easy to migrate an application on top of LXC from one environment to another (let’s say from development to QA to production) as there are a lot of dependencies between the application and the the underlying system.

Docker provides a platform, so that applications can be be built and packaged in a standardized way. The application built, will also include all the dependencies and so there is less friction moving from one environment to another environment. Also, Docker will try to abstract the resources required for the application, so that that application can run in different environments without any changes.

For those from a Java/JEE background, Docker applications can be considered similar to an EAR file which can be migrated from one environment to another without any changes as long as the proper standards are followed while creating the EAR file. The Java applications can make JNDI calls to discover services by name and so are not tied to a particular resource.

BTW, here is a nice documentation on Docker.
The Data Lab

The Data Lab welcomes 90 future Data Scientists at the 2016/17 MSc launch event

The Data Lab MSc 2016/17 cohort

The Data Lab MSc  has been developed specifically to build a high-quality pool of ‘data talent’ in Scotland that is equipped with the skills, experience, networks, creativity and confidence to succeed in employment, as entrepreneurs and as researchers.

The Data Lab’s MSc 2016/17 intake is more than doubling to 90 places funded across 7 Scottish universities - up from 40 places funded across 3 universities last year. This extended commitment underlines the growing need to deliver work-ready data skills into Scotland’s business community. Four of the courses selected for The Data Lab MSc are brand-new courses for the 2016/17 academic year in the South-West of Scotland.

All students on The Data Lab MSc are offered a paid, credit-bearing industrial summer placement towards the end of their course. This year we are working with MBN Solutions to deliver the industrial placements programme.

But we are going one step further this time. We have incorporated a challenge-driven learning initiative on top of each institution’s specialist data course, that brings together the students from all seven different universities in mixed teams for three intensive weekend workshops. Each group of budding data science talent will be tasked with solving real-life issues and generating economic impact to Scotland through data analysis. Through the ambitious Challenge Competition, the cohort of data post graduates will be charged with using datasets unique to Scotland to develop new insights, ideas, products, services or processes that will deliver social or economic benefit.

The year-long data Challenge Competition, which will be delivered in collaboration with Product Forge, is set as a requirement of all Data Lab funded MSc students where the best performing team will be awarded a prize. Teams will use data from open-source data sites such as the Scottish Government’s Open Data Platform and the Urban Big Data Centre. Students will address issues in three key areas:

  • Identifying problems within business and troubleshooting to deliver a suitable solution.
  • Devising a product or insight to support a start-up business.
  • Analysing data and relevant insights to create a compelling story that is in the public interest in Scotland.

Three weekend-long hackathon events are scheduled which will see teams work in intensive environments to resolve the issues. They will receive guidance and support from The Data Lab as they apply their analytical skills to the datasets.

During the launch event, the 2015/16 placements programme prizes were awarded to the best student and best host organisation:

Best Student:

Best organisation:

Have a look at some of testimonials from student and organisations that took part at last year's placement programme:


The MSc programme is core to the Data Lab’s aim to unlock the estimated £17bn value of data to Scotland and generate 248 high-value jobs and a pipeline of data science talent to stem the skills gap. Find out more about The Data Lab MSc.


Google Plus

Curt Monash

Notes on anomaly management

Then felt I like some watcher of the skies When a new planet swims into his ken — John Keats, “On First Looking Into Chapman’s Homer” 1. In June I wrote about why anomaly...


Revolution Analytics

In case you missed it: September 2016 roundup

In case you missed them, here are some articles from September of particular interest to R users. The R-Ladies meetups and the Women in R Taskforce support gender diversity in the R community....


October 08, 2016

Revolution Analytics

Because it's Friday: Dear Data

In 2014, information designers Stefanie Posavec (in London) and Giorgia Lupi (in New York) embarked on a year-long "slow data" project. Every week for a year, each of them would collect data on a...

Cloud Avenue Hadoop Tips

Consolidating the blogs

I have a few other web sites (around Ubuntu) and (around Big Data, IOT and related technologies) besides the current one. Over time it has become difficult to maintain multiple sites and also I had not been updating the other sites on a regular basis.

So, I decided to consolidate the sites to (the current site). In this process, I would be moving the blogs from the other sites to this this site in a phase vise manner.

Happy reading !!!!
Cloud Avenue Hadoop Tips

Big Data Webinars

Big Data is a moving target with new companies, frameworks and features get introduced all the time. It's getting more and more difficult to keep in pace with Big Data. There is nothing more relaxing than sitting in a chair and watching webinars on some of the latest technologies.
Brighton Beach by appoose81 from  Flickr under CC
So, here (1) is a calendar (XML, ICAL, HTML) with few of the upcoming webinars around Big Data. I would be populating more and more events in the calendar as they get planned. Those interested can import the calendar in Thunderbird, Outlook or some other calendar application and keep updated with the webinars in the Big Data space.

If you are interested in including any webinar around Big Data in the calendar, then let me know at
Cloud Avenue Hadoop Tips

Wanted interns to work on Big Data Technologies

We are looking for interns to work with us on some of  the Big Data technologies at Hyderabad, India. The pay would be appropriate. The intern preferably should be from a Computer Science background, be really passionate about learning new technologies and be ready to stretch a bit hard. The intern under our guidance would be performing installing/configuring/tuning of Linux OS and all the way to the Hadoop and related Big Data frameworks on the cluster. Once the Hadoop cluster has been setup, we have got a couple of ideas which we would be implementing on the same cluster.
The immediate advantage is that the intern would be working on one of the current hot technology and would have direct access to us to know/learn more about Big Data. Also, based on the requirement appropriate training would be given around Big Data. Also, the work being done by the intern will definitely help in getting them through the different Cloudera Certifications.

BTW, we are looking for someone who can work with us full time and not part time. If you or anyone you know is interested in taking an internship please send an email with CV at
Cloud Avenue Hadoop Tips

Looking for guest bloggers at

The first entry had been posted on 28th September, 2011 on this blog. Initially I started blogging as an experiment, but lately I had been having fun and liking to blog.

Not only the traffic to the blog had been increasing at a very good pace, but also I had been making quite a few acquaintances and also getting a lot of nice and interesting opportunities through the blog. I got offers to write a book, an article, blog on some other sites and others.

I am looking for guest bloggers to this blog. If you or someone else is interested then please let me know

a) a bit about yourself (along with LinkedIn profile)
b) topics you are interested in to write on this blog
c) references to articles written in the past if any
I don't want to put a lot of restrictions around this, but here are a few

a) the article should be authentic
b) no affiliate or promotional links to be included
c) the article can appear elsewhere after 10 days with a back link to the original

I am open to any topics around Big Data, but here are some of the topics I would be interested in

a) a use case on how you company/startup is using Big Data
b) using R/Python/Mahout/Weka for some interesting data processing
c) integrating different open source frameworks
d) comparing different open source frameworks with similar functionalities
e) ideas and implementation of pet projects or POC (Proof Of Concepts)
f) best practices and recommendation
g) views/opinions of different open source framework

As a bonus, if a blog gets posted here then it will also include a brief introduction about the author and a link to his/her LinkedIn profile. This will give enough publicity for the author.

If you are a rookie and writing for the first time, that shouldn't be a problem. Everything begins with a simple start. Please let me know at if you are interested in blogging here.
Cloud Avenue Hadoop Tips

Snap Circuits Jr. from Elenco for kids

Lately I had been my hands-on with the RaspberryPi and Arduino prototype kits to get started with IOT (Internet Of Things). Being a graduate in Electrical and Electronics Engineering helped me in getting started quickly.

I built a car with 4 DC motors, control it using a TV remote control which emits IR (Infra Red) signal. The IR signal was captured by the IR sensor on the car. Based on the button pressed on the remote control an Arduino program controlled the speed of the motors. This made the car move in different directions. It was not too complicated, but was fun to build.

All the time I was building the car, my son used to tag with me asking questions on what the different parts were and how they function. To continue his interest in the same, I bought him a Snap Circuits Jr. SC-100 and he was more than happy to do different circuits with it. The Snap Circuits comes with different models like Snap Circuits SC-300, Snap Circuits PRO SC-500 and finally the Snap Circuits Extreme SC-750. The Snap Circuits can be upgraded easily. There is a Snap Circuits UC-30 Upgrade Kit SC-100 to SC-300 and others.

The Snap Circuits are a big rugged and don't get damaged very easily, the only thing which needs to be taken care is the polarity of the components when connecting. There is no need to solder, the different components snap with each other using the push buttons. Batteries are the only items which need to be bought separately. It comes with a manual/guide explaining the different components, trouble shooting tips, circuit diagrams etc which is really nice.

As the name says, connecting them is a snap and fun. Would very much recommend for those who want to get their kids started with electronics.

Simplified Analytics

Mobile enablement in Digital age

Gone are the days when we used to carry big fat wallet filled with cash, coins, multiple credit cards, business cards, travel tickets, movie tickets, personal notes, papers with names, numbers and...


October 06, 2016

Silicon Valley Data Science

With Data, Ask “What” Before “How”

At SVDS, we encourage our clients to distinguish between what they want to accomplish with data and how they’re going to accomplish it. The what should focus on strategic business outcomes, whereas the how is about the tools, techniques, and architecture required to drive those outcomes. At the Strata + Hadoop World conference in New York last week, we gave two tutorial sessions:

In addition to half-day tutorials, the conference offered an impressive 16 tracks of session talks. A lot of them focused on the tools that everyone is excited about right now: there was a clear emphasis on Spark, and plenty of sessions about Hadoop and its various ecosystem components such as HDFS, as well as some sessions on Google tools like BigQuery and TensorFlow. But I was more curious about the what—the goals people are using data science to accomplish, and the value they’re creating.

Three-dimensional visualizations for interaction

Data visualization is a topic near and dear to my heart, and I’ve been fascinated for years by the ways in which our brains inherently interpret certain kinds of visual information (or cognitive perception). The size of an object and its position in space are the two most significant ways we determine something’s importance, and enlisting these properties to represent the relationships between data points is fraught with challenges.

This is true whether you’re working in two dimensions or three, but the transition from page or screen to the entire environment around you is particularly complicated. It was inspiring, then, so learn from Brad Sarsfield, Principal Software Architect in Data Science, how the Microsoft HoloLens is addressing some of these challenges in his session, “Holographic data visualizations: welcome to the real world.”

I admit that I went in full of skepticism and low expectations, but I left feeling very impressed. His team has clearly put a lot of thought into the entire user experience—not only how to represent data with additional dimension but also how to make sure the user can manipulate it easily. Instead of simply moving a 2D experience into virtual reality and slapping on a bit of depth perception, they’ve carefully curated the entire feature set. Gone are traditional menus, which work great in two dimensions but are incredibly awkward in three. New are the use of motion to help demonstrate relative depth, as well as “handles” for graphs and charts so they can be easily grabbed and rotated or otherwise manipulated in space. Transitions have been carefully crafted to avoid creating a feeling of motion sickness.

While there is still room for additional polish and fine-tuning, this was the first demonstration of 3D data visualization that made me feel an entire new realm of what was possible: interacting with information in the full three dimensions that we’re used to living in every day.

Targeted information for first responders

Bart van Leeuwen is a Data Architect who runs his own company, Netage, but he’s also a volunteer firefighter. His talk, “Smart Data for Smarter Firefighters,” explained how hard it can be to get the necessary information in a timely manner. When the alarm goes off at the station, firefighters are in the truck and on the move right away, and then there may only be a few precious minutes on the road in which to learn a wide variety of information such as:

  • Traffic conditions that may determine the fastest route to the fire
  • Building conditions or structures that may determine their fire-fighting strategy
  • Occupancy or other information about who may be inside
  • Special circumstance such as chemicals stored at the scene that may determine whether typical water hoses can be used

He described how fire trucks are currently stocked with multiple devices to try to provide this information in real time—but also demonstrated how little time there really is to look at them all. And once the fire truck has reached the scene, as he put it: “No one wants to see you stand around and look at an iPad for two minutes. They expect you to get in there.”

The data science challenge here is to turn big data—geographic data, traffic data, blueprint data, residency data, business data, and more—into small data: targeted information that can be absorbed by first responders on demand and on the fly in a matter of moments. This is the challenge that van Leeuwen is working to address, and it’s one that can make a very real difference to everyday people.

Transparent algorithms for crime prevention

Brett Goldstein is a Managing Partner at Ekistic Ventures, “a fund that is actively cultivating a portfolio of disruptive companies that bring new solutions to critical urban problems.” He has previously served as the Commissioner and Chief Data Officer for the Department of Innovation and Technology in the City of Chicago, and before that as Director of the Predictive Analytics Group, Counterterrorism and Intelligence Division, for the Chicago Police Department. His talk, “Thinking Outside the Black Box: the imperative for accountability and transparency in predictive analytics,” focused on the critical issue of bias in predictive policing models.

Goldstein argued that the familiar exhortation to “show your work”—the one we all rolled our eyes at in high school math classes—has never been so important as it is with algorithms, using predictive policing as a prime example. He shared some test results for the Chicago algorithm for classifying the risk-level of crime commission by various people, which showed both false positives for people of color and false negatives for white people at rates over 40%.

The ability to explain how a model works, and to test its accuracy according to different parameters, is critical to achieving public accountability and improving that model. When it comes to something like predictive policing, this algorithmic transparency is the difference in order to ensure both justice and successful prevention of actual crime.

How about you?

These are a few of the sessions that jumped out at me as particularly interesting and significant. Were you at Strata + Hadoop World this year? What jumped out at you? I’d love to hear about your favorite sessions in the comments—or about what you’re doing with data in your own organization.

To see videos from Strata + Hadoop World, check out their YouTube playlist. And if you’re interested in the slides from our tutorials, you can request them here.

The post With Data, Ask “What” Before “How” appeared first on Silicon Valley Data Science.

Data Digest

Pixar's 22 Rules for Phenomenal Storytelling: A CAO's Guide

Picture the scene. You and your team of analysts, statisticians and data scientists have just developed a new model that predicts the buying behaviours of your most premium clients. You believe the insights will grow revenue from this segment by 40% and reduce churn by 20%. The model works because the data set was good and the team had a moment of genius.

You head over to the CMO (or whoever asked for the insights). You sit them in a room and deliver an amazing presentation based on propensity scores and other technical information.

The mood falls flat. They don't get it, they don't believe it and they're not going to use it.

What went wrong?

Recently, Corinium ran the Chief Analytics Officer Forum in Johannesburg and the majority of discussion centered around the ability of analytics professionals to be able to tell the story of the insights. And, importantly, in a way that engages business and incites action.

You've got to speak to your audience in a way they understand!

I've written an article in the past about this but while sitting in Annie Symington's presentation, I was introduced to "Pixar's 22 Rules for Phenomenal Storytelling." A guide that Annie uses consistently when she and her team deliver insights to the rest of the business. This, alongside infographics, storyboards and other creative devices.

Never mind the fact that I liked the concept. The delegates at the event totally bought into the idea that they need to look outside of their own worlds for ideas on how to bring analytical insights to life.

Rule #8 resonated with a number of other presentations: "Finish your story, let go even if it’s not perfect. In an ideal world, you have to move on and do better next time."

Finish your story, let go even if it’s not perfect. In an ideal world, you have to move on and do better next time.

Put in a different way "done is better than perfect." Rather than get out a model that is 95% accurate in a short time frame than a model that is 98% accurate in 2 months. Take your story to business! Get business moving.

Since I have already written an article entitled "Business Storytellers - The New Job Title for Analytics Professionals" the point of this article was to introduce you to Pixar's rules. An introduction into a different way of thinking.

By Craig Steward:

Craig Steward is the Managing Director for EMEA responsible for developing Corinium’s C-level forums and roundtables across the region. One of Craig’s major objectives is to provide the data & analytics community with informative and valuable information based on research and interactions with senior leaders across EMEA. Contact Craig on

How Marketers use Machine Learning in Retail

Machine learning is revolutionising how companies are capitalising on Big Data to develop their marketing strategies. While the term encompasses a broad spectrum of technologies and approaches, in a marketing context it can be used to improve targeting, response rates and overall marketing ROI. To put it simply, machine learning involves the automated analysis of large volumes of data – such as consumer spending habits and purchasing behaviour, as well as demographic information – and using a mathematical algorithm and a computer to identify patterns and trends. The algorithm then tests predictions based on historical campaign data and learns from the predictions it gets right. With time, these algorithms become highly accurate as more data from campaign results is added.

Teradata ANZ

Machine Learning and the colours of Haute Couture…

Machine Learning and Haute Couture may not be what you expect to hear in the same sentence. But can Machine Learning help the creative side of fashion?

There are plenty of examples where machine learning is applied within fashion retailing to do predictive stock assortments, trend forecasting and much more, but I was interested in the more creative side of fashion.

One example is Google’s Muze project. It hasn’t exactly received universal acclaim and OMG! personally I would not be seen dead in it! This is not a work of a Central Saint Martins’ graduate!

One can nonetheless still appreciate the complexity of mathematics and algorithms that went into it. Its quite brilliant in respect of getting a machine to make an art form, but I don’t expect neural networks to replace a creative director any time soon.

So what about other “creative” tasks? Can a machine be any good at it? Can we look at the creative and understand how to spot trends and patterns?

Fashion magazines spend a significant amount of effort to provide pictures from the latest fashion shows with predictions for the next season’s fashionable colours and styles. Here is an aggregate example of the trends from recent shows.


I was curious as to how this is done, machine or human? How long does it take? A little googling and a quote from Pantone colour institute’s executive director Ms Eiseman reveals that they are relying on the intuitive and numeric abilities of their team members, who attend fashion shows: “When you’re seeing the colours and the clothing coming down the runway, you get a pretty good picture of what value the colour is and what tonality it is. We also record them. In the end, it really is a question of numbers – how many people are using these variations of colours”.

Then I decided to collate and estimate the data available for this challenge. There are 4 major fashion weeks in September alone – New York, Paris, Milan and London. Each of these contain over 120 shows per week and each show on average showcases 15 outfits. That makes approximately 7200 outfits. A team has to “intuitively” document around 240 outfits per day, every day, for the whole month, attending on average 15 .8 shows per day or 5.2 hours of shows daily, without travelling time between them. I hope they have a big team, or everything blurs!

So can this process be automated to some extent? Computer algorithms to perform colour quantisation have been studied since the 1970s. There are plenty of choices: K-means, PCA, etc. So I decided to try a “standard” K-means to see how easy it is. It is worth mentioning, that in this exercise the entire image is processed not just an outfit. Why? Because lighting, colour of the runway, background and even the colour of the model’s hair are all part of the overall ambience that a designer is trying to convey.

K-means is probably one the simplest machine-learning algorithms to understand. Every colour can be represented by RGB model – a vector in 3 dimensions. K-means groups similar vectors together based on their proximity to each other.
One major drawback of K-means is that it requires a predefined number of clusters to group similar colours.

So how many clusters should I request? I guess it depends on what are we trying to achieve. I had 2 ideas:
1. Try to estimate the number of major colours and produce a colour chart
2. Try to find out what the most popular colours are.

To answer the first question one approach is to let a machine “decide” the “best” number of colours to extrapolate. There are several techniques that allow you to do that. For the purpose of this blog I used a change point detection algorithm to find out if there is a significant change in clustering quality as the number of clusters grow.

Here are the results from the Autumn-Winter Couture collection 2016.



And no fashion show is complete without a celebrity; Celine Dion is here with one of my favourites, Giambattista Valli.

To begin to answer the second question, we don’t even need machine learning; we can simply extrapolate every pixel colour from an image and find the most “popular” colours.

For example, below is the histogram for the 50 most prominent colours based on Giambattista’s picture above. This is simple group and count.

The R code is only a few lines long and processing time for each picture on my laptop is about 3 seconds, 5 show, 15 outfits and 45 seconds.

Ok this is ‘a pet’ project with a few pictures only, but what about the volume of data from pictures for each fashion week and each season?

Below is a back of the envelope calculation.


There are 9900 pictures to process per season (Summer/Winter). My laptop will struggle. A more powerful engine like Teradata Aster would be a much better choice.

I don’t claim that K-means is the only and best algorithm to use for colour quantisation, or that machines are better than humans, but it does produce comparably good results in a fraction of the time and cost. With this data, trends can be highlighted and we can reverse-engineer the images and outfits that best represent these trends to summarise the season.

Then I realised the downside of this project, I talked myself out of a ‘research’ trip to Paris, which obviously would have required attendance at the after-show parties and much champagne!

The post Machine Learning and the colours of Haute Couture… appeared first on International Blog.