Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


January 17, 2017

Revolution Analytics

Git Gud with Git and R

If you're doing any kind of in-depth programming in the R language (say, creating a report in Rmarkdown, or developing a package) you might want to consider using a version-control system. And if you...

Big Data University

This Week in Data Science (January 17, 2017)

Here are some stories from this week in Data Science and Big Data. Don’t forget to subscribe if you find this useful!IBM Watson Health

Interesting Data Science Articles and News

Cool Data Science Videos

Teradata ANZ

Talking about real-time analytics? Be clear about what’s on offer

The inexorable increase in competition around the globe has led to an explosion of interest in real-time and near real-time systems.

Yet despite all this understandable attention, many businesses still struggle to define what “real-time” actually means.

A merchandiser at a big box retailer, for example, may want a sales dashboard that is updated several times a day, whereas a marketing manager at a mobile Telco wants the capability to automatically send offers to customers within seconds of them tripping a geo-fence. Her friend in capital markets trading, meanwhile, may have expectations of “real-time” systems that are measured in microseconds.

Since appropriate solutions to these different problems typically require very different architectures, technologies and implementation patterns, knowing which “real-time” we are dealing with really matters.

Before you start, think about your goals

Real-time systems are usually about detecting an event – and then making a smart decision about how to react to it.  The Observe-Orient-Decide-Act or “OODA loop” gives us a useful model for the decision-making process.  Here are some tips for business leaders about how to minimise confusion when engaging with I.T. at the start of a real-time project:

  1. Understand how the event that we wish to respond to will be detected. Bear in mind that this can be tough – especially if the “event” we care about is one when something that should happen does not. Or which represents the conjunction of multiple events from across the business.
  1. Clarify who will be making the decision – man, or machine? Humans have powers of discretion that machines sometimes lack, but are much slower than a silicon-based system, and only able to make decisions one-at-a-time, one-after-another.  If we chose to put a human in the loop, we are normally in “please-update-my-dashboard-faster-and-more-often” territory.
  2. It is important to be clear about decision-latency. Think about how soon after a business event you need to take a decision and then implement it. You also need to understand whether decision-latency and data-latency are the same. Sometimes a good decision can be made now on the basis of older data. But sometimes you need the latest, greatest and most up-to-date information to make the right choices.
  1. Balance decision-sophistication with data-availability. Do you need to use more, potentially older, data to take a good decision, or can you make a “good enough” decision with less data? Think that through.

Can you win at both ends?

Let’s consider what is required if you want to send a customer an offer in near real-time when she is within half-a-mile of a particular store or outlet.  It can be done solely because she has tripped a geo-fence, which means all that is required is the information about where the customer is now.

But you will certainly need access to other data if you want to know if the same offer has been made to her before and how she responded. Or, if you want to know which offers customers with similar patterns of behaviour have responded to an offer in the last six months. That additional data is likely to be stored outside the streaming system.

Providing a more sophisticated and personalised offer to this customer will cost the time it takes to fetch and process that data, so “good”, here, may be the enemy of “fast”. We might need to choose between “OK right now” or “great, a little later”.  That trade-off is normally very dependent on use-case, channel and application.

Rigging the game in your favour

Of course, I can try and manipulate the system – by working out beforehand the next-best actions in relation to a variety of different scenarios I can foresee. This is instead of retrieving the underlying data and crunching it in response to events that I have just detected. With this kind of preparation, I can at least try to be fast and good.

But then the price I pay is reduced flexibility and increased complexity. And the decision is based on the data from our previous interactions, not the latest data.

All these options come with different costs and benefits and there is no wrong answer – they are all more or less appropriate in different scenarios.  But make sure that you understand your requirements before IT starts evaluating streaming and in-memory technologies for a real-time system.

The post Talking about real-time analytics? Be clear about what’s on offer appeared first on International Blog.


January 16, 2017

Revolution Analytics

The fivethirtyeight R package

Andrew Flowers, quantitiative editor of, announced at last weeks' RStudio conference the availability of a new R package containing data and analyses from some of their data...

Data Digest

Want an honest measure of your customer centricity? Try this.

When asking the question to Senior Executives "How Customer-Centric is your company", there are typically two answers:

  1. We are very Customer-Centric
  2. We are way behind on this
The strange thing is that in both cases, many of them are wrong.

Those that think they are very Customer-Centric are regularly siloed, disconnected and slow but refer to their own personal approach as Customer-Centric. Those that think they are way behind are often ahead of the curve or at least, at par with their peers in the industry and just being hard on themselves or don't really have a benchmark to compare against. So given this, what are some simple ways to gauge Customer Centricity?

Benchmark against Competitors

It can be hard to get a hold of competitor information. Not impossible and easier than you might think if you really go looking, but, at best, you can just get a hold of their metrics and see how you compare. After all, who wants to tell their competitors what they are doing or working on?

Having said that, there are things that affect CX that you can have access to. Website responsiveness and UX, enquiry responsiveness, converted churn, third party research, social analytics, online reviews and personally buying their product, to name a few. These are things you can easily glean to form an intelligent opinion as to how you compare relative to your peers/industry.

Gather Feedback

For those who have driven their CX towards Data, you likely have a more realistic view of what is being achieved. In my opinion (see what I did there?), CX and Analytics must go hand in hand. Feedback and a simple measure like Net Promoter Score (NPS) should be part of what you are capturing.

Whilst NPS is only a part of what you should be capturing, it can give great insight into potential risks and opportunities.

Pay Attention to Social Media

A former colleague of mine, Alistair Clemett of The Customer Experience Company summarises this here better than I could. Depending on the type of business you are in, there is likely more insight to be gained from Social Media than surveys. Both have their faults but can be some of your most powerful ways to measure CX.

Measure Responsiveness

There are some major elements here for me, they are;

  • Website - Page load, forms, UX, etc.
  • Enquiry - When someone asks a question or registers to find information, how quickly do you respond? Think in seconds rather than the hours or days many companies think in.
  • Service Recovery - How simply and quickly do you solve your customer's problems?

Try this at home!

There are many more options, but these can be some of the simplest areas where you can get started. With or without these options, there is still one more thing you should be doing (and never stop doing) to get an honest, unfiltered measure of your customer centricity: Experience being a customer for yourself.

Look for information on your site, fill in forms, inquire, buy (if you can), speak to agents, use the product and generally see things from your customer’s perspective by being a customer. You will often find some of the simplest fixes this way.

If you really are curious you will find answers you otherwise wouldn't have.

For updates on Data, Analytics, Customer, Digital Innovation, follow Corinium on Twitter @coriniumglobal and Instagram @coriniumglobal

By Ben Shipley:

Ben Shipley is the Partnerships Director at Corinium Global Intelligence. You may reach him at  Twitter: @benjaminshipley LinkedIn: 

Data Digest

5 Ways to Start Your Data Governance Framework Right

Chief Data Officer at UNSW, Kate Carruthers, shares her top tips ‪for getting started with data and information governance‬.‬

Corinium: We are looking forward to hearing from you at the CDAO Sydney event speaking on your data governance journey at UNSW. Establishing a data governance framework is inevitably challenging, what are the key cornerstones to consider of any successful data governance framework?

Kate Carruthers: Clarify your mandate. Get your policies and procedures sorted out early. An official policy clarifies your mandate for running the data governance program and can assist in obtaining buy-in.  My starting point was a definition:

“Data governance is the organisation and implementation of policies, procedures, structure, roles, and responsibilities which outline and enforce rules of engagement, decision rights, and accountabilities for the effective management of information assets.”

Source: John Ladley, Data Governance: How to Design, Deploy and Sustain an Effective Data Governance Program, 2012

Setup an effective governance structure. This seems like an obvious thing, but many organisations struggle with this. Getting the right structure setup and the right people involved is critical to success. I have setup a Data Governance Steering Committee (DGSC), which has oversight of the entire program, with cross-organisational executive involvement, and it has been very important in obtaining credibility. The DGSC is supported by another Committee, which takes a more hands-on day to day role in deciding how we manage data across the organisation. We also work closely with IT, Privacy, Procurement, and Legal to ensure that they are involved in the data governance program.

Data governance is the organisation and implementation of policies, procedures, structure, roles, and responsibilities which outline and enforce rules of engagement, decision rights, and accountabilities for the effective management of information assets.

Make a start. Typically, in a large organisation, it can be daunting to consider data governance and to know where to start. Find an area of the organisation that has some willing people and just get started. This lets you demonstrate success and leverage that success to get the next area of the organisation involved.

Take inspiration from other organisations. Don’t feel the need to invent data governance from scratch. Talk to other practitioners – they’re usually delighted to find a fellow traveler. Find groups where data and information governance folks hang out, like The Data Governance Institute: The DGI, Information Governance ANZ or the Data Management Association Australia (DAMA). The kind folks at DG @ Stanford University were particularly helpful to me in the early days.

Ignore the vendors. There are a plethora of vendors who say they have solutions for data governance. Ignore them. It is not about the tools, it is about practice and culture.

Corinium: You’ve once said that when it comes to building a culture for data governance, “work with the willing and win hearts.” What are your top tips for achieving a truly data-driven culture?

Kate Carruthers: We are still very early on in the journey towards a data-driven culture at UNSW. However, a combination of a good policy and standards framework, together with the right tools to enable people to understand and manage their data are key foundations.

Corinium: Tell us your view on data ownership. Do you agree that territorial stance on data ownership is one of the greatest challenges in establishing the CDO role?

Kate Carruthers: Not at all. The Chief Data Officer role at UNSW is part of the business and I’m here to help data owners across the organisation to truly own and understand their data.  Another key relationship is with IT, and I work very closely with them to ensure that we build effective governance for the business.

Corinium: What do you see as the key benefits to an organisation in having a CDO role?

Kate Carruthers:
Increasingly, data is seen as a critical corporate asset that needs to be managed effectively. Having someone who is responsible for facilitating discussions about how the organisation determines its use of new, existing, and legacy information assets is critical. Additionally, CDOs can lead the debate on digital ethics, privacy and other regulatory issues relating to data.

Corinium: What do you believe are the top 3 qualities needed to be a successful CDO?

Kate Carruthers: Solid understanding of the business, ability to listen to and understand the needs of business and IT colleagues, and an understanding of data and related technologies.

Corinium: What are your key priorities for investment in the next year?

Kate Carruthers: At UNSW, we’re all about delivering on the UNSW 2025 Strategy and there are huge array of projects kicking off. In the business-as-usual space however, I am particularly concerned with the Data & Information Governance program and implementing the Cybersecurity Program, including rollout of the Information Security Management System and the Data Classification process. These are the key foundations for our strategic data initiatives. We are also looking at our next generation analytics to support the 2025 Strategy.

Corinium: What is the one advice you will give to a CDO who will be assuming the role for the very first time?

Kate Carruthers: Take your time to understand the context and getting to know the business priorities.

Corinium: What are the biggest trends you envisage dominating the data analytics space over the coming year?

Kate Carruthers: The legacy data warehouses will linger but Hadoop and predictive analytics will continue their growth. I’m expecting to see tools for managing realtime data streaming to go mainstream – we’ve already experimented with Amazon Kinesis during 2016. And of course, more cloud based offerings as vendors try to catch up.

To hear more from Kate, reserve your seat at the Chief Data and Analytics Officer, Sydney taking place on the 6-8 March, 2017.

For more information visit:  

For updates on Data, Analytics, Customer, Digital Innovation, follow Corinium on Twitter @coriniumglobal and Instagram @coriniumglobal

January 14, 2017

Simplified Analytics

This is how Analytics is changing the game of Sports!!

Analytics and Big Data have disrupted many industries, and now they are on the edge of scoring major points in sports. Over the past few years, the world of sports has experienced an explosion in the...


January 13, 2017

Revolution Analytics

Because it's Friday: Code Burn

I was unaware of the work of Jenn Schiffer until recently. At the risk of giving away the joke, she writes satire for coders. Some of her best pieces include: A Call For Web Developers To Deprecate...


Revolution Analytics

Microsoft R Server tips from the Tiger Team

The Microsoft R Server Tiger Team assists customers around the world to implement large-scale analyytic solutions. Along the way, they discover useful tips and best practices, and share them on the...


Mario Meir-Huber

Banking System

Standfore – Banking system

The world of finance is one of the most sensitive sectors of the economy as it dictates how money is going to flow and trades. For that reason, banking systems have to be given the utmost priority as this ensures that the users are able to trust in the system.
The Standfore banking system is designed, prototyped, tested and built on the grounds that secure information flows enhances daily business transactions as well as putting confidence into businesses and governments that their monies will be handled with discretion. With a firm firewall securing and filtering the data that goes in or out of the system, you can rest assured that all unauthorized intrusions will be thwarted and your system administrators alerted in the event of any suspicious or fraudulent looking activities.
Security plays a huge role in most modern systems as the threat of attacks is always present and very ominous, going in the form of malware attacks, information theft while in transit, denial of service attacks and so many other varied attacks that could totally cripple a bank thus halting its operations. In case information gets stolen while in transit to the online banking system, there has to be a way of ensuring that this information will be useless when it falls into the wrong hands. Encrypting all the data that is moving using strong digital keys prevents snooping attacks hence ensuring that all communications with the banking system are verifiable and totally secure.
standfore software
The banking platform software also needs to offer services at a fast pace which gives customers the confidence that no matter what time of day or night they log into the system, their account details will be very secure and conducting transactions will be easy, quick and very efficient. Building the system on reliable infrastructure is one of the ways towards ensuring that service delivery is quick and reliable so that the customers and other partners are able to gain trust in the system and do their business operations more often.
In terms of keeping track of all transactions that are being handled by the system, the banking system has got strong databases that encrypt all their content thus making it seem nonsensical to the outsiders while keeping the integrity of the information intact. This way, customers can be assured that all their sensitive information is stored in strong digital vaults that will stand all kinds of attacks. This is what the Standfore banking solutions software has been designed and built for in the modern digital age.

The post Banking System appeared first on Techblogger.


January 12, 2017

Revolution Analytics

Education Analytics with R and Cortana Intelligence Suite

By Fang Zhou, Microsoft Data Scientist; Hong Ooi, Microsoft Senior Data Scientist; and Graham Williams, Microsoft Director of Data Science Education is a relatively late adopter of predictive...


January 11, 2017

Revolution Analytics

In case you missed it: December 2016 roundup

In case you missed them, here are some articles from December of particular interest to R users. Power BI now has a gallery of custom visualizations built with R. Chicago's Department of Public...


January 10, 2017

Revolution Analytics

The anatomy of a useful chart: NOAA's flood forecasts

With thanks to NOAA's incredible data gathering and forecasting activities, I've been obsessed with this chart for the past few days: We used to live near the Napa river where this river gage is...

Jean Francois Puget

A Nice Optimization Problem From Santa Claus


Kaggle is a site that is most known for hosting machine learning competitions.  However, once a year, Kaggle team runs an optimization competition on some problem Santa Claus could face. 

This year competition is a stochastic optimization problem: we are asked to optimize some outcome when the data is known with some uncertainty.  Many real word problems are of this form. For instance, optimizing store replenishment and inventory levels takes as input sales forecasts.  By definition, future sales are only known up to some prediction uncertainty.  In a case like this, one can optimize for the worst case for instance: find a replenishment plan that minimizes the likelihood of out of stock.  I could go an and expand on this, but let's go back to Kaggle competition for now.

The Problem

This year competition description is the following:

♫ Bells are ringing, children singing, all is merry and bright. Santa's elves made a big mistake, now he needs your help tonight ♫

All was well in Santa's workshop. The gifts were made, the route was planned, the naughty and nice list complete. Santa thought this would finally be the year he didn't need Kaggle's help with his combinatorial conundrums. At last, the Claus family could take the elves and reindeer on that well deserved vacation to the South Pole.

Then, with just days until the big night, Santa received an email from a panicked database admin elf. Attached was a server log with the six least jolly words a jolly old St. Nick could read:


One of the North Pole elf interns had mistakenly deleted the weights for all of the inventory in the workshop! Santa didn't have a backup (remember, this is a guy who makes a list and checks it twice) and, without knowing each present's weight, he didn't know how he would safely pack his many gift bags. Gifts were already on their way to the sleigh packing facility and there wasn't time to re-weigh all the presents. It was once again necessary to summon the holiday talents of Kaggle's elite.

Can you help Santa fill his multiple bags with sets of uncertain gifts? Save the season by turning Santa's uncertain probabilities into presents for good little boys and girls.

The data section contains additional information:

Santa has 1000 bags to fill to fill with 9 types of gifts. Due to regulations at the North Pole workshop, no bag can contain more than 50 pounds of gifts. If a bag is overweight, it is confiscated by regulators from the North Pole Department of Labor without warning! Even Santa has to worry about throwing out his bad back.

Each present has a fixed weight, but the individual weights are unknown. The weights for each present type are not identical because the elves make them in many types and sizes.

It then provides a way to compute the weight distributions of each gift type.  Details are available on Kaggle site, just follow the above link. 

It is definitely a stochastic optimization problem:  we are asked to optimize the weights of the gifts Santa can distribute, when these weights are only known via a probability distributions.  I decided to approach this as a stochastic cutting stock problem.  All the code used here is available in a notebook on Kaggle site.  The code can be run for free by all using the DOcplexcloud service, or with the freely available academic version of CPLEX if you are eligible to it.

The Model

One way to approach the competition is to look for a solution structure that has a good chance to yield good submission. A solution structure is defined by a number of bag types, plus a number of occurrence of each bag type. A bag type is defined by the number of gifts of each type it contains. For instance 3 blocks and 1 train

We can focus on bag types because all bags have the same capacity (50 pounds).

There is a finite number of bag types that are possible. We define one random variables for each bag type.

All we need is an estimate the expected value and the variance of each possible bag type. Then we use two properties to find a combination of bags that maximizes a combination of expected value and standard deviation:

  • the expected value of a sum of random variables is the sum of the expected values of the random variables
  • the variance of a sum of independent random variables is the sum of the variances of the random variable

We estimate the mean and variance of each bag type via the law of large numbers: we use Monte Carlo simulation (with 1M sample), and compute mean and variance of the simulated results. 

While simple, this approach is way too expensive to run.  We improved running time in two ways:

  • limiting the number of bags to those that are inside the Pareto frontier (details below),
  • precomputing distributions for bags made of one gift type and reusing them for more complex bag types.

Let me expand a bit on the Pareto frontier idea.  Let's consider two bags for the sake of clarity:

  1. Three blocks, one train
  2. Three blocks, one bike, and one train

The second bag is obtained by adding one gift to the first bag.  We can compute the expected weight for each of these bags. If the expected weight is lower for the second bag than for the first bag, then the second bag can be ignored.  Why?  Because it uses more gifts for a lower value.  More generally, if the first bag cannot be extended with an additional gift without lowering the expected value then it is Pareto optimal.

Given this, we start with an empty bag, and add one gift at a time in every possible way. We do it until the expected value of the bag decreases. When this happens then we can discard the newly created bag, as it uses more items and yields a lower expected value.  This results in about 40,000 bag types.

Optimizing Expected Value

Next step is to solve the optimization problem.  As said before, it is a cutting stock problem.

The mathematical formulation is as follows.



  • n is the number of bag type
  • m is the number of gift types
  • meani is the expected weight of bag type i
  • gij is the number of gifts of type j in bag type i
  • capj is the number of available gifts of type j
  • xj is an integer decision variable that takes value v if bag type i is used v times
  • mean is a continuous variable that represents the expected value of the solution structure

This defines a mixed integer problem with linear constraints (MIP). 

Solving it is pretty straightforward with a state of the art MIP solver like CPLEX.  I used the recent DOcplex package to call it from Python. The code is close to the above mathematical formulation.

from import Model

def mip_solve(gift_types, bags, nbags=1000):
    mdl = Model('Santa')
    rbags = range(bags.shape[0])
    x_names = ['x_%d' % i for i in range(bags.shape[0])]
    x = mdl.integer_var_list(rbags, lb=0, name=x_names)
    mean = mdl.continuous_var(lb=0, ub=mdl.infinity, name='mean')
    for gift in gift_types:
        mdl.add_constraint(mdl.sum(bags[gift][i] * x[i] for i in rbags) <= allgifts[gift])
    mdl.add_constraint(mdl.sum(x[i] for i in rbags) <= nbags)

    mdl.add_constraint(mdl.sum(bags['mean'][i] * x[i] for i in rbags) >= mean)
    mdl.parameters.mip.tolerances.mipgap = 0.00001
    s = mdl.solve(log_output=True)
    assert s is not None
    x_val = s.get_values(x)
    mean_val = s.get_value(mean)
    print('mean:%.2f' % mean_val, 'std_val:%.2f' % std)
    return bags[x_val > 0]

bags is a data frame containing all bag types.  We return the portion of it that contains only the bag types that are used.

Solving this MIP yields solution structures with expected value around 35,540 pounds.  Results depend on how we estimate the expected value of each bag type.  The more simulation runs, the more accurate it is, but the more time it takes to generate all bag types.

I thought finding the optimal expected value would be good enough to win the competition, but I was really wrong.  My first submission scored about 35,880 pounds, and as I write now, top score is close to 37,000 pounds.

How could that happen?  Isn't my solution the optimal one?  It is, but in a probabilistic way: it is the best one one average.  Issue is that the competition isn't about finding the best solution on average.  The goal is to find the best solution given the actual (hidden) weights of the gifts. 

Optimizing Mean and Variance

One way to improve result is to generate many solutions from the same solution structure, albeit using different gifts.  For instance, if the solution structure contains one bag made of one train and three blocs, a first solution could include this bag: [train_1, blocks_3, blocks_8, blocks_12].  In a second solution, the same bag could be [train_1, blocks_3, blocks_8, blocks_12].   The expected value for both bags is the same, but the actual values will be different, because the weights of individual gifts are different: the weight for train_1 is not the same as the weight of train_2

Given we can generate many solutions from a given solution structure, how could we improve the value of the best possible one?  One way to do it is to favor solution structure with larger variance.  If two solution structures have the same expected value, then the one with the larger variance is more likely to generate larger value submissions.  It is also more likely to generate lower value submissions.

Question is how to do that with a solver like CPLEX?

Well, the standard deviation of the solution structure is the root of its variance.  And its variance is the sum of the variance of all bags in it.  The mathematical formulation is therefore a slight extension of the previous one:



  • alpha is the relative importance of standard deviation in the objective function
  • n is the number of bag type
  • m is the number of gift types
  • meani is the expected weight of bag type i
  • vari is the variance of the weight of bag type i
  • gij is the number of gifts of type j in bag type i
  • capj is the number of available gifts of type j
  • xj is an integer decision variable that takes value v if bag type i is used v times
  • mean is a continuous variable that represents the expected value of the solution structure
  • std is a continuous variable representing the standard deviation of the solution structure
  • var is a continuous variable representing the standard deviation of the solution structure

This problem contains a quadratic constraint.  It is a quadratically constrained mixed integer problem (QCMIP).  Again, solving it with CPLEX is rather easy, the code becomes:

def qcpmip_solve(gift_types, bags, alpha, nbags=1000):
    mdl = Model('Santa')
    rbags = range(bags.shape[0])
    x_names = ['x_%d' % i for i in range(bags.shape[0])]
    x = mdl.integer_var_list(rbags, lb=0, name=x_names)
    var = mdl.continuous_var(lb=0, ub=mdl.infinity, name='var')
    std = mdl.continuous_var(lb=0, ub=mdl.infinity, name='std')
    mean = mdl.continuous_var(lb=0, ub=mdl.infinity, name='mean')
    mdl.maximize(mean + alpha * std)
    for gift in gift_types:
        mdl.add_constraint(mdl.sum(bags[gift][i] * x[i] for i in rbags) <= allgifts[gift])       
    mdl.add_constraint(mdl.sum(x[i] for i in rbags) <= nbags)
    mdl.add_constraint(mdl.sum(bags['mean'][i] * x[i] for i in rbags) == mean)   
    mdl.add_constraint(mdl.sum(bags['var'][i] * x[i] for i in rbags) == var)
    mdl.add_constraint(std**2 <= var)

    mdl.parameters.mip.tolerances.mipgap = 0.00001
    s = mdl.solve(log_output=True)
    assert s is not None
    x_val = s.get_values(x)
    mean_val = s.get_value(mean)
    std_val = s.get_value(std)
    bags['used'] = x_val
    print('mean:%.2f' % mean_val, 'std_val:%.2f' % std)
    return bags[(bags['used'] > 0) | (bags['number'] > 0)]

Solving this QCMIP with alpha=2 yields solution structures with expected value around 35,525 pounds, and standard deviation around 333 pounds.  The expected value is a bit lower, but the standard deviation is much larger.  With solution structure like this, I had about 1/4 chance to get a submission above 36,400 pounds by the end of competition (about 90 submissions left at that time).  This looked much better but truth is that people have found ways to generate way higher value, as shown by the current leader board of the competition.  I am also able to generate better  solutions, but don't count on me to disclose how before competition ends ;)

There is a caveat in the second approach.  Can you see it?

The caveat is about how we prune the generation of candidate bags.  We need to modify it to take into account the new objective function we have.  When we generate bag2 by adding one gift to bag1 then we should compare mean(bag1) + alpha * std(bag1) with mean(bag2) + alpha * std(bag2).  If the former is higher than the latter then we can safely drop bag2 from further consideration.


What I found interesting is to use a QCMIP to optimize the maximum of the values we can get when generating solutions from one solution structure.  This is significantly different from usual stochastic optimization problems.  Indeed, in my experience, people are usually interested in finding solutions that are good on average (like in my first model), or that minimize worst case scenario.  Here we are asked to maximize the best case scenario.  It is very unusual.





Techniques for Assigning Dates to Web Content: What Was the Publish Date?

When making sense of a web page’s raw text, one of the ideal pieces of metadata is the “publish date.” Assigning dates to web content attributes the documents, and all other pieces of intelligence found within that document, to a specific time period. This helps the data analyst quickly drill-down into the data by date […] The post Techniques for Assigning Dates to Web Content: What Was the Publish Date? appeared first on BrightPlanet.

Read more »
Big Data University

This Week in Data Science (January 10, 2017)

Hello all! My name is Janice Darling and I will be taking over this column from Cora.
Here is a roundup of the news this week in Data Science and Big Data. IBM Watson Power7

Don’t forget to subscribe to keep up-to-date with developments in Big Data and Data Science!

Interesting Data Science Articles and News

Cool Data Science Videos

The post This Week in Data Science (January 10, 2017) appeared first on Big Data University.


January 09, 2017

Revolution Analytics

What can we learn from StackOverflow data?

StackOverflow, the popular Q&A site for programmers, provides useful information to nearly 5 million programmers worldwide with its database of questions and answers — not to mention the...


#PostTruth – what does it mean in the world of Data Science?

If I was to sum up our purpose at Principa, it would be “to help clients make informed decisions using data, analytics and software”.  As information grows, so the opportunity to make better decisions increases.  Data helps you understand your customer better.  That’s our mantra.  That’s our ethos.  That’s why we are.


January 07, 2017

Simplified Analytics

What are Microservices in Digital Transformation?

Today’s organizations are feeling the fear of becoming dinosaur every day. New disrupters are coming into your industry and turning everything upside down. Customers are more demanding than ever and...


January 06, 2017

Revolution Analytics

Three reasons to learn R today

If you're just getting started with data science, the Sharp Sight Labs blog argues that R is the best data science language to learn today. The blog post gives several detailed reasons, but the main...


Revolution Analytics

Because it's Friday: The camera might not lie, but sometimes it fibs

Photography is my favourite art form: it's more than just capturing a scene in the frame. A good photograph tells a story, chosen and delivered by the photographer. But sometimes that story isn't...


January 05, 2017

Revolution Analytics

Analyzing emotions in video with R

In the run-up to the election last year, Ben Heubl from The Economist used the Emotion API to chart the emotions portrayed by the candidates during the debates (note: auto-play video in that link)....

Silicon Valley Data Science

Imbalanced Classes FAQ

We previously published a post on imbalanced classes by Tom Fawcett. The response was impressive, and we’ve found a good deal of value in the discussion that took place in the comments. Below are some additional questions and resources offered by readers, with Tom’s responses where appropriate.

Questions and clarifications

Which technique would be best when working with the Predictive Maintenance (PdM) model?

This is kind of a vague question so I’ll have to make some assumptions of what you’re asking. Usually the dominant problem with predictive maintenance is the FP rate, since faults happen so rarely. You have so many negatives that you need a very low (eg <0.05) FP rate or you’ll spend most of the effort dealing with false alarms.

My advice is:

  1. Try some of these techniques (especially the downsampled-bagged approach that I show) to learn the best classifier you can.
  2. Use a very conservative (high threshold) operating point to keep FPs down.
  3. If neither of those get you close enough, you could see if it’s possible to break the problem in two such that the “easy” false alarms can be disposed of cheaply, and you only need more expensive (human) intervention for the remaining ones.

Can you suggest a modeling framework that targets high recall at low FP rates.

I’m not quite sure what “modeling framework” refers to here. High recall at low FP rates amounts to (near) perfect performance, so it sounds like you’re asking for a framework that will produce an ideal classifier on any given dataset. I’m afraid there’s no single framework (or regimen) that can do that.

If you’re asking something else, please clarify.

Why did over- and undersampling affect variance as they did in the post? Shouldn’t the (biased) sample variance to stay the same when duplicating the data set, while there’d be no asymptotic difference when using undersampling?

A fellow reader stepped in to help with this question:

You’re very close — it’s n-1 in the denominator. When duplicating points in a dataset the mean stays the same, and the numerator for the variance is proportional depending on which observations are duplicated, but the variance itself decreases.

Mathematically, variance is defined as E( [X – E(X)]^2 ), where E() is the mean function (typically just sum everything up and divide by n), but when finding the variance of a sample, instead of taking the straight-up mean of the squared differences as the last step you need to sum everything up and divide by n-1. (It can be shown that dividing by n underestimates the variance, on average.)

Suppose some dataset consists of 5 points. Say the numerator is sum([X – E(X)]^2) = Y, so the variance is Y/4. Now duplicate the data, creating dataset Z, and you have 10 points and the numerator of the formula is sum([Z – E(Z)]^2) = 2Y. But now the variance is 2Y/9, which is smaller than Y/4.”

With well-behaved data, this does not affect much in practical terms.

Additional tools and references

The post Imbalanced Classes FAQ appeared first on Silicon Valley Data Science.


Principa's Top 10 Data Analytics Blog posts for 2016

We take pride in our ability to predict - from the results of the 2015 Rugby World Cup and the 2016 Oscars to predicting profitable customers and customer churn. However, there is no denying that 2016 was a year full of shocking, unexpected events - from Brexit and the US election results to the acrimonious break-up of "Brangelina" (shocking!) and the sad loss of some very talented artists.


January 04, 2017

Revolution Analytics

The Flexibility of Remote and Local R Workspaces

by Sean Wells, Senior Software Engineer, Microsoft The mrsdeploy R package facilitates Remote Execution and Web Service interactions from your local R IDE command line against a remote Microsoft R...

Jean Francois Puget

Installing XGBoost on Mac OSX

OSX is much better than Windows, isn't it?  That's a common wisdom, and it seemed to be confirmed once more when I installed XGBoost on both OS.  Before I deep dive, let me briefly describe XGBoost.  It is a machine learning algorithm that yields great results on recent Kaggle competitions.  I decided to install it on my laptops, an old PC running Windows 7, and a brand new Mac Pro running OSX.  I thought the OSX installation was a no-brainer compared to the Windows one, as explained in Installing XGBoost For Anaconda on Windows

Reality is a bit different, and the OSX installation isn't as smooth as it seems.  To be accurate, the default OSX installation of XGBoost runs in single thread mode, as explained in these instructions.

Why is this a problem?  Because XGBoost is a machine learning algorithm, and running it may be time consuming.  I decided to install it on my computers to give it a try.   I am currently working on a dataset with about 100k rows (samples) only, and tuning XGBoost on my old Windows laptop (a Lenovo W520) takes about 2 hours.  What surprised me is that it takes 7 hours on my brand new Macbook Pro!  It is a bit weird, given they both have Intel i7 quad core cpus, and given that the Mac clock speed is higher.  Add to this the premium price of the Mac, and you get me really surprised.

I further observed that other cpu intensive tasks are faster on the Mac Book Pro.  Something is definitely wrong, but the culprit is easy to spot: it is all about XGBoost being single threaded on OSX. 

Before I explain how to enable multi threading for XGBoost, let me point you to this excellent Complete Guide to Parameter Tuning in XGBoost (with codes in Python).  I found it useful as I started using XGBoost.  And I assume that you could be interested if you read this far ;) 

Back to XGBoost, the installation instructions do explain how to get the mutli-threaded version of XGBoost. unfortunately they did not work for me.  The following is what worked for me.  i am sharing in case it helps others.  I had to perform the following step:

  • Get Homebrew if it is not installed yet.  Indeed, this is a very useful open source installer for OSX.  Instaling it is straightforward, open a terminal, then paste and execute the instruction available on Homebrew home page. I reproduce it here for convenience:
    /usr/bin/ruby -e "$(curl -fsSL"
  • Get gcc with open mp.  Just paste and execute the following command in your terminal, once Homebrew installation is completed.
    brew install gcc --without-multilib    
    This automatically downloads and builds gcc.  It can take a while, it took about 30 minutes for me.  Be patient.
  • Get XGBoost.  Go to where you want in your filesystem, say <directoy>.  Then type the git clone command and execute it:
    cd <directory>
    git clone --recursive 
    This downloads the XGBoost code into a new directory named xgboost.
  • Next step is to build XGBoost.  By default, the build process will use the default compilers, cc and c++, which do not support the open mp option used for XGBoost multi-threading. We need to tell the system to use the compiler we just installed.  That's the step that was missing from the installation instructions on XGBoost site. 
    There are various ways to do it, here is the one I used. 
  • Go to where we downloaded XGBoost
    cd <directory>/xgboost
  • Then open make/ and uncomment these two lines

export CC = gcc
export CXX = g++

  • Depending on you g++ installaiton you may need to change the above two lines into:
    export CC = gcc-6
    export CXX = g++-6
  • We then build with the following commands.
    cd <directory>/xgboost
    cp make/ .
    make -j4
  • Once the build is finished, we can use XGBoost with its command line.  I am using Python, hence I performed this final step.  You may need to enter the admin password to execute it.
    cd python-package; sudo python install

This concludes the installation. 

I tested it with My Anaconda distribution with Python 3.5.  It worked fine, and I could run XGBoost.  The speedup thanks to multi threading is noticeable, and my Mac Book Pro is now faster than my old PC.   

Updated on July 16, 2016.  Makefile changed in xgboost, making it easier to use gcc.

Updated on Jan 4, 2017. Upated the gcc and g++ declarations in makefile.  The original way didn't worked on some g++ installations.  Thanks to Brandon Mitchell who spot the issue.


January 03, 2017

Revolution Analytics

The biggest R stories from 2016

It's been another great year for the R project and the R community. Let's look at some of the highlights from 2016. The R 3.3 major release brought some significant performance improvements to R,...


January 01, 2017


December 30, 2016

Revolution Analytics

Because it's Friday: Goodbye, 2016

Between the deaths of beloved heroes and heroines, the civil unrest and political upheavals, and a slew of natural disasters, 2016 wasn't the greatest year. If you made a movie about it, this is what...


Revolution Analytics

Power BI custom visuals, based on R

You've been able to include user-defined charts using R in Power BI dashboards for a while now, but a recent update to Power BI includes seven new custom charts based on R in the customs visuals...


Simplified Analytics

Do you know what is powerful real-time analytics?

In the Digital age today, world has become smaller and faster.  Global audio & video calls which were available only in corporate offices, are now available to common man on the...

InData Labs

AI is changing the face and voice of customer service as we know it.

Deep learning as a game changer in modern customer service. Learn what is behind the DeepMind neural network that generates most natural speech signals that can be used to make communication with a customer care representative even more pleasant.

Запись AI is changing the face and voice of customer service as we know it. впервые появилась InData Labs.


December 29, 2016

Revolution Analytics

Using R to prevent food poisoning in Chicago

There are more than 15,000 restaurants in Chicago, but fewer than 40 inspectors tasked with making sure they comply with food-safety standards. To help prioritize the facilities targeted for...


Revolution Analytics

Combine choropleth data with raster maps using R

Switzerland is a country with lots of mountains, and several large lakes. While the political subdivisions (called municipalities) cover the high mountains and lakes, nothing much of economic...


December 27, 2016

Big Data University

This Week in Data Science (December 27, 2016)

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

The post This Week in Data Science (December 27, 2016) appeared first on Big Data University.


December 26, 2016

InData Labs

Data Science Competition 2017

InData Labs welcomes all the participants of the Data Science Competition! Take the challenge, show you have the fire and join our R&D Data Science Lab!

Запись Data Science Competition 2017 впервые появилась InData Labs.


December 25, 2016

Revolution Analytics

Parallelizing Data Analytics on Azure with the R Interface Tool

by Le Zhang (Data Scientist, Microsoft) and Graham Williams (Director of Data Science, Microsoft) In data science, to develop a model with optimal performance, exploratory experiments on different...


Revolution Analytics

The Basics of Bayesian Statistics

Bayesian Inference is a way of combining information from data with things we think we already know. For example, if we wanted to get an estimate of the mean height of people, we could use our prior...


December 24, 2016

Simplified Analytics

Fail fast approach to Digital Transformation

Digital Transformation is changing the way customers think & demand new products or services. Today Bank accounts are opened online, Insurance claims are filed online, patient’s health is...


December 23, 2016

Revolution Analytics

Because it's Friday: A Christmas Destiny

This video isn't CGI. It's more like a Machinima Christmas pantomime: carefully selected costumes, choreographed performances, and sharp editing of some long takes in the world of Destiny, one of my...


Revolution Analytics

Merry ChRistmas!

Christmas day is soon upon us, so here's a greeting made with R: Each frame is a Voronoi Tesselation: about 1,000 points are chosen across the plane, which each generate a polygon comprising the...

VLDB Solutions

Xmas Wish List

Teradata MPP Setup on AWSAll We Want For Christmas

It’s that time of year…yes Christmas (Xmas for short), most definitely not <yuck>’holidays'</yuck>.

There’s far too much chocolate in the VLDB office as you might expect. Geeks & chocolate are a winning combo, so it won’t last long.

Moving on…office Xmas decorations – check. Office Xmas party – check. Silly jumpers – check. Secret Santa – check. Loud Xmas music – check.

As we draw ever nearer to *finally* going home for Xmas, our thoughts turn to what VLDB’s Xmas list to Santa might look like…so, here goes…

Dear Santa, can you please make sure clients understand at least some of the following:

  1. Data warehouse systems aren’t a side-project you can pay for with left over funding from another project. Real funding, sponsorship, requirements & commitment are required. Subject Matter Experts (SMEs) or Business Analysts (BAs) will need to provide guidance to get an analytic solution delivered. Technology & designers/developers on their own won’t get very far.
  2. High praise from the likes of Gartner doesn’t mean a particular technology is a good fit for your organisation. Figure out your needs/wants/desires afore ye go looking to buy shiny new tech. Thou shalt not believe all thou hears at conferences. It’s the job of tech companies, VCs, analysts & conference organisers to whip up excitement (see kool-aid). They’re not on the hook for delivery.
  3. Accurate estimates for design/build/test are only possible if analysis is carried out. Either you do it, pay us to do it, or accept estimates with wide tolerances.
  4. Quality Assurance (QA) is not the same as unit testing. It’s a real thing that folks do. Lots of them. No really!
  5. CSV files with headers and trailers are a perfectly acceptable way to build data interfaces. Lots of very large organisations are guilty of clinging on to this ‘unsexy’ approach. It ‘just works’.
  6. You’ll probably need a scheduler to run stuff. cron is not a scheduler. Nor is the DBA.
  7. If you could have ‘built that ourselves in SQL Server in 5 days’ you would have already done so.
  8. Don’t focus on our rate card. Focus on the project ROI. Oh, wait, you haven’t even thought about ROI. Doh!
  9. Yes, we can get deltas out of your upstream applications without crashing the system. It’s what we do. We’ll even prove it.
  10. If you want us to work on site we’ll need desks to sit at, preferably next to each other. We’re picky like that 😉

Have a great Xmas & New Year Santa,

Love from all at VLDB

Have a great Xmas & New Year, and here’s to 2017 .


December 22, 2016

Revolution Analytics

Take a Test Drive of the Linux Data Science Virtual Machine

If you've been thinking about trying out the Data Science Virtual Machine on Linux, but don't yet have an Azure account, you can now take a free test drive -- no credit card required! Just visit the...

Silicon Valley Data Science

Techniques and Technologies: Topology and TensorFlow

On December 7, 2016, we hosted a meetup featuring Dr. Alli Gilmore (Senior Healthcare Data Scientist at One Medical), and Dr. Andrew Zaldivar (Senior Strategist in Trust & Safety at Google). Despite the drizzle and gloom outside, the atmosphere of the room was bright and buzzing. The lively audience engaged with both speakers throughout their talks, lending the event the feeling of an intimate small group discussion among peers.

Dr. Gilmore, spoke about user experiences that come with the application of machine learning algorithms. Carefully considering the experience that comes with using a particular machine learning algorithm is what will make artificial intelligence more productive and useful to people. She walked through using the unsupervised Mapper topological data analysis algorithm to group similar types of medical claims, discussed the varied reactions of subject matter experts to its outputs, and envisioned a more interactive and satisfying version of this.

Dr. Zaldivar illuminated the path to harnessing TensorFlow‘s powerful capabilities without the complex configuration by using a set of high-level APIs called TFLearn. He showed us how to quickly prototype and experiment with various classification and regression models in TensorFlow with only a few lines of code, as well as how to access other useful functionality in the TF package.


Unsupervised Topological Data Analysis

Gilmore presentation

After an introduction to topological data analysis, Dr. Gilmore summarized the reactions of domain experts to unsupervised clustering algorithms as finding the results difficult to interpret and being underwhelmed by the amount that they can contribute to the grouping process. It may be unsatisfying if interpreting the clusters feels like a guessing game, if there are seemingly duplicate groups, or even if the groups are really obvious. Similarly, it’s frustrating when people want to but can’t contribute their expertise. They may also want to reinforce the model’s results when it does something well, but it’s not necessarily easy to tell the system to do more of a particular thing.

How can we overcome the drawbacks that accompany unsupervised methods? Put a human in the loop! Make using the algorithm a positive and fruitful experience by leveraging what people can do confidently while avoiding things that are hard. For example, users can likely explain what features are relevant (this is what they know and care about), but they may have a difficult time describing how many groups should exist in the data. Let them influence the algorithm on these kinds of terms, perhaps by providing labels for the grouping process via exemplars selection as well as propagating labels through a question–answer feedback loop from machine to human and back. I’m sure every data scientist has imagined the day when they can more colloquially interact with an algorithm to get better results, even if the majority of today’s feedback only involves cursing that falls on deaf ears.

Practical TensorFlow

Dr Zaldivar presenting

Dr. Zaldivar took the audience through the steps required to build a relatively simple convolutional neural network (CNN) using the low-level TensorFlow Python API. It took four slides of code to cover all of the setup, which involved a lot of expertise to implement but demonstrated how specific one can be if needed. He contrasted this with implementing a deep neural network in just four lines of code using functions from the TFLearn module. He recommended running models at the highest level of abstraction first and only dig down into the details if performance is suboptimal. After all, more lines of code means more to debug if something is going wrong.

Peeking under the hood at the underlying architecture, we got a brief overview of the graphical nature of TF networks. At the lowest level, functional operations like multiply and add are nodes in a graph, and tensors (the data) flow through the graph. Operations become larger as TF is abstracted up to TFLearn, which has a similar level of abstraction to Keras. In this high-level API, many TFLearn models should already be familiar to anyone who has used scikit-learn-style fit/predict methods.

Falling somewhere between the core TF API and TFLearn is another module called TF-Slim, and its API can implement a CNN in far fewer lines of code than the initial approach. Slim focuses on larger operations but can intertwine with low-level API to give greater control that TFLearn. With the extensible capabilities of this module, you can also fine-tune a pre-trained model for operating on your own dataset, thereby providing yet another way to get up and running quickly with state of the art networks like Inception-ResNet-v2.

Next steps

You can find Dr. Gilmore’s slides here, and Dr. Zaldivar’s slides here. The decks contain a number of links to resources related to their talks—the interested reader is encouraged to peruse the slides to find gems related to the interactive machine learning field, topological data analysis, logging and monitoring capabilities in TensorFlow, additional built-in neural networks, Jupyter notebook examples, and tutorials. We’ve also put recordings of Dr. Gilmore’s and Dr. Zaldivar’s presentations on YouTube.

SVDS offers services in data science, data engineering, and data strategy. Check out our newsletter to learn more about the company and current projects, and to hear about future meetups hosted at our offices.

The post Techniques and Technologies: Topology and TensorFlow appeared first on Silicon Valley Data Science.


December 21, 2016

Revolution Analytics

Introducing the AzureSMR package: Manage Azure services from your R session

by Alan Weaver, Advanced Analytics Specialist at Microsoft Very often data scientists and analysts require access to back-end resources on Azure. For example, they may need to start a virtual machine...

Big Data University

This Week in Data Science (December 20, 2016)

Here’s this week’s news in Data Science and Big Data.

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

The post This Week in Data Science (December 20, 2016) appeared first on Big Data University.

Teradata ANZ

Is Collaboration killing Creativity?

If collaboration is good, is more collaboration better?

Project management methodologies that have been successful in production-centric environments, e.g., agile, dev-ops, lean are increasingly being deployed in big data projects. However, big data projects are a combination of production and creative work.

Software engineering and development is arguably production-centric and well-suited to optimisation workflows. On the other hand, Science –and research in general- explores how to reach a long-term goal. Outcomes of the scientific process are highly non-linear; significant results are obtained in a similar fashion to an artist’s creative breakthrough.

Data science is no exception to Science; it is a creation endeavour, not a production one.

To maximise the potential of data science teams, one should provide an environment that is suitable for creativity. Fortunately, that’s a well-researched area with evidence-based answers; unfortunately, these findings are often ignored.

Adequate collaboration is the most critical enabler of creativity, but not all collaboration principles are equal: how many works of art, such as paintings or books, are the products of teamwork? In short, not that many [1].

Creative “outbursts”, in research or artistic pursuits follow the common pattern of mentally reaching out to novel – and apparently unrelated – ideas to solve a problem or fulfill a vision, until they coalesce into what, to an outsider, appears as an epiphany.

Artists’ collaborative life is well documented: from circles of philosophers to Andy Warhol’s Factory and the close connections painters and writers in European capitals a few centuries ago. The most creative people experience a mixture of solitary work and external influences: the collaborative aspect having less to do with creating the work, but more with the inspiration it provides.

A number of studies on aspects as diverse as, e.g., on the ideation process [2], the quality and success Broadway shows [3] or communication vs. productivity in the workplace [4] all demonstrate the same two aspects: too much close collaboration is harmful – as it naturally leads to cliques, groupthink, and echo chambers – , while too little contact with “the outside world” also hampers creativity.

In the realm of research – academic or otherwise – that form of collaboration had been ongoing for a long time: a personal space to create (the fast disappearing office), and a collective space to make face-to-face contact [5] and exchange ideas (the fast disappearing workplace cafeteria, external seminars or conferences).

Instead of following these findings, which have been long-held best practices, recent trends have almost obliterated them: open offices, small kitchens replicated around various floors, restriction of travel budgets and constant collaborative meetings with a core team are stifling innovation. Indeed, pretty much every department or company I have visited over the past few years has showcased environments with similar answers to similar problems. This is not about skills shortage; this is about buzz-word driven project methodologies without understanding their context [7] or looking at the evidence.

On the optimist side, the recent emergence of “people analytics” as an area of focus may offer solutions to re-ignite the true innovation that leads to significant competitive advantage in data science. Indeed, the research is already there and most promising answers involve collaborative network graphs.

Among the key features of interest: successful creative projects and companies are composed of people that have a low local clustering coefficient [8] and short average path length [9], i.e., people whose collaborative and conversational networks are compact, but not inter-related.

Left: A highly connected graph of short paths, forming almost a clique (LCC = 0.66, APL = 1.1). This type of collaboration, occurring when everyone works tightly with everyone else and no one outside, leads to unproductive “groupthink”

Middle: a graph of long paths and limited inter-connectivity (LCC= 0, APL = 2.05). This type of collaboration, occurring when people only work with closely related trusted parties, can lead to “echo chambers”. Note that these types of paths are often in fact disconnected.

Right: a graph containing short path lengths and limited inter-connectivity (LCC = 0.05, APL = 1.5). The combination of short paths (easy access to diverse people) and close collaboration on a small scale is beneficial to the creative process

You can’t manage what you can’t measure. Fortunately, you can measure and quantify the structure of internal collaboration within an organisation. With that information, you can manage teams, projects or departments to maximizse the inventiveness and creativity of knowledge workers, which results in more significant findings and outcomes.

It’s not that: more collaboration => better outcomes
But: better collaboration => more outcomes



[1] Music is an exception here as there are at least two distinct areas: songwriting and composing

[2] Andrew T. Stephen, Peter Pal Zubcsek, and Jacob Goldenberg (2016) Lower Connectivity Is Better: The Effects of Network Structure on Redundancy of Ideas and Customer Innovativeness in Interdependent Ideation Tasks. Journal of Marketing Research: April 2016, Vol. 53, No. 2, pp. 263-279.

[3] Brian Uzzi and Jarrett Spiro (2005) Collaboration and Creativity: The Small World Problem. American Journal of Sociology: September 2005

[4] Alex Pentland (2013) Beyond the Echo Chamber. Harvard Business Review: November 2013

[5] Unscripted face-to-face communication is overwhelmingly more conductive to engagement and idea sharing [6]

[6] Alex Pentland (2012) The new science of building great teams. Harvard Business Review: April 2012

[7] Consider the difference between the original agile manifesto and its current incarnation.

[8] A number between 0 and 1 that measures the proportion of a person’s contacts who also know each other. If everyone knows everyone else, the network is called a clique.

[9] The average number of people a person has to “go through” to contact everyone in the network

The post Is Collaboration killing Creativity? appeared first on International Blog.


December 20, 2016


Guest Post: Was Santa involved in WikiLeaks too?

We partnered with Basis Technology to show how their technology Rosette Text Analytics could be utilized with ours in a fun, Christmas-themed example that they published on their blog. We harvested data from WikiLeaks and curated the data to find Christmas-related mentions. Find out what we were able to uncover. You can read it here. The post Guest Post: Was Santa involved in WikiLeaks too? appeared first on BrightPlanet.

Read more »

Revolution Analytics

Mixed Integer Programming in R with the ompr package

Numerical optimization is an important tool in the data scientist's toolbox. Many classical statistical problems boil down to finding the highest (or lowest) point on a multi-dimensional surface: the...


Revolution Analytics

Interactive decision trees with Microsoft R

Even though ensembles of trees (random forests and the like) generally have better predictive power and robustness, fitting a single decision tree to data can often be very useful for: understanding...

Data Digest

Beyond Data: Transforming into an Analytical Organisation

Editor's Note: In this blog, we curate relevant and  remarkable content for data and analytics community. Here's an interesting piece written by José Antonio Murillo Garza, which was originally published by Northwestern’s JIMC.  

Analytics within a traditional organisation often starts as a top management initiative aimed at increasing productivity. Implementation faces several challenges, and top management sponsorship could quickly fade away. Banorte, the largest Mexican financial group, is under way to successfully transform into an analytical organisation. This experience offers five lessons to start out such a transformation:

  1. Gain credibility by delivering short-term results that assure long-term sponsorship from top management.
  2. Set the right incentives for the organisation to embrace analytics, avoiding rivalry between analytics and the business lines.
  3. Do not take for granted that analytic projects are a priority for the whole organisation — hold the analytics group accountable for end-to-end implementation.
  4. The analytics team members beyond quantitative skills require the ability to build consensus across different stakeholders within the organisation.
  5. The contribution from analytic initiatives must be measured.
A traditional organisation that aims to become an analytical firm with the capacity to deepen and extend its relationship with its customers starts its journey with some leaders envisioning a future for the company. It is not uncommon to find contrasting visions within top management of what should be the future — some subscribe to the old Texas adage, “If it ain’t broke, don’t fix it.” This should not be a surprise since companies have limited resources, and there are competing projects that respond to existing customer demands. Hence, when the organisation embarks on the transformation path, it is clear that the analytical camp prevailed, but this does not mean that the traditional camp was convinced. Top management composition is not static, and suddenly the forward thinking camp that once prevailed to launch the transformational analytic initiative might be outnumbered. Assuring continuity requires the analytic team to gain credibility with both camps of the organisation. A high short-run ROI does the trick. Analytics in Banorte during its first year of operations contributed 10% of the group’s total net profit, gaining credibility and resources to advance with medium-term projects.
Business lines expect the quants to be a partner, not a lecturer.

A high yield from analytics is a necessary but insufficient condition. Analytics and the business lines must establish a partnership that will face some hurdles. Analytics will disrupt the way business has been conducted, and rivalry between groups might arise. Both groups speak different languages — one group has business experience while the other talks with statistics and models. The business lines have many concerns — devoting scarce resources to unproven projects, the credit they will get in case of success, or the blame they will share if failure occurs. These concerns must be addressed by the design of an incentive scheme that aligns the interests of analytics and the business lines. Banorte solved this problem by setting a shadow target for analytics that did not rival the business lines targets, but rather analytic projects helped business lines attain their targets.

Non-rivalry between groups does not guarantee that the business lines and other stakeholders will enthusiastically embrace analytic initiatives to change the way they have been working. Stakeholders within the organisation are unfamiliar with analytic initiatives, and they certainly have other projects top of mind. It is of an utmost priority for analytics to find partners willing to champion initiatives that can set an example to the rest of the organisation. Analytics will need to invest some of its top management sponsorship capital in assuring end-to-end implementation of the chosen projects. The transformational effort at Banorte initially focused on the credit card business because of the project’s value and the willingness from the business partners.

An analytics team obviously requires quantitative skills, and the organisation has to make a substantial investment in data and technology. However, these prerequisites are not enough to transform an institution. Banorte’s experience has shown that two soft skills are vital — the ability to build consensus and to have good team players who respect the business acumen of their counterparts. Business lines expect the quants to be a partner, not a lecturer.

Finally, rooting analytics within the organisation requires measurement. It is not uncommon for an organisation on its path to transformation to undertake several initiatives at the same time, and it may prove hard to disentangle the contribution. The analytics team should not assume that results are self-evident — a detailed report to the top management must be produced periodically. Furthermore, it is easy to stray from value by undertaking only the projects from willing partners. At Banorte, measurement has assured sponsorship and has kept analytics on track.


Hear more from José Antonio Murillo Garza and other distinguished speakers at the Chief Analytics Officer, Spring happening on May 2-5, 2017 in Scottsdale, Arizona. For more info, visit

By Dr. José A. Murillo Garza:

José Antonio Murillo Garza is the Chief Analytics Officer for Grupo Financiero Banorte S.A.B. de C.V. José also serves as general director of analysis at Banco Mercantil Del Norte, S.A. Prior to this, José was the director of price analysis for regional economies and information at Banco de México. He holds a degree and Doctorate in Economics from Rice University.

December 19, 2016

Teradata ANZ

Pooling data and analytics to power greater efficiency in water utilities

Although water and power utilities face very similar pressures, they have tackled them with very different levels of success. Both sets of utilities face a common battle to reduce network operating costs, beat off increased competition and meet the increasingly complicated requirements of the regulators.

Yet it is the power sector organisations that so far have been cleverer about using their wealth of data to deal with these challenges. They have become truly data-driven by taking a strategic approach to data integration and analytics, whereas their counterparts in the water industry are still hampered by holding data in individual silos.

Time then, for water companies to learn from the more mature power operators and recognise that there are four main areas where data integration and analytics will bring major gains:

  1. Asset management – which includes optimising maintenance and operation, along with improved planning for new assets. It also encompasses supply chain optimisation.
  2. Customer management – through better service levels and the reduction of complaints, with further gains from increased efficiency of water management and other green initiatives.
  3. Regulatory performance – chiefly through analysing and measuring performance against current regulatory KPIs.
  4. Regulatory modelling – where there are major gains in mapping out requirements and resources in relation to the future.

Unfortunately, it will be extremely difficult for water utilities to obtain any of these gains while their data remains closely guarded by the individual departments who are using it to answer their own business questions. This is a huge missed-opportunity, given the constant flow of high-value data water utilities have from their customer relationship systems, geographic information systems, operational systems, telemetry, asset registers, regulatory data and good old-fashioned spreadsheets.

Following the example of the power companies, it is time for water utilities to abolish the silos and integrate all this data with external information from weather and environmental monitoring systems. By making the data accessible to everyone in the business they can give themselves infinitely more valuable datasets.

Boosting revenue and operations

Regulatory data, for example, can be analysed alongside sensor data from machines to help achieve better operation of assets and greater revenue, while simultaneously minimising regulatory fines. Effective integration also means that when the network is hit by severe weather disruption, customers affected can be quickly identified and informed of the problem and what is being done to fix it.

At the other end of the market, integration means water utilities can better scope and provide bespoke packages to major customers in manufacturing or retail, who now expect a variety of enhanced services.

The right mix of analytics

The analytics required to achieve these benefits are both traditional and advanced, all running on a single integrated view of data.

Advanced analytics is used to produce more accurate, useful and timely answers where the problem or question is well defined but challenges remain around integration and the capacity to analyse large volumes of data quickly.

Discovery analytics, by contrast, mines data for insights – putting data together to look for patterns without preconceptions, rather than asking specific questions. Data scientists work with business experts to determine what these patterns indicate and how the previously “unknown unknowns” can yield benefits.

The right outcome

Once the water utilities learn these lessons from the power and other asset-based industries, they can get to work on small-scale projects and focus on what will give them a quick return-on-investment. This way they will enable a wider strategic programme that quickly becomes self-funding.

Importantly, they will also be laying the foundations for a strategic programme that kicks aside departmental and cultural boundaries to truly unleash the power of data.


The post Pooling data and analytics to power greater efficiency in water utilities appeared first on International Blog.

InData Labs

A short guide to neural networks. How to master them and become famous.

I am pretty sure you have heard that Artificial Intelligence (AI) is involved into creation of very interesting things nowadays. It helps to fight cancer, create artwork and disrupt the world economy. From the finance perspective, AI is on the hype as well. Investors are actively looking for AI projects, news about startups acquired by enterprises appear every day.

Запись A short guide to neural networks. How to master them and become famous. впервые появилась InData Labs.


December 18, 2016

Curt Monash

Introduction to and CrateDB and CrateDB basics include: makes CrateDB. CrateDB is a quasi-RDBMS designed to receive sensor data and similar IoT (Internet of Things) inputs. CrateDB’s creators were...


December 17, 2016

Simplified Analytics

Want to know how to choose Machine Learning algorithm?

Machine Learning is the foundation for today’s insights on customer, products, costs and revenues which learns from the data provided to its algorithms. Some of the most common examples of machine...


December 16, 2016

Mario Meir-Huber

Internet Of Things

Internet Of Things Company

IOT - Company

In General, life is becoming smoother; smartness from every corner from homes to cities, cars to watches and many more. Most of traditional device are like wash machines, and refrigerators are now connected to smartphones. Both consumers and manufacturers reap benefits of the world connection. Buyers and sellers will not meet at the point of sale in future. There would be a continual exchange of information between companies and their customers even as the client continue to enjoy their products. High-quality life will be a guarantee to users of connected devices, accompanied with comfort together with security and fun. Internet of things provide companies with an opportunity to increase their production efficiency, by minimizing overhead costs and maximizing output.

The internet of things have the following main advantages.

Assures customer satisfaction

Internet of Things allows companies to respond promptly to users’ requirements since there is a fast transfer of information. This enables companies to adapt their products and services to the specific needs of customers. The quality of goods and services can be improved in a short span of time because there is real time feedback.

Significant costs reduction

Implementation of Internet of Things Smart Home technologies to do business can significantly reduce cost. For instance, employment of smart energy grids reduces power consumption will ensure there is a lower cost of energy. On the other, remote monitoring and maintenance optimizes the amount of manpower and consequently costs are reduced.

The increase in Sales.

Accuracy and objectivity in decision making are achieved by systematic data collection and automatic analysis. The personalized and targeted system gives more efficiency and, consequently, increases the total volume of business sales.


Sensors and video cameras can significantly improve safety and significantly reduce physical threats. This is possible because the management can respond promptly to dangerous occurrences and can be dealt with accordingly

There is improved business opportunity and new potential.

Internet of Things can aid diversification or expansion of business, for instance, by providing new extended options of service like monitoring for prevention of an incident.

Major trends in Quix – IoT company Qulix

Big Data

Internet of Things involves large collection volumes of data. Therefore the right solution for data storage and its analysis will be in high demand, to ensure that there is data overload. This will ensure that data are processed in a timely manner, analyzed and conclusions made from it.

Cloud computing and data protection

Connected devices are prone to cyber security attacks, thus prompting the priority of ensuring secure connection, data storage and safety for the Internet of things technology.

Technical Know-How

Experienced and highly qualified personnel will be a requirement for the integration of new Internet of Things technology into the existing systems. The digital transformation has just begun, and it is leading us to a safer and sustainable future.

The post Internet Of Things appeared first on Techblogger.