Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.


December 09, 2016

Revolution Analytics

Because it's Friday: Angst in four panels

I like my comedy dry and dark, a niche that David Thorne's 26b/7 has always served admirably. But I recently discovered (thanks to JT on Facebook), Jake loves Onions, a comic series reminiscent of...


Revolution Analytics

The Value of R's Open Source Ecosystem

I was thrilled to be invited to speak at the Monktoberfest conference, held this past October in Portland, Maine. Not only have I been a great fan of the analysis from the Redmonk team for many...

Data Digest

Do you have the ability to tell stories with your data?

"The ability to take data…understand it, process it, extract value from it, visualize it, and communicate it, that’s going to be a hugely important skill in the next decades."  - Dr. Hal R.Varian, Chief Economist, Google, 2009 

In today’s rapidly advancing analytical world, I think it is fair to say that Dr. Varian was and still is very right.

Embedding an analytics strategy into an organisation doesn’t fail because we don’t know what to do or how to execute it. It often fails to resonate because we fail to create a story that is meaningful in the organisation and one that is able to maintain the interest long enough for capability to build to a critical mass.

In the past 5 years, there has been a huge drive in the recruitment of data and analytics experts, however this has tended to be focused around those with mathematical and economic skills rather than those with strong communications background. The latter, should, in theory, be better equipped to communicate their data in a visual way, i.e. be able to tell a story.

So why is Data Storytelling so vital? 

When you package up your insights as a data story, you create a channel for your data to become memorable, persuasive and engaging thus encouraging your colleagues to buy into your data strategy, use it and champion it. By creating a story that people can understand, it makes the acceptance and adoption of any changes to businesses that much easier.

When you package up your insights as a data story, you create a channel for your data to become memorable, persuasive and engaging thus encouraging your colleagues to buy into your data strategy, use it and champion it. 

In April 2017, Greg Nichelsen of Data Speaks Up! will be coming to London to share his expertise on the Art of Storytelling at the Chief Analytics Officer Europe. Join him and discuss:

  • Effective strategies for bringing analytics to life through storytelling and visualisation
  • How to create a compelling voice: demonstrating how analytics will drive business outcomes and achieve business strategy
  • The importance of using industry case studies and tangible results to show the outcomes your company can expect – Making it real!
To read more about what is on offer at the event, including the full speaker line of Chief Analytics Officer Europe, please click here  


December 08, 2016

Revolution Analytics

R Consortium Projects Update

The R Consortium has already funded 8 projects (and 3 more just in July) proposed by the R community, and the call for proposals for yet more projects is now open. If you have an idea for a projects...

Silicon Valley Data Science

Agile Data Science Teams Deliver Real World Results

Data science is an exciting, changing field. Curious minds and enthusiastic investigators can often get bogged down by algorithms, models, and new technology. If we’re not careful, we forget what we’re actually here to do: solve real problems. And if what we do is just theory, what’s the point?

To be relevant and useful, the day to day activities of data scientists must

  • prioritize use of technology, so that it produces the best results;
  • design technology and products with the consumer in mind; and
  • collaborate well with partners and customers.

In short, data science should result in real applications. The reality of this is multi-faceted. One important problem is managing data teams to get to that real world result. By using agile data science methods, we help data teams do fast and directed work, and manage the inherent uncertainty of data science and application development.

In this post, I’ll look at the practical ingredients of managing agile data science.

What are agile data science teams and why do we need them?

It’s a fact that data science results are probabilistic and unpredictable. At the start of a project, it can often look like there’s an obvious route from A to B. When you get started, it’s never that simple. Agile teams do away with strict planning and go into projects with a creative mindset; they embrace uncertainty instead of shying away from it.

This comes in handy when a roadblock pops up—traditionally-run data science teams can get stuck deciding on their options, while the flexible agile data science teams are more likely to find a new solution. Unpredictability and the need to adapt quickly to problems doesn’t scare them; it excites them.

At the same time, the agile planning method focuses hard on application to the customer’s problem. Otherwise, it’s easy for us to get lost down the rabbit hole of stringent rules about hypotheses, models, and results. In the latter scenario, we end up producing things that work—that validate our hypotheses—but that have little application to the real world scenario we’re producing them for. Wasting time is not good for us or our customers.

Key concepts

There are some key concepts that underpin the agile method we employ at SVDS. Collectively they provide us with the goals for a project, the top level strategy for investigation, and day-to-day action plans.

  • The charter—Why are we doing this project? What outcomes or conclusions do we hope to reach? What does my customer need at the end of this project?
  • Investigation themes—How do I gather and understand this data? What can I directly observe? What can I implement to help me understand the data?
  • Epics—Break down the investigation themes into one or more work plans.
  • Stories—Units of work that make up epics. These are concrete activities that can be completed in a given amount of time.

It’s great to have a method, but it helps to see how it’s used to solve a real problem. At SVDS, we used this method to create a system that tells train riders when the Caltrain is running late to a stop, and its approximate time of arrival. Let’s dive into how that worked.

A Caltrain example

I’ll give a brief overview of our Caltrain work, but if you want to learn more check out our project page. The point of this project, its charter, was to create an app that would tell the user when the Caltrain was running late, and how long it would be until it arrived at a designated stop. The Caltrain system has its own app, but it suffered from being inaccurate, and didn’t tell riders if a train was late, and how late it was. No one likes being late for work, so we wanted to create a solution for them.

The next step was to define the investigation themes, which started with the question: “how do I know the train is late?”

The epics portion included all the smaller questions and tasks required to answer the big questions posited in the investigation themes. Epics included undertakings such as “develop a working model for the Caltrain system under regular working conditions,” and “classify catastrophic events in the system that prevent the regular working model from applying.”

As the epics are broken down into units of work, the stories resulted. Example stories included “can I accurately and consistently use Twitter to find data on the train’s late times?” and “can I identify the direction of the train using video?”

Sprints, standups, and review meetings

Given the breakdown of work into epics and stories, how do you manage its execution and planning? This happens through sprints, standups, and review meetings. You’ll find that different agile practitioners have differing spins on these meetings, but the fundamentals are all similar.

Sprints. Stories are completed in sprints, which is a set chunk of time to work on tasks, typically two weeks, with the goal of producing new results. A sprint starts with sprint planning, where we’ll decide which stories to pursue.

Standups. Each day during the sprint, the team gathers in a standup meeting. Here they report their progress, say what they’re going to do next, and coordinate to remove blockers where people are stuck on their work. Standups, as their name suggests, aren’t for long discussions or problem solving. The point is to get information out there quickly, and set up any further discussion. Optionally, a customer stakeholder may attend these standup meetings. Alternatively, we schedule one or two updates with them separately each week. It’s important to keep them closely involved with progress.

Review meetings. The last step in the sprint process is to hold a review meeting, where the team presents and evaluates results. Customer stakeholders also attend this meeting. We present the work, and the group discusses it. Is it good enough? Should we keep working on it? Will it be useful, or should we abandon it now? We typically combine our review meetings, which are an assessment of the work done, with our sprint retrospective meetings, which are an assessment of how the work was done.

This keeps the group from spending long periods of time working on things that won’t actually benefit customers. If the work is incomplete, you discuss how to move forward. If it’s finished, you discuss what you learned from the experience. Agile teams are always learning from previous work.


Agile data science teams work in a way that is adaptable, collaborative, and produces usable results. They subscribe to the idea that data science can be creative and innovative. They embrace the unknown instead of making assumptions, and they don’t waste time beating their head against a wall for things that aren’t working.

Agile teams are the future of data science, the creative teammates who work together to make things that are useful, and answer real world problems. The future is fickle, and we must be flexible to succeed.

Editor’s note: We are grateful for the contributions of Amber McClincy and Edd Wilder-James.

The post Agile Data Science Teams Deliver Real World Results appeared first on Silicon Valley Data Science.

Forrester Blogs

Business Intelligence Skills

So you have gone through the Discover and Plan of your Business Intelligence (BI) strategy and are ready to staff your BI support organization. What skills, experience, expertise and qualifications...

Teradata ANZ

Getting personal – how retailers must use data to make campaigns work

E-commerce hands retailers bags of important data about customers, but there is always the nagging question about whether they are making the best use of it.

We are not just talking about segmenting customers according to their age, gender and clothing size for marketing purposes, which is now pretty commonplace. No, retailers can really boost their return on investment by going even deeper into every minute detail of how customers spend their time either in store or browsing and shopping online. A marketing communications strategy can be hugely improved by using data about the amount of time a customer spends on a retailer’s website, the pages they visit and the items they add and remove from their shopping baskets.


Greater relevance to customers

For example, the “value-hunter” customers, who spend most of their time in the sale or discount area of a site or who always sort items by price, can be given advance visibility of online sales or targeted solely with paper sale catalogues. This reduces marketing costs and ensures material is always relevant. On the other hand, “in-trend” customers who are mostly interested in the new section of the online store can be sent promotional emails with a more aspirational feel about what is available for the new season.

Dividing customers up this way may seem obvious, but by applying multiple layers of segmentation data, retailers and marketers can start to make highly accurate predictions about the kinds of offers and alerts that will be most effective in capturing their target consumers. They can then move on and identify each customer’s preferred communication channel, be it the postal service, e-mail, SMS and so forth, so they can deliver successful promotional campaigns by making sure each message provides real value.

It certainly doesn’t end there, either. A retailer can work out which device the customer prefers to use and employ the information to its advantage. For instance, a customer may frequently browse the catalogue on their smartphone and yet only make purchases via a PC. This presents a perfect opportunity to send the customer a link to the retailer’s app on the App Store or Play Store.


Making it personal to cut out waste

Segmenting customers in this way means retailers only send emails to those who are most likely to respond or spend. Instead of wasting time and money on emails that end up in the customer’s spam folder, promotional campaigns deliver far greater return on investment.

The days of “scatter shot” promotions – sending out flurries of impersonal promotional messages and offers in the hope that even a small percentage sticks – are certainly numbered. Consumers are now becoming increasingly unresponsive to these sorts of campaigns, relegating them to the junk folder or rapidly scrolling through to find the “unsubscribe” button.

Yet, besides reducing overall marketing spend by targeting resources where they have the greatest impact, it is also time retailers embraced the use of advanced analytics to reduce the number of customers defecting to competitors. These techniques watch and learn from customer behaviour and indicate when someone is about to move on, allowing such defections to be headed off with timely and personalised offers via email, text message or through the web portal.


The big gains from pulling it together

To make all this work in today’s intensely competitive market, retailers need a solution that provides a wholly integrated approach to planning, developing and managing their customer communications across multiple channels, product lines and business locations. Once they have it, they can start up both traditional and non-traditional campaigns, including highly sophisticated multiple-step dialogues and event-based promotions. They will also be able to optimise all their customer communications to deliver a highly-effective blend of messages and promotional offers exclusively for each customer. These will be based on priority and the availability of resources within a specified period.

When the gains are so obvious, there can no longer be any doubt that the time has come for retailers to put the terabytes of data they have about campaigns and customer purchase-histories to good use. By doing so, they will deliver major improvements in the effectiveness of their marketing communications and find they can engage in far more valuable interactions with their customers.

The post Getting personal – how retailers must use data to make campaigns work appeared first on International Blog.

Teradata ANZ

Is your DbFit?

In my previous blog I talked about modular design enabling continuous integration and automated testing through to delivery. Many customers I have worked with find the immediate challenge is how to do the testing, especially as it applies to a data warehouse or lake.

In this blog I will look at how to setup simple, modular tests, in line with, and supporting, the modular design paradigm discussed previously. The tool I will use is DbFit, a test driven development framework for databases which runs on top of FitNesse, a wiki based testing framework. DbFit supports a good variety of databases, including Teradata, and both DbFit and FitNesse are open source.

While the mechanics of the testing will be specific to DbFit, the concepts and design will be applicable to any tool you have access to, as the functional elements are still the same.

To move towards our goal of continuous delivery (and even if you don’t want to go all the way, taking steps in that direction is still very valuable to streamlining your deployment process), we need to ensure that the testing is comprehensive, but also minimal, so that the testing can be executed and completed in the minimum elapsed time possible. Modular design assists us here by restricting the scope of our testing, therefore reducing the number of objects to create, load with test data, execute tests and verify the results.

The code in our example is part of Teradata’s EDW Decoded service offering, where we apply modern analytic techniques to your Teradata environment to determine opportunities to optimize the ecosystem, including offloading suitable processing and data to other platforms such as Hadoop. You can read about this here.

The scenario we will use is testing some code (SQL) which selects data from a view containing joins and aggregation, and outputs the result to a table. This is a common scenario in data processing and simple enough to show us the concepts without being overly complex.

You can see that in order to execute the code being tested, i.e: the “Insert Select”, we need to meet a number of dependencies.

These are:
• Be able to connect to the database
• Database(s) exists to contain the test objects
• A user exists with the correct permissions
• The output table must exist
• The input tables must exist
• The input view must exist
• The input tables must contain test data

These dependencies form our “recipe” for what I will refer to as our functional test. We will be testing the functionality of the process, ensuring that it executes successfully and produces the correct results.

To meet the dependencies, we require a number of “ingredients” which will be used in the functional test. These are unit tests, as they also need to execute correctly and produce the correct result, but at a lower atomic/unit level.

The source code for the objects is stored within scripts, normally executed by bteq (Teradata batch utility for executing SQL) and the scripts are tokenized, so that they can be executed in different environments without code change by having the tokens replaced with the environment specific values.

dbFit allows you to execute queries against a database, but does not know about bteq or tokens, so we must add command line execution as a requirement, to allow us to execute our source code.

Our ingredient list looks like this:
• Connect to Teradata database
• Execute a command line statement
• Execute a DDL script (eg: table, view, stored procedure) using the command line
• Replace tokens in the script before execution
• Load test data using dbFit
• Test expected results using dbFit

We will assume that creating a database, user for the tests and assigning permissions are done as a once off task before testing begins, but could be easily added if required.

Our ingredient list is small, so we only need a few templates to be developed which we can then replicate for all of the specific objects within the test scope. In a real world application, this is the perfect scenario for automated generation. Develop the template(s) initially, test thoroughly and then automate the creation of specific tests using the template. By breaking the tests down into the simplest atomic/unit test cases, you increase the ability to automate the creation of them, and at the same time you increase the flexibility and re-usability of those test cases, enabling reuse of these ingredients in many different recipes.

For executing the command line statements, we will use the CommandLineFixture code found here.

FitNesse is wiki based, so our tests are built up from web pages. You can read about the different page types here. We will be using a test suite page for our “recipe” to specify the unit tests to execute, and test pages as our “ingredients” to define the unit tests. Static pages can be used to group and organize the unit tests.

The test suite will follow the high level pattern of setup, execute tests and finally tear down. This ensures that our tests are repeatable. Test results are stored by FitNesse and are used to investigate failures.

Setup the Test
You can specify setup pages to be automatically included both at the Suite level and the individual test level. For our suite setup, we want to load our library files (java .jar files) and import the command line fixture.
The page is named “SuiteSetUp” and is created under the parent (root) page, so that it is inherited by all of the children pages.

nathan-green_dbfit-2The wiki code is:
!path lib/*.jar

!|Import |CommandLineFixture|

|Comment |
|set classpath & import fixtures|


The other setup page we will use is a test SetUp page. This will be executed for all of the tests at the same level of the wiki hierarchy, and is how we will get dbFit to connect to Teradata to execute SQL for loading test data and testing results.

nathan-green_dbfit-3The wiki code is:


!|Connect Using File||




We can see above the inherited page “SuiteSetUp”, so we know the required libraries are included. The first line tells dbFit that we are executing tests against Teradata. The second line tells dbFit to connect using a file, this allows us to store the username and password details securely on the test server, rather than have them exposed on a web page.

The contents of the file is the jdbc connection string:

Execute Tests
The first thing we need to do is create our database objects by executing the DDL scripts. Creating a table is done using this page:


The wiki code is:
!|CommandLineFixture |
| title | Execute DDL | |
| command | /root/edw_decoded/ ${DATABASE_NAME} ${DDL_PATH}/QueryUse.sql ${LOGIN_FILE} | result |
| contains| result.stdout | (return code) = 0 |

The test page for creating the “QueryUse” table is in a hierarchy, under the parent “EDWDecoded”, “DatabaseObjects”, “DatabaseTables” pages. This allows us to organize the unit tests to make it easier to find specific tests, but also allows us to inherit variables, allowing the template to be more generic (and thus, easier to automatically create).

The ${DATABASE_NAME} and ${LOGIN_FILE} variables are set on the “EDWDecoded” page, so that we use these consistently with all tests and test suites.

The wiki code is:
Database used for testing:
!define DATABASE_NAME {edw_decoded_dbfit}

Script Login file used for testing:
!define LOGIN_FILE {/mnt/hgfs/edw_decoded/DDL/logon.txt}

!contents -R2 -g -p -f -h

The ${DDL_PATH} variable is different based on the object type, so tables, views, stored procedures etc… are stored in different directories. Since we are creating a table, we define this variable in the “DatabaseTable” page.

The wiki code is:
!define DDL_PATH {/mnt/hgfs/edw_decoded/DDL/02_Tables}
!contents -R2 -g -p -f -h

Due to inheritance of the variables, the only customized part of the test page is the name of the script itself, that is “QueryUse.sql”. So some simple code to scan the directory of your source code repository, get the file names then create test pages using the template is very easy to write, allowing you to easily automate the creation of many test pages.

Taking this further, you could add a trigger to your source code control system (eg: Git) on commit, so that when a new DDL file is committed to the repository, the test generation is automatically executed, creating & executing the test (and rejecting the commit if it fails). This concept embodies some of the fundamental principles of DevOps and continuous deployment.

The command line fixture is executing a “helper” shell script and inspecting the standard out to ensure that it contains a success message (return code = 0), this is the test to ensure it executed successfully.

The helper script performs the token replacement(s) using sed, and feeds the script and login file to bteq to execute the DDL.

The helper shell script is:


# Detokenise the sql file, add the login details and execute with bteq
sed s/\$BASE_NODE/${DATABASE_NAME}/g ${DDL_FILE} | cat ${LOGIN_FILE} – | bteq

We use a helper script because we want to keep the actual test cases very simple, a case of right tool for the right job. It is much easier to use a small shell script to perform token replacement and stream the login file details and script contents into bteq, than to try to put this into the actual command line fixture within FitNesse.

The shell script is easy to modify to support other tokens or requirements specific to different scripts, or multiple shell scripts used to support legacy code or multiple standards. This is the principle of isolating change, increasing the resilience of the test suite by simplifying individual changes and also increasing flexibility.

This command line template is used for creating the tables and view required for our test suite.

Now that we have our objects created, we need to load test data. This is done using dbFit to clean the table (i.e.: delete any existing records), insert the test data and test to see if the records were inserted correctly.

The data loading test page looks like this:


And the wiki code is:
!4 Clean any existing data
|Clean |
|table |

!4 Insert test data
|Insert |${DATABASE_NAME}.queryGroup |
|1 |Group_1 |? |abc |
|2 |Group_2 |? |abd |
|3 |Group_3 |? |efg |

!4 Confirm records inserted
|Query|SELECT COUNT(*) AS c1 FROM ${DATABASE_NAME}.queryGroup|
|c1 |
|3 |

!4 Commit transaction to persist test data

The dbFit methods “Clean”, “Insert”, “Query” and “Commit” are used to clean the table, insert the data, perform a query to ensure results are as expected, then commit the changes to the database, as we will be using this data in subsequent tests.

The test data template will require the most manual preparation, to ensure that the developer creates a meaningful set of test data, with referential integrity, that can be used to cover all of the known test cases. The set of test data is expected to be enriched over time, as it is best practice to ensure that any failures, both in subsequent testing and production, are used to create further test cases (and the data that is required to test). This increases test coverage and reduces risk of repeat failures being introduced into the environment.

With objects created and test data loaded, we can execute the process.


The wiki code is:
!4 Execute the transform script
!| CommandLineFixture |
|title |Execute DDL | |
|command |/root/edw_decoded/ ${DATABASE_NAME} ${DDL_PATH}/Populate_SumQGAffinity.sql ${LOGIN_FILE}|result |
|contains|result.stdout |(return code) = 0|

!4 Test result data
!|Inspect query|Select * from ${DATABASE_NAME}.sum_qgaffinity|

We use a command line fixture to execute the bteq script, our transform code, and dbFit’s “Inspect Query” to get the contents of the table returned into the test results.

Additional tests would normally be included for number of records, expected values etc… but for simplicities sake I have just included the inspect in this example.

All of the unit test pages are put together into a suite, which allows you to pick specific tests, and the sequence to execute them.

Our test suite page is:


The wiki code is:
!3 Ensure dependencies are met

!4 Create Tables
!see .EdwDecoded.DatabaseObjects.DatabaseTables.QueryUse
!see .EdwDecoded.DatabaseObjects.DatabaseTables.QueryTableUse
!see .EdwDecoded.DatabaseObjects.DatabaseTables.TableUse
!see .EdwDecoded.DatabaseObjects.DatabaseTables.QueryGroup
!see .EdwDecoded.DatabaseObjects.DatabaseTables.SumQgAffinity

!4 Create Views
!see .EdwDecoded.DatabaseObjects.DatabaseViews.QueryGroupAffinity

!4 Load Test Data
!see .EdwDecoded.TestData.QueryUse
!see .EdwDecoded.TestData.QueryTableUse
!see .EdwDecoded.TestData.TableUse
!see .EdwDecoded.TestData.QueryGroup

!4 Execute the transform script
!see .EdwDecoded.RuntimeScripts.PopulateSumQgAffinity

I have setup the hierarchy so that the test suite page does not have any children, so we use the FitNesse “!see” command to include other test pages into the suite. This allows us to pick the exact unit tests we need included in our suite, without worrying about extra tests outside of our scope being included automatically.

The functional test lends itself to be automatically created as well, albeit more complicated as you would need to parse the code to extract the dependencies. This is made (much) easier by having well defined coding standards, and rejecting the commit of any code that does not meet them. While initially annoying for the developer, the benefits you gain from having known code quality and standards vastly outweigh any drawbacks, and enables increased automation, which in turn reduces testing and deployment timeframes.

Test Tear Down
Now that the testing is complete, we need to clean up after ourselves so that the test is fully repeatable.
This is performed using a “SuiteTearDown” page.


The wiki code is:


!|Connect Using File||




We simply connect to Teradata and delete the database used for our testing.

To execute test suites in parallel, simply use separate databases for those tests. Following the approach outlined here, each database will be small in size and reusable, so having many of them to perform testing in parallel is easily done. Having a separate test database per developer is a good approach, allowing developers to test their code in isolation, while still having comprehensive test coverage (minimum scope, maximum coverage).

By taking a small set of simple, reusable templates, we have been able to put together a comprehensive test suite, allowing us to setup a test environment with all necessary objects and data to perform our tests, then clean up after the testing is complete.

At the same time, we have started to build a reusable test library, with the “ingredients” able to be reused for other functional tests. The effort required to add new functional tests will reduce over time, as reuse increases. Manual effort can be significantly reduced using automation, allowing developers and testers to focus on the important task of getting the test data correct and fit for purpose.

In a full DevOps deployment pipeline, the testing tool is just one component of a tool suite which must be integrated together. I hope I have showed you that it is possible to start out with just one tool, get effective results and start the culture change (adopting coding standards), the first few steps along the DevOps journey. Once comfortable with testing automation, additional tools can be integrated to provide better functionality and automate other stages in the deployment pipeline. Getting the foundations correct, and implementing incrementally are keys to being successful in DevOps.

The post Is your DbFit? appeared first on International Blog.


December 07, 2016

Big Data University

New York Data Science Bootcamp And Validated Badges

This post is about our New York Data Science Bootcamp. Interested? Find data science bootcamps in your city.

The Data Science Bootcamp journey for Big Data University started in China back in September. At the time, we were in Bejing for the CDA (Certified Data Analyst) Summit, which had an attendance of more than 3,000 data analysts, data scientists, data engineers, students, and professors. After that, we held a Data Science Bootcamp in Mexico which was a 4-day affair with an attendance of 80 participants in the city of Guadalajara.

New York Data Science BootcampMore recently, IBM’s Big Data University held a four and a half day beginner friendly, hands-on data science bootcamp in New York City. It took place from Tuesday, November 29th to Saturday, December 3rd at the Zicklin School of Business, Baruch College, CUNY.

BDU’s Chief Data Scientist Saeed Aghabozorgi, PhD delivered the Data Science bootcamp with the aid of two fellow BDU data scientists: Candi Halbert and Hima Vasudevan. 45 participants from the corporate and academic world took part in the bootcamp.

We are very grateful for the help and support from Maithily Erande (Lecturer, Marketing Analytics Dept. of Marketing & International Business) in arranging this New York data science bootcamp. (Maithily is an Analytics Expert and a Product Manager/ Vice President for Product Engineering with several Fortune 100 clients including Apple, Microsoft, Dun & Bradstreet, JP Morgan Chase, Cisco, and Sesame Street.)

Introducing Validated Badges

Saeed at the New York Data Science BootcampAt this bootcamp, Saeed gave an overview of data science, Big Data University, an introduction to the R language, data analysis, data visualization, machine learning and working with big data with SparkR. There was also a capstone project which gave participants a hands-on experience on a real-world data science project. For this project, participants used the Data Scientist Workbench, a virtual lab environment which allowed them to perform lab exercises directly from their browsers.

For the first time ever, participants who passed the exam were eligible not only for a certificate but for a Validated Badge as well. This badge is reserved for Big Data University students who pass the test for a course in person (so that we can guarantee that they were the ones who took the exam). These badges, like our regular badges, can be shared with others using social media, on your resume, etc.

We, at BDU, are greatly enjoying these bootcamps. The feedback received so far has been equally enthusiastic. If you are interested in attending future Data Science Bootcamps worldwide, please refer to our Data Science Bootcamps link. We hope to see you there!

The post New York Data Science Bootcamp And Validated Badges appeared first on Big Data University.

Revolution Analytics

Microsoft R Server 9.0 now available

Microsoft R Server 9.0, Microsoft's R distribution with added big-data, in-database, and integration capabilities, was released today and is now available for download to MSDN subscribers. This...

Big Data University

This Week in Data Science (December 06, 2016)

Here’s this week’s news in Data Science and Big Data. Data Trends 2017

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

The post This Week in Data Science (December 06, 2016) appeared first on Big Data University.

Data Digest

My First 90 Days as Chief Data Officer

Ahead of Corinum’s Chief Data and Analytics Officer, Sydney conference in March 2017, we caught up with Mark Hunter, Chief Data Officer of Sainsbury’s Bank (UK). It was great to get a snapshot of Mark’ s experience in his new role and the similar challenges faced by data and analytics professionals in Australia and  the UK.

Corinium: You are relatively new to the Chief Data Officer role. For those about to embark on a similar journey, what have been the key lessons learned in the first 90 days that you would like to share?

Mark Hunter: Although CDO is a technical role, the most important thing to remember in the first 90 days is that “it’s all about people”. Invest time to listen intensely to what people are willing to share with you. This should include your team, the broader data community, stakeholders and your company’s leaders. This should give you a good understanding of where to focus your attention.

Although CDO is a technical role, the most important thing to remember in the first 90 days is that “it’s all about people”.  

Corinium: You will be speaking on “How Architectures Need to Evolve in Order to Leverage New Innovations in Analytics” – a really interesting topic given the staggering amount of talk around machine learning and artificial intelligence. How can you make your company machine learning ready?

Mark Hunter: Unfortunately many firms are not quite analytics-ready, never mind machine learning! During my career I’ve seen firms struggle to come to terms with the fact that analytics results need to find their way onto production data platforms. The issue is that many production data platforms were built to support a static set of requirements, which is the absolute opposite of analytics.

Corinium: What are the biggest challenges people face when it comes to getting the data landscape right?

Mark Hunter: The biggest challenge is how to mainline analytics. By this I mean that we need to build data platforms that are adaptive to change and welcome analytics as a first-class citizen.

Corinium: What are the key factors to consider in how architectures need to evolve?

Mark Hunter:
The biggest factor to consider is how to remove friction out of the system. Some of the types of friction I’ve been focussing on are data, analytics and people. How do we make it easy to add new data and new types of data onto the platform; how do we make it easy for analytics results to find their way onto the platform; and how do we structure our people to ensure that everyone is aligned and working towards a common vision.

Corinium: Why do organisations need a ‘logical data warehouse’?

Mark Hunter: Data is evolving; we are capturing increasing volume and variety of data. It is crucial that our data platforms cater for all of the data and analytics requirements, hence the need to evolve towards a logical data warehouse.

Meet Mark and an amazing line up of over 60 speakers at the Chief Data and Analytics Officer, Sydney taking place on the 6-8 March, 2017.

For more information visit:  

December 06, 2016

The Data Lab

Learning Journey: Strata + Hadoop San Jose, March 2017

Strata + Hadoop World, San Jose 2017

Strata + Hadoop World is a 4-day immersion in the most challenging problems, intriguing use cases, and enticing opportunities in data today. Over a hundred of the most interesting people in data will take the stage to share their expertise and ideas, including The Data Lab’s CEO Gillian Docherty.

The Data Lab is offering 8 fully-paid passes for the event, and SDI is complementing the offer with travel & accommodation support in the form of a flat-rate grant up to a maximum of £600 – based on support levels for USA and a maximum of 5 nights in-market. In addition, we will organise a full visit program to include possible meetings and networking events with relevant organisations.

Companies should note that places are limited and these will be allocated based on the relevance for both the applying organisation, and their representative. This Learning Journey is targeted at senior leaders at CTO level or equivalent. Applications will be assessed with regard to their fit with the objectives of the Learning Journey and their ability to derive benefits / insights for Scotland from their participation. Participating companies will also need to commit to a post-event survey regarding outcomes delivered.

To apply, please complete the enclosed application form and e-mail it to, with the subject “San Jose Application” by 5pm on Tuesday 31st January 2017.


Google Plus

Revolution Analytics

dplyrXdf 0.90 now available

by Hong Ooi, Sr. Data Scientist, Microsoft Version 0.90 of the dplyrXdf package has just been released. dplyrXdf is a package that brings dplyr pipelines and data transformation verbs to Microsoft R...


December 05, 2016

Revolution Analytics

In case you missed it: November 2016 roundup

In case you missed them, here are some articles from November of particular interest to R users. Microsoft R Open 3.3.2, based on R 3.3.2, has been released for Windows, Mac and Linux. A new, free...

The Data Lab

Mission Christmas: You can help us help Santa make it to every house in Edinburgh

Mission Christmas Cash for Kids

One in five children are living in poverty right now, and Cash for Kids Mission Christmas is working hard to make sure every child has a happy Christmas morning. In 2015 they raised over £773,200 in gifts and cash and supported over 14,200 local children.

This year, The Data Lab has signed up as a Mission Christmas drop off point. The toys and gifts received through supporters such as you will go a really long way in keeping the magic of Christmas alive for children who would normally experience Christmas as just another day.

We're asking you to buy a present that we can give to a disadvantaged local child to make their Christmas morning special. We need new and unwrapped gifts suitable for children and young people aged 0-18 years. You can donate at our Edinburgh office (15 South College Street, Edinburgh EH8 9AA) by close of business Tuesday 13th December. If we are too far from you but you would like to donate, there are over 600 drop off points across the region.

Everything we can do to support Cash for Kids Mission Christmas really does mean the world to the kids they support, and it will put huge smiles on the faces of children who are going through a really rough time. This is a time of year when every child deserves to be happy, opening presents by the tree. With your help, we can make that happen. Santa will make it to every house in Edinburgh, the Lothians and Fife this year, that’s the Mission.

Together, we can make a difference.

Merry Christmas!


Google Plus
Teradata ANZ

Banks are now part of the Internet of Things (IoT) and need the right tools to excel

In retail banking, as in other sectors, new types of data are emerging as banks shift shape to meet changes in consumer demand, head off threats from Fintech insurgents and adapt to the arrival of new payment mechanisms such as Apple Pay.

Indeed, banks, have already entered the Internet of Things (IoT), as mobile devices such as smartphones and wristbands are increasingly used to pay for goods and services.

Being thrust into this new environment should cause banks to reassess how they use data and analytics to solve business challenges. The question is whether they are equipped to seize the opportunity they have. Using all this new data to uncover lucrative insights into customers and their behaviour is not straightforward, as source-complexity and formats hinder consumption, requiring new analytical techniques to find patterns and trends.

The advance of IoT in payments

With consumers having more payment options at their disposal, including peer-to-peer systems, there is greater pressure on banks to use their new data to re-fashion customer relationships. A good example is the collaboration of a chain of coffee shops with a telecommunications provider. After detecting the proximity of customers through their smartphones, the coffee shop company sends them an offer, allowing drinks and snacks to be picked up without queuing. With this level of service more common, younger customers in particular are bound to expect something similar from banks.

The banks will also have to change their ways to combat the threats from Apple Pay and online banking operators. There is a real danger that such tech innovators and related third parties will dominate the customer relationship by effectively owning the interface. Without this direct contact with consumers, banks stand to lose highly valuable opportunities for marketing to individuals.

IoT data as a tool of effective competition

To combat these threats, banks will need to consider using the streams of information flowing from devices in the same way that manufacturers, maintenance providers and utility companies do. For banks, IoT data comes from devices such as customers’ smartphones, mobile apps, ATMs and online. To extract the nuggets, this data has to be integrated with all of the insights generated from more traditional channels, sparking a revolution in how banks understand their customers’ needs and habits.

Yet there is another big hurdle to overcome. While cheaper storage methods have enabled the collection of large volumes of data, banks still have notoriously disparate systems that inhibit innovation. Banks need to abolish these divisions, so that data can be queried and integrated using analytics at scale, enabling them to make the crucial link between the customer journey and the individual. This means working out in near real-time who they are, the products they use and their lifetime value. This is the context that allows a bank to assess the true customer experience.

Integration of this nature is indispensable. In manufacturing, for example, a set of sensor readings from a machine is virtually useless if not accompanied by the machine attributes such as its age, warranty, length of service, last maintenance point and so forth. Data integration such as this is the key to unlocking value for banks just as much as it is for electricity generators or motor manufacturers.

IoT data is now there for the banks to tap into and extract an enormous amount of value from. Yet it will require astute management so that analysts have the flexibility to integrate and continue uncovering hugely precious insights.

The post Banks are now part of the Internet of Things (IoT) and need the right tools to excel appeared first on International Blog.


December 03, 2016

Simplified Analytics

Digital Transformation helping to reduce patient's readmission

Digital Transformation is helping all the corners of life and healthcare is no exception. Patients when discharged from the hospital are given verbal and written instructions regarding their...


December 02, 2016

Revolution Analytics

Because it's Friday: Border flyover

President-elect Trump has famously pledged to build a wall along the US-Mexico border, but what would such a wall actually look like? This short film directed by Josh Begley follows the path of the...


Revolution Analytics

Stylometry: Identifying authors of texts using R

Few people expect politicians to write every word they utter themselves; reliance on speechwriters and spokepersons is a long-established political practice. Still, it's interesting to know which...

Data Digest

How the Evolution of Learning & Development Transformed the role of Chief Learning Officers (CLO)

After months of research with over 50 Chief Learning Officers (CLOs) and de facto Learning & Development (L&D) executives in preparation for Corinium’s Chief Learning Officer Forum USA, I found that the fundamental principles of L&D have not changed substantially since their implementation nearly 20 years ago. However, these exceptional leaders are constantly looking at ways to innovate and keep ahead of the curve developing more creative, thorough and inclusive initiatives to fulfil those traditional ideas.

It is evident that corporations still seek to champion and support the professional development of their workforce in order to maintain their predominant position within a specific market. Nonetheless, highly competitive environments have propel substantial changes in the fundamental tasks that every CLO faces at any modern corporation. For instance, corporations now expect from these executives to: (1) effectively develop leadership programmes focussed on business delivery; (2) facilitate the transition from legacy methods to modern technologies to further promote employee capacitation/engagement/retention; and (3) overcome the generational gap issue when implementing new tools and L&D programmes.

Key Principles of Learning & Development 

In order to understand how these new challenges have transformed the CLO role, let us look at two main and contemporary principles of any L&D initiative: Diversity and Inclusion. Due to some old cultural traditions, a few of the old fashioned corporate values were determined by underlying notions of segregation and discrimination. These old social institutions, that shaped those L&D programmes in the past, have gradually given way to more contemporary ideas of “people’s development” and “workforce Inclusion”. The direct task of transforming those notions falls into the CLOs responsibility. These experts understand that there is a direct correlation between better professional performance and the implementation of programmes that recognise and promote diversity and inclusion.

Certainly, the key is understanding that diversity is not only a concept applicable to race and gender; it also encompasses age, sexual orientation, faith, social background, language, amongst many other things. Also that inclusion does not only refer to minorities and/or outsiders integrating with established social groups, but also to fundamental ideas of recognition and respect.

The key is understanding that diversity is not only a concept applicable to race and gender; it also encompasses age, sexual orientation, faith, social background, language, amongst many other things.

The importance of right technological solution

It should not be a surprise to discover that these corporations, who will join us next March in New York City, have embraced these two concepts thoroughly and have applied their principles as core values of their L&D programmes, promoted by the CLOs. The vast majority have publicly committed to improving the presence of minority groups within the leadership organisational structure; close the gender gap amongst the executive office and promote values of respecting difference amongst their local or international workforce. Similarly, they have acknowledged the vital role that generation exchange has to further endorse leadership initiatives and the importance of openly discussing topics related to race, gender, background, faith and others.

However, all these initiatives have been supported by the right technological solution to enable and grant access to these programmes to a wider audience. Whether it offers faster, easier and cheaper access to information or promotes more contemporary principles that aim to support the ideas of inclusion and diversity, it has crucially transformed the way that corporations interact with their employees.

Those L&D programmes that were once designed to target a select group of individuals, have now become accessible to anyone. The wider opportunity to access information everywhere and anytime has democratised the way people learn and develop their personal and professional hard and soft skills. Nowadays the capacitation, readiness and reskilling of personnel is carried out without segregation or segmentation, thanks to inclusive corporate policies that promote contemporary values of respect and social recognition and technological platforms that allow anyone to access L&D programmes anytime.

Clearly, neither of those initiatives or programmes could be considered perfect nor the answer to those demands raised by the global corporations. Nonetheless, they could show us how the CLO the US has embraced these concepts and tried to construct a more solid policy of inclusion of a vast and diverse workforce within flexible organisational structures.

Understanding the significance of these concerns and the importance of public initiatives for debating these key aspects and many others, Corinium Intelligence has proudly created the Chief Learning Officers Forum USA, taking place in March 7-8 2017 in New York City, for all CLOs and L&D experts in the US. The event promises to be the place in which more than 100+ L&D experts will gather to discuss these and more issues regarding the challenges they face daily and the clever solutions they have produced to implement those diversity and inclusion values. Please, get in contact and let us welcome you at the Convene 101 Park Avenue next March!

By Alejandro Becerra:

Alejandro Becerra is the Content Director for LATAM/USA for the CDAO and CCO Forum. Alejandro is an experienced Social Scientist who enjoys exploring and debating with senior executives about the opportunities and key challenges for enterprise data leadership, to create interactive discussion-led platforms to bring people together to address those issues and more. For enquiries email:
Data Digest

JUST RELEASED: Chief Data Scientist, USA - Speaker Presentations | #CDSUSA

On 16th - 17th November 2016, Corinium launched the inaugural Chief Data Scientist, USA; the premier conference for high-level data science practitioners to get a detailed roadmap for developing the leadership role for data science. The event set out to assist anyone looking to fully exploit the data science capability within their organization.

Using an interactive format, the forum brought together over 100 senior-level data science peers to share their latest innovations, best practises, challenges and use cases, as well as facilitate conversations and connections.  Alongside keynote presentations from our senior speaker line-up, our informal discussion groups, in-depth masterclasses and networking sessions provided the opportunity to take away the new ideas and information to deliver real benefits to attendee's companies.


December 01, 2016

Revolution Analytics

Using R to Gain Insights into the Emotional Journeys in War and Peace

by Wee Hyong Tok, Senior Data Scientist Manager at Microsoft How do you read a novel in record time, and gain insights into the emotional journey of main characters, as they go through various trials...

Silicon Valley Data Science

Big Data is About Agility

As a buzzword, the phrase “big data” summons many things to mind, but to understand its real potential, look to the businesses creating the technology. Google, Facebook, Microsoft, and Yahoo are driven by very large customer bases, a focus on experimentation, and a need to put data science into production. They need the ability to be agile, while still handling diverse and sizable data volumes.

The resulting set of technologies, centered around cloud and big data, have brought us a set of capabilities that can equip any business with the same flexibility, which is why the real benefit of big data is agility.

We can break this agility down into three categories, spread over purchasing and resource acquisition, architectural factors, and development.

Buying agility

Linear scale-out cost. A significant advantage of big data technologies such as Hadoop is that they are scalable. That is, when you add more data, the extra cost of compute and storage is approximately linear with the increase in capacity.

Why is this a big deal? Architectures that don’t have this capability will max out at a certain capacity, beyond which costs get prohibitive. For example, NetApp found that in order to implement telemetry and performance monitoring on their products, they needed to move to Hadoop and Cassandra, because their existing Oracle investment would have been too expensive to scale with the demand.

This scalability means that you can start small, but you won’t have to change the platform when you grow.

Opex vs capex. Many big data and data science applications use cloud services, which offer a different cost profile to owning dedicated hardware. Rather than getting lumbered with a large capital investment, using the cloud makes compute resource into an operational cost. This opens up new flexibility. Many tasks, such as large, periodic Extract-Load-Transform (ETL) processes, just don’t require compute power 24/7, so why pay for it? Additionally, data scientists now have the ability to leverage the elasticity of cloud resources: perhaps following up a hypothesis needs 1,000 compute nodes, but just for a day. That was never possible before the cloud without a huge investment: certainly not one anybody would have made for one experiment.

Ease of purchase. A little while ago I was speaking to a CIO of a US city, and we were discussing his use of Amazon’s cloud data warehouse, Redshift. Curious, I inquired which technical capability had attracted him. It wasn’t a technical reason: it turned out he could unblock a project he had by using cloud services, rather than wait three months for a cumbersome purchase process from his existing database company.

And it’s not just the ability to use cloud services that affects purchase either: most big data platforms are open source. This means you can get on immediately with prototyping and implementation, and make purchase decisions further down the line when you’re ready for production support.

Architectural agility

Schema on read. Hadoop turned the traditional way of using analytic databases on its head. When compute and storage are at a premium, the traditional Extract-Transform-Load way of importing data made sense. You optimized the data for its application—applied a schema—before importing it. The downside there is that you are stuck with those schema decisions, which are expensive to change.

The plentiful compute and storage characteristics of scale-out big data technology changed the game. Now you can pursue Extract-Load-Transform strategies, sometimes called “schema on read.” In other words, store data in its raw form, and optimize it for use just before it’s needed. This means that you’re not stuck with one set of schema decisions forever, and it’s easier to serve multiple applications with the same data set. It enables a more agile approach, where data can be refined iteratively for the purpose at hand.

Rapid deployment. The emergence of highly distributed commodity software has also necessitated the creation of tools to deploy software to many nodes at once. Pioneered by early web companies such as Flickr, the DevOps movement has ensured that we have technologies to safely bring new versions of software into service many times a day, should we wish. No longer do we have to make a bet on three months into the future with software releases, but new models and ways of processing data can be introduced—and backed out—in a very flexible manner.

Faithful development environments.
One vexing aspect of development, exasperated by deploying to large server clusters, is the disparity between the production environment software runs in, and the environment in which a developer or data scientist creates it. It’s a source of continuing deployment risk. Advances in container and virtualization technologies mean that it’s now much easier for developers to use a faithful copy of the production environment, reducing bugs and deployment friction. Additionally, technologies such as notebooks make it easier for data scientists to operate on an entire data set, rather than just a subset that will fit on their laptop.

Developer agility

Fun. Human factors matter a lot. Arcane or cumbersome programming models take the fun out of developing. Who enjoys SQL queries that extend to over 300 lines? Or waiting an hour for a computation to return? One of the key advantages of the Spark analytical project is that it is an enjoyable environment to use. Its predecessor, Hadoop’s MapReduce, was a lot more tedious to use, despite the advances that it brought. The best developers gravitate to the best tools.

Concision. As big data technologies advance, the amount of code required to implement an algorithm has shrunk. Early big data programs needed a lot of boilerplate code, and their structures obscured the key transformations that the program implemented. Concise programming environments mean code is faster to write, easier to reason about, and easier to collaborate over.

Easier to test. When code moves from targeting a single machine to a scaled computing environment, testing becomes difficult. The focus on testing that has come from the last decade of agile software engineering is now catching up to big data, and Spark in particular incorporates testing capabilities. Better testing is vital as data science finds its way as part of production systems, not just as standalone analyses. Tests enable developers to move with the confidence that changes aren’t breaking things.


Any technology is only as good as the way in which you use it. Successfully adopting big data isn’t just about large volumes of data, but also about learning from its heritage—those companies which are themselves data-driven.

The legacy of big data technologies is an unprecedented business agility: for creating value with data, managing costs, lowering risk, and being able to move quickly into new opportunities.

Editor’s note: Our CTO John Akred will be at Strata + Hadoop World Singapore next week, talking about the business opportunities inherent in the latest technologies, such as Spark, Docker, and Jupyter Notebooks.

The post Big Data is About Agility appeared first on Silicon Valley Data Science.

Data Digest

What the C-Suite Thought Leaders in Big Data & Analytics are saying about Corinium [VIDEO]

Corinium events are different. Our extensive experience has led us to develop a new format that gives you the chance to not only hear from the leading minds in the industry, but to contribute to the critical thinking being shared.

"We all learn differently...and this format [discussion groups] let's us learn by participating. And that is the thing we will remember.  It's personal.”

We connect C-Suite executives in the Data, Analytics and Digital space and focus them into Discussion Groups where genuine progress can be made. Each group is dedicated to the topic of greatest importance to you and includes the most knowledgeable people on the subject.

After our expert Co-Chairs initiate the discussion, you are invited to offer your own experiences or questions, sparking a free-flowing and open conversation in which over 65% of the participants typically contribute.

CEP America, Georgia Tech & McGraw Hill Education explain how Corinium Discussion Groups gives a rare opportunity for senior-level executives to have the time to participate with one another in such a personal format.  Barriers are discussed and decisions are made on how to move forward, and more importantly, how to solve challenges together.’

SAP, PwC, TimeXtender &  Caserta Concept  are discussion group facilitators.  PwC and SAP lead discussion groups all around the world at Corinium C-Suite events, and continue to do so because of the value of these interaction.

What our participants had to say:

We all learn differently...and this format [discussion groups] let's us learn by participating. And that is the thing we will remember.  It's personal.”
Dipti Patel, Chief Data and Analytics Officer, CEP America

'It really takes the coming together of people like these [Attendees of the Chief Analytics Officer Fall] tell us, this is the problem we need to come together to solve'
John Sullivan, Director of Innovation, North America Center of Excellence, SAP

'I think the interactions here are much better than...the other conferences I have ever attend'
Oliver Halter, Partner, PwC

'The people at the Corinium events are the decision-makers and are the people that push the industry forward. If that benefits the industry, great. If that benefits us too, also great."
Romi Mahajan, Chief Commercial Officer, TimeXtender

To learn more, visit    
Data Digest

CCO Forum: What Customer Experience Leaders were talking about

What an amazing event these last two days at the Chief Customer Officer Sydney. I feel very privileged to have spent so much time speaking with some of the best minds in Customer Experience.

I noticed during my conversations that there are some common themes for what is expected for the future, what is working in the present and what is holding back progress.

1. Lead the pack

Mark Reinke of Suncorp set up his presentation by talking about the extinction of the Mammoth and how as a child you get taught about this and assume it was quite a sudden extinction. In reality, it happened very slowly as the Mammoth did not adapt to the changing environment.

There seemed to be some shared sentiment that it was not only important to adapt to avoid extinction, but that the first mover advantage was significant when it comes to customer experience opportunities.
There seemed to be some shared sentiment that it was not only important to adapt to avoid extinction, but that the first mover advantage was significant when it comes to customer experience opportunities.

2. Alignment with IT is a priority

The jealousy was almost unanimous upon finding out that Richard Burns of Aussie leads both the CX and IT teams, helping to increase the velocity of CX change projects, which brought into view the issue, that many at the forum were struggling with IT alignment.

I heard from many conversations that delays caused when the Enterprise Infrastructure cannot support the next phase of CX transformation, were top of the agenda to avoid.

To many, the solution is ensuring that IT teams are included in forward planning and that contingencies are in place to resource projects that take advantage of a "moment in time opportunity" as one attendee put it, without having to compromise on projects essential to the "Roadmap".

Whilst there are certainly some strategies to solve this problem, internal agility and alignment continue to be a challenge for most companies.

3. Agility and expertise of partners/vendors

When it comes to transformation and digitisation, many companies do not have the skills in-house to be able to ensure the success of an implementation. There is an expectation that the Vendor they end up choosing will bring this to the table along with the product. The feeling was that the initial success and speed of delivery came down to choosing a partner that was able to provide this at a high level.

Whilst there were certainly examples where this was not the case, I heard some amazing feedback on this front about some of the Vendors in the room.

As well as the expertise, the ability to change course and assist quickly, wth one of those "moment in time" opportunities was what created the stickiness with some of their closest relationships.

4. Using Data to inform transformation and product development opportunities

This was perhaps the most interesting to me as a self-professed data enthusiast, in that whilst many companies are still struggling to harness their own internal data to understand customer behaviour, there are many companies looking to use both internal and external data sources to better understand what their customers will need predictively rather than historically.

The ability to use these insights to create better products and provide better service seems to be the goal for many.

In many cases, however, attendees were suggesting that (back to point 2) their IT Infrastructure was not ready to support and would be a few years away unless they could find a cost effective solution to skip a couple of steps.

These certainly were not all of the themes, but a good representation of the conversations I had across the two days.

Thank you to everyone who shared so willingly, it was an amazing learning experience and I can't wait to do it all again in Melbourne

Thank you especially to our wonderful Sponsors and Presenters who supported the event and the attendees with their insights, knowledge and solutions. The feedback on how you assisted and educated across the event was spectacular.

By Ben Shipley: 

Ben Shipley is the Partnerships Director at Corinium Global Intelligence. You may reach him at   
Teradata ANZ

Turbo Charge Enterprise Analytics with Big Data

renato-manongdo_enterprise-analytics-1We have been showing off the amazing art works drawn from numerous big data insight engagements we’ve had with Teradata, Aster and Hadoop clients. Most of these were new insights to answer business questions never before considered.

While these engagements have demonstrated the power of insights from the new analytics enabled by big data, it continues to have limited penetration to the wider enterprise analytics community. I have observed significant investment in big data training and hiring of fresh data science talents but the value of the new analytics remain a boutique capability and not yet leveraged across the enterprise.

Perhaps, we need to take a different tact. Instead of changing the analytical culture to embrace big data, why not embed big data into existing analytical processes? Change the culture from within.

How exactly do you do that? Focus big data analytics to adding new data points for analytics then make these data points available using the current enterprises data deployment and access processes. Feed the existing machinery with the new data points to turbo-charge the existing analytical insight and execution processes.

A good starting area is the organisation’s customer events library. This library is a database that contains customer behavior change indicators that provide a trigger for action and provide context for a marketing interventions. For banks, this would be significant deposit events (e.g. three standard deviations from the last 5 months average deposit) and for Telco’s significant dropped calls. Most organisations would have a version of this in place and would have dozens of these pre-defined data intervention points together with customer demographics. These data points support over 80% of the actionable analytics currently performed to drive product development and customer marketing interventions.

What new data points can be added? For example, life events that can provide context to the customer’s product behavior remains a significant blind spot for most organisation e.g. closing a home loan due to divorce, or refinance to a bigger house because of a new baby, etc. The Australian Institute of Family Studies have identified a number of these life events.

Big data analytics applied to combined traditional, digital and social data sources can produce customer profiles scores that become data points for the analytical community to consume. The score can be recalculated periodically and the changes become events themselves. With these initiatives, you have embedded the big data to your existing enterprise analytical processes and moved closer to a deeper understanding to enable pro-active customer experience management.

We have had success with our clients in building some of these data points. Are you interested?

The post Turbo Charge Enterprise Analytics with Big Data appeared first on International Blog.


November 30, 2016

Data Digest

How to Put Data at the Heart of Everything We Do

Ahead of Corinium’s CDAO  Sydney event, we spoke with keynote speaker Darren Abbruzzese, General Manager, Technology Data, at ANZ to find out more about their group data transformation program. We also gauged his views on data as a tool for competitive advantage, its importance in the financial services industry and the role of the Chief Data Officer, its longevity and evolution.

Meet Darren and our amazing line up of over 60 speakers at the Chief Data and Analytics Officer Forum, Sydney taking place on the 6-8 March, 2017.

For more information visit: 

Corinium: You will be speaking on “data at the heart of everything we do” – how have ANZ’s data needs changed in the last 18 months?

Darren Abbruzzese: Providing customers with a fantastic experience is absolutely critical in developing and maintaining deep and engaging relationships. The minimum expectation around what is a ‘great customer experience’ is being continually lifted by interactions our customers have every day with companies like Uber, Facebook, Google and the like. These companies are really raising the bar on what a great digital experience is all about. Customers then come to their bank and expect a similar level of experience. We are embracing that challenge and working to deliver a standout customer experience, both digital and through our branches. We can only achieve that if we make the best use of our data. A great customer experience, one that is tailored to the individual and their specific needs, can only be successful if we use the data we share between us and the customer to its full extent.  This helps us create a unique and engaging experience for our customers, and it will lead to a more engaging relationship than they’ve had in the past.

The minimum expectation around what is a ‘great customer experience’ is being continually lifted by interactions our customers have every day with companies like Uber, Facebook, Google and the like.

Corinium: Digital channels must play a huge part in that. What are your top technology investments in the coming year?

Darren Abbruzzese: Big data and fast data are major priorities. The amount of data we produce as a bank is exploding and we need to ensure we’ve got the tools to harness and make use of that data, which is where our Big Data capability comes in. But having a scaled infrastructure and Hadoop capability is not in itself enough. As a customer, I want real-time information and I want it relevant for my specific interaction. This is where fast data comes in. Moving away from earlier batch-based patterns and towards real-time capture and exposure of data into our internal and customer-facing channels is a key pillar to our strategy of developing a digital bank.

Corinium: How will ANZ’s group data transformation program cater to the diverse enterprise needs?

Darren Abbruzzese: As a large organisation, servicing millions of customers across a multitude of countries and segments we certainly have a diverse set of needs when it comes to data. Trying to deliver to the needs of the organisation via individual solutions won’t get us very far. Instead, our approach is to shape our delivery around solving common bank-wide problems, and being lean in our approach so we can learn, react and move at pace. Some of those common problems are things like data sourcing and collecting millions of data points from hundreds of platforms on a daily basis and consolidating that into joined up, usable models. If we solve that common problem we will make consumption of data via reporting easier and faster. Having a view about a medium-term architecture is also critical. While the technology in the data landscape is moving fast, the capability we need to help deliver our strategy is clear. Building common assets in line with that architecture will help us move at pace and solve for individual business or project needs.

Corinium: What are the leadership challenges educating and synchronising a global data team?

Darren Abbruzzese: It really comes down to being clear on our purpose and ensuring our staff understands what we are trying to achieve and why. If everyone is clear on what will make us successful, and how as a team we will add value to the organisation, then day-to-day decision making will be faster and more aligned to strategy. That’s easier said than done though and requires a lot of work.  Putting our purpose in a PowerPoint pack and emailing it around isn’t going to cut it.  We need to continually reinforce our purpose through concrete measures such as our organisational structure, operating model, architecture, delivery processes and KPIs. We can talk about purpose, but if we embed it in the way we work then it will become part of the culture. 

Corinium: What’s your opinion on the view that the CDO role is just a flash in the pan job title that will eventually become merged or lost amongst the web of new C-suite titles?

Darren Abbruzzese: Banks traditionally have seen their deposits and loans as strategic assets, and also their customers, staff and technology.  In each of these cases there’s been solid management structures established in recognition of the importance of these assets to the future of the organisation.  So we have a bunch of C-suite roles to lead these functions. Data has emerged as a core, strategic asset of any organisation that needs to be curated, managed, protected and leveraged just like any other strategic asset. Data needs a clear organisational strategy – what will we use data for and how will it help make us successful? It needs to be managed and protected. Whether it needs a Chief Data Officer really depends on each organisation and how they operate. That might be a role absolutely critical in raising the profile and importance of data, or it might be something the CEO themselves will define and drive. In other cases, generally where data is already more maturely managed, it has already been well-embedded into the organisation on many levels.

Data has emerged as a core, strategic asset of any organisation that needs to be curated, managed, protected and leveraged just like any other strategic asset. 

Corinium: What do you consider to be the key building blocks to establishing an effective data governance framework?

Darren Abbruzzese: Two key pillars: the first is the age-old problem of “garbage in / garbage out”. Start with what are the key data elements for your organisation and put in place processes to govern the collection, checking and cleaning of data right from the source and throughout its lifecycle. Someone needs to be accountable for this process or it won’t happen, or it will happen for a few months and then fall away. The second pillar is ensuring you have really strong information security and user access management in place. Nothing will destroy the credibility of your data program more, and perhaps even your organisation itself, if you were to suffer a data breach via internal or external means.  It's an important asset so protect it.

Corinium: What do you believe to be the most common form of ‘bad data’ and what effect can that have an organization?

Darren Abbruzzese: Poor data quality management can have a really detrimental impact. If poor data capture and management processes allow inaccurate, or wrong, or misleading data to pollute your key information assets then all the investments you’ve made into building a data capability will be for nothing. Your reporting won’t be trusted and you’ll spend countless hours trying to explain it. Your analytical efforts will return misleading signals potentially leading to sub-optimal or downright disastrous decisions. Your data program will lose credibility – as will you – as you’ll be left to explain the bad outcomes, even though you may not have controlled the input. So clearly, data quality right from the start is really important.  Again, it comes down to data being a core asset of high value and needs to be treated as such.   

Corinium: The financial services industry understands the value and power of data, why do you think that is? How do you see that developing in the next 3-5 years, with particular reference to the use of analytics?

Darren Abbruzzese: Banks have appreciated the value of data and what that means for their businesses for a long time. It's only been in the last few years as the tools and capability have started to mature that banks have begun to make better use of their data and do it at scale. I believe we are only at the start of this journey. Banks have been pretty good at developing digital channels for their customers, and these are the predominate ways that customers now interact with their bank for simple transactions, but looking forward it is by blending data into all of our channels that will drive the next great leap forward. Using analytics to really understand the needs of the individual customer, recognising what they need and are likely to need in the future, and building that into their mobile and desktop interface in an engaging way is where banks will go next and it’s pretty exciting.

Join Darren and 200 attendees at the Chief Data and Analytics Officer Forum, Sydney taking place on 6-8 March 2017.

For more information visit:   


November 29, 2016

Revolution Analytics

Free online course: Analyzing big data with Microsoft R Server

If you're already familiar with R, but struggling with out-of-memory or performance problems when attempting to analyze large data sets, you might want to check out this new EdX course, Analyzing Big...

Big Data University

This Week in Data Science (November 29, 2016)

Here’s this week’s news in Data Science and Big Data. Machine Learning

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

  • Data Science Bootcamp – This is a beginner-friendly, hands-on bootcamp, where you will learn the fundamentals of data science from IBM Data Scientists Saeed Aghabozorgi, PhD and Polong Lin.

The post This Week in Data Science (November 29, 2016) appeared first on Big Data University.

Silicon Valley Data Science

When Decisions Are Driven by More Than Data

I have been thinking a lot lately about storytelling, especially in the wake of the election and the ensuing discussion of the role played by fake news sites, echo chambers, and filter bubbles. Stories can lead us astray. They can reinforce what we want to hear rather than tell us the truth. And many times, two different sets of people will listen to the same story and then leave with completely different takeaways.

Another thing I think a lot about is data visualization—which, as we know it, is barely a century and a half old: it arguably started with William Playfair in the 1860s. But data visualization (and modern data science) are really just another phase in a tradition of human storytelling that goes all the way back to cave paintings.

We have always used stories to help each other make better decisions. Look at the parables in religious texts or the folk tales we tell to children. We use data in the same way, to help drive better decision-making.

The thing about stories is: they’re about people. They’re about explaining the things that happen to people, and they’re tools that we use to spread the history and shared values of groups of people (which is why they become so powerful in partisan election scenarios, for example).

In other words, stories are all about relationships, and so are data. It’s not the data points, or the nodes, that matter—it’s the edges between them that paint the trend lines and allow us to say something about our future. Or to put it more poetically, as Rebecca Solnit does:

The stars we are given. The constellations we make. That is to say, stars exist in the cosmos, but constellations are the imaginary lines we draw between them, the readings we give the sky, the stories we tell.

Why visual storytelling matters

A quartet of graphs, each with the same diagonal blue line showing identical statistical averages. Orange circles show the data points on each graph are actually in really different patterns: one form an arc, one forms a mostly diagonal line with one extreme outlier, one forms a mostly vertical line with one even more extreme outlier, and another forms a very loosely diagonal line.

Graphs by Wikipedia user Schutz, CC BY-SA 3.0

You may already be familiar with Anscombe’s Quartet. It’s four sets of data points (x-y pairs), and all four sets have the same statistical properties. So if you’re only thinking about them mathematically, they appear to be identical in nature. But as soon as you graph them, you see immediately that they’re very different, and you can detect with your eye a relationship in each dataset that you wouldn’t see if you were looking at them with math alone.

Storytelling matters because it’s innate in us as human beings, and it’s how we learn about the world and think about the decisions we should make as people living in relationship to that world. Visual storytelling matters because it reveals relationships that we might not be able to understand any other way.

A black background with bluish lines tracing the routes of commercial airline traffic: there are so many routes that the shape of the USA is clearly visible. Major cities appear as bright spots of connecting lines.

An image from Aaron Koblin’s “Flight Patterns” project.

There’s another thing that happens when you begin to make stories visible—to draw constellations—which is that meta patterns and structures emerge. Take, for example, Aaron Koblin’s visualization of North American air traffic, “Flight Patterns.” Making those routes visible and layering them together in a composite reveals not only the locations of major cities, but also the shape of an entire continent, without the need for any geographic map as a base layer. The relationships between the data points end up describing more than just themselves, in part because of where the data isn’t: the emptiness describes the oceans.

Why you as a storyteller matter

We all tend to think of data as infallible, as black-and-white, but once you understand that it’s not the data you’re presenting, it’s the relationships among the data, then you can see that you as the designer and storyteller are bringing something important to the process.

You must be very deliberate about what of yourself you put into your visualizations. You are naturally going to have an opinion, and that will likely inform the story you tell. Stories are powerful things. And it’s not unusual for people to become so attached to a particular story that they insist on drawing their constellations in ways that ignore the position of the stars. Just think of the historical inaccuracies in almost every Hollywood war film ever made, especially Mel Gibson’s.

Steve Jobs standing on stage to the left of a giant screen, which shows a colorful 3D pie chart. The pie chart has a green slice center front, which is labeled 19.5%. It appears to be much larger than another, purple slice in center-back, which is labeled 21.2%.

3D pie charts use foreshortening to create the illusion of three dimensions, but that same effect also distorts the data.

Data, like stories, can also have the “ending changed,” so to speak. If I were to show you this image from a keynote that Steve Jobs gave at MacWorld in 2008 and ask you to tell me, without reading the numbers, which section is bigger, green or purple? You would surely answer “green.” But of course, once you do read the numbers, you can see that the visualization is misleading. From a storytelling perspective, this is the same thing as changing the outcome of a battle because you felt like it.

You as the data scientist or as the data visualization designer, have an incredible impact on how the story gets told, and therefore on how decisions get made. Draw your constellations carefully, and use well the power to expose relationships and meta structures. Our ability to make good decisions—and even our futures—depend on it.

Editor’s note: Julie will be presenting her “From Data to Decisions” tutorial at TDWI Austin next week, which will elaborate on how designers can influence decision-making with data visualizations. Find more information, and sign up for her slides, here

The post When Decisions Are Driven by More Than Data appeared first on Silicon Valley Data Science.

Revolution Analytics

Microsoft R Open 3.3.2 now available

Microsoft R Open 3.3.2, Microsoft's enhanced distribution of open source R, is now available for download for Windows, Mac, and Linux. This update upgrades the R language engine to version 3.3.2,...

The Data Lab

Online Learning Funding Call

Online Learning Call

We know that Scotland needs more people with the skills to manipulate, organise and analyse an ever increasing amount and variety of data. This demand is not confined to traditional computing industries. Data skills are required in agriculture, tourism, construction, energy and most industry sectors.

To achieve The Data Lab’s vision to showcase Scotland as an international leader in Data Science, and to train a new generation of data scientists, we are looking to fund the development of online courses which contribute to the requirement for more flexible high-quality data science training and education. These may take the form of MOOCs or closed application online courses.

The Data Lab will provide between £30,000 and £50,000 for the development of up to three online courses for this call. These must be delivered fully online, which means there should be no mandatory in-person requirement for the learner.  You can select one of the following three course types:

  1. MOOC (Massive Open Online Courses) – Typically delivered for free using a pre-existing popular MOOC platform (up to £30k)
  2. Online Training – Paid for learning content that is focused on equipping Scottish Industry with the data science training they need (up to £30k)
  3. MOOC and Online Training – A combination of 1 and 2. This could be a MOOC with additional paid-for advanced learning content that meets the data science training needs of Scottish Industry. Here is an example. (Up to £50k)

If you are interested, please complete and submit our online Expression of Interest form. You must submit your completed application form by 17:00 on Monday 13th February 2017.

Before submitting your Expression of Interest please read our application guidance.

Further online guidance on how to build a MOOC can be found online thourgh many different sources. Here are some examples that might be useful from The University of Glasgow, The University of Edinburgh and


Google Plus

November 28, 2016

The Data Lab

The Data Lab Hosts 2nd Executive Away Day

Executive Away Day November 2016

Over dinner, we heard some fascinating stories about people’s personal experiences and practical advice on driving value from data. Hannah Fry was our captivating after dinner speaker who got us all thinking about how data affects us every day. The conversation went on long into the night and it was an excellent way to warm our brains up for the next day.

The Away Day presenters continued the trend set by Hannah with a series of enlightening and entertaining insights. Mark Priestly, F1 commentator and guru, shared the journey he went on with McLaren as they embedded data scientists into the team and how that changed the way decisions were made. I can’t do justice to the way he presented the material, so clear your diaries for DataFest next March, when both Hannah and Mark will be back in Scotland.

Our next two speakers were Steve Coates from Brainnwave and Inez Hogarth for brightsolid. Both Steve and Inez shared their experiences in a start-up and in an established company respectively, on the value that data scientists can bring to organisations. It was great to see the passion they had for investing in people and setting them up for success working in partnership with businesses. I highly recommend getting along to any event that you see Inez or Steve speaking at so that you can hear their experiences first-hand and, more importantly, get a chance to engage them in conversation.

Our last two speakers were two people I have had the privilege to work with in previous roles. Chris Martin from Waracle and Callum Morton from NCR both spoke about the role data plays in their companies. Chris is a fascinating speaker and raised many superb points including the fact that data is not a silver bullet, it is just one of the many things you need to master to create an environment for customer innovation. Callum brought to life the sweet spot between a compelling strategy and rapid execution. It was great to listen to Callum share his experiences and his views on the future. Especially powerful was the fact that only 9 months ago Callum was sitting in the audience at our last Executive Away Day, and now here he was leading the conversation based upon real world experiences.

I also gave a short talk on Data as a strategic differentiator as part of opening the second day. I was delighted and relieved to hear some of the points I raised brought to life by our excellent speakers. But best of all for me was the feedback we got on the day from attendees who found it engaging, practical and enjoyable. Most people I spoke to said that the speakers had stimulated their thought processes and that they were returning to work enthused, armed with practical advice and with a much richer network. As Gillian our CEO said, the attendees are all now part of The Data Lab gang and we will be there to support them as they continue on their own data journeys. 

Find out about our next Executive Education course: "Big Data Demystified", delivered in Partnership with the Institute of Directors (IoD) on 7 February 2017.


Google Plus
Jean Francois Puget

What’s in CPLEX Optimization Studio 12.7?

Here is a guest post by my colleague Paul Shaw on the latest release of CPLEX Optimization Studio.  That release made some buzz at the latest INFORMS conference because of support for Benders Decomposition.  However, Benders decomposition isn't the only novelty in this release as Paul explains.  Paul's post originally appeared here.  This release is also available to all academics and students for free, as my other colleague Xavier Nodet explains here.   Xavier Nodet also posted a detailed presentation of CPLEX Optimization Studio 12.7 on slideshare.

CPLEX Optimization Studio 12.7 is here! It was officially announced on Tuesday 8th November on the IBM website and will be available from November 11th. I’ll go over the main features here.

First of all, performance has been improved under default settings in both CPLEX and CP Optimizer. For CPLEX, the most significant gains can be seen for MIP, MIQP, MIQCP and nonconvex QP and MIQP models. For CP Optimizer, the main performance improvement is for scheduling models, but combinatorial integer models should see some improvement too.

Benders’ decomposition
For certain MIPs with a decomposable structure, CPLEX can apply a Benders’ decomposition technique which may be able to solve the problem significantly faster than with regular Branch-and-Cut. This new Benders’ algorithm has a number of levels of control indicated through the new “benders strategy” parameter. This parameter specifies how the problem should be split up into a master problem and a number of subproblems. The levels work as follows:

  • User level: This level gives you full control over the specification of the master problem and the subproblems. To do this, you need to annotate your model. Annotations are a new concept added in CPLEX 12.7 to associate values with CPLEX objects like objectives, variables, and constraints. Models can be annotated through the APIs or specified in annotation files. To specify the master and subproblems, you give an annotation to each variable. A variable with annotation 0 belongs to the master and one with annotation k1 belongs to subproblem k.
  • Workers level: Here, CPLEX will take the given annotation and try to further break up the subproblems into independent parts. In particular, this level lets you think about the separation of variables into “in the master” and “in a subproblem” without having to worry about how the subproblems are set out. For example, you can annotate either the variables in the master or those in the subproblems as you desire. CPLEX will then automatically break up the subproblem variables into independent subproblems if possible.
  • Full level: The fully automatic level has the following behavior. First, CPLEX will assume that all the integer variables of the problem will go into the master, with all the continuous variables being placed in the subproblems. Then, as for the Workers level, CPLEX will attempt to refine the subproblem decomposition by breaking it into independent parts if possible.

By default, “benders strategy” uses an automatic level which behaves as Workers if a decomposition is specified and runs regular branch-and-cut if no decomposition is specified.

Modeling assistance
With 12.7, CPLEX can be asked to issue warnings about modeling constructs which, although valid, may contribute to performance degradation or numerical stability. To turn on these warnings, the “read datacheck” parameter should be set to the value 2. Here is an example of the type of warning that CPLEX can issue:

CPLEX Warning 1042: Detected a variable bound constraint with large coefficients. Constraint c8101, links binary variable x934 with variable x2642 and the ratio between the two is 1e+06. Consider turning constraint into an indicator for better performance and numerical stability.

Interactive improvements
CP Optimizer now includes an interactive interface similar to that of CPLEX. You can load and save models in CPO format, change parameters, run propagation, solve, refine conflicts and so on. You can type “help” at the prompt to get information on the facilities available.

Evaluating variability
Both the CPLEX and CP Optimizer interactive shells now provide a way of easily examining the performance and variability of the solver on a particular model instance. In both interactive shells, the command tools runseeds [n] will run an instance n times (default is 30) with the current parameter settings, varying only the random seed between runs. Information on each run is displayed. For example, here is the output of tools runseeds on a CPLEX model where the time limit has already been set to 8 seconds.

====== runseeds statistics of 30 runs

    exit  sol    objective     gap  iteration      node   runtime   dettime
run code stat        value     (%)      count     count   seconds     ticks
  1    0  108          ---     ---     419468     39808      8.00   4195.20
  2    0  108          ---     ---     514998     49440      8.00   4369.46
  3    0  101            0    0.00      69242      8453      1.55    981.88
  4    0  108          ---     ---     518059     37514      8.00   4460.28
  5    0  101            0    0.00     123420     16923      4.52   3692.25
  6    0  108          ---     ---     511910     46768      8.01   4550.99
  7    0  108          ---     ---     459329     41168      8.00   4260.86
  8    0  108          ---     ---     431280     32714      8.01   4215.08
  9    0  108          ---     ---     379432     36441      8.00   3992.18
 10    0  108          ---     ---     377233     27091      8.01   3710.86
 11    0  108          ---     ---     333495     20436      8.01   3620.59
 12    0  101            0    0.00      21160      2993      0.44    225.97
 13    0  101            0    0.00     124943     14762      4.55   3714.96
 14    0  101            0    0.00     113538     12581      4.43   3641.18
 15    0  108          ---     ---     549655     46617      8.00   4606.69
 16    0  108          ---     ---     447175     26129      8.00   4007.73
 17    0  101            0    0.00     38622       5525      1.42   843.57
 18    0  108          ---     ---     413188     43561      8.00   4158.66
 19    0  101            0    0.00     490580     41267      7.83   4324.97
 20    0  108          ---     ---     499872     38093      8.00   4394.23
 21    0  101            0    0.00        292         0      0.14     86.63
 22    0  108          ---     ---     450731     47616      8.00   4373.14
 23    0  101            0    0.00     101645      9091      1.57    982.50
 24    0  108          ---     ---     520977     50566      8.00   4412.67
 25    0  108          ---     ---     501371     45112      8.00   4371.53
 26    0  108          ---     ---     496808     39674      8.00   4352.48
 27    0  108          ---     ---     414917     44412      8.00   4345.97
 28    0  101            0    0.00     427330     31227      6.97   4120.19
 29    0  108          ---     ---     481120     44447      8.00   4547.70
 30    0  101            0    0.00     465275     40579      7.69   4222.06

Exit codes:
      0 : No error

Optimization status codes:
                 objective     gap  iteration      node   runtime   dettime
                     value     (%)      count     count   seconds     ticks
    101 : integer optimal solution (11 times)
     average:            0    0.00     179641     16673      3.74   2439.65
     std dev:            0    0.00     185898     14571      2.89   1772.79
    108 : time limit exceeded, no integer solution (19 times)
     average:         ---      ---     459001     39874      8.00   4260.33
     std dev:         ---      ---      58792      8317      0.00    267.13

This is an instance where either a solution is found and proved optimal, or none is found. Looking at the breakdown by status code shows that the optimal is found and proved in 11 out of the 30 runs and a timeout happens without a solution being found on the remainder of the runs. When the optimal is proved, it happens in 3.74 seconds on average.
Conflict refinement and tuning tools have also been moved to the “tools” sub-menu.

CP Optimizer warm start
CP Optimizer warm starts can now be represented in the CPO file format in the “startingPoint” section. This makes it possible to portably and persistently store starting points and communicate starting points with others or with IBM support. Moreoever, this means you can solve problems with a warm start using IBM Decision Optimization on Cloud:

Here is an example of how to specify a starting point:

x = intVar(1..10);
y = intVar(1..10);
x + y < 12;
itv = intervalVar(optional, end=0..100, size=1..5);
startOf(itv, 10) == x;
endOf(itv, 1) == y;

startingPoint {
  x = 3;
  itv = intervalVar(present, size=4, start=7);

When you export a model on which a starting point has been set with the APIs, a "startingPoint" section is automatically generated containing the starting point.

Additionally, CP Optimizer's warning message mechanism has been evolved to include additional information on starting points, particularly when the starting point contains inconsistent information (such as values which are not possible for particular variables due to domain restrictions).

Piecewise linear functions
The support for piecewise linear functions in CPLEX has been extended and is now available both in the C API and in the file formats. Here's an example of how to specify a piecewise linear function in the LP file format.

Subject To
  f: y = x 0.5 (20, 240) (40, 400) 2.0

Here, we specify a piecewise linear function f and the constraint y = f(x)
The function f consists of three segments. For x < 20, the slope of the
function is 0.5, then there is a segment between the two points (20,240) and (40,400). Finally, for x > 40, the slope is 2.0.

Additional Parameters
A new parameter has been introduced to control so-called RLT cuts, based on Reformulation Linearization Technique. These cuts apply when the optimality target parameter is set to 3: that is, when solving a nonconvex QP or MIQP instance to global optimality. The parameter is named CPXPARAM_MIP_Cuts_RLT in the C API, and is accessible as set mip cuts rlt in the interactive optimizer. Possible values are -1 (off), 0 (let CPLEX decide, default), 1 (moderate), 2 (aggressive) and 3 (very aggressive).

A new effort level for MIP starts is available. This effort level CPX_MIPSTART_NOCHECK specifies that you want to inject a complete solution that you know is valid for the instance. CPLEX will skip the usual checks that are done, which can save time on some models. If the solution is not actually valid for the model, the behavior of CPLEX is undefined, and may lead to incorrect answers.

Please enjoy using and exploring the new features of CPLEX Studio 12.7!


What type of Machine Learning is right for my business?

Machine Learning is by no means a new thing. Back in 1959, Arthur Samuel’s self-training checkers algorithm had already reached “amateur status” – no mean feat for that period in time. This article is intended to shed some light on the two different types of Machine Learning that one can encounter, which may be useful if you are thinking of entering into this space and are unsure as to which avenue is appropriate for your business.


November 26, 2016

Simplified Analytics

Product recommendations in Digital Age

By 1994 the web has come to our doors bringing the power of online world at our doorsteps. Suddenly there was a way to buy things directly and efficiently online. Then came eBay and Amazon in...


November 25, 2016

Simplified Analytics

What is Deep Learning ?

Remember how you started recognizing fruits, animals, cars and for that matter any other object by looking at them from our childhood?  Our brain gets trained over the years to recognize these...

Teradata ANZ

Is your business ready to learn from the record-breakers?

In sport, as in business, there is the constant interplay between marginal gains and game-changing innovations.

Take the 100m freestyle swim, records have been broken year on year, but every so often we see not just a record broken, we see an outstanding accomplishment like Albert Vande Weghe in 1934. In one stroke he changed the nature of competitive swimming with his underwater somersault ‘Flip Turn’. Then further change followed in 1976 with the introduction of pool gutters at the Montreal Olympics capturing excess water and resulting in less friction and faster times. The next big advance was 2008, with the advent of low-friction swimwear that enabled athletes to move through the water with even greater speed.

The relevance of this story to business analytics is that just like athletes in training, data scientists make incremental improvements every day, and yet every so often comes one of those momentous, game-changing innovations.

Prescriptive analytics are the catalyst

As organisations increasingly seek to drive value from historical insights, they can start predicting the future, ensuring that positive predictions are fulfilled and that negative outcomes are avoided. This is how prescriptive analytics can influence the future.

Achieving this nonetheless requires a shift away from statistical and descriptive ways of looking at data, towards considering events and interactions. Applying contextual analytics to these events and interactions allows us to investigate modes of behaviour, intentions, situations, and influences.

Vast amounts of money are being spent by organisations on the creation of data ecosystems, enabling them to capture, store, and archive, large volumes of data at an unprecedented scale, in a cost-efficient manner. They are responding to the headlines about Big Data, but unfortunately many end up with fragmented architectures and data silos that thwart their ability to interrogate data and create value.

Gartner predicts that by 2018, 70 per cent of Hadoop deployments will fail to meet cost savings and revenue-generation objectives due to skills and integration challenges.

Should data ecosystems be built?

The answer is firmly “yes”. The age of infrastructure opened the door to the use of analytics for extracting value from data. Essentially, the resulting insights make business decisions more accurate and intelligent and because of that, the focus has shifted. Now it is the business team and not the IT department that leads data and analytics initiatives, demanding more value from data-plus-insights capable of creating commercial opportunities and solving problems.

It must be recognised that the value of storing and organising data depends on what you do with it. Business teams want to ask questions that cross data silos; questions that account for customer, product, channel, and marketing in combination. This amounts to a fundamental realignment of priorities and means that in future, many of our data professionals will no longer be technical specialists. Instead, they will be business-focused individuals using data, analytics, and technology as key enablers.

The Olympic spirit – higher, stronger, faster

Unsurprisingly, the monetisation of data and analytics will be a big differentiator. Gartner’s strategic prediction states that by 2018, more than half of large, global organisations will compete using advanced analytics and proprietary algorithms, causing the disruption of entire industries.

Without an underlying strategic framework – the organisation, the people, the processes, and the execution – businesses will drown in data. Only a judicial mix of analytics can help business leaders make decisions with confidence and intelligence, and sharpen the competitive edge.

The fact is that in any given organisation, data analysts beaver away making incremental improvements to their analytics ‘personal best’. Yet, as in the Olympics, it is the “Fosbury Flop” moments, and the Bob Beamon breakthroughs that live in the memory.

It is only such record-shattering leaps forward – like prescriptive analytics – that are capable of changing corporate thinking. Or, more precisely, transform the whole data-driven nature of business competition.


The post Is your business ready to learn from the record-breakers? appeared first on International Blog.


November 24, 2016

Data Digest

Activewear brand evolves from sports clothing to wearable fitness tech | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by Bev Terrell was originally posted on SiliconANGLE blog.  

While most people think of Under Armour as the supplier of sportswear and sports footwear, it also owns the digital apps MapMyFitness, MyFitnessPal and Edmundo. As such, the company has taken the first steps this year toward more investment in wearable technologies, called “Connected Fitness,” so that people can track their exercise, sleep and nutrition throughout the day.

Chul Lee, head of Data Engineering and Data Science at Under Armour Inc., joined Jeff Frick (@JeffFrick), co-host of theCUBE*, from the SiliconANGLE Media team, during the Chief Data Scientist, USA event, held in San Francisco, CA, to discuss aspects of the Chief Data Scientist role and Under Armour’s move into wearable fitness technology.

Being your own advocate

During a panel discussion held earlier in the day, Frick noted that Lee had brought up the point that in addition to being a scientist, a CDS also has to be a salesperson to sell the role, engage the business units and help them understand what they’re doing, at the right level.

Lee expanded on that by saying, “I learned, through many experiences and many years of failing, that there was an ‘ah-ha’ moment where I had to start communicating and being a salesperson.”

He also explained that data scientists tend to think they have to unpack the ‘black box’ of whatever project they are working on at the time and try to explain everything that they are doing to everyone. Data scientists feel the pressure to talk about the science aspect of projects and how it is done, rather than focusing on the value you’re trying to deliver to your customers.

Lee has found that all that is needed is to explain your project at a high level to coworkers and make sure they understand and are supportive of that.

Data is in sports clothes, too

Frick asked about how Under Armour got started with its Connected Fitness services, the software services arm built around the Endomondo, MyFitnessPal and MapMyFitness apps.

“The way we start thinking about shoes and shirts is that, OK, you need to enter an experience around shoes and shirts,” he said, adding that because data is everywhere, in every sector, they asked why shouldn’t it be in fitness clothes, too.

Watch the complete video interview below:

Data Digest

The Gawker effect: Can deep learning go deep enough to write tomorrow’s headlines? | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by R. Danes was originally posted on SiliconANGLE blog.  

When now-defunct Gawker revealed its use of analytics in content decisions, many old media types shook their heads; algorithms must not replace human judgement in journalism, they warned. But some believe a happy medium is possible: Data can be sourced and analyzed to inform content writers while leaving them with the final say on what readers see.

Haile Owusu, chief data scientist at Mashable, said that this space where data meets human knowledge workers is fertile ground for innovation. He told Jeff Frick (@JeffFrick), host of theCUBE, from the SiliconANGLE Media team, during the recent Chief Data Scientist, USA event that data practitioners do their best work in tandem with “people who are not especially quantitative, who are expecting — and rightfully so — expecting to extract real, concrete, revenue based value, but are completely in the dark about the details.”

Digital research assistant

Owusu explained how Mashable assists writers with data without encroaching on their judgement. They utilize an accumulated history of viral hits, its Velocity Technology Suite and its CMS.

“What we found is that writers are able to distill from sort of a collection of greatest hits — filtered by topic, filtered by time window, filtered by language key words — they are able to incorporate that collected history into their writing,” he said, adding that it does not simply fetch more clicks, but actually improves the quality and depth of their writing.

Two heads are better than one

Owusu stated that deep learning neural networks are able to grok the nuances of data in an almost human manner.

“They’ve allowed us to do feature extraction on images and text in a way that we hadn’t been able to before, and there has been a significant improvement in our ability to do predictions along these lines,” he concluded.

Watch the complete video interview below:

Data Digest

Show me the money: selling inexact data science to tight-fisted investors | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by R. Danes was originally posted on SiliconANGLE blog.  

What industry takes risk management more seriously than finance? Corporate and personal investors want a gold-embossed sure thing when they sink their cash into a venture. So techies who come to them with an untested data analytics toy will likely find them tough customers. Some folks in the finance world are out to dispel anxieties by educating investors on why data sometimes picks a winner and why it may fail.

Jeffrey Bohn, chief science officer at State Street Global Exchange, said the confusion lay in the problem of data quality. He also believes that companies still do not have enough hands on deck to separate the wheat from the chaff.

“You still find 70 to 80 percent of the effort and resources focused on the data preparation step,” he told Jeff Frick (@JeffFrick), host of theCUBE*, from the SiliconANGLE Media team, during the recent the Chief Data Scientist, USA event.

Data scientists spread too thin

According to Bohn, more data stewards are need to select quality data and to free up analysts to innovate and find solutions.

“I’ve had problems where you have great models, but data quality produced some kind of strange answers,” he explained. “And then you have a senior executive who looks at a couple of anecdotal pieces of evidence that suggest there are data quality issues, and all of a sudden they want to trash the whole process and go back to more ad hoc, gut-based decision making.”

The best and the rest

Bohn argueds that to increase data quality, companies need to start culling from a greater number of sources.

“We’ve recently been very focused these days on trying to take unstructured data — so this would be text data, it might be in forms on PDFs or html document or text files — and marry that with some of the more standard structured or quantitative data,” he said.

Watch the complete video interview below:

Data Digest

Data science: What does it mean and how is it best applied? | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by Gabriel Pesek was originally posted on SiliconANGLE blog.  

While data science (along with its associated tools, utilities and applications) is a hot topic in the tech world across innumerable facets of industry, there’s still a large degree of uncertainty as to just what data science means, and how it can best be applied.

At the Chief Data Scientist, USA event in San Francisco, CA, Assad M. Shaik, head of Enterprise Data Science for Cuna Mutual, joined Jeff Frick (@JeffFrick), co-host of theCUBE*, from the SiliconANGLE Media team, to talk about data science as it’s commonly seen and as it really is.

Improving understanding

As Shaik explained, his focus at the event was to clarify what data science actually is. Noting “a lot of confusion” about whether it’s a new name for analytics, advanced analytics or something else entirely, his goal is to speak frankly and clearly enough for attendees to gain a better understanding of the challenges and goals encountered by data scientists on a daily basis.

As part of this examination, he’s looking at both the experiences encountered by customers and the revenue growth that can be gained by applying analytics to those customers. He’s also providing some insight into the new skill expectations being encountered by data scientists and the reasons behind those changes.

In Shaik’s estimation, sales skills and a grasp of marketing have become “essential” for a data science group to find success. He attributes this mainly to the development from “IT and the business, if you just go back a few years,” to a centralization of data teams by other organizations in the search for additional value within their data.

Solving real problems

One of the biggest questions savvy data science groups can ask, in Shaik’s mind, is: “How can we help you meet the corporate and the business area goals using data science?” By looking for concrete problems to which the team can apply their tools and research, more informative conclusions can be drawn from the experience.

In a similar development line, Shaik shared his thoughts on companies such as Uber and Airbnb, which are using data science to evolve from traditional models. To him, the most important part of these companies is the way that they’re applying their data to the problems of an existing industry standard and leading the rest of that industry along with the need to innovate and keep up with the times.

As the conversation came to a close, Shaik also shared how much he enjoys the conference. In his experience at the event, “The biggest thing is the networking. I get to meet the people from the different industry sectors, with a similar background in the data science, and understand how they are doing what they are doing in the data science field, [while] sharing my perspective with them. It’s a fabulous event.”

Watch the complete video interview below:

Data Digest

Crowd-sourcing online talent to win a million-dollar competition | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by Nelson Williams was originally posted on SiliconANGLE blog.  

Talent shows have become a television favorite across the world. On the production side, they’re cheap and relatively easy to throw together. Viewers love seeing new acts and voting for their favorites. Still, only so many people can audition for these shows, and even fewer can make the trip to wherever filming takes place. The online world of user-generated media has a solution.

To learn more about an online talent show in the works, Jeff Frick (@JeffFrick), co-host of theCUBE*, from the SiliconANGLE Media team, visited the Chief Data Scientist, USA event in San Francisco, CA. There, he spoke with Roman Sharkey, chief data scientist at Megastar Millionaire.

The data science of talent

The conversation opened up with a look at Megastar Millionaire itself. Sharkey described the show as the world’s first online talent platform. Currently in beta testing, he expected Megastar Millionaire to go global sometime next year. He mentioned how the winner would be determined by votes and video shares, with a celebrity judging panel reigning over the finals.

Sharkey stated that as a data scientist, his role with the company was twofold. First, he was responsible for analytics, collecting data to extract information from users. Second, there was the machine learning part. A major project within the company involved obtaining new performers by detecting real talent in videos online outside the show and then inviting those people to join.

“The system is already really accurate, and its accuracy is improving,” Sharkey said.

Testing and business

At the moment, Megastar Millionaire is still testing its technology. Sharkey explained it’s keeping the number of performers low while testing out the platform, with about 200 to 250 people in the beta competition. On the business side, the company is working with funding from investors and is listed on the Australian stock exchange.

Sharkey pointed out the company is working on things no one has done in practice so far. His goal was to find new ways to accomplish tasks through data science. As for Megastar Millionaire itself, users can find its app in the Apple App Store and Google Play.

Watch the complete video interview below:

Data Digest

Data confluence: handling the scale of distributed computing | #CDSUSA

Editor's Note: Recently, at the Chief Data Scientist, USA, we had the pleasure of being joined by @SiliconANGLE Media, Inc. (theCUBE) to interview some of our attendees about the world of Data Science. Watch the video below. This article written by Nelson Williams was originally posted on SiliconANGLE blog.  

We live inside an explosion of data. More information is being created now than ever before. More devices are networked than ever before. This trend is likely to continue into the future. While this makes data easy to collect for companies, it also presents the challenge of sheer scale. How does a business handle data from millions, possibly billions, of sources?

To gain some insight into the cutting edge of distributed data collection, Jeff Frick (@JeffFrick), co-host of theCUBE*, from the SiliconANGLE Media team, visited the Chief Data Scientist, USA event in San Francisco, CA. There, he met up with Sam Lightstone, distinguished engineer and chief architect for data warehousing at IBM.

The discussion opened with a look at a recently announced concept technology called “Data Confluence.” Lightstone explained that data confluence was a whole new idea they’re incubating at IBM. It came from a realization that vast amounts of data is about to come upon business from distributed sources like cellphones, cars, smartglasses and others.

“It’s really a deluge of data,” Lightstone said.

The idea behind data confluence is to leave the data where it is. Lightstone described it as allowing the data sources to find each other and collaborate on data science problems in a computational mesh.

Using the power of processors at scale

Lightstone mentioned a great advantage of this concept, being able to bring hundreds of thousands, even millions of processors to bear on data where it lives. He called this a very powerful and necessary concept. Such a network must be automatic if it is to scale for hundreds of thousands of devices.

The complexities of such a system are too much for humans to deal with. Lightstone stated his goal was to make this automatic and resilient, adapting to the state of the devices connected to it. He related that with data confluence, they hoped to tap into data science for Internet of Things, enterprise and cloud use cases.

Watch the complete video interview below:

Jean Francois Puget

Using Python Subprocess To Drive Machine Learning Packages

A lot of state of the art machine learning algorithms are available via open source software.  Many open source software are designed to be used via a command line interface.  I much prefer to use Python as I can mix many packages together, and I can use a combination of Numpy, Pandas, and Scikit-Learn to orchestrate my machine learning pipelines.  I am not alone, and as a result, many open source machine learning software provide a Python api. 

Most, but not all.  For instance Vowpal Wabbit does not support a Python API that works with Anaconda.  A more recent package, LightGBM, does not provide a Python API either. 

I'd like to be able to use these packages and other command line packages from within my favorite Python environment.  What can I do?

The answer is to use a very powerful Python package, namely subprocess.  Note that I am using Python 3.5 with Anaconda on a MacBook Pro.  What follows runs as well on Windows 7 if you use commands available in a Windows terminal, for instance using dir instead of ls.  Irv Lustig has checked that the same approach runs fine on Windows 10, see his comment at the end of the blog.

First thing to do is to import the package:

import subprocess

We can then try it, for instance by listing all the meta information we have on a given data file named Data/week_3_4.vw:["ls", "-l", "Data/week_3_4.vw"], stdout=subprocess.PIPE).stdout

This yields

b'-rw-r--r--  1 JFPuget  staff  558779701 Aug 11 12:40 Data/week_3_4.vw\n'

Let's analyze a bit the code we executed. runs a command in a sub process, as its name suggests.  The command is passed as the first argument, here a list of strings.  I could have passed a unique string such as "ls -l Data/week_3_4.vw" .   Python documentation says it is preferable to break the command into as many substrings as possible.

The command outputs a CompletedProcess object that can be stored for latter use.  We can also use it immediately to retrieve the output of our command.  For this we need to pipe the standard output of the command to the stdout property of the object returned by  This is done with the second argument stdout=subprocess.PIPE.

A similar example from Windows 7 (I guess WIndows 10 would be the same):["dir"], stdout=subprocess.PIPE, shell=True).stdout

will output a string containing the content of the default directory for your Python script.  Note we must use the shell=True argument in this case.  It first launches a shell, then runs the command in that shell.

Let's now run Vowpal Wabbit.  We assume that the Vowpal Wabbit executable vw is accessible in our system path.  One way to check it is to just type vw in a terminal.  The following code snippet runs it with the above data file as input:

cp =['vw -d Data/week_4_5.vw'],

Let's look at the code we wrote.  We store the CompletedProcess object returned by the command in a variable for later use.  We then print the stdout property of that object.  We redirect the standard error as well as the standard output, with the argument stderr=subprocess.STDOUT. This redirects the standard error to the standard output, which in turn is redirected to the standard output of the cp object. 

We want to run Vowpal Wabbit in a shell, as this is what it expects.  This is done via the shell=True argument.

The universal_newlines=True argument basically says to treat new lines as new lines.  If it is not set than output printing will be jammed. 

Last, the check=True argument is set to true in order to trigger a Python exception if the sub process command return code is different from 0.  This is the only way to make sure that the command executed properly.

This code prints:

Num weight bits = 18
learning rate = 0.5
initial_t = 0
power_t = 0.5
using no cache
Reading datafile = Data/week_4_5.vw
num sources = 1
average  since         example        example  current  current  current
loss     last          counter         weight    label  predict features
0.480453 0.480453            1            1.0   0.6931   0.0000        6
0.263166 0.045879            2            2.0   0.6931   0.4790        6
0.389123 0.515081            4            4.0   1.0986   1.1152        6
0.303324 0.217524            8            8.0   1.3863   1.5294        6
0.216799 0.130274           16           16.0   1.6094   1.4195        6
0.332574 0.448349           32           32.0   2.9444   1.4557        6
0.431952 0.531331           64           64.0   1.3863   1.5907        6
0.832383 1.232813          128          128.0   1.7918   2.7743        6
0.686496 0.540610          256          256.0   1.0986   1.6892        6
0.586707 0.486917          512          512.0   0.0000   1.8906        6
0.743715 0.900723         1024         1024.0   1.0986   1.9852        6
0.793468 0.843222         2048         2048.0   2.6391   2.2505        6
0.707151 0.620834         4096         4096.0   2.1972   2.0928        6
0.699104 0.691057         8192         8192.0   0.6931   1.4820        6
0.497394 0.295684        16384        16384.0   1.6094   1.4555        6
0.374966 0.252538        32768        32768.0   2.9444   2.5621        6
0.305165 0.235363        65536        65536.0   2.5649   2.0110        6
0.266821 0.228478       131072       131072.0   1.3863   0.8909        6
0.278243 0.289665       262144       262144.0   1.6094   1.2183        6
0.263753 0.249263       524288       524288.0   1.0986   1.4362        6
0.252903 0.242053      1048576      1048576.0   1.0986   1.0863        6
0.255260 0.257616      2097152      2097152.0   1.0986   0.9340        6
0.270311 0.285362      4194304      4194304.0   1.0986   1.6406        6
0.293451 0.316591      8388608      8388608.0   0.6931   0.5203        6

finished run
number of examples per pass = 11009593
passes used = 1
weighted example sum = 11009593.000000
weighted label sum = 17739622.688639
average loss = 0.301545
best constant = 1.611288
total feature number = 66057558

This is a typical Vowpal Wabbit output. 

The above code looks nice and handy, but it has a major drawback.  It prints the output when the sub process command completes.  It does not let you see the current output of the command.  This can be frustrating when the underlying command takes time to complete.  And as machine learning practitioners know, training a machine learning model can take a long long time.

Fortunately for us, the subprocess packages provides ways to communicate with the sub process.  Let's see how we can harness this to print the output of the sub process command as it is generated.  Instead of using the run command, we use the Popen constructor with the same arguments, except for check=True. Indeed  check=True only makes sense if you run the sub process command to completion.

We then parse the output line by line and print it.  There is a catch however: we need to stop at some point.  Looking at the above output, we see that Vowpal Wabbit terminates its output with a line that starts with total.  We check for this, and stop reading from the Popen object.  We then print all existing output before completion.  We use rstrip to remove trailing endline as print already adds an end line.  An alternative would be to replace print(output.rstrip()) with print(output, end='') .

proc = subprocess.Popen(['vw -d Data/week_3_4.vw'],
while True:
    output = proc.stdout.readline()
    if 'total' in output:

remainder = proc.communicate()[0]
print( remainder)

The output of this code is the same as above, except that each line is printed immediately, without waiting for completion of the sub process. 

You can use the same approach to run LightGBM or any other command line package from within Python. 

I hope this is useful to readers, and I welcome comments on issues or suggestions for improvement.


Update on November 24, 2016.  Added that code works fine on Windows 10 as pointed out by Irv Lustig.  Also tested on Windows 7.


Data Digest

Forrester: Empowered customers drive deeper business transformations in 2017

Businesses today are under attack, but it’s not by their competitors. They are under attack from their customers. Three years ago, Forrester identified a major shift in the market, ushering in the age of the customer. Power has shifted away from companies and towards digitally savvy, technology-empowered customers who now decide winners and losers.

Our Empowered Customer Segmentation shows that consumers in Asia Pacific are evolving — and becoming more empowered — along five key dimensions. These five key shifts explain changing consumption trends and lead to a sense of customer empowerment: Consumers are increasingly willing to experiment, reliant on technology, inclined to integrate digital and physical experiences, able to handle large volumes of information, and determined to create the best experiences for themselves.

At one end of the spectrum are Progressive Pioneers, who rapidly evolve and feel most empowered; at the other end we find Reserved Resisters, who are more wary of change and innovation. While the segments are globally consistent and apply across markets, we see significant differences when comparing countries. Our analysis of metropolitan online adults in Australia found that a third of them fall into the most empowered segments — Progressive Pioneers and Savvy Seekers. Highly empowered customers will switch companies to find new and exciting experiences. In this environment, being customer-obsessed and constantly innovating are the only ways to remain competitive.

Organisations in Australia understand this new environment and have started leveraging digital technologies to better engage and serve their B2C and/or B2B empowered customers. While important, most of these investments remain cosmetic in nature. Being customer obsessed requires much more than a refreshed user experience on a mobile app. It requires an operational reboot. To date, few organisations have started the hard transformation work of making their internal operations more agile in service of these customers. To win the business of these empowered customers, digital initiatives in 2017 will have to move from tactical, short term initiatives to broader and deeper functional transformation programs.

Being customer obsessed requires much more than a refreshed user experience on a mobile app. It requires an operational reboot. 

Customer obsession requires harnessing every employee, every customer data point, and every policy in the organisation. Eventually, companies will have to assess and address six key operational levers — technology, structure, culture, talent, metrics, and processes — derived from the four principles of customer obsession: customer-led, insights-driven, fast, and connected. Done well, customer obsession promises to help your organisation win, serve, and retain customers with exceptional and differentiated customer experiences.

Join Michael Barnes, VP and research director, Forrester at the Chief Customer Officer, Sydney on 29th November, Tuesday where he will speak on how to Transform Marketing Into A Customer-Driven Effort. To find out more, visit 


The Top Predictive Analytics Pitfalls to avoid

Predictive Analytics can yield amazing results.  The lift that can be achieved by basing future decisions from observed patterns in historical events can far outweigh anything that can be achieved by relying on gut-feel or being guided by anecdotal events.  There are numerous examples that demonstrate the possible lift that can be achieved across all possible industries, but a test we did recently in the retail sector showed that applying stable predictive models gave us a five-fold increase in the take-up of the product when compared against a random sample.  Let’s face it, there would not be so much focus on Predictive Analytics and in particular Machine Learning if it was not yielding impressive results.

Teradata ANZ

Service Recovery that Deepens Relationship & Brand Loyalty

Optimising customer’s revenue contribution depends heavily on a company’s ability to deepen and effectively maintain loyalty along with emotional attachment to its brand. There has been plenty of rhetoric around Customer Experience Management as the strategy to achieve this competitive edge. Yet the fact remains that the vast majority of such initiatives either concentrate solely on cross-sell/up-sell marketing or they are Voice-of-Customer (VOC) service related surveys. These are laudable efforts but are unlikely to result in sustainable differentiation when they are not part of coordinated customer-level dialogue.

What has been proven to deliver a superior business outcome is the ability to engage with “One Voice” when communicating with customers, especially after negative experiences. For example, only a handful of customers would do more business with a company when their complaints remain unresolved. Therefore, at a minimum, these customers should be excluded from promotional marketing until after satisfactory resolution. Ideally a system should be in place to automatically replace a cross-sell message with a service recovery one for the affected customers.

According to BCG[1] , regardless of the channel they started in, most consumers would seek human assistance (usually via telephone) if they do not get their problem resolved. There would already be a degree of annoyance right from the start of such service calls, especially if customers have to retell background from the beginning. Ideally the service rep should already know the service failures and breakpoints from an earlier interaction. The key capabilities differentiator here is data integration for just-in-time analytics specific to each customer’s context. Even better is to avoid the negative experience in the first place – i.e. develop the capabilities to detect and predict potential servicing and quality issues that erode customer satisfaction.

A company that only proactively contacts customers to sell would very likely condition more and more recipients to switch off and disengage from these communications. A different approach is needed to succeed with Customer Experience Management strategy. In order to inculcate a customer-centric mindset and systematically deliver bespoke servicing across the entire customer base, an organisation will need to align processes and introduce new performance metrics (e.g. customer’s share of wallet) to drive appropriate contents of automatic communication management capabilities. Successfully deploying service recovery into its broader marketing dialogue would stand the company in good position to take advantage of a sale opportunity as and when it emerges for each customer.



[1] BCG, “Digital Technologies Raise the Stakes in Customer Service”, May-16

The post Service Recovery that Deepens Relationship & Brand Loyalty appeared first on International Blog.


November 23, 2016

Curt Monash

DBAs of the future

After a July visit to DataStax, I wrote The idea that NoSQL does away with DBAs (DataBase Administrators) is common. It also turns out to be wrong. DBAs basically do two things. Handle the database...


Curt Monash

MongoDB 3.4 and “multimodel” query

“Multimodel” database management is a hot new concept these days, notwithstanding that it’s been around since at least the 1990s. My clients at MongoDB of course had to join the...

Data Digest

Is Latin America really ready for Big Data and Analytics?

When one tries to understand the rapidly evolving market of Big Data and analytics globally, there is always a tendency to compare the most advanced markets with those that are currently at the beginning of their data and analytics journey, such as Latin America. In our research, it was clear though that there is much greater level of expertise and knowledge than what is perceived from the outside.

We recently interviewed Giulianna Carranza (GC) of Yanbal, David Delgado (DD) of Banorte, and Victor Barrera (VB) of AXA Seguros, three of the most prominent CDOs in Latin America. During our conversations, we explored their role, competencies and future predictions about the Big Data and Analytics market in the region.

Corinium: What are the top qualities that any CDO in Latin America must have?

GC: The ideal CDO must have fundamental skillsets to drive forward any business strategy within an organisation in Latin America. For instance, he/she must possess innovation leadership, tactical acumen, technological knowledge and strategic thinking in order to promote the adoption of new ideas and processes around data. This executive must also have the ability to guide the members of his/her team and other corporate silos to build solid strategies based on past and present data stories as well as future goals.

DD: A CDO must have a complete and integral business acumen and vision. He/she also needs to be an integrator, business process analyst and a great strategist.

VB: Any CDO in Latin America must have and promote strategic values such as empathy, curiosity and the agility to drive forward any data strategy. This executive also needs to learn and understand how to build a solid data enterprise aligned with the organisation’s values and future objectives, to really lead a data-driven transformation from within; not only from an IT but also from a business perspective. Finally, it is imperative for a CDO to have exceptional leadership skills to engage all silos, including IT, and functions around a clear idea of leveraging data as a vital asset.

It is imperative for a CDO to have exceptional leadership skills to engage all silos, including IT, and functions around a clear idea of leveraging data as a vital asset.

Corinium: What recommendations would you give to any professional seeking this career path?

GC: The is some hard-stuff to be considered by any future CDO in terms of a simple  “Professional self-assessment”: (1) Complete knowledge of the organisation’s operative structure and behaviour; (2)  Sufficient theoretical- practical knowledge of the latest technological tendencies and developments in both digital and analytics tools; (3) Enjoy working out of their “comfort zone” and see the professional path of a CDO as a journey of continuous development and improvement; finally, (4) passion and commitment.

DD: The future CDO needs to have great analytical skills, the ability to break traditional paradigms about customers and brands interactions to recognise its evolution through data; and possess an innate inquisitive mind to understand the complex processes and problems around data to transform them into applicable and easy-to-apply business insights.

VB: As I mentioned in the previous point, any CDO requires a great deal of empathy, curiosity and adaptability. Any “data-driven transformation” must be understood as a fundamental change of paradigms at the basic corporate structure. What’s more, it is vital for any CDO in the region to have a strategic thought leadership mentality and knowledge of how to build a solid Data strategy to be able to develop a consistent corporate transformation within the foreseeable future, instead of seeking a swift technological adaptation to a rapidly changing market. In short, the CDO needs to be a leader, able to encourage both the IT and BI teams to work together, with the same goal in mind, to properly deliver a consistent data driven transformation.

Corinium: What are the biggest challenges that any CDO will face in the next few years in the region?

DD: Firstly, how to cope with the velocity in which data is produced locally and all administrative challenges this implies to generate any business insight. Also, the complexity of using new data management platforms, such as “Hadoop”, and the workforce capacitation needed to implement them.

Culturally speaking, Latin American corporations should implement working models that allow organisations to properly follow what the data indicates.

GC: Culturally speaking, Latin American corporations should implement working models that allow organisations to properly follow what the data indicates. Also, we have important professional skillsets gaps, so we need to deploy a full new organisational model around analytics in LATAM, including Learning & Development, to properly exploit existing and future data.

We also need to change our traditional “Ad Hoc Technology” approach of a “unique-analysis-based-platform” and adopt more tailored solutions to support different silos to perform concordantly with a data strategy (there are obvious expenses related, but it is imperative to consider it to achieve real business outcomes).

It is very important to consolidate the organizational positioning of CDOs in the region. When anyone asks me: Would the CDO be the next CEO? My answer is always: Yes! It is the next natural step!

VB: Change management will be the biggest. It may sound simple, but if all related tasks are not carried out as required, any Data Strategy will fail.

Corinium: How do you see the current Big Data and Analytics market in Latin America and what predictions could you make for it in the next 4 years?

DD: It is a market that has rapidly developed in the last couple of years with regards to its conceptualisation and initial implementation in the region. Clearly, it will grow exponentially now that local corporations have recognised how vital it is to embrace the digital era, and everything it represents, to understand customers’ behaviour, data recompilation, data monetisation and data administration. Several Big Data tools will be deployed by more organisations faster to gain a competitive advantage.

GC: I believe there will be a substantial consolidation of Small Data, Data Intelligence, Digital Analytics, amongst many others, in the region and that they will generate a fundamental transformation of the traditional management structure. Real-time Analytics concepts will represent an important challenge for current or future CDOs –Gartner, for instance, is now talking about hybrid operations models for it: on-going details + consolidates.

From a vendor perspective, I believe they need to adopt a more “consultancy” approach, rather than a simple “sale of products” one, as they have the opportunity to cement integral technological foundations to support CDO’s performances and initiatives for the future in Latin America.

VB: From my perspective, Latin America is not yet ready for Analytics and Big Data. There are fundamentals to be achieved before considering the above: (1) uniformity on Data integration; (2) real business questions’ identification; (3) tools to identify how data, both internal and external, will help answer those questions; and (4) development of statistics models to answer them. It is currently possible to apply some analytical tools to find specific solutions, but not from a wider organisational perspective, given that the data is not really integrated in a unique format.

When anyone asks me: Would the CDO be the next CEO? My answer is always: Yes! It is the next natural step!

To hear from these experts and many more, join us at Chief Data & Analytics Officer, Central America 2017. Taking place on January 24-25, 2017 in Mexico City, CDAO, Central America is the most important gathering of CDOs and CAOs in the region, to further and promote the dialogue concerning data and analytics and their untapped potential. For more information, please visit:  


November 22, 2016

Big Data University

This Week in Data Science (November 22, 2016)

Here’s this week’s news in Data Science and Big Data. Coffee

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

  • Apache Spark – Hands-on Session – Come join speakers Matt McInnis and Sepi Seifzadeh, Data Scientists from IBM Canada as they guide the group through three hands-on exercises using IBM’s new Data Science Experience to leverage Apache Spark!

The post This Week in Data Science (November 22, 2016) appeared first on Big Data University.

Silicon Valley Data Science

Embracing Experimentation at AstroHackWeek 2016

There is something very freeing about experimentation—the ability to fail without fear, and move on. At SVDS, we encourage experimenting as part of our agile practices.

The last week of August, I saw experimentation in action at AstroHackWeek 2016, which is billed as a week-long summer school, unconference, and hackathon. It is part of the Moore-Sloan Data Science initiative being sponsored by three universities: UC Berkeley, New York University, and the University of Washington.


While the first half of each day of AstroHackWeek was spent on more traditional lectures, I would like to focus on the “hacks.” Operating as a form of hackathon, the hacks were well organized and structured.

The first day after lunch, everyone participating in the unconference stood up, introduced themselves, and either proposed a hack idea or mentioned some skills they had that others might find useful for their own hacks. While the event and attendees were rooted in astrophysics, everyone was encouraged to explore any idea they liked, which helped contribute to that atmosphere of being open to the unexpected.

Then, everyone simply self-organized and worked on whatever projects they found interesting. Each hack group ended their days by giving a short recap of what was accomplished, what failed to work, and calls for help. I want want to note that second piece—the encouragement to speak to what failed, or ask for help from others, created a very positive environment for trying new and difficult projects.

Structuring experimentation

I found it striking how important having structure around the freeform process of proposing and working on “hacks” was to the success of the week.This rapid reassessment and evaluation of each of the hacks, and a deliberate calling out of what did not work and why reminded me of the daily standups in our agile data science projects at SVDS. During a project, we meet in the morning to quickly mention what happened yesterday, what is planned for today, and whether something is blocking progress.

The engagement from the whole group at the unconference was incredibly high—people stayed late every night, working at different locations (from bars to restaurants to nearby houses) to continue working on different hacks. Hacks ranged from projects like creating a “Made At AstroHackWeek” GitHub badge in ten minutes, to an analysis investigating exoplanet periods from sparse radial velocity measurements (this one is currently being written up for submission to a journal).

You can find the full list of hacks here (and all the materials here), but I’ll link to a couple of my favorites:

Concluding thoughts

Experimentation is one of the engines that drives scientific inquiry. The rapid turnaround on hack projects throughout AstroHackWeek were of a different kind than is typical in academia, and felt more similar to an agile project. The freedom to fail, to iterate quickly, and the cross pollination of having researchers in different astrophysics disciplines made for a powerful and productive week.

What have you experimented with lately? Let us know in the comments, or check out our agile build case study with

The post Embracing Experimentation at AstroHackWeek 2016 appeared first on Silicon Valley Data Science.

The Data Lab

Bizvento - Knowledge Extraction for Business Opportunities

Since 2013, Glasgow-based Bizvento, has been developing innovative business software for the event management industry. The Scottish start-up has developed a mobile software platform specifically for professional event organisers which lets them manage all aspects of an event in one place. It also gives real-time data analysis capability, useful to event organisers. 

After establishing a successful business in providing mobile solutions, Bizvento realised the potential value of the data gathered by its apps at events in the higher and further education sectors, specifically at college and university open days. The app is designed to provide information to prospective students at these events around available courses and programmes. Based on this information, students then have the opportunity to attend talks and information sessions about what’s on offer. This information is recorded in the app and analysed by Bizvento.

The introduction of tuition fees in the UK has created a £20billion market for the higher and further education sector. Bizvento saw the potential to offer universities and colleges across the country the ability to predict and forecast the number of prospective student applicants using reliable data.

To do this, Bizvento created project KEBOP (Knowledge Extraction for Business Opportunities) and approached The Data Lab for partnership, strategic support and grant funding to realise this opportunity. The Data Lab then facilitated the academic partnership between Bizvento and the University of Glasgow based on the data analysis requirements. 

KEBOP is made up of a suite of sophisticated analysis tools capable of extracting actionable information from two main sources, namely the usage logs of the Bizvento technologies and the registration data of the Bizvento users. 

In the case of the usage logs, the analysis tools adopt information theory to model the behaviour of the users and to identify classes of usage. In particular, the analysis tools manage to discriminate between average usage patterns (those adopted most frequently by the users), peculiar usage patterns (those that appear less frequently while being correct) and wrong usage patterns (those that correspond to incorrect usages of the app). Furthermore, the tools allow the analysis of user engagement as a function of time. This has shown that usage dynamics tend to change abruptly at specific points of time rather than continuously over long periods.

In the case of the registration data, the key-point of the KEBOP approach is the integration of basic information provided by the users (name and postcode) and publicly available data about sociologically relevant information (gender, status, education level, etc.). 

The main outcome of KEBOP is a suite of data analysis technologies capable of making sense of the digital traces left by the users of Bizvento products at academic open days. 

The usage logs and specifically the registration data captured and analysed by KEBOP technologies will identify the main factors that determine the participation in a large-scale event, not only in terms of the very chances of participating at the event but also in terms of the preferences for different aspects of the event itself (e.g., the choice of specific sessions in a conference). The experiments have been performed over data collected at the Open Days of the University of Glasgow and show that the most important factors underlying the participation of prospective students to the Open Days are as follows: education level, unemployment rate and average income in the area where an individual lives. Advancing on this, the analysis tools also show the interplay between gender, social status and the choice of the subject of study.

Through the analysis of rich data sets, Bizvento can reliably predict the number of prospective student applicants to universities and colleges throughout the UK. This information can be used by higher and further education institutions to inform their student recruitment processes and forecast levels of applications and interest in specific courses and programmes.

Google Plus

November 21, 2016

Jean Francois Puget

The Machine Learning Workflow



I have been giving two talks recently on the machine learning workflow, discussing pain points within it and how we might address them. First one was at Spark Summit Europe at Brussels, the other one at MLConf at San Francisco

You can find videos and slides for each below.  Main message is that the machine learning workflow is not that simple.


MLConf, San Francisco

That was a great event.  I was in very good company with top presenters from a number of prominent companies, as you can see from the speakers page.  One key takeaway (not a surprise for me) is that machine learning is not all about deep learning.  Sure, deep learning is used, but other techniques such as factorization machines and gradient boosted decision trees play a significant role in some very visible applications of machine learning as well. 

I encourage readers to take a look at the videos of MLConf presentations. Here is information about mine:

My Abstract:

Why Machine Learning Algorithms Fall Short (And What You Can Do About It): Many think that machine learning is all about the algorithms. Want a self-learning system? Get your data, start coding or hire a PhD that will build you a model that will stand the test of time. Of course we know that this is not enough. Models degrade over time, algorithms that work great on yesterday’s data may not be the best option, new data sources and types are made available. In short, your self-learning system may not be learning anything at all. In this session, we will examine how to overcome challenges in creating self-learning systems that perform better and are built to stand the test of time. We will show how to apply mathematical optimization algorithms that often prove superior to local optimization methods favored by typical machine learning applications and discuss why these methods can crate better results. We will also examine the role of smart automation in the context of machine learning and how smart automation can create self-learning systems that are built to last.

Watch the presentation on YouTube

See the slides on SlideShare

Spark Summit Meetup, Brussels

At the recent sold-out Spark & Machine Learning Meetup in Brussels, I teamed up with Nick Pentreath of the Spark Technology Center  to deliver the main talk of the meetup: Creating an end-to-end Recommender System with Apache Spark and Elasticsearch.

Nick did most of the talk, presenting how to build a recommender system.  I talked about 10-15 minutes at the end, discussing the machine learning workflow and typical pain points within it.

Watch the presentation on Youtube

See the slides on SlideShare