decor
 

Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.

 

July 30, 2016


Simplified Analytics

How to deploy Data Science projects?

In this Digital age today, data science has become the top skill and sexiest job of the century.  Data science projects do not have a nice clean life-cycle with well-defined steps like software...

...
 

July 29, 2016

 

July 28, 2016

Silicon Valley Data Science

Structured Streaming in Spark

Editor’s note: Andrew recently spoke at StampedeCon on this very topic. Find more information, and his slides, here

Spark 2.0 (just released yesterday) has many new features—one of the most important being structured streaming. Structured streaming allows you to work with streams of data just like any other DataFrame. This has the potential to vastly simplify streaming application development, just like the transition from RDD’s to DataFrame’s did for batch. Code reuse between batch and streaming is also made possible since they use the same interface. Finally, since structured streaming uses the Catalyst SQL optimizer and other DataFrame optimizations like code generation, it has the potential to increase performance substantially.

In this post, I’ll look at how to get started with structured streaming, starting with the “word count” example.

Note: Structured streaming is still in alpha, please don’t use it in production yet.

Word Count

Word count is a popular first example from back in the Hadoop MapReduce days. Using a Spark DataFrame makes word count fairly trivial. With the added complexity of ordering by count this would look something like the code below.

Note that we are using the new SparkSession entry point (as spark) that is new in Spark 2.0, along with some functionality from Datasets.

val df = spark.read.text("/some/data/")

val counts = (df.as[String] // Convert to Dataset[String]
 .flatMap(_.split(" ")) // Split into words
 .groupBy("value") // Group by word (column is "value")
 .count() // Count each group
 .orderBy($"count".desc)) // Sort by the count in descending order

counts.write.csv(...)

If you wanted to accomplish the same thing in the old Spark streaming model you would be writing quite a bit of code because it relied on the older RDD interface and required you to manually track state. However, with structured streaming we can do the following.

val df = spark.readStream.text("/some/data/")

val counts = (df.as[String] // Convert to Dataset[String]
 .flatMap(_.split(" ")) // Split into words
 .groupBy("value") // Group by word (column is "value")
 .count() // Count each group
 .orderBy($"count".desc)) // Sort by the count in descending order

val query = (counts.writeStream
 .outputMode("complete")
 .format("console")
 .start())

As you can see, no change to the logic is required. What has changed is that we used readStream instead of read to get the input DataFrame, and writeStream instead of write to do output. Also, the output is a bit different; in the batch mode we can persist the output in a number of ways, such as csv like we used above. Since structured streaming is alpha it is still lacking in output options. Above, I’m using the console output option, which prints the result of each batch to stdout.

Window Changes

If you are familiar with traditional Spark streaming you may notice that the above example is lacking an explicit batch duration. In structured streaming the equivalent feature is a trigger. By default it will run batches as quickly as possible, starting the next batch as soon as more data is available and the previous batch is complete. You can also set a more traditional fixed batch interval for your trigger. In the future more flexible trigger options will be added.

A related consequence is that windows are no longer forced to be a multiple of the batch duration. Furthermore, windows needn’t be only on processing time anymore, we can rearrange events that may have been delayed or arrived out of order and window by event time. Suppose our input stream had a column event_time that we wanted to do windowed counts on. Then we could do something like the following to get counts of events in a 1 minute window:

df.groupBy(window($"event_time", "1 minute")).count()

Note that the window is just another column in the DataFrame and can be combined with other existing columns for more complex aggregations:

df.groupBy($"action",window($"event_time", "1 minute")).agg(max("age"), avg("value"))

You may rightly wonder how this is possible without having an ever-increasing number of aggregation buffers sticking around so that we can process arbitrarily late input. In the current alpha version there is no optimization for this. In the future, though, the concept of discard delay will be added, which will allow you to specify a threshold beyond which to discard late data so that aggregations can eventually be finalized and not tracked.

Sources

Currently, only the file based sources work. But there are some details you need to be aware of:

  • File formats that normally infer schema (like csv and json) must have a schema specified.
  • New files must be added to the input directory in an atomic fashion; usually this would be accomplished by moving the file into the directory after creation.

There is also a socket source which is only suitable for testing and demos, see the examples included in the release for usage. A Kafka source is also planned to be released shortly after the Spark 2.0 release.

Sinks

Before I discuss sinks, we first need to explore output modes. There are three modes: append, complete, and update. Append can only be used when there is no aggregation taking place. Complete outputs everything at the end of every batch. Update will do an update in place to a mutable store, such as a SQL database. Currently only append and complete are implemented.

The current options for sinks are extremely limited. Parquet output is supported in append mode, and there is also a foreach sink available for you to run your own custom code that works in append or complete modes. Additionally, there are console and memory sinks that are suitable for testing and demos.

Conclusion

Structured streaming is still pretty new, but this post should have given you an overview of why it’s an exciting feature. Have you tried it out at all? Let us know in the comments.

The post Structured Streaming in Spark appeared first on Silicon Valley Data Science.

Principa

Three ways to manage credit risk governance in a volatile economic climate

Credit companies are facing an increasingly volatile global financial climate. A person has to look no further than the impact the unexpected Brexit results have had on the global market. And if that’s not enough, the highly accelerated pace of technological development means that companies need to always be prepared to update their processes and methodologies to accommodate ever-changing client needs and to mitigate risk.

Data Digest

The CDO Journey: Building a case for the Chief Data Office(r)

(This article is addressed to C-suite executives and business leaders looking to build a CDO organization within their company.)

If you don’t know where you are going, any road will get you there.
- Lewis Caroll in “Alice in Wonderland”

If you are someone who has been following the evolution of strategies, new technologies, architectural innovation, best practices, success stories and a host of vendor-driven publicity in the data and analytics space, it is very easy to be influenced and assume that this is the direction you want your company to grow into. You may even want to initiate some of these cool projects and proceed with the creation of a CDO role in your organization.

There is nothing wrong in wanting to embrace the advancements in the data space to position the business for growth, but that decision cannot be based on someone else’s story. Before you or your company start down the path of setting up the CDO, you need to understand what your unique story is. A unique story that is crafted from a combination of problems and opportunities. That is, your case should reflect the collective scenarios of the need to solve existing problems as well as a larger desire to embrace and grow from opportunities in the marketplace. What usually succeeds in crafting this story is a blend of factual and creative exercises to identify ‘Musts’, ‘Shoulds’ and ‘Nice-to-haves’.

Thinking through the following questions can help with articulating your story better when it comes to making a formal business case.

Is the volume, variety and velocity of data & analytics needs complex and critical enough to justify a CDO? Based on your specific needs, all that is required could be a few tweaks to the existing processes when you may already have a mature organization supporting it. For example, some companies have a small data footprint when it comes to variety, a very deep requirement in volume and they manage a specific set of functions where complexity is low to medium. Do these companies need a CDO? Maybe, maybe not. It totally depends on the specific situation and what the intended expectations are.

Beyond solving immediate problems, what are the long-term expectations on data and analytics? Independent of your immediate drivers, there are extended benefits that can be realized through the CDO. If the immediate needs are driven by regulatory/compliance mandates, what are the opportunities that exist for extending the purpose of this function beyond these needs to ensure achieving overall business growth?  In fact, regulatory mandates and regulators are a CDO’s best friends, as the primary expectations and principles from these agencies are fully aligned on how data and analytics functions should be rightfully managed – with level of controls, ease of change, comprehensiveness of coverage and adequate management oversight. Make this as a standard yardstick for managing data across the company and you will automatically achieve success with managing the entire business through actionable insights based on reliable data and analytics, not to mention the heightened level of efficiencies created in the process.

Can you or your company support a function that operates across organizational boundaries? Where the function lines up is irrelevant as the application of the processes, procedures, and toolsets are company-wide. Typically, CDO’s are aligned within Risk, Finance or Technology and in some cases under Marketing or Strategic initiatives area. There is a cultural component involved here that needs to be addressed in almost all CDO initiatives. For a CDO function to be effective, the reach need to be across organizational boundaries. When it is truly addressed this way, it is easy to extend the applicability to realize extended benefits that cannot be perceived initially.

How are you going to support the organization and its leader to succeed? Who will this function report into? The easiest and quickest way to ensure that the CDO initiative fails is to align it with a C-level leader who does not understand the function (even at a high level), not value it, cannot see the strategic benefits to the organization or have conflicting interests that the person would rather see it fail than succeed. In the previous point, I discussed that it does not matter where the function aligns, but it is critical that it is aligned within an area where the greatest support can be provided. Also critical is ensuring that adequate support is provided on items related to funding, resources, independent authority, strategic evolution and reporting.

Once the question “Does your company really need a CDO?” has been answered based on the above-mentioned points, the next step is to sell the case for the CDO, and for this, the full range of benefits need to be established. This will be covered in greater detail in future articles. However, some of the most obvious are listed below.

  • Establishing comprehensive data governance through Business glossary, data lineage, business/technical metadata, and data certification.
  • Handling Data privacy, data classification, access control and entire range of information lifecycle management.
  • Robust and reliable data quality through a structured process of data quality rules definition, assessment, monitoring and reporting.
  • Data Issue identification, inventory, research and remediation activities that may start with data but if properly implemented, can be extended to cover critical aspects of managing the business – across policies, procedures, alerts, broken/missing controls, process gaps & improvements and efficiency creation.
  • Manage data as an asset – address platforms, sourcing, processing and consumption with the company and across vendor relationships as applicable, as well as increased coverage from inclusion of new sources – internal, external and social-media.
  • Promotion of Analytics internally, enhancing customer engagement and experience, behavior predictive modeling for risk mitigation & upsell opportunities and operationalizing analytics through front-line systems (digital, including cross-channel evolution, mobile, robotic automation for processes, and enhanced data exchange possibilities across entities).
  • R&D, new product development, market expansion, M&A or new industry diversification.

I have included a lot of benefits and an initial response could be to even challenge these benefits since it covers a wide footprint. Establishing a CDO is an opportunity to really embrace management of data as an asset. It is as important as people asset or financial asset for company’s survival and growth. Unfortunately, not many companies and leaders have embraced this notion yet. The first three or four benefits are the obvious benefits. But, if data is well-managed as a business-critical function, there should be no doubt why the extended benefits cannot be realized.

Get creative and reap the boundless benefits. Isn’t it time you start your Chief Data Office?
Data Digest

Chief Data and Analytics Officers Tackle Big Data Issues in South Africa


We gathered the emerging thought leaders in data and analytics space to talk about the key issues facing the South African market.

Annie Symington, HOD: Analytics, Multichoice
Annie Symington is Head of Analytics for MultiChoice, a Naspers owned company in South Africa. Annie holds responsibility for data analytics and insights across the Consumer, Content, Retention and Sales business lines, aimed at driving improved organisation decisions and communications with the use of insightful data publications.

With almost a decade in the marketing industry (across telecoms and media organisations), she has strived to change the company view from the static normal, to the view of using all available data to better understand the customer and their specific behaviours.

Richard van der Wath, Chief Data Officer, MyBucks
I am a highly skilled individual with above-average problem solving and leadership skills. I have in-depth knowledge in the science of extracting knowledge from data as well as making predictions by building computer models and simulations of complex dynamic systems. I have worked in a wide area of fields from financials through microbiology to industrial processes, and I really excel at crossing domain boundaries and interacting with people from diverse backgrounds.

Danie Jordaan, Chief Data Officer, PSG
I am passionate about providing great service to clients. Service takes many forms in PSG, anything from assisting Advisers with Infrastructure to the more satisfying tasks of building innovative ways of enabling Advisers through technology.

I have a particular interest in CRM systems as it is a culmination of many different technology initiatives in order to enable business. CRM is not a nice-to-have anymore - it is a ticket to the game. Clients expect you to know them and service them as you agreed. But it is not only clients expecting this, it is regulations as well. And that is what CRM does great - if you apply it right, it turns regulations ( that seems cumbersome to many ) into opportunity.

I love spending time with Advisers. They make substantial differences in the lives of their clients, and I see it as a personal goal to assist and enable them.

In order to add value in PSG I have combined my technology background with financial planning training. Currently I am extending this by also studying MBA. This have put me in a unique position to understand technology and balance what is possible, with what is required by Advisers and the regulator.

Wicus du Preez, Head: Business Analytics, AIG

Data scientist with well-rounded expertise in advanced analytics, innovation, data strategy development as well as marketing optimization.

Strong strategic thinker with proven record of execution and driving bottom line results. Adept at building talent, leading teams and mobilizing cross-functional groups in multicultural settings. A proven business professional with strong analytical skills and demonstrated success globally in both management and functional leadership.

Specialities include:
Customer segmentation, Propensity modeling, Analytics & data strategy, Customer insight, Business intelligence, Automated reporting & visualization solutions, Business & marketing optimization and efficiencies.

Yudhvir Seetharam
, Head of Analytics, FNB
I am responsible for the analytics department in the FNB Business segment. Our segment banks all types of businesses, from start-ups to JSE listed entities. My job enables me to think creatively, analytically and with a business oriented goal in mind. My team of 10 analysts ensures that we maintain the innovative edge over our competitors by effectively using Big Data to enhance FNB's market position. I am a regular public speaker on topics of Big Data and Entrepreneurship.

Academically, I am reading towards my Ph.D in Behavioural Finance at the University of the Witwatersrand. I have also obtained my various RPE (Registered Person Exams) qualifications, enabling me to give financial advice. As a lecturer of investments and risk management, I am able to provide bridge the gap between theory and practice.

FNB Business Analytics (previously FNB Business Banking and FNB Commercial Banking) covers all MI and analytics that form and influence our customer's lifecycle. The department of 10 analysts looks after reporting and analytics of several business areas, namely: Growth and Sales; Operations, Legal, Risk and Compliance; and Transactional behaviour.

My role involves the decision making, execution and management of this analytics space to ensure we meet our internal and external stakeholder needs.


Articulating insights to business in a way they understand and therefore use is a major challenge in the analytics profession. What initiatives are you using in your business to overcome these challenges and effectively narrate the story to business?

Danie Jordaan (DJ): This is a significant challenge. Traditionally PSG had a very flat structure so developers interacted with business directly. Documentation lacked however as the developers started developing “word-of-mouth” and made no documentation. We have appointed BA’s and PM’s to aleviate the problem. This caused improved paperwork, but for some reason there still seems to be a disconnect between what business wants and developers do. I believe the answer is not as simple as adding personnel to a problem. It is about adding the correct personnel who is willing and able, and keeping them in order not to lose corporate knowledge.

Annie Symington (AS): Keeping it simple and close to the customer is imperative. It’s import for business to understand what the customers are doing / preferring.  We deliver our information using visually pleasing and simple graphics, slides etc.  Our goal is to storyboard the information at all times.

Richard van der Wath (RW): We have a strong focus on data and algorithm visualisation via live dashboards and in reports. To achieve this, it is important to use the right tools as well as having experts with know-how about data visualisation and visual design.

Wicus du Preez (WP): Increased usage of visual representation tools, user selectable interactive reporting. Closer collaboration with business representatives, allows feedback in plain language.

Yudhvir Seetharam (YS): Analytics and data is now a c-suite initiative, being included as part of the strategy for enhancing all areas of interaction with the customer. Projects ranging from sales to risk and operations are being created to ensure that we leverage fully both internal and external data to enhance our customer’s banking experience. Further, in the drive to maintain market share, innovation in the area of “beyond banking” is also being investigated. We have also recently integrated the research function into analytics – providing an end to end value chain to our internal stakeholders.

How would you describe the relationship between the data owners in your business and the analytics owners? And what is your desired state for this relationship?

DJ: It should be symbiotic. Data owners need analytics for decision making purposes, so if they do not involve analytics in system changes they jeopardize their ability to make decisions from their own data.

AS: The relationship is acceptable to a certain extent.  As the analytics owners within our business consume transformed information, once the data has been made available through the EDW (electronic data warehouse) environment, there is less involvement between the data owners and the analytics owners.

Ideally, it would be better for the team to be closer as this would assist in understanding the impact to the customer better (in terms of processes etc) or understanding why the data presents itself the way it does.

RW: A close working relationship between Data Science and Tech/IT is imperative in order to ensure and maintain data integrity as well as for deployment of models. The ideal relationship is where Tech can monitor and maintain data integrity and Data Science can just focus on analytics and model building.

WP: Separated, is a compromise but otherwise analytics resources are side tracked with DBA maintenance issues. Ideally data ownership should sit within IT, with separate analytics team that can prescribe data requirements.

YS: It differs per segment in the bank, but typically data and analytics are centralised to provide good communication between the two areas. Analysts need to have a basic understanding of both areas in order to enhance the offering of each respective department. Having these functions centralised in FNB Business allows both teams to work together instead of duplication of work in, say, obtaining data.


A close working relationship between Data Science and Tech/IT is imperative in order to ensure and maintain data integrity as well as for deployment of models. The ideal relationship is where Tech can monitor and maintain data integrity and Data Science can just focus on analytics and model building.

What is your approach to building use cases in your organisation - in both a data and analytics environment - to drive trust and investment in your work?

DJ: The developer and business needs to communicate very closely to ensure that the use case is exactly as the business require. It is of immense value if a BA or Head of analytics were in a Business role in the past for that specific company. It makes the translation of business and developer language much easier.

AS: Previously the analytics area was part of the Marketing and Sales function, so it was easy to understand and build the use cases.

RW: We focus on the low hanging fruit and have an agile approach to rapidly build and deploy models that derive actionable insights from data. It is important to show how value derived from data analytics translates into bottom line savings as early as possible.

WP: Use case development is declining. Focus is shifting onto newer agile approaches.

YS: We typically interact with our business owners in order to build and identify scenarios where analytics can be deployed. By sitting with our stakeholders (as opposed to a central bank wide department), we are “closer to the ground”, enabling the team to work more effectively towards understanding the needs of their stakeholders.

What are the big focus areas for you in the data & analytics space over the next 12 to 18 months? Can you share some of key strategic objectives?

DJ: Unifying data across PSG.

AS: Deliver more insights across the African continent.  Lots of focus on retention and customer engagement activities.  Integrate social and digital data.

RW: Expanding the use of prescriptive analytics and incorporating more alternative (unstructured) data sources into models.

WP: Machine learning, telematics in big data environment.

YS: We would like to get a holistic (not just  a business view) of our customer – to effectively “ask once” in terms of driving our interactions with our customers. Second, there is a focus on deploying analytics decision making closer to our sales staff – what tools can we use to enable our sales staff to make better sales decisions (in real time)

There is a distinct lack of skills in the South African market. How are you overcoming this challenge and what do you think needs to be done to increase the number of young people moving into data, analytics and data science?

AS: We have grad programme internally that we tap into and has helped us get access to young graduates.  We also tap into other grad programmes as much as possible, NWU / SAS etc.  The benefit is you get someone fresh out of university, with great energy and no bad ‘data’ habits.  We train them and invest a lot in them so they are definitely filling a gap for us.  The challenge is it does create a bit of a gap better junior and senior level analyst.  However, as we spend more time leveraging these skills, our analysts will continue to grow.

RW: An effective retaining strategy: making sure problems remain interesting, flexible work environment, supporting training/conferences. Training courses/degrees tailored to data science career paths can help to improve the skills gap.

WP: Training and transitioning staff from BI/MIS environment over to analytics. Utilizing spare capacity in actuarial environments. Awareness of the career field seems to be a current blockage, perhaps highlighting the field at university open days could assist.

YS: In my team there is fairly diverse skills set in terms of qualification. I have found that it is equally (if not more) important to have an understanding of business as opposed to “technicals” (stats/maths).


Hear more from other leading Chief Data and Analytics Officers at the coming Chief Analytics Officer Forum Africa organized by Corinium. For more information, visit www.caoforumafrica.com

 

July 27, 2016

The Data Lab

Do you know what makes a Data Scientist? No worries, AdzunaDataBot is at your service

Data Scientist Word Cloud

The Data Science job market is still fairly nascent and the question on what makes a Data Scientist has not been answered conclusively. This was when we conceived the AdzunaDataBot! The project was initiated with a two-fold goal of building an understanding of the Data Science job market in the UK and monitoring it on a continuous basis, as well as building expertise in product development on one of the major cloud solutions provider by doing a pilot project. AdzunaDataBot gathers jobs data from Adzuna, a UK based job boards aggregation website, stores and processes them on a cloud platform, and presents them visually in an easily interpretable format for interested users. Adzuna's data store can be accessed through the Adzuna's web API, which can be queried by keywords, and provides a rich variety of information regarding each job ad posted on the different job boards adzuna queries.

While it may be true that a 'complete' data scientist would (should?) have all the skills mentioned above, few among us can claim to have reached that pinnacle. However, all aspiring Data Scientists begin at one of those careers in one area of speciality and build their skills in the other areas as they progress in their career. But how do we prioritize these skills by the order in which they are valued in the job market today? We, at The Data Lab, looked to take a Data Science approach to this Data Science problem by analysing at actual jobs data to see what the market says, and voilà, AdzunaDataBot!

Not only will this data from AdzunaDataBot be useful to individuals who want to make smarter career choices, it will be very useful for program coordinators at universities, skills academies and bootcamps, to correctly identify the different kinds of data science positions, and tailor each of their programs to better provide the required skills to their students. And this goes to the heart of the core mission of The Data Lab, which is to drive collaboration betweeen Scottish industry, public sector and academia, to exploit the value of Data Science together. Training people up with the right skill sets is the first step in ensuring Scottish industry is in the best position to be able to exploit the techniques of Data Science effectively.

A public preview of the AdzunaDataBot is available here.

 

An API a day, and with a cloud solution to play, makes an easy data product today

APIs, or Application Programming Interfaces to give them their full credentials, are becoming increasingly common on the web, with all kinds of services wanting to build an ecosystem around their product by enticing developers with the 'cool-factor' of their API. Web APIs offer a well-defined way to programmatically access the underlying data which power many of the services we use on the web today. The drive towards a more open data culture has further pushed the drive towards open APIs. While Facebook and Twitter might be the first ones to come to mind, they are by no means the only ones to offer access to their data mines. All kinds of services offer API access to their data including flight pricing engines, job boards, hotel booking services and auction websites, among many others. Developers from yesteryears might nostalgically look back on their days scraping data from HTML pages, but the Data Scientists are not complaining! The ability to access clean datasets from the APIs now allows us to spend more time building the data product, which after all is the more interesting/ potentially lucrative bit.

 

Cloudy days ahead

So with this in mind, we started this project with the aim of collecting data from the Adzuna web API, storing it, and building a data product around it. And we decided to build this solution on the cloud to guarantee consistency and interoperability between platforms. A few different cloud solutions were evaluated including PaaS solutions like IBM BlueMix and Heroku and IaaS solution, AWS EC2. While each platform has its advantages, we chose AWS as our cloud solution as we wanted to start with a simple solution without too many fancy services attached to it. AWS EC2 cloud is simple to configure and get started, and it has great documentation to get unfamiliar developers up to speed quickly. The AWS free tier, which is available to anyone for a 12-month introductory period, was sufficient for this task.

 

Implementation

The AdzunaDataBot was implemented completely in R. The infrastructure components from AWS included a free tier EC2 linux box and a MySQL database to store the data. To configure the EC2 environment for running R, we followed this very easy to blog post by Amazon. The API call returns a JSON object which can be easily read into an R dataframe. Since making a call to the API and returning a dataframe object is a core functionality which can be leveraged across many different applications, it was implemented into an R package, called “adzunar”, which has been released separately to Github. This allowed us to experiment with the search terms and the results and abstract away the details of actually making the API calls. We setup a job to query the Adzuna API on a daily basis, with the keywords “data science”, and store the results in the MySQL DB. The free tier of the MySQL DB on Amazon RDS comes with 750hrs of usage and 20GB of storage. This is more more than sufficient for the prototype which we built. This data was then queried on a daily basis to render an HTML page using flexdashboard, a super cool publishing tool available for R. Flexdashboard gives the ability for non-web developers (like us) to simply render R plots (like ggplot2) onto a beautiful HTML page with just a few lines of code!

The complete code for this implementation is available on Github for any interested Data scientists out there. This project is still a work in progress, so contributions are most welcome!

 

Nuggets from AdzunaDataBot – Programmers are in demand

The development version of the app has already yielded some pretty useful information regarding the Data Science job market in the UK.

We know that:

  • The top five skills mentioned are Python, statistics, java, hadoop and spark. Programmers/ Data Engineers are clearly in demand
  • London forms the most significant hub for Data Science jobs in the UK
  • The median salary on offer is £49k per annum
  • There is a large variance in the salaries on offer, starting from £20k all the way to £200k
  • The most popular buzz word in use among all the job adverts is “analytics”

For future work, we can look to split the analysis by experience level to identify the skills required for entry-level data scientists vs those required for experienced hires. 

 

Template: 
Image
Share: 
Google Plus
Facebook
Pinterest
Twitter
Teradata ANZ

Creative Business Analytics — The New Black?

I know two things about Henry Ford.

One, he was forever telling customers “Any colour as long as it’s black”. And two, when he wasn’t stipulating decor preferences, he was claiming that “If I had asked people what they wanted, they would have said faster horses”.

Now, whether his opinions were prompted by an unequivocally practical approach to product development or sheer pig-headedness, the Ford Model T, the world’s first mass-produced automobile, changed not only the motor industry but the personal and professional horizons of ordinary people.

And we see this time and again with large, innovative companies. For instance, Apple – standard bearers for commercial creativity – changed the music industry with the ‘1000 songs in your pocket’ iPod and levelled the playing field for independent labels and artists with its digital music store, iTunes.

Thinking differently

Which is all well and good. But when you’re working with data day-to-day, it’s all-too-easy to fall into the trap of thinking that incremental improvement – little steps that make small enhancements to the model you’re building or the analytics you’re juggling with – is enough. It’s progress of a kind, yes, but there’s so much more to be gained by thinking outside of the box and taking a leap of faith.

After all, you can’t discover if you might be the next Olympic swimmer without, first, jumping into the water.

Getting creative with data

Whether you’re mapping customer journeys, identifying and delivering the next-best action/offer, or trying to understand the voice of the customer, analytics can always be enhanced by viewing the project through creative lenses. Global companies do this as a matter of course:

  • eBay’s carefully designed and tailored digital platform is under close, and constant, scrutiny. At any one time, they have 100 different versions of their website; each absorbing as much data as possible to clarify design decisions, from button shapes and colours to advice on photography for sellers.
  • Netflix fuelled its success by using large volumes of data to create hit online shows such as House of Cards. In fact, the confidence to commit $100 million for two 13-episode seasons came from discovering that Kevin Spacey fans also liked David Fincher and British political dramas.
  • LinkedIn monetise data by deploying creative and innovative new data products (that analyse connections and recommend new links, jobs, careers, etc.) on a daily basis.
    Obama’s number-crunching Big Data election campaign targeted voters with more accuracy than ever before.

Two sides of the same coin

Clearly, creativity and data analytics are not mutually-exclusive disciplines. They work together to deliver maximum benefit by allowing those involved to think outside of the box and make real step-changes.

Look at the design of marketing communications. No matter how creatively conceived the message, if it’s sent to a misinformed or ‘woolly’ target audience, no one will read it. Unless data is used to target the right people and the right channels in the right way, the marketing department will burn through budgets in no time, to scant effect.

On the flipside, data guides companies to the right customer segment, describing the behaviours and attributes of individuals in that segment, and informing a more personalised promotional message. But any list of prospects, identified and segmented according to customer analytics, will be unlikely to respond to a pedestrian narrative. Creativity and data analytics go hand-in-hand.

A healthier business diet

Data scientists tell stories with data – the process of discovering insights, initiated and sustained by inquisitive minds translating data patterns and trends into essential tactical and strategic roughage for the c-suite.

Creativity informs data analytics from the very beginning, challenging data scientists with question after question. “What are the assumptions I am making?“ “Are there alternative ways of visualising the data to uncover patterns?” “Which external datasets could be integrated to add further value and highlight new correlations?” “Can I borrow analytical techniques from another field/industry?” And as we drill through data the narrative tightens, increasing project credibility with each new insight.

I love the thought that this intensely-inquisitive creativity joins me and my fellow data scientists to Henry Ford. Because if the Model T visionary was still moving and shaking the motor industry today, I’m sure he’d be insisting:

“Any business analytics model… as long as it’s creative”.

This post first appeared on Forbes TeradataVoice on 29/01/2016.

The post Creative Business Analytics — The New Black? appeared first on International Blog.

 

July 26, 2016


BrightPlanet

VIDEO: How We Identify Dark Web Sites

The TOR network is an anonymous internet that can contain powerful information relating to the illicit sale of goods and personal identifying information. Managing and identifying TOR network websites is a challenge as sites are changing constantly. In this video, we cover the process we use to stay on top of sites on the Dark Web so […] The post VIDEO: How We Identify Dark Web Sites appeared first on BrightPlanet.

Read more »
Big Data University

This Week in Data Science (July 26, 2016)

Here’s this week’s news in Data Science and Big Data. Light Based Communication

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

The post This Week in Data Science (July 26, 2016) appeared first on Big Data University.

Teradata ANZ

My Golf, Big Data and the Cloud

The reason why I’m blogging about this is because I see a strong similarity between my golf aspiration and the frenzy around big data, open source and cloud.

My starting premise has always been – Enterprises with good information management practices will leverage data better, regardless of what technology comes before them.

But like bad golfers buying new gear to fix a swing problem, there will be no improvement unless the underlying weak business practices are addressed. I often hear from business users “We can go to cloud and open source so we don’t have to plan for capacity and it will be cheaper. We can avoid expensive and lumpy capital expenditure too”.

There are so many potential underlying weak business practices driving this statement which will not go away when you go to the cloud. Let me paint you a picture here.

Renato Manongdo_Cloud 1.jpg
At the heart of this statement is a lack of visibility of the business value being generated by the enterprise from the data assets it currently uses. This leads to poor understanding of the business risk of not managing the asset effectively. In turn, the lack of value recognition manifests itself through the urgency and rigor placed on data governance and data stewardship effort. This by the way is not only for structured data, but for ALL data.

It is also common place to have no formal and structured capacity planing process with the required committed funding mechanism.    Often, the platform capacity is only addressed when significant business disruption is already occurring and business continuity issues are becoming a significant concern. This is best described as a forced upgrade cycle.

Combine the poor planning and forced upgrade cycle mentioned,  we have an unpredictable cost and service delivery situation. THIS IS NOT GOING AWAY WHEN YOU GO TO CLOUD.

I admit this is an extreme example but you need to accept that elements of this situation exist in parts for many information management organisations today.

What’s the answer? Research good business practices. Find out what business practices work best in traditional, open source and cloud-based environment before committing further to buying a new set of tools.

Having said all of that, I have again renewed my aspiration in golf by buying a new set of clubs – this time it’s single length irons (yes same length, 8 iron length from sand wedge down to 5). I know it’s not the clubs but a whole raft of underlying bad practices. . . but hey . . my birthday is coming up and I got the approval from the wife :-).

The post My Golf, Big Data and the Cloud appeared first on International Blog.

 

July 25, 2016

Data Digest

Actuaries and Data Scientists – Match Made In Heaven or Hell?



In my previous article titled “What is the reality of the Chief Data Officer role within the insurance sector?”, I was joined by leading Chief Data Officers and Chief Analytics Officers who provided their thoughts on the status of data and analytics leadership and transformation within this industry. In the report, I touched upon the role of the Actuary and their business in calculating risk through stringent statistical and mathematical processes, however, we did not explore their obscure relationship with Data Scientists. The foundations of actuarial science can be dated back to the 17th century, whilst the term “actuary”  was first coined by Equitable Life for its chief executive officer in 1762. The Society of Actuaries (SOA) defines this position as “a business professional who analyses the financial consequences of risk. Actuaries use mathematics, statistics and financial theory to study uncertain future events, especially those of concern to insurance and pension programs.”
The emergence of data science has been in direct response to the rise of Big Data, but of course data is useless without the ability to transform it into actionable insights. Insurers now have access to a plethora of structured and unstructured data sources churned up by telematics, wearable technology, social media and the Internet of Things. It is evident that both Actuaries and Data Scientists wish to effectively predict future outcomes. However, the methodology by which each party accomplishes this feat is rather different as well as the context by which the role operates.
Once again, we conducted a survey with the Chief Data Officer Forum, Insurance speaker faculty to better understand their thoughts on the relationship between these two critical roles.

How do you view the relationship between the more traditional Actuaries and Data Scientists and/or CDOs?

Eric Huls, Chief Data Scientist, Allstate

The two roles complement each other. Actuaries have a deep understanding of the insurance business as well as the statistical knowledge and skills to provide foundational analytic capabilities that every insurance company should have. Analytic capabilities only proliferate when you add data scientists into the mix. Data scientists often bring a unique crop of methods and techniques that advances analytics across all business areas. The fact that you see both roles in our industry is a testament to the value insurance companies are placing in analytics.

TJ Houk, Chief Data Officer, Trupanion

There are a wide range of relationships.  In many cases, they are totally separate, they get no synergies from each other, and they actually compete.  At Trupanion, the actuaries and data scientists sit on the same team, collaborate, and build each other’s skills.  Both parties will be more effective if they collaborate, and that will ultimately lead to competitive advantages for teams that achieve that.

Guizhou Hu, VP, Chief of Decision Analytics, Gen Re


Data Scientists have to collaborate with traditional Actuaries in order to make meaningful business decisions. Most executives in insurance are trained actuaries.



Meghan Anzelc, VP, Predictive Analytics Program Lead, Zurich North America

There is some overlap between traditional actuarial roles and the roles of data scientists. Where analytics is used in areas such as pricing, often the actuaries and data scientists work closely together. Typically the data scientists have more experience and skill in manipulating large data sources, dealing with unstructured data, and using more sophisticated statistical or machine learning techniques. The actuaries typically have deeper insurance knowledge, knowledge of internal system data, and understanding of the concerns and constraints the business is facing. Each group complements each other and together can create better solutions to the business problems at hand.

Heather Avery, Director, Business Analytics, Aflac

The relationship between the more traditional Actuaries and Data Scientists and/or CDOs is still maturing.  The culture of an organization will play a major role in ensuring that these functions identify distinct roles and responsibilities – and work together to achieve success.  The relationship has to be viewed as a partnership, as both roles can contribute unique perspectives and actionable insights to drive value for an organization.

Brahmen Rajendra, AVP, Data Warehouse and Business Intelligence, Endurance

The relationship here is one of mutual benefit as the Actuaries understand in detail their line of business whereas the CDO/CAO is the bridge to harnessing the analytical  industry.
It is clear that the majority of respondents are in agreement that although the relationship between Data Scientists and actuaries is in its infancy, it is critical that both parties cooperate in order to maximise the potential value of the raw materials that are leveraged to create actionable insights. Although Data Scientists are gaining great traction for their use of rigorous data and analytics processes, this is rendered almost useless without strong business acumen and context. Actuaries can provide this knowledge, backed up with centuries of refined familiarity with the insurance business model.

However, centuries of legacy processes can breed stagnation and great company inertia. Thus, actuaries should be agile and embrace data science and innovation or we may see the two roles at odds, competing for relevance.

By Andrew Odong

Andrew Odong is the Content Director for the Inaugural Chief Data Officer Forum, Insurance 2016For more insights into the relationship between Data Scientists and Actuaries in insurance, join us on September 15th in Chicago.


Data Digest

Actuaries and Data Scientists – Match Made In Heaven or Hell?



In my previous article titled “What is the reality of the Chief Data Officer role within the insurance sector?”, I was joined by leading Chief Data Officers and Chief Analytics Officers who provided their thoughts on the status of data and analytics leadership and transformation within this industry. In the report, I touched upon the role of the Actuary and their business in calculating risk through stringent statistical and mathematical processes, however, we did not explore their obscure relationship with Data Scientists. The foundations of actuarial science can be dated back to the 17th century, whilst the term “actuary”  was first coined by Equitable Life for its chief executive officer in 1762. The Society of Actuaries (SOA) defines this position as “a business professional who analyses the financial consequences of risk. Actuaries use mathematics, statistics and financial theory to study uncertain future events, especially those of concern to insurance and pension programs.”
The emergence of data science has been in direct response to the rise of Big Data, but of course data is useless without the ability to transform it into actionable insights. Insurers now have access to a plethora of structured and unstructured data sources churned up by telematics, wearable technology, social media and the Internet of Things. It is evident that both Actuaries and Data Scientists wish to effectively predict future outcomes. However, the methodology by which each party accomplishes this feat is rather different as well as the context by which the role operates.
Once again, we conducted a survey with the Chief Data Officer Forum, Insurance speaker faculty to better understand their thoughts on the relationship between these two critical roles.

How do you view the relationship between the more traditional Actuaries and Data Scientists and/or CDOs?

Eric Huls, Chief Data Scientist, Allstate

The two roles complement each other. Actuaries have a deep understanding of the insurance business as well as the statistical knowledge and skills to provide foundational analytic capabilities that every insurance company should have. Analytic capabilities only proliferate when you add data scientists into the mix. Data scientists often bring a unique crop of methods and techniques that advances analytics across all business areas. The fact that you see both roles in our industry is a testament to the value insurance companies are placing in analytics.

TJ Houk, Chief Data Officer, Trupanion

There are a wide range of relationships.  In many cases, they are totally separate, they get no synergies from each other, and they actually compete.  At Trupanion, the actuaries and data scientists sit on the same team, collaborate, and build each other’s skills.  Both parties will be more effective if they collaborate, and that will ultimately lead to competitive advantages for teams that achieve that.

Guizhou Hu, VP, Chief of Decision Analytics, Gen Re


Data Scientists have to collaborate with traditional Actuaries in order to make meaningful business decisions. Most executives in insurance are trained actuaries.



Meghan Anzelc, VP, Predictive Analytics Program Lead, Zurich North America

There is some overlap between traditional actuarial roles and the roles of data scientists. Where analytics is used in areas such as pricing, often the actuaries and data scientists work closely together. Typically the data scientists have more experience and skill in manipulating large data sources, dealing with unstructured data, and using more sophisticated statistical or machine learning techniques. The actuaries typically have deeper insurance knowledge, knowledge of internal system data, and understanding of the concerns and constraints the business is facing. Each group complements each other and together can create better solutions to the business problems at hand.

Heather Avery, Director, Business Analytics, Aflac

The relationship between the more traditional Actuaries and Data Scientists and/or CDOs is still maturing.  The culture of an organization will play a major role in ensuring that these functions identify distinct roles and responsibilities – and work together to achieve success.  The relationship has to be viewed as a partnership, as both roles can contribute unique perspectives and actionable insights to drive value for an organization.

Brahmen Rajendra, AVP, Data Warehouse and Business Intelligence, Endurance

The relationship here is one of mutual benefit as the Actuaries understand in detail their line of business whereas the CDO/CAO is the bridge to harnessing the analytical  industry.
It is clear that the majority of respondents are in agreement that although the relationship between Data Scientists and actuaries is in its infancy, it is critical that both parties cooperate in order to maximise the potential value of the raw materials that are leveraged to create actionable insights. Although Data Scientists are gaining great traction for their use of rigorous data and analytics processes, this is rendered almost useless without strong business acumen and context. Actuaries can provide this knowledge, backed up with centuries of refined familiarity with the insurance business model.

However, centuries of legacy processes can breed stagnation and great company inertia. Thus, actuaries should be agile and embrace data science and innovation or we may see the two roles at odds, competing for relevance.

By Andrew Odong

Andrew Odong is the Content Director for the Inaugural Chief Data Officer Forum, Insurance 2016For more insights into the relationship between Data Scientists and Actuaries in insurance, join us on September 15th in Chicago.


Data Digest

Actuaries and Data Scientists – Match Made In Heaven or Hell?



In my previous article titled “What is the reality of the Chief Data Officer role within the insurance sector?”, I was joined by leading Chief Data Officers and Chief Analytics Officers who provided their thoughts on the status of data and analytics leadership and transformation within this industry. In the report, I touched upon the role of the Actuary and their business in calculating risk through stringent statistical and mathematical processes, however, we did not explore their obscure relationship with Data Scientists. The foundations of actuarial science can be dated back to the 17th century, whilst the term “actuary”  was first coined by Equitable Life for its chief executive officer in 1762. The Society of Actuaries (SOA) defines this position as “a business professional who analyses the financial consequences of risk. Actuaries use mathematics, statistics and financial theory to study uncertain future events, especially those of concern to insurance and pension programs.”
The emergence of data science has been in direct response to the rise of Big Data, but of course data is useless without the ability to transform it into actionable insights. Insurers now have access to a plethora of structured and unstructured data sources churned up by telematics, wearable technology, social media and the Internet of Things. It is evident that both Actuaries and Data Scientists wish to effectively predict future outcomes. However, the methodology by which each party accomplishes this feat is rather different as well as the context by which the role operates.
Once again, we conducted a survey with the Chief Data Officer Forum, Insurance speaker faculty to better understand their thoughts on the relationship between these two critical roles.

How do you view the relationship between the more traditional Actuaries and Data Scientists and/or CDOs?

Eric Huls, Chief Data Scientist, Allstate

The two roles complement each other. Actuaries have a deep understanding of the insurance business as well as the statistical knowledge and skills to provide foundational analytic capabilities that every insurance company should have. Analytic capabilities only proliferate when you add data scientists into the mix. Data scientists often bring a unique crop of methods and techniques that advances analytics across all business areas. The fact that you see both roles in our industry is a testament to the value insurance companies are placing in analytics.

TJ Houk, Chief Data Officer, Trupanion

There are a wide range of relationships.  In many cases, they are totally separate, they get no synergies from each other, and they actually compete.  At Trupanion, the actuaries and data scientists sit on the same team, collaborate, and build each other’s skills.  Both parties will be more effective if they collaborate, and that will ultimately lead to competitive advantages for teams that achieve that.

Guizhou Hu, VP, Chief of Decision Analytics, Gen Re

Data Scientists have to collaborate with traditional Actuaries in order to make meaningful business decisions. Most executives in insurance are trained actuaries.


Meghan Anzelc, VP, Predictive Analytics Program Lead, Zurich North America

There is some overlap between traditional actuarial roles and the roles of data scientists. Where analytics is used in areas such as pricing, often the actuaries and data scientists work closely together. Typically the data scientists have more experience and skill in manipulating large data sources, dealing with unstructured data, and using more sophisticated statistical or machine learning techniques. The actuaries typically have deeper insurance knowledge, knowledge of internal system data, and understanding of the concerns and constraints the business is facing. Each group complements each other and together can create better solutions to the business problems at hand.

Heather Avery, Director, Business Analytics, Aflac

The relationship between the more traditional Actuaries and Data Scientists and/or CDOs is still maturing.  The culture of an organization will play a major role in ensuring that these functions identify distinct roles and responsibilities – and work together to achieve success.  The relationship has to be viewed as a partnership, as both roles can contribute unique perspectives and actionable insights to drive value for an organization.

Brahmen Rajendra, AVP, Data Warehouse and Business Intelligence, Endurance

The relationship here is one of mutual benefit as the Actuaries understand in detail their line of business whereas the CDO/CAO is the bridge to harnessing the analytical  industry.
It is clear that the majority of respondents are in agreement that although the relationship between Data Scientists and actuaries is in its infancy, it is critical that both parties cooperate in order to maximise the potential value of the raw materials that are leveraged to create actionable insights. Although Data Scientists are gaining great traction for their use of rigorous data and analytics processes, this is rendered almost useless without strong business acumen and context. Actuaries can provide this knowledge, backed up with centuries of refined familiarity with the insurance business model.

However, centuries of legacy processes can breed stagnation and great company inertia. Thus, actuaries should be agile and embrace data science and innovation or we may see the two roles at odds, competing for relevance.

By Andrew Odong

Andrew Odong is the Content Director for the Inaugural Chief Data Officer Forum, Insurance 2016For more insights into the relationship between Data Scientists and Actuaries in insurance, join us on September 15th in Chicago.


 

July 24, 2016


Simplified Analytics

Why Open Source is gaining momentum in Digital Transformation?

Once upon a time in IT, using open source simply meant Linux instead of Windows, or maybe MySQL instead of Oracle. Now, there is such a huge diversity of open source tools, and almost every leading...

...

Simplified Analytics

Why Open Source is gaining momentum in Digital Transformation?

Once upon a time in IT, using open source simply meant Linux instead of Windows, or maybe MySQL instead of Oracle. Now, there is such a huge diversity of open source tools, and almost every leading...

...
 

July 22, 2016


Rob D Thomas

A Practical Guide to Machine Learning: Understand, Differentiate, and Apply

Co-authored by Jean-Francois Puget (@JFPuget) Machine Learning represents the new frontier in analytics, and is the answer of how many companies can capitalize on the data opportunity. Machine...

...
VLDB Solutions

Database Benchmarking

Readings in Database SystemsTPC Database Benchmarks

As most readers of data warehouse related blogs will no doubt know, the Transaction Processing Performance Council (TPC) define database benchmarks to allow comparisons to be made between different technologies.

Amongst said benchmarks is the TPC-H ad hoc decision support benchmark, which has been around since 1999 with the latest version being v2.17.1.This is the most relevant TPC benchmark for those of us focussed on data warehouse platforms.

Benchmark Scale

TPC-H benchmarks are executed at a given scale point ranging from 1GB up to 100TB:

  • TPCH1 = 1GB
  • TPCH10 = 10GB
  • TPCH30 = 30GB
  • TPCH100 = 100GB
  • TPCH300 = 300GB
  • TPCH1000 = 1,000GB/1TB
  • TPCH3000 = 3,000GB/3TB
  • TPCH10000 = 10,000GB/10TB
  • TPCH30000 = 30,000GB/30TB
  • TPCH100000 = 100,000GB/100TB

Benchmark Data

Data used to populate the TPC-H schema is generated using the DBGEN utility which is provided as part of the TPC-H download.

Database Schema

The schema used in the TPC-H benchmark is retail based and contains the following tables:

  • customer
  • orders
  • lineitem
  • part
  • partsupp
  • supplier
  • nation
  • region

Each table contains either a fixed number of rows or a number related to the scale factor in use.

Unsurprisingly, the ‘lineitem’ table contains the most rows, which is 6,000,000 x scale factor e.g. 600,000,000 rows for scale factor 100 (TPCH100).

Column names and data types are provided for all tables.

For those using MPP databases such as Teradata, Greenplum or Netezza you’ll have to decide your own data distribution strategy.

Benchmark Queries

The queries that are executed against the populated schema are generated using the QGEN utility which is provided as part of the TPC-H download. Minor query modifications are allowed to take into account differences between DBMS products.

There are a total of 22 ‘SELECT’ queries (‘query stream’) that must be executed against the populated retail schema at a given scale factor. Each query corresponds to a business question. There are also 2 refresh functions (‘refresh stream’) that add new sales and remove old sales from the database.

Database Load Time

The elapsed time required to generate & load the data to the database must be recorded. The time to execute other supporting tasks such as table creation, index creation & statistsics collection must also be recorded.

Performance Tests

Once the data is loaded the real fun and games can begin. The performance test consists of both single-user power and multi-user throughput tests.

The power test consists of the first refresh function followed by the 22 query set and lastly the second refresh function.

The throughput test consists of a minimum number of concurrent runs of the 22 queries (‘streams’), as determined by the scale factor:

  • SF1 = 2
  • SF10 = 3
  • SF30 = 4
  • SF100 = 5
  • SF300 = 6
  • SF1000 = 7
  • SF3000 = 8
  • SF10000 = 9
  • SF30000 = 10
  • SF100000 = 11

The throughput test is run in parallel with a single refresh stream.

The set of 22 queries is run in a sequence specified in Appendix A of the TPC-H guide and is dependant on the number of streams.

The timing of all queries is measured in seconds.

Benchmark Metrics

The metrics captured by the power and throughput tests are as follows:

  • composite query-per-hour (QphH@Size)
  • price/performance ($/QphH/@Size)

Detailed explanations as to how these metrics are computed are available in section 5 of the TPC-H guide. There are some lovely equations in there for maths geeks to enjoy!

Benchmark Results

Vendors that submit results are required to provide an executive summary in addition to a full disclosure report. All of the shell scripts, SQL, log files etc are also provided as supporting files. There are no secrets here!

Benchmark performance results have been submitted at the following scales:

  • SF100 (100GB)
  • SF300 (300GB)
  • SF1000 (1TB)
  • SF3000 (3TB)
  • SF1000 (10TB)
  • SF3000 (30TB)
  • SF100000 (100TB)

Perhaps interestingly, the fastest system at all scale factors is currently the Exasol database running on Dell PowerEdge servers. We’ll let you peruse to TPC web site to see how they did it.

Benchmark Use

The TPC-H benchmark is primarily of interest to technology vendors to show us what they’ve got in their locker.

Here at VLDB we’re data consultants, not tech vendors, so why the TPC-H interest?

Well, it is possible to use a cut-down version of the TPC-H benchmark to assess the real-world capability of various database platforms.

Why believe the hype when you can test database performance with a consistent set of queries against a real-world(ish) schema and usable data?

Well, that’s exactly what we’ve been doing for several years now.

Specifically, we use a set of benchmark tests to assess cloud database performance. Also, as cloud platforms are said to suffer the ‘noisy neighbour‘ problem, we also run benchmark tests over extended periods to test the variance of database performance over time for cloud platforms.

Some of the results are interesting…very interesting 🙂

 

The post Database Benchmarking appeared first on VLDB Blog.

Data Digest

Leading content and conference producer, Corinium, embarks on a brand refresh: “It’s all about connected thinking”, says CEO


Corinium, the leader in content and conferences for the emerging C-suite in data, analytics and digital innovation, recently embarked on a brand refresh. This is in line with the growing importance of Chief Data Officers (CDOs), Chief Analytics Officers (CAOs), Chief Data Scientists (CDSs) and other emerging leaders in today’s information-driven and inter-connected global economy.


“There is no denying that the growth of big data has been unprecedented, and as a result, the growth of CDOs, CAOs, CDSs has been phenomenal. As an active community leader in this space, we were quite astonished at the pace of growth that we’re seeing and my sense is this is just the tip of the iceberg”, says Charlie James, CEO of Corinium.

Indeed, Corinium’s community of over 20,000 active members who consume its content and attend its conferences that take place across six continents including US, Europe, Africa, Asia and Australia is proof of that. With Gartner predicting that 90% of large organizations will have Chief Data Officer by 2019, the key challenge will not be the availability of knowledge and information but the ability to make it actionable.

“When virtually all the world’s information is at everyone’s fingertips, the key issue that will be faced by future leaders is how to use all the available information and turn it into insight. That’s where ‘connected thinking’ comes in. With our brand refresh, our vision is to drive inclusivity, content and dialogue – all defining facets of the company – and make it a central theme of everything we do”, explains Charlie James during an interview at Corinium’s London headquarters.

With our brand refresh, our vision is to drive inclusivity, content and dialogue – all defining facets of the company – and make it a central theme of everything we do.

As part of this brand exercise, Corinium has successfully migrated the websites of the following flagship conferences happening this year, bearing the new look and brand proposition. Click on the link to visit the website.

$400 Gift for Conference Delegates

As a special incentive for Data Digest readers, Corinium will waive $400 off its regular conference pass pricing. Just enter DATADIGEST in the discount code when you register.

For more information about Corinium or if you have any enquiries about attending a conference or sponsorship, contact alexis.efstathiou@coriniumintelligence.com
Big Data University

Welcome to the new BDU!

After several months of hard work, we are proud to unveil our web site with a new look and feel!  We hope you like it!

Though you may have already seen it through our beta.bigdatauniversity.com site, we now have completely migrated all user information (including information about completion certificates previously earned), and we now have enabled final exams and badges.

Here is a summary of all changes made:

  • Migration from the Moodle learning management system to OpenEdx
  • Migration of all user information
  • Improved course structure to include Review Questions at the end of Modules
  • New badges offered, and the ability to claim level-1 badges immediately after satisfying minimum requirements
  • New courses!
  • New learning paths!

The old Moodle platform will still be available as our archive (archive.bigdatauniversity.com) for at least one year, though NO new registrations will be accepted. If you were working on a course on the old platform, and were close to finishing it, you will have until Friday August 5th, 2016 (12noon EST time) to complete the course in order to get a badge and a completion certificate.  After that, you can still see the course content, and even take the Final Exam; however, the corresponding badge and completion certificate will not be issued.  Badges offered in the old system will still be valid, but will no longer be issued. (See our FAQs for answers to questions about the old and new sites.)

More changes to come

We are not done with the site. We are continually changing our site, including improvements to all our courses, more learning paths, better user profiles, and more!

We want to hear from you!

Feedback or issues can be provided through the Support button, on the right side of the screen.

Happy Learning!

Big Data University Team

The post Welcome to the new BDU! appeared first on Big Data University.

 

July 21, 2016

Silicon Valley Data Science

Why You Need a Data Strategy

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

This is a great time for big data in business. There’s a widespread awareness of the value of big data analytics, and plenty of use cases that demonstrate its potential: understand your customer, optimize your supply chain, provide personalized app and media experiences.

No wonder that business leaders are asking their teams to look at big data. Anyone who’s read the business press on big data knows that the plan is simple—

    1. Get Data
    2. ???
    3. Profit!

Yet in the real world, there’s a big and unanswered question: what goes into step 2? That’s what data strategy is all about.

While it would be great for everyone if you could just “buy a Hadoop” and skip straight to “Profit!”, in reality there’s a lot of work involved, and 95% of it is unique to your business. How do you determine the steps of a big data project, and ensure it delivers results early?

Best practice suggests that successful companies have started with a pilot, and set out on a 2-3 year learning curve. But we think you can do better. Thanks to our years of data experience across diverse industries, we’ve arrived at the practice of data strategy.

Effective data strategy integrates a wide range of factors: the goals of the business and product teams, the capabilities and workflow of the analytics and engineering teams, and existing systems and data. Working through these, we can identify a roadmap of new capabilities, prioritized meaningfully. You get the benefit of focused scope with immediate return, but you can also build towards a long-term architecture.

Today’s business environment is digitized—look at mobile apps, 3-D printing, and IoT, as just a few examples. In other words, data today has to serve the strategic imperatives of what a business is doing. When we say strategic imperatives, we mean those things that define an organization by what they want to achieve. You could minimize what you do with data, but you would be left behind. You would be a taxi firm when Uber starts, or an old-fashioned TV network subsumed by Netflix or Amazon. A data strategy is about imbedding data-driven decision making, about being able to create data applications inside a company to help it achieve its strategic goals.

We have a quick, two question, quiz that we use when teaching strategy in person:

  1. Do you feel that the technology leadership (the CIO, for example) prioritizes their IT investments according to the ambitions of the business as a whole?
    When we ask this, about 40% of our audience puts their hands up. That may be a good amount, but still means the majority don’t think their IT spending is supporting the ambitions of the business.
  2. Can you clearly say how your investments in data technology have impacted the business?
    We find that maybe 10-15% of the audience raise their hands here.

If you answer no to either or both of these questions, then you need a data strategy.

We do accept that data strategy is a term that has been overloaded. Conventional data strategy is often all about looking inward at the systems you have, how to reduce risk, and how to implement governance. All valid things to do, but they don’t look to value. In fact, these well-meaning efforts can actually impede progress. It can frequently take three months to move a data set into a warehouse so it can be exploited, and three months is a long time in business.

Instead of just thinking about what you need to do to data, we think about what to do with data. For example, how can we use predictive analytics to identify our VIP customers and then generate great recommendations for them? Or, how can we use data to automate processes that take days, and turn them into hours?

There’s more to it, of course, which is why we speak on this topic at various events—look to the list at your left for more information on our upcoming presentations.  In the meantime, check out these resources:

 

The post Why You Need a Data Strategy appeared first on Silicon Valley Data Science.

Jean Francois Puget

Overfitting In Mathematical Optimization

Have you ever heard of overfitting?  I bet you did if you are using machine learning one way or another.  Avoiding overfitting is a key concern for machine learning practitioners and researchers, and it led to the development of many techniques to avoid it, such as cross validation and regularization.  I claim that OR practitioners, especially those using mathematical optimization solvers, should be equally worried about overfitting.

In a nutshell, overfitting in machine learning happens when the model you trained is too good to be true on the training data: your model predicts the training data very well, but your model performs poorly on new data.  An extreme case of overfitting is rote learning: in that case your model achieves 100% performance on data you have already seen, and probably won't do any better than random guess on new data.  You can read Overfitting In Machine Learning if you want to know more.

Overfitting In General

How could overfitting be relevant to optimization?

Let me abstract a bit overfitting in order to answer that question.  We need few ingredients:

  1. Algorithm.  It could be a machine learning algorithm, but it could be another algorithm, say a sort algorithm. 
  2. Training Data. A set of data sets to which the algorithm can be applied.  For instance, a set of arrays of numerical values to be sorted.  In machine learning, training data would be labelled examples, and the algorithm would produce a model that computes labels from example features.  This data is called training data.
  3. Evaluation metric.  For instance the time it takes to sort arrays.  In a machine learning setting we would use metrics that are related to the error between predicted values and actual values, like like accuracy, F1 score, roc-auc, etc.
  4. Parameters that can be set for the algorithm.  Differing parameters would result in differing results.  For instance, if our sort algorithm is a bucket sort, the number of bins is a parameter for that algorithm.  Different values for this parameter could lead to different running times of the algorithm, i.e. different values of the evaluation metric.  For machine learning, algorithm parameters are called hyper parameters. Examples of hyper parameters are learning rate(s) and regularization weight(s).  

Our goal is to select the parameters that lead to the best value for the evaluation metric on new, unforeseen, data sets.  These data sets are called test data

For our sorting algorithm, finding the best parameter settings amounts to finding the number of bins that yield best performance on new, unforeseen, arrays.  I machine learning, finding good parameter settings is called hyper parameter optimization (HPO).  HPO finds the values of hyper parameters that lead to the best evaluation metric on new, unforeseen, examples. 

The catch is that we don't have the new, unforeseen, data, when we select parameters, hence we cannot compute evaluation metric on it.   As explained in Overfitting In Machine Learning, the solution to this is to assume that the training data is representative of the new, unforeseen, data.  Then we get a simpler problem:

Find parameter settings that lead to the best evaluation metric for the training data.

These parameter setting would then yield best possible evaluation on new, unforeseen, test data.

The above is true only if we made a significant assumption.  Let me state it again:

We assume that the training data is representative (close to) the data to which the algorithm will be applied in the future.

If this assumption is violated, then the best parameter setting for the training data may not be the best parameter setting for test data.  Violating this assumption results in overfitting: we get good evaluation metric on training data, and poor evaluation metric on test data.

Overfitting In Optimization

Let us now look at mathematical optimization solvers.  These are mostly a collection of clever implementations of well known algorithms such as simplex, log-barrier (interior point), and branch&bound for MIP.  These implementations come with a number of variants that are controlled via public parameters and hidden parameters: different parameter settings lead to different running times.  For instance, in a MIP solver, you will have quite a few parameters, including on/off switches for presolve reductions and cut families. 

One of the main focus of solver developers is to improve performance, i.e. find ways to solve optimization problems faster and faster.  One way to do it is to try to select the best default values for parameter settings. This fits the framework described above.  Using CPLEX MIP as an example:

  1. Algorithm.  CPLEX MIP.
  2. Training Data. A set of MIP instances.
  3. Evaluation metric. Geometric average of running times. 
  4. Parameters. Public and hidden parameters for CPLEX MIP.

Overfitting happens if the training data is not representative of the new data to which CPLEX MIP will be applied.  In optimization terms:

We overfit if the MIP instances we use to select best parameter settings are not representative of the MIP instances our users will have.

In order to avoid this, we try to have a set of MIP instances that are representative of our users MIP instances.  We do it via a very straightforward way: we collect instances from our users (with their explicit permission of course).  Having done that for decades ensures we have a reasonable representative set containing thousands of models. 

What about our users? OR practitioners usually spent time tuning the solver they use so that problems get solved faster.  They do so by collecting a set of problem instances, then try manually or automated tuning tools that selects parameters leading to better performance.  If the problem set they use is not representative of the future problem the solver will be applied to, then they will suffer from overfitting-

We can further generalize overfitting to the modeling part of optimization.  This part is somehow the equivalent of feature engineering in machine learning.  It is how to translate a business problem into a mathematical optimization problem.  i gave an example of optimization modeling in Step By Step Modeling Of PuzzlOr Electrifying Problem.  Point is that the same business problem can be represented in many different ways as an optimization problem.  Some ways will lead to better performance than others.  It is therefore no surprise that OR practitioners spend significant time trying various models in order to find the one leading to best performance.  Here again there is a risk of overfitting.  Indeed, think of the model as one (major) parameter to the algorithm:

  1. Algorithm.  CPLEX MIP for instance.
  2. Training Data. A set of business problem instances.
  3. Evaluation metric. Geometric average of running times. 
  4. Parameters. A set of models of the business problem

Users could suffer from overfitting if the training data is not representative of the new data to which CPLEX MIP will be applied.  In optimization terms:

Users overfit if the business problem instances used to select best model are not representative of the business problem instances users will have.

This is often overlooked as there is pressure to show best possible performance during the development of models.

Let me conclude with yet another way overfiting can creep in.  It is when one uses a public benchmark to infer solver performance comparison.  There are public benchmarks for mathematical optimization solvers, like MIPLIB.  CPLEX performance is one of the top performance on these public benchmarks.  Yet we could do better.  Indeed,  It would be tempting to use these public benchmarks as the training data when we select CPLEX default parameter settings:  this would yield even better performance.  However, given these benchmarks usually contain a small number of instances that are not representative of our users problems, we would be overfittting.  Another consequence is that you should not use these public benchmarks as an indication of how various solvers would behave on your problems.  There is only one way to know this: try the various solvers on a set of problems that are representative of the problems you'll need to solve later.  Anything that departs from this will lead to overfitting and lead to disappointing performance in the future.

 

The Data Lab

Summer Internship Stories at The Data Lab

The Data Lab Interns

 Here are their stories:

 

Amy Ramsay - Funding Intern 

Born and bred in Edinburgh, I fled the nest to study Psychology and Biology at the University of St Andrews, where I am entering my final year. Research, data collection and statistics are a large part of my course resulting in a new-found appreciation for quantitative methods. Whilst learning about this is an advantageous skill, one thing I questioned was the real life application of these techniques. I stumbled across The Data Lab when I was exploring internship options for the summer. What interested me was how The Data Lab encourages cross-collaboration between academics and companies to utilise data science to help drive innovation. This answered my question of how what I’ve learnt at university applies in a genuine business situation.

As the Funding Researcher Intern, I am responsible for scoping the UK funding landscape and with Brexit, how access to EU funding will be affected. I am also involved with our Skills programme, where my tasks include MOOC and data apprenticeship research as well as investigating opportunities for The Data Lab's MSc and EngD courses. Without wanting to sound cliché, every day at The Data Lab I learn something new - whether it be domain specific about my research, enhancing my technical skill in LucidChart and R or enhancing my soft skills. My plan for the future is unclear but my internship at The Data Lab has allowed me to gain invaluable experience of a working environment, as well as helping me determine what type of job is right for me.

 

Perry Gibson - Project Analyst Intern 

Fumbling for meaning in a fast moving world, I find myself a Glasgow lad, at a midpoint in an Informatics degree at the University of Edinburgh.  I had been floating around some of the excellent Scotland Data Science & Technology Meetup’s over the past year, and heard that one of the hosts, The Data Lab, was offering internships.  Eager to get some more real-world experience in a field I loved, I took up the role of Project Analyst intern.

I am responsible for examining individual projects and the pipeline as a whole, and documenting my findings to give the team & myself a better idea of The Data Lab’s portfolio spread.  Developing an R Shiny application that pulls information from our CRM system and creating interactive visualisations has allowed me to refine my time-management and problem solving skills, and   should be a useful tool even after I have moved on.

Where next?  In short, I don’t know.  I’ll continue to attend events in Edinburgh, and learn about things that peak my interest.  I chose Informatics since it offers me a broad choice of interesting fields, I’ve been particularly interested in Machine Learning and other areas of AI. However, subjects such as networking à la IPFS & blockchain technologies are also appealing to me.  Whatever I do, I feel more aware of modern business and project processes thanks to The Data Lab, and I’m glad I got the chance to be part of such a dynamic organisation.

 

Rachel Kilburn - Data Analyst Intern 

I’m Rachel Kilburn, a Data Scientist intern here at The Data Lab. Currently I’m half-way through a Mathematics Master course at Newcastle University, studying a range of applied, pure and statistics courses. My main responsibilities consist of investigating The Data Lab’s website, Twitter and LinkedIn analytics. This involves collaborating with Trang, the Marketing intern, to produce weekly, monthly and collated reports. In the 5 weeks I’ve been here my R skills have increased dramatically and I have learned to use new R packages such as ggplot, reshape and dplyr. I have been introduced to Google, Twitter and LinkedIn analytics and the world of Shiny. Regularly, I am scheduling Skype meetings with Trang in Aberdeen, which has improved my communication and time management skills. I have also gained an appreciation for the effort that goes into producing simple computer functions that we take for granted.

As for my future, I’m not sure where my career will take me. I really love the subject I’m studying and so I’d like to have a job where I am able to use the skills I’ve learned. I applied for this internship to get a better idea of the career options that are available to me and get an insight into the big data industry to see if this is a career path I might want to explore. I have really enjoyed my time here thus far and look forward to using the new skills I’m learning in the future.

 

Trang VuMarketing & Website Analyst Intern 

While waiting for my Master application response from RGU, I was fortunate to get the Marketing and Website Analyst intern position at The Data Lab for what could be my last summer in the UK. As a government funded organizationm the Data Lab is radically different from the other places I have worked at. Our objective is to bring benefits to the community and economy of Scotland through data innovation. 

My daily responsibilities mainly involve working with The Data Lab’s website and social media channels, maximising the use of analytical tools including Google, Twitter, LinkedIn Analytics and Webmaster Tools. I have been working closely with Rachel, our Data Analyst intern, to produce a holistic report on trends and changes in our marketing landscape. Activities such as events and advertisements are recognised to have a significant influence on the performance and engagements of the website and social media. I am also responsible for enhancing the appeal of the website by designing infographics and improving search engine optimisation (SEO). By using Webmaster Tools, I have been helping the Google search ranking of the website improve gradually using sitemap and metadata. 

During my time here, my analytical and communication skills have improved significantly through various tasks. Moreover, I have the opportunity to utilise my strong numerical skills to produce comprehensive reports. With the inspiration from The Data Lab, I have realised my desire to engage with innovative projects involving commerce and technology. I plan to then use these skills to contribute to the economy and community of Vietnam, my home country.

 

The Data Lab team is proud to have these bright and talented interns on board for the summer. We are already experiencing the benefits of their contribution to the organisation, and hope to continue helping them to sharpen their knowledge and skills in preparation for their future careers.

 

The Data Lab also helps MSc students from our programme to find placement opportunities in industries such as oil and gas, healthcare, digital technologies and finance. Find more about The Data Lab MSc programme here.

 

Template: 
Image
Share: 
Google Plus
Facebook
Pinterest
Twitter
 

July 20, 2016

Teradata ANZ

Can Real-Time Event Analytics Really Trump Doctor Who, The Sonic, And The Tardis?

What is it with the dear old Doctor?

Turn your back for five minutes and he looks like someone else, completely. And all that multi-dimensional travelling – backwards and forwards in time to check up on this or that in the Tardis. It must be affecting his analytical faculties because his solutions are pretty one-dimensional. Shame that, in spite of all the gizmos, he doesn’t have the equivalent of real-time event analytics to help him predict some of the scrapes he gets himself (and his colleagues) into.

Now I think of it, traditional business analytics are a lot like Doctor Who. Or, should I say, the Tardis… stuck in reverse. They report what’s already happened, dwelling on questions like: “have sales increased or decreased?”, “how many new customers joined the company last quarter?”, and “what was the revenue from new product launches?”.

Eyes front

Insightful as this is, continually looking over your shoulder won’t help the business move forward. To have a real impact on the next quarter and the quarter after that, you need to go beyond simply reporting recent history. Examining why events happened and how you can use that knowledge to create actionable insights – that’s the real benefit of modern, shapeshifting analytics; event analytics. They give businesses the qualitative tools to deliver the kind of revenue-generating, cost-cutting, or customer-experience improvements, stakeholders demand.

Why-oh-why-oh-why?

Within analytics, traditional quantitative approaches don’t make the cut. Quantitative focuses on counting, summing, and averaging what’s happened. But without context, there’s no insight into human involvement, sentiment, or a hundred-and-one other critical factors.

This is where event analytics unlock real business value. Knowing why sales were down,why customers stopped buying a product, and why complaints increased, is the springboard for future performance. Event analytics help you move beyond quantitative numbers to a qualitative understanding of situations. That way, you can tailor your tactical and strategic actions to control business outcomes.

Event-Based Analytics

Event analytics start with the creation of a single analytical data set that defines and captures events from all channels (related to all products and services, for all customers). This enterprise event repository provides a view of how your customers experience your business. Events can be defined in a number of ways. For instance, an encounter involving a specific action and sentiment such as ‘Negative Call-Centre Payment Experience’. The event could be defined by analysing the customer call and finding that the customer’s payment was refused because the amount was above their telephone-banking limit.

Another event might help analyse a customer’s longer-term behaviour. For instance, ‘Poor Network Experience’, where a telecommunications provider’s customer has endured a series of dropped calls due to poor network reception over the past three days. Such events not only capture actions but customer perception, brand loyalty, awareness, and engagement, too. In fact, they measure and qualify the customer experience.

This core enterprise event repository becomes the baseline for a huge range of analytics, providing the potential to ask and answer many questions. The traditional approach of analysing and counting single events is insightful. However, by looking for patterns and associations between events you can get a much deeper understanding of what’s happening, and why. Furthermore, by using predictive techniques in machine learning you can begin to anticipate the next event.

Shaping your future with real-time action

What events can help you predict churn? And, more importantly, how can you take real-time action to bring potentially-churning customers back into the fold?

Event analytics keep your finger on the consumer pulse, guiding you towards making the right decisions at the right time. Once you’ve detected tell-tale event patterns that lead to a positive or negative outcome, you can keep an eye out for further occurrences. Then you can take evasive action and tackle the underlying issues before a customer ever gets to a negative outcome. These tailored actions not only benefit the business but also enhance the customer relationship.

Event analytics move the focus from making the ‘right numbers’ from your customers – quarter-by-quarter – to developing long-term customer relationships capable of triggering behaviours that lead to increased and more sustainable profitability.

In other words, think of Real-Time Event Analytics as a kind of supercharged Sonic Screwdriver. A commercial game-changer even more prescient than level-4000 Gallifreyan technology.

This post first appeared on Forbes TeradataVoice on 25/05/2016.

The post Can Real-Time Event Analytics Really Trump Doctor Who, The Sonic, And The Tardis? appeared first on International Blog.

Data Digest

Do or Die: 5 Data Analytics Project Pitfalls you Must Avoid

“If I had asked people what they wanted, they would have said faster horses.” Henry Ford


I’ve seen this quote referenced so many times with regards to data and analytics technologies and it seems so perfectly apt. With the rise of Big Data and analytics we have seen a surge of investment in the various tools and technologies – en masse. Yet, not everyone is delivering the innovative business insights they were anticipating for success.  Open source technology has reduced the barriers to entry, making it all the more tempting to implement them in a “me too” style. Implementing the same tools as the rest of the crowd and trying to do it better is not likely to benefit you unless there is a clear need for your business.

During my recent research in developing the Chief Data and Analytics Officer Forum, Melbourne,  I came across some of the key reasons why organisations are unable to leverage their data for innovation.

Top 5 Issues to Address:

1.      Lack of an enterprise-wide strategy. In a recent post about the data disconnect, it touched on the importance of a carefully managed data analytics strategy. Data strategy must it be effectively communicated across the business and underwritten by well-developed organisational support in order for it to become an inherent part of the way an organisation operates.

2.      Lacking the right skillset. People are often searching for the perfect blend of IT and business experience and there is much debate around whether that skillset should be recruited or built internally. Not having the right skillset at the right time can be fatal for an analytics project.

3.      Disparate information systems. Uniting all relevant data from various legacy systems and differing technologies is a very common challenge. In order for your business insights to be meaningful, they need to be derived from one source of the truth. Crucially, data governance and data quality must underpin every data analytics project.

4.      Failure to identify the business problem. A point that is always highlighted at our conferences is that you must first identify the business need and then design the analytics project. Collecting data in the hope that something meaningful will emerge that will be of use for the business is an incredibly inefficient way of operating – you need to first know what problem you are trying to solve.

5.     Need for a ‘fast-fail’ analytics culture. Building a culture for a ‘fast-fail’, learn quickly and move forward environment will reap the greatest rewards.  Pilot the viability of your project before scaling up. Starting small and failing fast minimises the economical loss and assures the viability of the project before you start scaling up.

The pitfalls are many but the general consensus is that an iterative approach to data analytics projects is a must. Don’t be disheartened by failure – expect it.  Focus on the business problem and start asking the right questions in order to tailor your project. If Henry Ford had asked his customers about their day-to-day needs, perhaps he would have got a different answer.


You can hear more on this topic at our Chief Data & Analytics Officer Forum in Melbourne this September. Phil Wilkenden, ME Bank and Richard Mackey, Department of Immigration & Border Protection will be hosting a discussion group addressing how to understand the challenges of data and analytics projects.

By Monica Mina:

Monica Mina is the Content Director for the CDAO Forum Melbourne. Monica is the organiser of the CDAO Forum Melbourne, consulting with the industry about their key challenges and trying to find exciting and innovative ways to bring people together to address those issues – the CDAO Forum APAC takes place in Sydney, Melbourne, Canberra, Singapore and Hong Kong. For enquiries, email:monica.mina@coriniumintelligence.com.
Data Digest

What is the reality of the Chief Data Officer role within the insurance sector?


The insurance sector has been grappling with data for decades and is perhaps one of the oldest institutions to practice what we would now consider to be Data Science and analytics. The inherently risk averse nature of Insurance has historically demanded Actuaries to implement rigorous statistical and mathematical processes to enable the quantification of risk into a packaged, saleable product. Although insurance organisations have long been aware of the applications of data for business gain, there has been renewed interest in Big Data given the diversification of new sources now available through the exponential growth in technology.

This awakening has resulted in the elevation of data as a strategic core asset and laid the grounds for the adoption of Chief Data Officers (CDOs) as well as Chief Analytics Officers (CAOs) across a multitude of sectors, wishing to capitalise on the repertoire of benefits that come with being more data-centric. However, the often traditional constructs of insurance companies accompanied with a highly siloed method of business, has slowed the pace in the rapid adoption of CDOs/CAOs which has been seen in other industries. However, what is the reality, or perhaps the perceived reality, of the role of CDOs and CAOs within insurance?

In the lead up to the inaugural Chief Data Officer Forum, Insurance, I conducted a survey with our speaker faculty to better understand their thoughts on the current status of the CDO and CAO roles.

Where do you think the Insurance sector currently stands when it comes to the maturity of the CDO/CAO role?

Heather Avery, Director, Business Analytics, Aflac

“While Gartner estimates that 90% of large organizations will have a CDO by 2019, the insurance industry has largely disregarded this trend and instead opted for including related CDO roles into chief risk offer, chief analytics officer, and/or CIO roles.  In essence, the demand for CDO duties are met through a myriad of existing roles.  This can be thought of as common practice in the insurance industry and I don’t see this changing in the near term.”

Guizhou Hu, VP, Chief of Decision Analytics, Gen Re

“Most life insurance companies still do not have CDO or CAO yet. Data management are treated as an IT function and analytics are done mostly by traditional Actuary using Excel and Access.
Generally speaking, insurance company CDO/CAO roles are much better defined or matured in the marketing sector. While CDO/CAOs are mostly non-existent in traditional underwriting and actuarial pricing or valuation.”

Meghan Anzelc, VP, Predictive Analytics Program Lead, Zurich North America

“I would say that the insurance sector is still in the early stages for these roles. CDO and CAO roles are relatively new across all industries. Some insurance carriers have created these roles, others have continued to use the organizational structures already in place (such as having analytics teams sit within each function), and others have taken a more hybrid approach, where there is some centralization of analytics talent and some sitting with each functional area.”

Eric Huls, Chief Data Science Officer, Allstate Insurance Company

“The maturity of the CAO role really follows the maturity of the analytics function within a company. Over the last few years we’ve seen all industries, including insurance, begin to lean heavily on data science to stay ahead of the competition and drive strategic decision making. Now that the appreciation for Data Science has taken root, companies are relying on the CAO and CDO roles to set the strategic vision, not only for tomorrow, but for several years into the future. This is a good thing, as the CAO and CDO will be at the forefront of innovation, and the catalyst for driving a data-driven culture within their organization. They’re in the early stages of maturity, but growing quickly. Very soon, I anticipate that these roles will no longer be optional, but rather one of the most critical voices within an enterprise.”

TJ Houk, Chief Data Officer, Trupanion

“The insurance sector is positioned to lead in maturing the CDO & CAO roles.  Effectiveness as a CDO and CAO is largely dependent on the extent to which there is already a culture of using and understanding.  Traditionally, insurance has had that discipline, so it’s positioned to expand on it.  It’s the industry’s opportunity; we’ll see if it adapts quickly enough to take advantage of it.”

In conclusion, there is a general consensus that the CDO and CAO roles are still in their infancy in insurance. In essence, insurers must compound their data assets and build infrastructures which allow for environments of innovation and analytics to survive, and ultimately thrive in today’s increasingly connected world of business. Legacy systems and processes have resulted in the accountability of data exploitation sometimes falling into  the hands of more traditional job titles, however, this does not necessarily mean that the responsibilities of  a CDO/CAO aren’t being realised.

So must Insurers appoint a CDO or CAO to further their journey of data centricity? Or should the emphasis be placed on the fulfillment of the responsibilities held by such a position?

By Andrew Odong: 

Andrew Odong is the Content Director for the Inaugural Chief Data Officer Forum, Insurance 2016 If you wish to learn more about the status of the Chief Data Officer & Chief Analytics Officer within the Insurance sector, join us at the on the 15th September in Chicago, Illinois. 
Data Digest

Big Opportunities for Big Data in the Asia Pacific


It goes without saying that big data will be a game changer for businesses and governments around the world. The rate of investment and proliferation of big data adoption transcends across industries and the enablers for business success are being re-written thanks to rapidly evolving analytics capability. From predictive analytics and cognitive computing to machine learning and AI, the future for governments and businesses within the data driven economy is filled with opportunity and promise!

For the Asia-Pacific region, organisations are slowly taxiing onto the runway of big data and analytics maturity.  Most are yet to take-off with full scale implementation and the vast majority are currently taking an ad-hoc and opportunistic approach. For cities such as Hong Kong, rooted as one of the world’s leading hubs for financial services and home to an array of data rich industries including banking, communications and media, transport and logistics as well as a burgeoning start-up scene, the opportunities to capitalise on these data assets are immense. 

There are however 3 key challenges that need to be collectively addressed to develop the foundations for future success.

1. Leadership and Vision

Organisations and businesses in the region need to demonstrate leadership and visionif they intend to reap not only customer focused benefits of data but also top line growth and bottom line results.  For instance, Hong Kong with its mature economy and competitive business environment relies heavily on the economics around investment and ROI when making commercial business decisions.  With industries such as financial services and real estate forming the bedrock for the city’s economy, it’s understandable that a conservative approach to business would be the predominate mindset prevalent within the city’s leading c-level community. If however Hong Kong, and indeed the region wishes to accelerate the benefits of big data and analytics adoption, it will need to embrace a greater degree of openness toward innovation and shift its position on risk aversion. 
Additionally, the key drivers for big data implementation will have to come from the business side of the organisation rather than the IT side if companies truly seek to achieve success with their big data and analytics initiatives.  A common perception within the region is that big data falls under the purview of the ‘IT team’ and so it’s up to them to push the case and implement adoption.  The reality of course is that if organisations are to adopt big data initiatives at an enterprise level, leaders at the c-level need to develop the vision, culture, technology and processes holistically required if they’re serious about reaping the transformational benefits of big data. Building a data culture alone takes time and is a journey in itself. Business leaders need to be at the forefront of this movement leading the charge.

The Chief Data and Analytics Officer Forum Hong Kong will address this issue as a whole with perspectives from government, industry and business all present to deliver their viewpoints. 

2. Ecosystem for Talent Creation

Another area that presents a challenge for the Asia Pacific region but significant opportunities for its citizens is the need for developing an ecosystem for talent creation.  When it comes to building the data workforce, the region is 3 to 5 years behind Europe and the US with Singapore’s Infocomm Development Authority alone projecting a shortfall of nearly 30,000 cybersecurity, data analytics and applications development professionals by 2017.  The key to addressing this challenge will be the need to bring policy makers, public sector organisations, academic institutions and private sector players together to create an environment where technology and data talent can flourish and thrive. 
Similarly, now is also the time for organisations to be strategic in positioning themselves at a competitive advantage as the land grab for talent intensifies over the coming decade.  Building, attracting and retaining the next generation of data scientists, data professionals and analytics experts will be the key to long term success and those that fail to act will be left to languish.

Bertrand Chen, Lead Data Scientist from Asia Miles will deliver a keynote presentation at the Chief Data Analytics Officer Forum Hong Kong and will articulate his viewpoints on the keys to building the right data science team.  With many organisations currently chasing the elusive ‘all-in-one’ data scientist who possesses the both the technical and business skillsets, Bertrand will offer an alternative roadmap necessitating the need for incorporating 4 key roles within any data team.

3. Regulatory Reforms

Finally, on a macro level, regulatory reforms around compliance, privacy and security play a crucial part in nurturing the environment for big data acceptance in the Asia Pacific.  Technological innovation continues to outpace the regulatory landscape. Challenges associated with data protection, cross border data sharing, data security and data breaches are all significant issues that need to be confronted. Dr Henry Chang from the Hong Kong Privacy Commissioner for Personal Data will be one of the many speakers who will address these issues at the Chief Data Analytics Officer Forum Hong Kong.


You can hear about all these issues and more at the Chief Data and Analytics Officer Forum Hong Kong scheduled for the 4th and 5th of October. With over 25+ speakers from a cross section of industries and sectors, this conference will comprehensively address the big data opportunities and challenges of the Asia Pacific region.

By Fasih Qureshi:


Fasih Qureshi is the Content Director for the CDAO Forum Hong Kong. Fasih is the organiser of the CDAO Forum Hong Kong, consulting with the industry about their key challenges and trying to find exciting and innovative ways to bring people together to address those issues – the CDAO Forum APAC takes place in Sydney, Melbourne, Singapore and Hong Kong. 
For enquiries, email:info@coriniumintelligence.com
Data Digest

Building a team with capabilities to deliver actionable insights


Kyle Wierenga is the Director of Analytics and Measurement at Aimia Inc. and will be presenting at two Corinium events later this year - Chief Analytics Officer Forum Africa and Data & Analytics Leaders Forum UAE

I spent some time with Kyle understanding the roles he's had, what he's focused on currently and what he'll be covering at the Forums. 

Kyle, can you please give us an introduction into yourself and your role at Costco?

My role with Costco was to build an Advanced Analytics team to build decision support. I built the team capabilities from a focus on descriptive ad hoc analytics to strategic predictive modeling and prescriptive insights. My current role with Aimia is Director of Analytics and Measurement. In this role I am focused on building a team with the capabilities to deliver actionable insights to our customers quickly. Helping our customers react to changes in their customer base buying patterns quickly will help them to respond in the most effective way with the hope of strengthening that relationship.

Perhaps you can give us a brief history of the journey Costco has taken in terms of data analytics. South African/Middle Eastern companies are, generally, at the very beginning of their journey into data analytics and it would be valuable to understand what they’re likely to face

I think the journey at Costco was similar to most companies investing in analytics hoping to gain a competitive advantage. Most of our reporting had been in Excel with some conditional formatting to make the most important details stand out. We moved to dashboards that help our users see bigger trends and consume data much easier. This helped speed up decisions and raised the quality of those decisions. We also started increasing our predictive modeling practice through which we were able to show some very good ROI, this gave the program great traction with the executive team.

Most people assume that companies in the US are at a very mature stage of data analytics. Is this the true reality or is there still work to be done?

I have attended many conferences talking to many people representing their companies analytics programs here’s what I learned. Each industry seems to be at a different level of maturity with their programs. For instance banking and insurance rely heavily on risk analysis so they are at what I would call a high level of analytics maturity. Their business depends on getting this right. Other sectors like retail are in a highly competitive market and are at mixed levels of maturity. Costco was an industry laggard in analytics but I feel like we brought them up to the middle of the pack where Costco is comfortable being positioned. Retailers like Amazon, Walmart, and Target are on the cutting edge of analytics in retail and are receiving the benefits.

What we’ve found is that South African/Middle Eastern companies face very similar challenges to data & analytics professionals in other markets Corinium works in. What do you feel are the 3 biggest challenges companies in the US face with regards to data analytics?

The first and most significant challenge I ran into was receiving executive buy­ in. Costco is a successful company which I believe makes change more difficult. Sure analytics might help me but I’m doing really well so why do I need analytics to help my decision making? Getting the point across that we are trying to help them be more successful , not replace their industry knowledge with a machine was key. Second is gaining the approval to hire the people and then finding the people with the right skills. Third is assessing the right tolls for the organizations level of maturity and winning purchasing approval.

Your case study/discussion group and workshop at CAO Forum Africa/Data & Analytics Leaders Forum UAE is centered around building organizational trust in analytics. Can you expand on how you achieved this at Costco and other companies you’ve worked at?

I believe the key is getting to know your executives and really understand the Key Performance Indicators they are tasked to deliver on for the company. For instance if marketing has a goal to reduce customer churn how can we deliver insights that can help them achieve higher KPI’s? In this example we were able to do a much more accurate churn analysis report, pair it with Member Lifetime Value analysis, and come up with a marketing list they could take action on to be more successful. I also worked on HR analytics for a shoe / apparel company to address the problem of employee turnover in their call center. By utilizing analytics we were able to find the greatest cause for turnover and build a program to address it. This lowered the turnover percentage by almost 5 percent and built trust in the analytics program.

What can attendees expect to take­away from your workshop? What are the ‘actionable insights?’

My hope would be that those attending the workshop walk away with ideas for how they can build more support from their executives to build a robust analytics program in their company. The actionable insights will come from many case studies that I have built from building several successful analytics programs. I hope to have some great discussions around problems they are facing in their own programs so we can discuss possible solutions together.

By Craig Steward:

Craig Steward is the Content Director for the Chief Analytics Officer Forum Africa taking place 26-29 September in Sandton, South Africa and the Data & Analytics Leaders Forum UAE taking place 9-12 October in Dubai, UAE. Join Craig and other data and analytics professionals later this year.
Data Digest

Organising for Analytics Success - Centralising vs. Decentralising


Over the course of the past 3 months I've conducted in-depth research with senior analytics professionals in South Africa. The research group has been spread across a number of industries and a variety of company sizes.

One of the key topics that has come up time and again is that of how to organise for analytics success. I suppose there are myriad ways this concept -  'organise for analytics success' - can be interpreted. And it's clear that there is no one answer as it ultimately depends on each individual business.

Given that Corinium focuses on Chief Analytics Officers what I was specifically looking for was whether or not a CAO was critical to analytics success or not. Especially given that, in South Africa, there is only one titled Chief Analytics Officer.

It stands to reason that a small company has to have their analytics centralised. There's likely to be only one person looking after analytics/insights/BI. So, we knock them out of the conversation.

As we know the analytics team needs to have an acute understanding of the business and business unit they are working in. To be able to build models and derive insights its important that there is some context to the objectives of the business unit as well as the problem the analytics team is solving for.

It's based on this premise, then, that many Heads of Analytics (and similar) believe that analytics has to be decentralised. Deploy a Head of Analytics into each business unit, allow them to work alongside the business owners and build insights with specific knowledge of the customer and the product.

This structure makes perfect sense. Except when you take into account that there is a distinct lack of skills when it comes to people who can build advanced analytical models; and understand business; and have the ability to lead a team and engage with business.

Finding one strong, effective Head of Analytics is a big ask. Finding enough to satisfy all your business units is a near impossible task. 
Finding one strong, effective Head of Analytics is a big ask. Finding enough to satisfy all your business units is a near impossible task. 
So then, does it make sense to focus on finding that one unicorn? That Head of Analytics that is strong across all areas - technical and business. Call them a CAO or don't - that's not the debate here. The debate is whether or not this central person can control a group of Data Scientists, Statisticians, Actuaries etc. and deploy them into business units as and when they're needed.

In a perfect world you might have a CAO who has true C-level standing with a mandate, influence and accountability that has a team of Analytics Heads stationed in each business unit. The CAO would be responsible for the overall analytics strategy for the business and would also have a direct relationship with the Chief Data Officer (this is in a perfect world, remember?) to ensure that organisational data is of the highest quality and easily shared across the whole organisation.

Have I answered the question? Do you now know how to organise for analytics success? Probably not. I think there still needs to be more research done but still I think that there is no one size fits all solution to this.

I'm interested in your thoughts. Let's start the conversation here and perhaps some ideas will come out - for and against both arguments - that can be incorporated in a hybrid model.

By Craig Steward:

Craig Steward is the Managing Director EMEA for Corinium Global Intelligence and Content Director for the Chief Analytics Officer Forum Africa taking place 26-29 September in Sandton, South Africa and the Data & Analytics Leaders Forum UAE taking place 9-12 October in Dubai, UAE. Join Craig and other data and analytics professionals later this year.




Teradata ANZ

Mobile App Fishing For Retailers

Amazing, isn’t it.

Back in December (when asked to write this blog) I was brimming with ideas. Now, all those great insights are just echoes and as soon as I put finger to keyboard, I’m lost.

The fact is, I haven’t a clue where to start.

Much the same could be said about the retail industry. Technology is transforming the way customers shop; how they engage with retailers and vice versa. There’s a lot going on, but it’s not always obvious how to join the revolution.

‘Customer journey’ is the business focus and retailers are having to come to terms with the way new digital capabilities are affecting bricks-and-mortar stores, as they try to create a personal and consistent, multi-channel shopping experience.

Cutting through the digital noise

Data analytics help retailers understand shopping habits and behaviours in detail. They pinpoint the best time and place to communicate personalised offers and messages, ensuring that customers navigate seamlessly across the various channels to their purchases.

Many retailer initiatives are still in their infancies, though. Different technologies – particularly mobile and in-store digital enablement – have been piloted but not implemented on a strategic basis, yet. The fact is, there’s so much going on and the pace of change is accelerating so hard, that retailers can’t get a handle on what to do first.

All together now

Recently, I spoke at a retail-innovation forum on this very subject: ‘Using Customer Insight to Optimise the Customer Journey and Experience’. The presenter before me, a mobile network provider in Indonesia, talked about how they had exploited mobile technology and advanced analytics to enhance the Indonesian fishing industry (from retail to fishing – not such a leap as you might imagine). What they did was as practical as it was creative, and I think we can all learn a lot from them.

In essence, they created mFish; a mobile application which uses mobile broadband services (available out at sea, which is a challenge) to create a connected community. This community promotes sustainable fishing practices, encourages efficiency and, importantly, strives to make day-to-day life as safe as possible for fishermen.

fishing

The Transformers v The Procrastinators

mFish allows everyone in the fishery supply chain to navigate, share their location, and keep up-to-date with the latest weather conditions. Plus, fishermen have access to trade tools that link them with markets, together with real-time information on demand and pricing. mFish also enables mobile money transactions via its mobile wallet, allowing seamless interchanges between sellers and buyers.

“So, what’s that got to do with me?”, you ask. Well think about it. A whole industry has been transformed by technology, creating a digitally-enabled business model. It fully-embraced the concept and delivered a whole new way of working based on the digital business model. The initiative has been thought-through, implemented, and adopted in daily commerce, while the retail industry has yet to get off the ground; contenting itself with cycling through pre-flight checks for much of the same technology.

There’s a lesson here for everyone. Digital can revolutionise your business. Think it through from the customer’s perspective. Make a start. You’ll be surprised what can be achieved.

Get in the groove, Daddy-o

Taking the story a step further, my son is studying Marine Biology at Bangor University. When I told him about mFish, it sparked a major discussion on how new technologies and advanced analytical capabilities were going to change the world. To cut a long story short, he told me to get with the program. He said he’d been using Python and R to analyse the results of his experiments for ages, and went on and on about how easy they were to use (pretty remarkable seeing as he flunked Maths).

And there was I, the old man, assuming I was opening his eyes to a whole new way of thinking. I remember PLAN, Fortran, and paper-tape readers; Wow! Was that 30 years ago?

Anyhow, enough of that. I must start this blog…

Next time, I’ll take you on a hilarious journey of a data geek – from what seems like the stone age, to the present day… JP

This post first appeared on Forbes TeradataVoice on 04/05/2016.

The post Mobile App Fishing For Retailers appeared first on International Blog.

Data Digest

Here Comes the Juggernaut, the Public Cloud


For a long time, I was not convinced about the power of the public cloud. Naturally! I, like many others, thought that it was a sideshow, and one which would mostly cater to startups and some medium size companies. However, I discovered I could not be further from the truth.

The cloud, from the very beginning, has been confusing term, at least to me, due to two different sources, one from SalesForce and another from Amazon Web Services. The Salesforce.com original CRM portal was what became the definition of Software as a Service (#SaaS). What made SalesForce famous as a cloud offering is when they created the Force.com platform upon where different applications could be built.

The mighty conquest by AWS started with introduction of first weapon in 2006 called Simple Storage Service, or S3. This started a brand new battle, which would later lead to a huge war i.e. cloud war. The belligerents were Amazon on one side and old guard infrastructure providers like Dell, HP, EMC and Cisco on the other. As you can probably guess, all the casualties happened on the non-Amazon side.  The second weapon AWS introduced was called Elastic Compute Cloud, or EC2 to power compute and new assault on old guard ; and with this, further weapon after weapon, battle after battle Amazon kept winning. Of course!

Amazon, being smart enough, did not target enemy strongholds (aka big enterprise customers) but smaller posts (aka SMBs and startups) to start with. These customers were relatively insignificant to big players. So, for a long time a lot of folks (along with me!) thought that Amazon's war was a SMB war and would never expand into bigger territories.

Obviously Jeff Bezos, the commander of Amazon's army, had something else in mind. He had already won the retail war a few years back. The memory of  casualties and bloodshed in retail war still terrorizes surviving retailers. The war resulted in countless graveyards, some of them being Circuit City, Sports Authority and Borders.

The second phase of war was fought on two fronts. One was an assault on enterprise strongholds, and the second was the introduction of a new weapon class called PaaS, or Platform as a Service.

I call PaaS offerings by Amazon the Kirkland phenomenon. It is where Costco puts a competing cheaper Kirkland product next to a vendor’s product, cannibalizing vendor revenue while providing more choices to the buyer . I no way mean that AWS offerings are inferior (nor are Kirkland). On the contrary they are simpler, and in that sense, superior. Amazon’s PaaS offerings are not targeted at destroying few whales like IaaS offerings are. But are targeted at large number of platform vendors, both large and proprietary ones like Teradata, SAP and Oracle, and companies who build their platforms around open-source (open-core). Platforms around open-source make this very interesting as these platforms were set to target proprietary software and it's vendors. Now both are being threatened by simple services built on AWS.

At present, a lot of clients are shying away from these PaaS offerings as they do not want to get tied to a specific cloud vendor, but this will not last for long. I saw the same phenomenon happen for database vendors when clients would not even use a database specific JDBC driver like OCI in the early 2000’s. In a few years that changed when companies realized that they were already Oracle or some other DB vendor’s shop. In the next two to three years you are going to hear the same for public cloud that this company is AWS shop or Azure shop and sometimes Google Cloud shop (for a change). Once that happens, companies will prefer simpler, already integrated PaaS offerings from these cloud vendors.

There are already competing services available. Like AWS Kinesis and Azure Event Hub for Kafka, AWS EMR and Azure HDInsight/Data Lake for Hadoop/Spark, and many others. All of these services have been built ground-up keeping cloud in mind, according to Andy Jassy, CEO of AWS. And I cannot agree more. Most of these services also do not have rich functionality provided by platform vendors, but most of the clients do not really need richer functionality. Most of the features with any platform or software go unused.

Talking about the second front of this stage of the war and, i.e.,  the enterprises, enterprises had a different concern, which was stopping them from moving to the cloud. These concerns revolved around security, data governance and other enterprise features. Now AWS and other cloud vendors have crossed that inflexion point where they are believed to be as secure as on-prem installations, if not more.

Cloud vendors have crossed that inflexion point where they are believed to be as secure as on-prem installations, if not more.

Another point (there is always another point!) is the price-point where all three cloud vendors have a huge advantage and, i.e., they are all cash-rich due to their other businesses. It means that they can ensue a pricing war which will not make either of them blink, but will drive the rest of the traditional (and even cloud!) vendors to the ground. Amazon has maintained a very low-margin for a very long time, and the stock market has not complained. So yep, there is a track record out there.

Are other vendors putting up no defense? In fact they are, in the form of a private cloud and hybrid cloud. However, the private cloud is dead on arrival. But the hybrid cloud I consider as migration strategy to a public cloud rather than a sustainable solution.

So now, back to the future. When these players have won the war, how will the new world look like? How will survivors align? First, the biggest winners are going to be customers who like the way it happened in the retail war. Second, will be benefactors who, although not exactly winners, are going to be consulting companies like us. But that would depend upon how fast we align ourselves to serve the new masters.

Once this rebuilding is done, the IT world will be simpler, leaner, richer in functionality, and more beautiful. Or maybe I am starting to sing the praises of the new masters too soon!

P.S: This blog, in part, was influenced with great insights by John Furrier and  Dave Vellante, who shared on SiliconAngle media.

By Rishi Yadav:

Rishi Yadav is President & CEO at InfoObjects (Big Data Analytics,IoT, Data Science).

Curt Monash

Notes on vendor lock-in

Vendor lock-in is an important subject. Everybody knows that. But few of us realize just how complicated the subject is, nor how riddled it is with paradoxes. Truth be told, I wasn’t fully...

...

Curt Monash

Notes from a long trip, July 19, 2016

For starters: I spent three weeks in California on a hybrid personal/business trip. I had a bunch of meetings, but not three weeks’ worth. The timing was awkward for most companies I wanted to...

...
 

July 19, 2016

Big Data University

This Week in Data Science (July 19, 2016)

Here’s this week’s news in Data Science and Big Data. Prisma App

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

The post This Week in Data Science (July 19, 2016) appeared first on Big Data University.

Roaring Elephant Podcast

Episode 20 - Dave's Hadoop Summit San Jose 2016 Retrospective - Part 2

Dave's Hadoop Summit San Jose 2016 Retrospective - Part 2In this second part, we discuss the sessions that Dave attended at the San Jose Hadoop Summit and we go in depth on some related topics. Since we ran over an hour with the main topic, and we did not want to make this a three-parter, we decided to forgo the questions from the audience just this one time...Read more »
Teradata ANZ

When Perfection is the Enemy of Good

While consulting on machine-learning and data-mining initiatives with companies around Australia and New Zealand I commonly come across an objection to the proposals we are putting forward. This objection is that the model we are seeking will not be 100% accurate – that is to say that it will not be the perfect model.

Brain

This objection is entirely valid and at the same time entirely irrelevant. It is true to say that even the data a company collects are never ‘perfect’. Take a marketing attribution project for example. The aim is to capture digitally the entire decision making process which lead to a successful conversion and then to attribute the value of that sale to the events which precede it. The data generated as a customer interacts with a company captures some aspects of the decision making process, but it doesn’t perfectly represent entirety of the journey – and it never will. Humans are complex psychological and emotional beings and much of our decision making is subliminal – hidden even from ourselves. Data which records customer touch points and multi-channel interactions will not embody the entirety of this process.

Furthermore, the attribution model we choose will not perfectly attribute the value of the outcome to the various steps in the process. Whether we use last click, first click, exponential decay, weighted uniform distribution or some combinations thereof, the model may be a good approximation it but it will never be perfect. The well-known statistician George Box even went as far to say that “All models are wrong”.

So does that mean that we should not bother with attribution? No. Because Box also said that “some models are useful”. Rather than comparing the model we can produce with an ideal model which can never exist, we should compare it with the status quo. We should then do a business case evaluation to calculate the ROI on undertaking the model development process. If the model has a good likelihood of improving on the status quo and the effort involved in developing and running the model doesn’t outweigh the expected improvement, it is worth building the model.

This pragmatic approach to model development is a tried and tested means by which companies can reap the benefit of machine learning and advanced analytics and exploit the value inherent in their data assets.

The post When Perfection is the Enemy of Good appeared first on International Blog.

 

July 18, 2016

VLDB Solutions

An Introduction to Primary Indexes and Distribution Keys

Primary Indexes and Distribution Keys

It’s All About Data Distribution.

As experts in Massively Parallel Processing (MPP), here at VLDB Solutions we talk regularly about ‘Primary Indexes’ (PI) or ‘Distribution Keys’ (DK). They are integral to the architecture of both Teradata and Greenplum respectively, and the correct identification and employment of them is ‘key’ to the maximised performance of both Massively Parallel Processing (MPP) systems. But how do they work?

Data Distribution

Before we examine each in detail, it is first important to understand how data is stored and accessed on a MPP system, and how the distribution of data helps a system achieve true parallelism.

Within a MPP system, data is partitioned across multiple servers (referred to as AMPs in Teradata, and Segments in Greenplum). These servers ‘share nothing’ – they each process only their own share of the data required by a query, and do not process data located on other servers in the system. If data is not partitioned across all of the available servers, then those servers without data sit idle during the processing of a workload, and the full power of MPP has not been harnessed. The data is considered ‘skewed’, and therefore the query will be skewed too.

Primary Indexes and Distribution Keys are, as the name suggests, the key by which data is distributed across the servers. They are designated at a table level within the database, turning a column, or a selection of columns, into the key for each row of data.

If the Primary Index or Distribution Key for each row of data within the table is unique, then the rows are distributed across the servers in a ‘round robin’ manner. When the data is queried, each server has an equal share of the workload, and the system’s full power has been harnessed; the system has achieved true parallelism.

If the Primary Index or Distribution Key is not unique, rows with duplicate Primary Indexes or Distribution Keys are grouped on the same server. In a table where there are many matching key values, this can lead to skewed data and therefore skewed performance.

For example, if a MPP system had 10 available servers, but a table with a Primary Index / Distribution Key was created where that column only contained two different values (1/0, True / False, Yes / No, etc.) then that data would only be distributed to two servers, as the rows with matching value sets would be stored together. When that data is queried, only those servers with data can process the workload; the remaining eight servers remain idle.

Unique and Non-Unique Primary Indexes

As mentioned earlier, data distribution in a Teradata system is governed by a table’s Primary Index. Primary Indexes can be designated as either Unique or Non-Unique, depending on the column, or selection of columns, that has been chosen as the Primary Index. If a Primary Index is determined as a Unique Primary Index (UPI), then duplicate values are no longer allowed within the chosen column / selection of columns, and will not be loaded to the system. However, those rows that are loaded will be distributed evenly across the AMPs, and the system will achieve parallelism when that table data is queried. If a Primary Index is determined as Non-Unique Primary Index (NUPI), then duplicate values will be allowed within the column / selection of columns, but they will be grouped on the same AMP when the data is distributed.

Distribution Keys and Distributed Randomly

The Distribution Key is how data is distributed on a Greenplum system. Unlike Teradata, the key is not declared as unique or non-unique; it merely is, or is not. Again, as with Teradata, table data with a unique Distribution Key is distributed ‘round robin’ across the Segments; and duplicate Distribution Key values are grouped together on the same segment.

However, Greenplum tables can also be designated as ‘distributed randomly’. In this case, column data is not used for distribution, and each row is distributed to the Segments in the same ‘round robin’ manner as when using a unique Distribution Key.

 How to Choose Primary Indexes and Distribution Keys

As should now be clear, correctly identifying which column, or selection of columns, to use as a Primary Index or Distribution Key is integral to the performance of a MPP system. In a Relational Database Management System (RDBMS), a table’s Primary Key is often a natural candidate to become the Primary Index / Distribution Key – a column of values uniquely identifying each row could easily be used as a UPI for best distribution.

However, there will be times when a table does not have a single column of unique values, but where ‘round robin’ distribution is still desired – ‘reference’ or ‘dimension’ tables, for instance. On a Greenplum system, this could be achieved by distributing randomly; but on a Teradata system, it would be necessary to identify a selection of columns, where the combination of data within those columns would be unique on a row-by-row basis.

The post An Introduction to Primary Indexes and Distribution Keys appeared first on VLDB Blog.

 

July 16, 2016

Jean Francois Puget

Installing XGBoost on Mac OSX

OSX is much better than Windows, isn't it?  That's a common wisdom, and it seemed to be confirmed once more when I installed XGBoost on both OS.  Before I deep dive, let me briefly describe XGBoost.  It is a machine learning algorithm that yields great results on recent Kaggle competitions.  I decided to install it on my laptops, an old PC running Windows 7, and a brand new Mac Pro running OSX.  I thought the OSX installation was a no-brainer compared to the Windows one, as explained in Installing XGBoost For Anaconda on Windows

Reality is a bit different, and the OSX installation isn't as smooth as it seems.  To be accurate, the default OSX installation of XGBoost runs in single thread mode, as explained in these instructions.

Why is this a problem?  Because XGBoost is a machine learning algorithm, and running it may be time consuming.  I decided to install it on my computers to give it a try.   I am currently working on a dataset with about 100k rows (samples) only, and tuning XGBoost on my old Windows laptop (a Lenovo W520) takes about 2 hours.  What surprised me is that it takes 7 hours on my brand new Macbook Pro!  It is a bit weird, given they both have Intel i7 quad core cpus, and given that the Mac clock speed is higher.  Add to this the premium price of the Mac, and you get me really surprised.

I further observed that other cpu intensive tasks are faster on the Mac Book Pro.  Something is definitely wrong, but the culprit is easy to spot: it is all about XGBoost being single threaded on OSX. 

Before I explain how to enable multi threading for XGBoost, let me point you to this excellent Complete Guide to Parameter Tuning in XGBoost (with codes in Python).  I found it useful as I started using XGBoost.  And I assume that you could be interested if you read this far ;) 

Back to XGBoost, the installation instructions do explain how to get the mutli-threaded version of XGBoost. unfortunately they did not work for me.  The following is what worked for me.  i am sharing in case it helps others.  I had to perform the following step:

  • Get Homebrew if it is not installed yet.  Indeed, this is a very useful open source installer for OSX.  Instaling it is straightforward, open a terminal, then paste and execute the instruction available on Homebrew home page. I reproduce it here for convenience:
    /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
  • Get gcc with open mp.  Just paste and execute the following command in your terminal, once Homebrew installation is completed.
    brew install gcc --without-multilib    
    This automatically downloads and builds gcc.  It can take a while, it took about 30 minutes for me.  Be patient.
  • Get XGBoost.  Go to where you want in your filesystem, say <directoy>.  Then type the git clone command and execute it:
    cd <directory>
    git clone --recursive https://github.com/dmlc/xgboost 
    This downloads the XGBoost code into a new directory named xgboost.
  • Next step is to build XGBoost.  By default, the build process will use the default compilers, cc and c++, which do not support the open mp option used for XGBoost multi-threading. We need to tell the system to use the compiler we just installed.  That's the step that was missing from the installation instructions on XGBoost site. 
    There are various ways to do it, here is the one I used.  I uncommented the following lines in make/config.make:
  • Go to where we downloaded XGBoost
    cd <directory>/xgboost
  • Then open make/config.mk and uncomment these two lines

export CC = gcc
export CXX = g++

  • We then build with the following commands.
    cd <directory>/xgboost
    cp make/config.mk .
    make -j4
    
  • Once the build is finished, we can use XGBoost with its command line.  I am using Python, hence I performed this final step.  You may need to enter the admin password to execute it.
    cd python-package; sudo python setup.py install

This concludes the installation. 

I tested it with My Anaconda distribution with Python 3.5.  It worked fine, and I could run XGBoost.  The speedup thanks to multi threading is noticeable, and my Mac Book Pro is now faster than my old PC.   

Updated on July 16, 2016.  Makefile changed in xgboost, making it easier to use gcc.
 


Simplified Analytics

Digital disruption in Customer Experience

In an age of digital disruption, great customer experience has become do or die.  Digital technologies such as Analytics, Mobile, Cloud, Gamification, Cognitive computing, Artificial...

...
 

July 15, 2016

Knoyd Blog

Attention movie fans! Learn how to make the most of your geeky San-Francisco-film-locations tour!

In this edition of the Knoyd-blog we will take a look at movie locations in San Francisco. Using the Google Places API and IMDb API we selected places in “The Golden City”, which every movie fan should visit, while they are in town.

The original dataset was downloaded from SF OpenData site, which provides many datasets about San Francisco. Apart from the already mentioned movie locations, you can find there, for example, information on all exhibitions hosted by the San Francisco airport, the Mobile Food Facility Permits, the Aircraft Noise Complaint Data, the Air Traffic Passenger Statistics, etc.

Our base dataset included following columns:

  • Title (name of the movie)
  • Release Year
  • Locations (identification of the location)
  • Fun Fact (if available)
  • Production Company
  • Distributor
  • Director
  • Writer
  • Actor 1 (the protagonist of the movie)
  • Actor 2 (the secondary protagonist of the movie (if available))
  • Actor 3 (the tertiary protagonist of the movie (if available))

A location feature, which uniquely identifies the place, was included. However, the information about the longitude and latitude was missing so we were not able to plot these locations onto a map right away. We found the geo coordinates for all the places using Google Places API, and plotted them on the map using Python library gmplot.

Map of all movie locations in San Francisco.

Map of all movie locations in San Francisco.

Next, we focused only on the places where more famous movies were shot. To determine these movies, we used the IMDb API. All information about the movies, including the average rating and the overall number of votes were downloaded from imdb.com. The top movies shot in San Francisco with respect to the average rating were:

Movie Rating No. of votes
Forrest Gump 8.8 1,234,615
Sense8 8.4 63,164
All About Eve 8.3 82,126
Looking 8.3 10,696
I Remember Mama 8.3 3,857

On the other hand, movies with the biggest number of votes on IMDb were:

Movie Rating No. of votes
Forrest Gump 8.8 1,234,615
Indiana Jones and the Last Crusade 8.3 509,609
Dawn of the Planet of the Apes 7.6 313,938
Ant-Man 7.4 301,246
Godzilla 6.5 299,385

Using the combination of these ratings and votes, we selected the top 7 movies: Forrest Gump, Indiana Jones and the Last Crusade, Dawn of the Planet of the Apes, Ant-Man, The Game, Godzilla, and The Graduate. These movies are associated with 36 movie locations across San Francisco.

In order to do a graph analysis on these locations, we needed to come up with a way to set up the edges of the graph accordingly. For this, we used the handy thing called Google API, where you can compute driving, cycling or walking distances between any two geographical locations. These travel times were taken as edge values between each of the pairs of locations (nodes or vertices). We built the graph using places from the top 7 movies with cycling and driving distances between these places. Furthermore, only 2 edges, those to the 2 closest places, were created for each location. The logic behind this is to avoid using paths with long travel times.

Below you can see the visualization and simple analysis of the graph. Edges of this graph were created from the cycling distances between locations.

Visualization of the graph.

Visualization of the graph.

As the next step in analyzing the movie locations, we looked into the betweenness of a particular location. Betweenness is equal to the number of shortest paths from all the vertices (in our case the locations) to all others that pass through that node. A location with high betweenness has large influence on transfer of people through the network of the top movie locations under the assumption that people are always looking for the shortest path. We have compared locations with the biggest betweenness using edges based on driving as well as cycling distance. We came up with following results:

Places with the highest betweenness when cycling:

  1. Bank of America Building (555 California Street)
  2. 301 Howard Street
  3. Embarcadero & Washington
  4. Mission & Beal
  5. Bay Bridge

Places with the highest betweenness when driving:

  1. Bank of America Building (555 California Street)
  2. Washington Street & Waverly Place (Chinatown)
  3. City Club (155 Sansome Street)
  4. Embarcadero & Washington
  5. Bay Bridge

We can see that two places were different when using driving distances: 301 Howard Street and Mission & Beal were replaced with Washington Street & Waverly Place (Chinatown) and City Club (155 Sansome Street) respectively. This means that movie fans are more likely to pass through the Chinatown of San Francisco if they travel around by car.

Finally, we looked into the Traveling Salesman Problem (TSP) and applied it to our dataset. The TSP is an optimization problem of finding the shortest possible route to visit each from given set of places. Using a random start and iterative algorithm we came up with a single route that any movie fan should take if they want to visit all of the interesting places from famous movies.

You can see a visualization of the optimal route below. Google supports only 10 locations in 1 route, therefore four layers were created. The descriptions start with A and end with J for each layer, J is overlapping with A from the next layer. The beginning of each layer is marked with a number for better readability.

Visualization of the optimal route between famous film locations in SF.

Visualization of the optimal route between famous film locations in SF.

This is the optimal route itinerary, starting at Harrison Street (The Embarcadero) and ending at Mission & Beal:

0(A)      Harrison Street - The Embarcadero (The Game)
1(B)      Mission & Fremont St (Godzilla)
2(C)      301 Howard Street (The Game)
3(D)      Bay Bridge (The Graduate)
4(E)       Administration Building - Treasure Island (Indiana Jones and the Last Crusade)
5(F)       California & Powell (Dawn of the Planet of Apes)
6(G)      Pier 1 (Godzilla)
7(H)      Broadway between Powell and Davis (Ant-Man)
8(I)        California & Davis St (Godzilla)
9(J-A)    City Hall (Dawn of the Planet of the Apes)
10(B)     Potrero & San Bruno (Godzilla)
11(C)     Alioto Park (Dawn of the Planet of the Apes)
12(D)     Market between Stuart and Van Ness (Ant-Man)
13(E)      Eddy & Taylor St. (Godzilla)
14(F)      420 Jones St. at Ellis St. (Ant-Man)
15(G)     Post & Jones St. (Godzilla)
16(H)     Presidio - Golden Gate National Recreation Area (The Game)
17(I)       Conzelman Rd at McCollough Rd and down Conzelm... (Ant-Man)
18(J-A)   Mason & California Streets - Nob Hill (The Game)
19(B)      Broadway & Columbus (Godzilla)
20(C)     Sacramento & Front St. (Godzilla)
21(D)      Pier 7 - The Embarcadero (Godzilla)
22(E)      Embarcadero & Washington (Godzilla)
23(F)      Bush & Kearny (Godzilla)
24(G)     California St from Mason to Kearny (Dawn of the Planet of the Apes)
25(H)      Kearney & Pine St. (Godzilla)
26(I)        Stockton & Clay St (Godzilla)
27(J-A)    University Club (Dawn of the Planet of the Apes)
28(B)       Pine between Kearney and Davis (Ant-Man)
29(C)       Washington Street & Waverly Place - Chinatown (The Game)
30(D)      Columbus between Bay and Washington (Ant-Man)
31(E)       Bank of America Building - 555 California Street (The Game)
32(F)       City Club - 155 Sansome Street (The Game)
33(G)      Grant between Bush and Broadway (Ant-Man)
34(H)      Pine St. & Davis St (Godzilla)
35(I)        Mission & Beal (Godzilla)

And on the more detailed map of San Francisco city center:

Visualization of the optimal route between famous film locations in SF.

Visualization of the optimal route between famous film locations in SF.

If you are interested, you can check out the other data sources from the City by the Bay by yourself - we are definitely going to do that.

 

July 14, 2016

Silicon Valley Data Science

Brain Monitoring with Kafka, OpenTSDB, and Grafana

Here at SVDS, we’re a brainy bunch. So we were excited when Confluent announced their inaugural Kafka Hackathon. It was a great opportunity to take our passion for data science and engineering, and apply it to neuroscience.

We wondered, “Wouldn’t it be cool to monitor our brain wave activity? And process those signals to control devices like home appliances, light switches, TV’s, and drones?“ We didn’t end up having enough time to implement mind control of any IoT devices during the 2-hour hackathon. However, we did win 2nd place with our project: streaming brainwave EEG data through Kafka’s new Streams API, storing the data on OpenTSDB with Kafka’s Connect API, and finally visualizing the time series with Grafana. In this post, we’ll give a quick overview of how we did all this, reveal the usage of Confluent’s unit testing utilities, and, as a bonus, we’ll show how it’s done in Scala.

Please note that this is not meant to be a production-ready application. But, we hope readers will learn more about Kafka and we welcome contributions to the source code for those who wish to further develop it.

Installation Requirements

All source code for our demo application can be found in our GitHub repository.

We used the Emotiv Insight to collect brainwave data, but the more powerful Epoc+ model should work, too. For those who don’t have access to the device, a pre-recorded data sample CSV file is included in our GitHub repository, and you may skip ahead to the “Architecture” section.

In order to collect the raw data from the device, you must install Emotiv’s Premium SDK which, unfortunately, isn’t free. We’ve tested our application on Mac OS X, so our instructions henceforth will reference that operating system.

Once you’ve installed the Premium SDK, open their “EEGLogger.xcodeproj” example application.

brains_1

Assuming you have Xcode installed, the example application will open in Xcode. If you have the Insight instead of the Epoc+, you will need to uncomment a few lines in their Objective-C code. Go to the “getNextEvent” method in the “ViewController.mm” file and uncomment the lines of code for the Insight, and comment out the lines of code for the Epoc+.

Xcode

Next, save your changes and press the “play” button in Xcode to run their example app on your Mac. Power on your Insight or Epoc+ headset and the example app will soon indicate that it’s connected (via Bluetooth). Once the Bluetooth connection to the headset is established, you’ll see log output in Xcode. That’s your cue to inspect the raw EEG output in the generated CSV file found in your Mac at the path “~/Library/Developer/Xcode/DerivedData/EEGLogger-*/Build/Products/Debug/EEGDataLogger.csv”.

Press the “stop” button in Xcode for now.

EEG connected

Architecture

To give you a high level overview of the system, the steps of the data flow are:

  1. Raw data from the Emotiv headset is read via Bluetooth by their sample Mac app and appended to a local CSV file.
  2. We run “tail -f” on the CSV file and pipe the output to Kafka’s console producer into the topic named “sensors.”
  3. Our main demo Kafka Streams application reads each line of the CSV input as a message from the “sensors” topic and transforms them into Avro messages for output to the “eeg” topic. We’ll delve into the details of the code later in this blog post.
  4. We also wrote a Kafka sink connector for OpenTSDB, which will take the Avro messages from “eeg” topic and save the data into OpenTSDB. We’ll also describe the code for the sink connector in more detail later in this blog post.
  5. Grafana will regularly poll OpenTSDB for new data and display the EEG readings as a line graph.

Note that, in a production-ready system, brain EEG data from many users would perhaps be streamed from Bluetooth to each user’s mobile app that in turn sends the data into a data collection service in your cloud infrastructure. Also note that, for simplicity’s sake, we did not define partition keys for any of the Kafka topics in this demo.

Running the Demo

Before running the demo, you’ll need git, scala, sbt and Docker (including docker-machine & docker-compose) installed. If you’re running Mac OS X, you can install them with Homebrew.

To run the demo, clone our GitHub repository so that you have a copy of the source code on your computer, and follow the instructions in the README, which will ask you to run the “build_and_run_demo.sh” shell script. The script will handle everything, but keep an eye on its output as it runs. By default, the script plays the pre-recorded data in a loop. If you wish to see the raw data from your Emotiv headset, run the script with the “-e” flag.

  1. If your Docker machine isn’t already up and running, you may need to start it and re-initialize Docker-specific environment variables.
    Start Docker machine
  2. The shell script will clone source code repositories from GitHub and build JAR files.
    Git clone and build JARs
  3. Then it’ll download Docker images.
    Downloading Docker images
  4. After the Docker images are downloaded, the Docker containers will be spawned on your machine and Kafka topics will be created. You should see that the “sensors” and “eeg” topics are available.
    Kafka topics
  5. You’ll be asked to open the URL to your Docker container for Grafana.
    Go to Docker container URL
  6. You’ll need to login with “admin” as the username and “admin” as the password.
    Grafana login
  7. Then you’ll be redirected to the administrative web page for configuring the database connection to OpenTSDB. The only thing you might need to change is the IP address in the URL text input field. The IP address is your Docker machine’s IP address and should match the one in your web browser. Here in the screenshot below, my Docker IP address is 192.168.99.100, so there is nothing for me to update.
    Edit data source
  8. Click the green “Save & Test” button and you should see a success dialog indicating that Grafana can indeed connect to OpenTSDB.
    Save and test
  9. Go back to your terminal and press the “enter” key to start streaming the example EEG data through the system. The URL to your EEG dashboard on Grafana will appear.
    Press enter and see EEG dashboard
  10. Finally, go to the URL, and you’ll see the brain wave data in your web browser.
    Brain wave data

Source code details

We’ll highlight key pieces of information we had to figure out, in many cases by looking at the Confluent’s source code since they weren’t well-publicized in either Kafka’s or Confluent’s documentation.

Streams application

Importing libraries

To use Confluent’s unit testing utility classes, you’ll need to build the a couple of JAR files and install them in your local Maven repository since they are not yet (at the time of this writing) published in any public Maven repository.

In the build.sbt file, the unit testing utility classes are referenced in the “libraryDependencies” section with either “test” and/or “test” classifier “tests” for each relevant library. Your local Maven repository is referenced in the “resolvers” section of the build.sbt file.

Installing the JAR files into your local Maven repository requires cloning a couple of Confluent’s GitHub repositories and running the “mvn install” command, which is handled in the “test_prep.sh” shell script. Note that we’re using a forked version of Confluent’s “examples” GitHub repository because we needed additional Maven configuration settings to build a JAR file of the test utility classes.

Avro class generation

Avro is used to convert the CSV input text and serialize it into a compact binary format. We import sbt-avrohugger in “build.sbt” and “project/plugins.sbt” for auto-generating a Scala case class from the Avro schema definition, which represents an OpenTSDB record. The Avro schema follows OpenTSDB’s data specification—metric name, timestamp, measurement value, and a key-value map of tags (assumed to be strings). The generated class is used in the Streams application during conversion from a CSV line to Avro messages in the “eeg” output Kafka topic, where the schema of the messages are enforced by Confluent’s schema registry.

Unit testing

Note the use of the “EmbeddedSingleNodeKafkaCluster” and “IntegrationTestUtils” classes in the test suite. Those are the utility classes we needed to import from Confluent’s “example” source code repository earlier. For testability, the Streams application’s topology logic is wrapped in the “buildAndStartStreamingTopology” method.

Command line options

After seeing a few examples in its documentation, we found Scopt to be a convenient Scala library for defining command line options. The first block of code in the main Streams application uses Scopt to define the command line parameters.

Different input and output serdes

The default Kafka message key and value are assumed to be Strings, hence the setting of the StreamsConfig key and value serdes (serializers/de-serializers) to the String serde. However, since we want to output Avro messages to the “eeg” topic, we override the outbound serde when the “.to()” method is called.

Avro serde and schema registry

Serializing and de-serializing Avro requires Confluent’s KafkaAvroSerializer and KafkaAvroDeserializer classes. Their constructors require the schema registry client as a parameter, so we instantiate an instance of the CachedSchemaRegistryClient class.

EEG CSV format and downsampling

The data originating from the EEG CSV file includes measurements from each sensor of the headset. Each of those sensor readings are converted into an Avro object representing an OpenTSDB record with the name of the sensor included in the OpenTSDB metric name.

Although there is a CSV column for original event’s timestamp, it’s not a Unix timestamp. So the system processing time (i.e., “System.currentTimeMillis()”) is used for simplicity’s sake.

The very first column of the CSV is the “counter”, which cycles through the numbers from 0 through 128. We downsample by filtering on the counter since the demo Docker cluster setup can’t handle the full volume of the data stream.

Sink connector

This section highlights some key points in our sink connector source code. We’ll mention what was needed to define custom configuration properties for connecting to a particular OpenTSDB host as well as settings to the OpenTSDB server itself.

The “taskConfigs” method

When defining a Kafka Connector, each task created by the connector receives a configuration. Even if your connector doesn’t have any task configuration settings, the “taskConfigs” method must return a list that contains at least one element, even if it’s an empty configuration Map class instance, in order for tasks to be instantiated. Otherwise, your connector won’t create any tasks and no data will be written to OpenTSDB.

Defining config settings for OpenTSDB host & port

As you may have inferred from the Docker file, the sink connector settings are in a properties file that are read when the Kafka Connect worker starts. We’ve defined the property keys for the OpenTSDB host and port, plus their default values, in the “OpenTsdbConnectorConfig” class. The default values are overridden in the properties file to match the host and port defined in the main docker-compose.yml configuration file. The property settings are propagated to each Kafka Connect task via the “props” parameter in the overridden “start()” method of the sink task class.

Writing to OpenTSDB

OpenTSDB’s documentation recommends inserting records through its HTTP API. We used the Play framework’s WS API to send HTTP POST requests containing data read from the Avro messages in the “eeg” Kafka topic. Currently, the task class has no error handling of cases when the HTTP request returns an erroneous response or times out.

OpenTSDB configuration changes needed for Connector

There are a couple of configuration settings in the OpenTSDB server itself that we needed to override. Both are set to true.

tsd.storage.fix_duplicates
This needs to be set to “true” because different measurements from the same sensor may have the same timestamp assigned during processing time.

tsd.http.request.enable_chunked
Chunk support should be enabled so that large batches of new data points will be processed if the large HTTP request is broken up into smaller packets.

Kafka Connect Standalone & Distributed Properties

In the Kafka Connect worker’s Docker file, note that the CLASSPATH environment variable must be set in order for the Kafka Connect worker to find the OpenTSDB connector JAR file.
Also, it includes property files for both standalone and distributed modes, but only standalone mode is enabled in the Docker image.

Grafana

Lastly, the visualization settings and connection settings to OpenTSDB are pre-loaded in its Docker container image.

Conclusion

We’ve given a whirlwind tour of our brain EEG streaming application, highlighting the usage of Confluent’s unit testing utility classes and many other components in the system. I’d like to thank my SVDS colleagues Matt Mollison, Matt Rubashkin, and Ming Tsai, who were my teammates in the Kafka Hackathon.

Would you like to see further developments? Please let us know in the comments below. Your feedback and suggestions for improvement are welcome.

The post Brain Monitoring with Kafka, OpenTSDB, and Grafana appeared first on Silicon Valley Data Science.

Data Digest

JUST RELEASED: Chief Data Officer Forum Financial Services - Speaker Presentations


The Chief Data Officer Forum, Financial Services brought together over 100 CDOs, CAOs, and other data leaders from leading global and national financial institutions, with those attending benefiting directly from their experience and insight. Alongside keynote presentations from our senior speaker line-up, our informal discussion groups, an in-depth masterclass and networking sessions provided delegates with the opportunity to take away the new ideas and information they needed to deliver real benefits to their company. Topics covered included; Data Governance, regulatory compliance, data-centric change management and extracting business value from your data through effective analytics.

Hear more from the leading Chief Data Officers at the coming Chief Data Officer Forum organized by Corinium Global Intelligence. Join the Chief Data Officer Forum on LinkedIn here.
















 

July 13, 2016


BrightPlanet

VIDEO: What is a Deep Review All About

It can be a challenge proving the ROI of collecting external web data to executive teams. Companies that aren’t yet ready for a full commitment to Data-as-a-Service and are currently testing out how external web data can be used for their business, often begin with a Deep Review. A Deep Review is the best way to […] The post VIDEO: What is a Deep Review All About appeared first on BrightPlanet.

Read more »

David Corrigan

Getting Value from “Digital Garbage”

Durham Region in Ontario, Canada has been deadlocked for years in a debate over garbage incineration.  There seem to be equal amounts of proponents and detractors for this controversial plan.  But...

...
Teradata ANZ

What You Need To Do To Get Big Data To Work For You

You probably don’t need to know that yet another report has been published about the benefits of big data. Tech giants like Google, Facebook and eBay are in on it, using a mix of bespoke, freeware and licensed technologies to monetize internal data assets by combining it with freely available big data. Even Dilbert has something to say about it!

But how are organizations implementing solutions that would enable them to tackle the magnitude and unleash the potential of this big data opportunity?

whatchu-need

Long Road Ahead

Typically, scenarios involve senior executives sanctioning the use of a large source of funding to initiate the creation of a big data platform expeditiously. Organizations soon realise that they need to deploy analytics to make sense of this data.

Some organizations embark on an “agile” program. It might have a platform that includes Hadoop of some flavor for distributed storage. And some data framework to handle machine learning and real-time streaming, such as Apache Spark, and many other disparate moving parts.

The result? After a year or two and a few million dollars, a workable big data platform is unveiled.

Unfortunately, this is often too little, too late. Why? The organization has lost critical time and resources. They have handed the advantage to competitors who took a different tack.

Run With Big Data Analytics

Those who have been successful have pursued a very different strategy and approach – one that allows the infrastructure follow the needs of successful pilot projects. Crucially, this approach ensures that the big data platform is funded by the analytics it enables.

Now, how does this work in practice? Really pretty much in the same way as for operational analytics, except that we will be incorporating big data alongside operational data!

The Four-Step Approach

  1. Identify a few pilot projects that have a strong business case and require external sources of big data. For instance, you might want to see whether you can leverage any insights from the tweets about your organization. You could pilot a project using sentiment analysis to uncover the topics and positive or negative feelings that your customers have towards your business.
  2. Prioritise these projects in terms of value to business and ease of implementation. Initial successes will act as proof-points, enabling you to build out the skillsets and resources within your organization to be able to tackle bigger and more difficult analytics.
  3. Evaluate big data technologies on a short sharp engagement. This testing can be done internally if the expertise exists or with external consultants focusing on the analytics projects most likely to succeed and with high business value.
  4. Continue the identify, prioritise and test process over a few cycles. This gives you time to understand what your organization’s big data needs are, and provides valuable input for the eventual delivery of a ‘fit for purpose’ big data technology platform.

Need More Convincing?

Surprisingly, this evolutionary approach takes no more time than the approach of first spending two years to deploy a big data analytics platform before using it for business benefit. Better than that, at no point during the journey is the organization ignoring its operational analytic needs.

There is even an added advantage. This gives an organization time to embed and integrate ‘big data thinking’ in the organization. This is something that happens incrementally – you can’t expect an organization’s ability to analyze data and use those insights from novice to experienced in one fell swoop. This is organically achieved in the evolutionary approach.

Clearly, this is a much better way to move into big data!!

This post first appeared on Forbes TeradataVoice on 06/05/2016.

The post What You Need To Do To Get Big Data To Work For You appeared first on International Blog.


Simplified Analytics

5 mistakes to avoid in Digital Transformation

Digital Transformation is happening everywhere you look. It is impacting businesses of any size, in any industry, any market and every geography. Many organizations recognize the importance of...

...
The Data Lab

e-Placement Scotland about to reach 1500 placements

e-Placement Scotland about to reach 1500 placements

The Data Lab had a great experience working with e-Placement Scotland to organise the placements for The Data Lab MSc programme. As part of their course, students have the opportunity to undertake a paid placement in industry. Recognising e-Placement Scotland’s expertise in engaging with employers and helping them to benefit from working with students and course leaders, The Data Lab turned to e-Placement Scotland to bring employers into the project.

e-Placement Scotland was able to develop a bespoke campaign throughout and beyond the technology sector to identify companies working on data problems with a real and pressing need for targeted input and expertise. e-Placement Scotland was able to get the ‘offer’ out to industry and work with key companies to help develop areas where universities and their masters students could make tangible inputs to data problems, and delivering significant and valuable returns for all parties.

The initiative was a resounding success, with 70% of participating students attaining a full time paid placement within a range of private and public sector organisations across healthcare, financial services, media and oil and gas.  With big names including Scottish Government, NHS, Waracle, Standard Life and DC Thomson on board, as well as range of fast growing SMEs, it’s no surprise that in 2016 the programme will double in size, bringing several more Scottish Universities into the fold.

 

Through advertising with e-Placement Scotland, you will have access to:

  • A database of over 4000 students seeking placement opportunities
  • A dedicated matching service, ensuring we find the right student for you

Placements can be full time or part time at any point in the year. e-Placement Scotland are flexible to your requirements and will work with you to help find your perfect match. And now you also have the opportunity to benefit from free job advertisements until they reach 1500 placements.

 

If you're interested in learning more, please contact Jamie Duncan on 01506 472 200, email jamie.duncan@scotlandis.com, or read the employer guide.

 

Find out more about The Data Lab MSc.

 

 

Template: 
Image
Share: 
Google Plus
Facebook
Pinterest
Twitter
Data Digest

In data and analytics, momentum trumps trajectory. Here’s why.


This article was originally published in CIO Review. Here is the link.

No one can dispute that data has significant value for organizations. We see it everyday in how some companies are using data to successfully deliver better customer experiences. This can take many shapes, from better products and services that companies create based on collected and analyzed customer behavior, to personalizing customer experiences.

Examples abound: we all have read how Disney is creating magical experiences by leveraging data collected through the magic bands. Netflix has used viewership data to design and produce new series that are adjusted to viewer behavior and preferences. Companies have been completely built (and successfully sold) on data such as The Climate Corporation.

As easy as it may seem, many companies are still struggling to make data and analytics work for them. In a few recent conferences I have participated, most of the companies I talked to are struggling with the fundamentals of data management: reigning data in, getting business units aligned with data solutions, and, more importantly, getting data solutions to be adopted and used. Making progress is complicated even further by the noise created in the marketplace by things like big data, machine learning, and internet of things. A lot of these terms have been hijacked by vendors making them seem as silver bullet solutions. The perception has been created that by acquiring these technologies alone, companies can solve all their challenges and start implementing solutions right away: add water and that is it. Something similar is happening with people side of the equation: companies are hiring data scientists, equipping them with technology, and hoping for the best.

The perception has been created that by acquiring these technologies alone, companies can solve all their challenges and start implementing solutions right away: add water and that is it.

With all this said, what works? In my experience, companies need to focus on a few things - what i call the fundamentals:

  • Keep a balance between people, process, and technology when designing and implementing data and analytics solutions;
  • Implement data solutions that are aligned with business needs; and,
  • Implement solutions in an agile way, in small iterations that deliver business value quickly.
Let’s take a look at each of these points in detail.

Balance between people, process, and technology

There is nothing that sounds more like a cliché in the technology world that “keep the balance between people, process, and technology.” However, in my experience this is one of the most fundamental elements companies need to take into account when delivering data and analytics solutions, and one that is often and easily overlooked. In the data and analytics space, the lack of balance between these three elements manifests itself in many ways. Companies that want to join the data and analytics party, purchase large amounts of technology in the shape of BI tools, database appliances, Hadoop clusters, or many other similar components. The belief is that by purchasing and deploying that technology, useful solutions will come out of them. In this case, the technology dimension is completely out of balance. In my experience, companies already have plenty of technology with which they can get going. Furthermore, cloud computing nowadays offers easy access to technology that can be consumed on demand and that allows companies to start without large investments. In my experience, the driver of this aimless purchase of technology can be traced back to business leaders requesting that companies jump into data and analytics; technology teams react by acquiring technology or in many cases, business teams purchase technology themselves that then gets dumped into IT’s hand for management. The best way to avoid this scenario is for technology teams to elevate the conversation with their business counterparts, focusing on what they want to accomplish rather than on the business telling them what to do. This recent article in LinkedIn captures this concept nicely.

The lack of balance can also manifest itself on the people side. Companies have attempted to get going by hiring armies of data scientists. Organizations think that just by hiring data scientists, business value will be delivered. Reality is that in many cases companies end up with groups of really smart people that are creating fantastic data and analytics solutions but that are disconnected from the reality of the business needs. In other words, solutions are created for which problems need to be found. Nowadays, there are plenty of ways to start small in this regard. For example, there are plenty of companies offering data scientist in consulting engagement ways. Organizations should define what is it they want to accomplish and partner with these companies to start small, in a prototype kind of approach.

On the process side of the people, process, and technology equation, lack of balance manifests itself when solutions are implemented without taking into account the changes that business processes need to through in order for the data solutions to be adopted. A simple example of my experience is one time in which we implemented a sophisticated forecasting algorithm that reduced process time from weeks to a few hours; the business team consuming the results of the forecast wasn’t ready for the solution. We assumed that simply by producing a better forecast in more efficient way (too much emphasis on technology), the business team would be able to adapt to it and “run with it” (ignored the process changes).

Alignment with business needs

Similarly to the points above, teams leading data and analytics work need to have a laser focus on the business needs of the company and on how the data solutions can help address those needs. This requires data teams to be in close sync with the business teams, focusing conversations on understanding what the real business needs are. Many times, the relationship between business and data teams is transactional in nature, putting data teams in a “order taking” kind of role. Data teams needs to elevate themselves out of this position, focusing the relationship on delivering high value business solutions. As simple as it sounds, it is a critical area that can make a big difference in the success of data teams.

Agile solution delivery

Recently, I was sat in a session at a data and analytics conference led by Jared Souter, CDO, First Republic Bank. One of the key points he shared with us was the need for data teams to understand that “momentum forward is more important than perfect trajectory”. I really liked the simplicity with which he explained it. Perfectly in alignment with the points I have highlighted above, data teams need to ensure that value is delivered quickly, in an agile way that allows business teams to realize concrete results in the short term. This can and has to be done without losing sight of the long term vision of where the data and analytics team wants to be in the future. Short term gains with a long term view. I have been part of efforts and I have seen many projects that want to make sure their solutions are perfect. They take a significant amount of time to deliver the perfect solution. In this scenario, what usually happens is that the solution delivered may be the right solution at the wrong time (too late). By the time the solution is delivered, the business context has changed, rendering the solution useless. Delivering small, quick solutions also has the added benefit of allowing business teams to face less change when adopting new solutions.

Data teams need to ensure that value is delivered quickly, in an agile way that allows business teams to realize concrete results in the short term.

This is a time in which companies can become more competitive by using data and analytics. Technology is no longer a barrier and as business processes have become more digital, the amount of data available has increased significantly. Organizations that focus on the right priorities, in the right way, will be able to realize large business benefits. All that is needed is a little bit of common sense and a strategic view of what the ultimate business goal should be.

By Juan Gorricho: 

Juan Gorricho is the Chief Data & Analytics Officer at Partners Federal Credit Union – The Walt Disney Company. He is also a member of the Chief Analaytics Officer Forum Advisory Board. 

 

July 12, 2016

Big Data University

This Week in Data Science (July 12, 2016)

Here’s this week’s news in Data Science and Big Data. Musicmap

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

The post This Week in Data Science (July 12, 2016) appeared first on Big Data University.

Ronald van Loon

Are CEO’s Missing out on Big Data’s Big Picture?

july12newarticle

Big data allows marketing and production strategists to see where their efforts are succeeding and where they need some work. With big data analytics, every move you make for your company can be backed by data and analytics. While every business venture involves some level of risk, with big data, that risk gets infinitesimally small, thanks to information and insights on market trends, customer behaviour, and more.

Unfortunately, however, many CEOs seem to think that big data is available to all of their employees as soon as it’s available to them. In one survey, nearly half of all CEOs polled thought that this information was disseminated quickly and that all of their employees had the information they needed to do their jobs. In the same survey, just a little over a quarter of employees responded in agreement.

Great Leadership Drives Big Data

In entirely too many cases, CEOs look at big data as something that spreads in real-time and that will just magically get to everyone who needs it in their companies. That’s not the case, though. Not all employees have access to the same data collection and analytics tools, and without the right data analysis and data science, all of that data does little to help anyone anyway.

In the same study that we mentioned above, of businesses with high-performing data-driven marketing strategies, 63% had initiatives launched by their own corporate leaders. Plus, over 40% of those companies also had centralized departments for data and analytics. The corporate leadership in these businesses understood that simply introducing a new tool to their companies’ marketing teams wouldn’t do much for them. They also needed to implement the leadership and structure necessary to make those tools effective.

Great leaders see big data for what it is – a tool. If they do not already have a digital strategy – including digital marketing and production teams, as well as a full team for data collection, analytics, data science, and information distribution – then they make the moves to put the right people in the right places with the best tools for the job.

Vision, Data-Driven Strategy, and Leadership Must Fit Together

CEOs should see vision, data-driven strategy, and leadership as a three-legged chair. Without any one of the legs, the chair falls down. Thus, to succeed a company needs a strong corporate vision. The corporate leadership must have this vision in mind at all times when making changes to strategy, implementing new tools and technology, and approaching big data analytics.

At the same time, marketing and production strategies must be data-driven, and that means that the employees who create and apply these strategies must have full access to all of the findings of the data collection and analysis team. They must be able to make their strategic decisions based directly on collected data on the market, customer behaviour, and other factors.

To do all this, leadership has to be in place to organize all of strategic initiatives and to ensure that all employees have everything they need to do their jobs and move new strategies forward.

Have you implemented a digital strategy for your business? What’s changed since you’ve embraced your strategy, and what are your recommendations for strategy and data-driven technology for business owners and executives like yourself?

Let us know what you think and how you’ve used your digital strategy to set your business apart from the competition.


To learn more about the world of Entrepreneurship & Data Science follow Bob Nieme on Twitter or connect with him on Linkedin
CEO – O2MC I/O prescriptive computing

 


Connect with author Ronald van Loon to learn more about the possibilities of Big Data
Co-author, Director at Adversitement

Ronald

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at Adversitement, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

More Posts - Website

Follow Me:
TwitterLinkedIn

Author information

Ronald helps data driven companies generating business value with best of breed solutions and a hands-on approach. He has been recognized as one of the top 10 global influencers by DataConomy for predictive analytics, and by Klout for Data Science, Big Data, Business Intelligence and Data Mining and is guest author on leading Big Data sites, is speaker/chairman/panel member on national and international webinars and events and runs a successful series of webinar on Big Data and on Digital Transformation. He has been active in the data (process) management domain for more than 18 years, has founded multiple companies and is now director at Adversitement, leader in Big Data & data process management solutions. Broad interest in big data, data science, predictive analytics, business intelligence, customer experience and data mining. Feel free to connect on Twitter or LinkedIn to stay up to date on success stories.

The post Are CEO’s Missing out on Big Data’s Big Picture? appeared first on Ronald van Loons.

Teradata ANZ

The future of energy is….data

Tesla’s recent bid to acquire Solar City and in effect become a new kind of energy provider has generated a huge amount of copy in recent days. Some cite it as another key milestone on the way to a new renewable energy world. Some cite it as the beginning of the end of the traditional utility model. (Or perhaps another nail in the coffin, as the end of the traditional utility model has been nigh for some time now…) And yet others have used the speculation and hype about this bid to remind us of the technical and financial realities of moving to an entirely new energy landscape. Because in truth, it’s not really all that easy; it’s not here yet; and for now it’s mostly not actually commercially viable either.

All of those positions have merit. Even the doom-and-gloom ones. But looking across all of them, one thing is entirely clear. Our future energy landscape remains hugely uncertain.

Now, that’s all fine for the chattering / blogging classes. But what about the utility businesses that need to navigate such an uncertain future while keeping the lights on; running a viable business; and keeping energy costs manageable for their customers?

It’s all too easy to say “who cares?”, given how much nobody loves their utility company. But you should care. After all, you do want the lights to always come on when you flick that switch, or don’t you? The future may well mean the death of the traditional utility business. And I have no attachment – emotional, financial or otherwise – to, which particular company is part of that future and which isn’t. But we do actually have to get to that future first. As a subset of the recent blogs on Tesla will tell us, we can’t just wave a wand and move to a new model overnight, leaving the old utility dinosaurs and their 100+ years of critical national infrastructure behind. It doesn’t work like that. No, really, it doesn’t. These traditional utility companies will have to be there to take us – at least part of the way – along this journey to the future energy landscape.

David Socha_Energy 1

If they want to be part of it, utilities already know they must change, adapt, innovate to claim a place at the heart of energy provision in the 21st Century and beyond. I’ve previously talked about how that will need three key enablers:

1. New leadership;
2. an entirely new kind of customer focus; and
3. a shift to a data-driven approach across every part of their new and still-forming 21st century business.

Let’s explore that data-driven point a little further.

There’s plenty of evidence to show that in any business environment, companies that take a data-driven approach will outperform their rivals. But that’s even more the case when presented with an uncertain future. Who will be your new competitors? What will customers want from you? What new lines of business are likely to be successful? What new regulation is coming your way and how will it affect your strategy?

Any and all of these questions can be answered by any method you like. Gut feel sometimes works. Anecdotal evidence-based opinions seem to have kept some businesses ticking along in the past. The ever-popular HiPPO (Highest Paid Person’s Opinion) can be pretty good at times. After all, if these methods really didn’t work at all, an awful lot of senior people would be out of a job by now.

But times have changed. In the traditional, slow-moving, often Government-funded, monopoly utility market, strategy-by-gut-feel (or being generous, let’s call it strategy by years of experience) was pretty much OK. Things didn’t change too much, so gut feel / experience often led to decisions to do the same as we did last year…last decade….last century. And it worked. Plus, with all that Government money (or at least a monopoly position) then nobody was really counting the effectiveness of every last cent anyway, right? Today though, Tesla is stealing your business. Not metaphorically. Really. You can’t just do what you always did before. It’s losing you money. What use is your experience now?

David Socha_Energy 2

Now is the time to embrace the data available to you and the analytics you can perform on it. Enedis (formerly ERDF) are doing it already. Like them, take the opportunity to understand your customers like never before; maintain and operate your assets at peak efficiency; plan your new business ventures with a solid foundation of data & information. The Internet of Things is already delivering rich new sources of data and rich new business opportunities.

If today’s utility wants to be a part of tomorrow’s energy landscape, one things is certain: the future of energy is all in the data.

The post The future of energy is….data appeared first on International Blog.

 

July 11, 2016

Teradata ANZ

The current state of Predictive Analytics

During last Strata conference in London I had the pleasure to share some thoughts on the current state and the challenges of predictive analytics with Jenn Webb, O’Reilly Radar’s managing editor for Design, Hardware, Data, Business, and Emerging Tech spaces. 

We touched on a number of subjects related to Data Science, Machine Learning and their applications: the advent of predictive APIs fueled by big data and machine learned models, the advantages and limits of deep learning, and the current and future applications of predictive analytics to financial services and marketing.

Click on the video below to watch the interview.

Schermafbeelding 2016-07-11 om 13.57.09

The post The current state of Predictive Analytics appeared first on International Blog.

 

July 09, 2016


Simplified Analytics

Digital Transformation helping Smart Cities flourish

A smart city is simply a community that harnesses digital technologies, such as Internet of Things, Big Data Analytics, Mobility, Drones & Wearables to improve the quality of life of people,...

...
 

July 07, 2016

Algolytics

Approximation or Classification – which one to choose?

Among the many decisions you’ll have to make when building a predictive model is whether your business problem is either a classification or an approximation task. […]

David Corrigan

The Barber of Brooklin

I sat in the waiting chair for 40 minutes while he finished with another client. “Hey Dave, how’ve ya been?” “Nothing to complain about.” I sat in the barber’s chair and without hesitation he said...

...
Data Digest

JUST RELEASED: Chief Analytics Officer Forum Canada - Speaker Presentations


On June 21-22, over 80 Chief Analytics Officer and senior data leaders from across Canada met in Toronto for the inaugural CAO Forum, Canada, including our expert speaker line up, including companies such as change.org, WSIB, BMO Financial Group, Kobo, AIG Insurance and many others.

No matter what your industry, Big Data and analytics is a key concern for the future of your business. With the ever-growing wealth and breadth of data available to your organization, it is of critical importance that you can not only store and manage that data but utilize it to derive useful and actionable insight to give you a competitive advantage in your market place.

Hear more from the leading Chief Analytics Officers at the coming Chief Analytics Officer Forum organized by Corinium Global Intelligence. Join the Chief Analytics Officer Forum on LinkedIn here.













decor