decor
 

Planet Big Data logo

Planet Big Data is an aggregator of blogs about big data, Hadoop, and related topics. We include posts by bloggers worldwide. Email us to have your blog included.

 

September 28, 2016

Big Data University

BDU China initiatives

bdu-cda-summit-pic-1

Every time I travel to China, I can’t stop thinking that the entire population of Canada probably fits in a single Chinese city!  To serve such large population Chinese officials, professionals, and workers are used to doing things fast. Really fast.  From the moment I applied for my Chinese visa to checking-in at airports in China, I’m often amazed at the speed and efficiency in which they operate. Technological advances and adoption have followed the same pace. AliPay or WeChat Pay have gone mainstream in China for some time, while Apple Pay in the west is yet to take off.

With such great and vibrant population we could not have done something small to relaunch BDU in China. On September 3rd and 4th BDU sponsored the CDA (Certified Data Analyst) Summit in Beijing, China with an attendance of more than 3000 data analysts, data scientists, data engineers, bdu-cda-summit-pic-3students and academia.  The event put BDU in the minds and hearts of the community we want to reach. There were 6,382 people registered to the event, of which 3,221 checked-in, and others watched online.  We participated in the keynote, a panel, and 4 breakout sessions.

Keynote:

  • “Great opportunities ahead for Data Scientists” by Yan Yong Ji (Y.Y) Director, Analytics Platform Services on Cloud, IBM China Development Lab (CDL).
  • “BDU initiatives in China” by me (Raul F. Chong)

Panel:

This was a mixed panel (not just mixed bdu-cda-summit-pic-2 backgrounds but also languages!), with the theme being: “Getting started with Data Science”. I was honored to be the moderator and to have my colleagues Saeed Aghabozorgi (BDU Chief Data Scientist), and Henry Zeng (IBM China Senior Data and Solutions Architect), with me on stage along with four other panelists representing the industry, academia, and startup companies. We managed to cover interesting questions in English and Chinese (live translation) from comparing the Chinese with the US data science outlook, to clarifying the distinction between the terms “Data Analyst”, “Data Scientist”, and “Data Engineer”. 

Breakout sessions:

  • Smarter Traffic (Henry Zeng)
  • Data science: Competition to beat humans (Saeed Aghabozorgi)
  • Data science: Methodology, tools and skills (Saeed Aghabozorgi)
  • Data science: From university to Big Data University (Saeed Aghabozorgi)

The team also participated in media interviews, and had a booth were flyers and small gifts were provided.

bdu-cda-summit-pic-4

Announcements:

At the event, the following announcements were made:

With WeChat dominating social media and communication in China, we focused on launching and promoting our WeChat official BDU Account. While at the conference, this newly created account grew to more than 1000 subscribers!  If you have not yet done it, please subscribe!:

BDU WeChat QRCode
Video recording is available here:
http://e.vhall.com/395340718
 
I’m looking forward to continued collaboration with our existing BDU Ambassadors, and new partnerships for the rest of the year and 2017!

The post BDU China initiatives appeared first on Big Data University.


Revolution Analytics

Using R to detect fraud at 1 million transactions per second

In Joseph Sirosh's keynote presentation at the Data Science Summit on Monday, Wee Hyong Took demonstrated using R in SQL Server 2016 to detect fraud in real-time credit card transactions at a rate of...

...
The Data Lab

Urban Tide's time with a Big Data MSc student

Urban Tide

 

TELL US ABOUT YOUR BACKGROUND AND YOUR DEGREE PROGRAMME

I graduated from a BSc course in Computing Science back in 2007 and worked in the oil and gas industry for 8 years as a systems engineer and information management engineer.  In that 8 years a lot has changed in the computing world and the big data revolution has taken place.  I was looking at the next step in my career and big data was perfectly aligned so I wanted to get involved.  I hoped to get really deep into it so I looked around the big data related university courses in Scotland (and some in England too) and I felt that the MSc Big Data course at the University of Stirling was well structured and covered the right mix of technologies that I wanted to learn about and that would be applicable in industry.  My employer kindly gave me a year off work and some financial assistance to take the MSc Big Data course and it's been a fantastic experience.  I've learned about NoSQL databases like MongoDB and Cassandra, distributed big data processing with Hadoop MapReduce and Apache Spark, machine learning with neural networks, decision trees and Bayesian networkgraph theory and evolutionary optimisation techniques... the list goes on!  There's so much to learn about in the field of big data - I still feel like I've only just scratched the surface.  Having said that I am ready now to get back into industry and start developing and innovating with these new skills using real data for real world challenges, and the project at Urban Tide is an excellent example of that.

 

WHY DID YOU WANT TO WORK WITH URBAN TIDE?

The Data Lab, which has helped to develop and sponsor the big data courses in Scotland, has teamed up with e-Placement Scotland to provide a number of placements in companies around Scotland where we can do our dissertation projects over the summer.  It's been a great opportunity to get out of the academic scene and back into industry to work on projects within a business context using real data and the objective to generate value.  There were about 40 placements arranged for the three universities involved with The Data Lab this year and we were allowed to select our three preferred placements to apply for.  The companies interviewed their candidates and Urban Tide offered me the place.

 

WHY DID THE PROJECT AT URBAN TIDE INTEREST YOU?

Well there were a lot of interesting projects to choose from but the one at Urban Tide appealed to me for a number of reasons.  I liked the idea of working in a young startup company because it would be very different to the context of the multinational organisation that I was familiar with.  The topic of smart cities was intriguing and I'm sure that this will be a growth industry with lots of opportunities for technological innovation.  I also wanted to experience living in Edinburgh which is a really cool city with loads of character, and the company is located in Codebase which is full of very trendy young startup companies in the technology space and there's lots going on to get involved with.  But most of all I was attracted by the technology stack that Urban Tide is using to build their U~SMART product.  U~SMART is being built from the ground up with big data analytics in mind and I think that Urban Tide has made some very good choices in their technology stack which was exactly aligned with the products that I wanted to develop a deeper understanding of and become more skilled in.  I'm very pleased with how my skills have developed in this project and my only regret is that there isn't more time to go even deeper with the machine learning component.

 

WHAT'S NEXT FOR YOU AFTER THE PROJECT IS FINISHED?

The plan is to take the new skills and knowledge gained on this course back to the company that I worked for over the 8 years prior to the course starting, in a new job role that will focus on developing and delivering the big data analytics solutions.  However the oil and gas industry is going through a very tough time at the moment with continuing redundancies so if that plan doesn't work out then I'd love to develop my own startup company (C L Data Tech) doing big data consulting in Scotland.  It's a very exciting time to be in this field and it's moving so quickly - there are a lot of opportunities  to work on exciting and challenging projects and there's lots of innovation going on.  I'll hopefully get a chance to work with Urban Tide  again in the future - which would be really great because there's a very good balance of young and enthusiastic, very bright, people there with fresh ideas along with more experienced people with sound wisdom and leadership skills.  I have no doubt that, if things continue as they are, the U~SMART product that Urban Tide is  building will be a real success... and I hope that my dissertation project will, in some way, contribute to that success.

 

Template: 
Image
Share: 
Google Plus
Facebook
Pinterest
Twitter
Data Digest

Will machines help humans make big decisions in future? Chief Analytics Officers weigh in


In a report recently released by PwC entitled PwC’s Data and Analytics Survey 2016: Big Decisions TM, it was revealed that “we’re at an inflection point where artificial intelligence can help business make better and faster decisions.”  The said report “shows that most executives say their next big decision will rely mostly on human judgment, minds more than machines. However, with the emergence of artificial intelligence, we see a great opportunity for executives to supplement their human judgment with data-driven insights to fundamentally change the way they make decisions.”

In our discussions with notable thought leaders in this space for the upcoming Chief Analytics Officer Forum Fall on 5-7 October in New York, we got a deeper insight as to how this trend is felt and viewed on the ground. For example, John France, Head of Sales Operations & Analytics at VALEANT PHARMACEUTICALS sees that the opportunities are limitless. He said that, “if there was a machine that could scan you at home and provide an instant reading on your health (heart, blood pressure, cholesterol, diet, etc…) and then deliver an action plan to correct such as what to eat that day, a work out regime, what meds to take, etc… that could really help save lives”

On the other hand, John Lodmell, VP, Credit & Data Analytics, ADVANCE AMERICA  believes that the increasing collection of geographic information from cell phones or fitness trackers will open up a lot of big data opportunities around movement and traffic patterns. “If you think back to vehicle traffic studies putting “car counters” across the road to track every-time a car crossed a certain point, we now have much richer data and collection techniques that are not fixed to tracking crossing a single point, but overall movement.  I remember hearing years ago that a large retail store was tracking traffic patterns within the stores seeing where customers went.  It provided them tons of useful information around where to place promotional items and how to make things more convenient for customers to find.  I can only imagine that type of information being used for urban planning or store marketing,” he said.

Read our full interviews by clicking on the links below:

Andrea L. Cardozo - Pandora 
Ash Dhupar - PCH 
Cameron J. Davies - NBC Universal 
Christina Hoy - WSIB 
Dipti Patel-Misra - CEP America 
Eric Daly - Sony Pictures 
Inkyu Son - Nexlend 
Jason Cooper - Horizon 
John France -  Valeant 
John Lodmell - Advance America 
Nikhil Aggarwal - Standard Chartered     


To learn more about Chief Analytics Officer Forum Fall, visit www.chiefanalyticsofficerforum.com 
 

September 27, 2016

Big Data University

This Week in Data Science (September 27, 2016)

Here’s this week’s news in Data Science and Big Data. Smart City

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

New in Big Data University

  • Text Analytics – This course introduces the field of Information Extraction and how to use a specific system, SystemT, to solve your Information Extraction problem.
  • Advanced Text Analytics – This course goes into details about the SystemT optimizer and how it addresses the limitations of previous IE technologies.

The post This Week in Data Science (September 27, 2016) appeared first on Big Data University.


BrightPlanet

How to Find the ‘Signal’ in the Noise of Open Source Data

Finding ‘signal’ through the noise of open source information is ultimately what will drive value to your organization. Whether this is to support sales, investment decisions, or the fight against fraud, corruption, IP theft, or terrorism, it all depends on identifying the ‘signal’ in the data. This article is intended to start a conversation about […] The post How to Find the ‘Signal’ in the Noise of Open Source Data appeared first on BrightPlanet.

Read more »

Revolution Analytics

The Financial Times uses R for Quantitative Journalism

At the 2016 EARL London conference senior data-visualisation journalist John Burn-Murdoch, described how the Financial Times uses R to produce high-quality, striking data visualisations. Until...

...
Data Digest

El vibrante mercado de Big Data en Latinoamérica.


Las predicciones sobre el crecimiento del mercado de análisis de datos y Big Data han sido bastante positivas en los últimos años para Latinoamérica. Gartner, Frost & Sullivan y muchos otros líderes en el tema, han estimado un crecimiento cercano al 40% en la adquisición de soluciones e implementación de herramientas para análisis de datos avanzado en los siguientes 4 años.

Ciertamente, los sectores Financieros y Aseguradores se encuentran liderando este fenómeno, bien sea por la necesidad de cumplir con agencias regulatorias en relación al uso y manejo de información o por una iniciativa clara de monetizar sus activos de datos. No obstante, otros sectores locales ya se encuentran en un proceso muy adelantado de modernización e implementación de estrategias corporativas para explotar sus datos.

Por ejemplo, industrias como Marketing, Telecomunicaciones, Retail y Producción de Bienes de Consumo están mostrando un mayor interés en las posibilidades que el análisis avanzado y Big Data les ofrece en términos económicos a mediano y largo plazo. Esta tendencia puede verse directamente reflejada, por una parte, en la aparición de un nuevo rol dentro de las estructuras directivas de dichas compañías – Chief Data Officer o Chief Analytics Officer-, y por otra, en, la implementación de procesos de modernización tecnológica para mejorar la captura y análisis de datos en tiempo real.

Por ejemplo, industrias como Marketing, Telecomunicaciones, Retail y Producción de Bienes de Consumo están mostrando un mayor interés en las posibilidades que el análisis avanzado y Big Data les ofrece en términos económicos a mediano y largo plazo.

A pesar de lo importante de estas iniciativas, y luego de más de 50 conversaciones directas con altos ejecutivos en la región, es evidente que aún hay un buen camino por recorrer antes de poder ver realmente los beneficios comerciales de cualquiera de estos sectores. Las razones principales para ello oscilan entre (1) la implementación de proyectos de modernización tecnológica en sus etapas iniciales; (2) la consolidación de programas de desarrollo profesional privados y públicos; y (3) la divergencia entre las necesidades locales y las soluciones disponibles en el mercado.

Interesantemente, las dos primeras aspectos han sido rápidamente reconocidos y varios agentes públicos y privados han sumando esfuerzos para poder acelerar sus resultados. Proyectos de modernización de infraestructura tecnológica y capacitación ya están en un estado muy avanzado de desarrollo. Mientras que el tercero, aún está lejos de lo que podría llegar a ser. Es claro que el mercado de soluciones local e internacional todavía se está ajustando al cambio organizacional y que se ven sorprendidos al ver el gran conocimiento que los nuevos CDOs y CAOs tiene sobre Big Data y análisis de datos.

Al parecer, son pocas las compañías que reconocen que éstos líderes regionales son los agentes adecuados y directos con los que pueden establecer posibles alianzas comerciales. De igual forma, son pocos los que se han tomado el tiempo de entender cuáles son sus reales necesidades, cuáles podrían ser las mejores formas de colaboración conjunta y cuál podría ser el camino más corto y económico para implementar programas duraderos de explotación de datos.

Por esta razón, Corinium Global Intelligence ha abierto la primera puerta de discusión directa en la región, para permitir que ambas partes tengan la oportunidad de reunirse y discutir sobre las mejor soluciona para solventar esta discrepancia. Del 24 y 25 de Enero de 2017 se llevará acabo el primer Chief Data & Analytics Officer, Central America, en el Marquis Reforma en la Ciudad de México, que reunirá más de 100 especialistas en la materia para discutir los aspectos operativos y estratégicos en la implementación de programas alrededor de la explotación comercial de datos. Esperamos ver a todos los expertos en regionales el siguiente año en México!

By Alejandro Becerra:

Alejandro Becerra is the Content Director for LATAM/USA for the CDAO and CCO Forum. Alejandro is an experienced Social Scientist who enjoys exploring and debating with senior executives about the opportunities and key challenges for enterprise data leadership, to create interactive discussion-led platforms to bring people together to address those issues and more. For enquiries email: alejandro.becerra@coriniumintelligence.com
Jean Francois Puget

CPLEX Is Free For Students (And Academics)

image

 

As part of our effort to make optimization pervasive we made our mathematical optimization products free for academic use 6 years ago.  4 years ago we removed license files, enabling the use of CPLEX offline for teachers, researchers and university staff.  We are now making a further effort by allowing any student to use CPLEX for free. 

 

We have also streamlined the registration process on the new Academic Initiative site.  If you use an email address provided by your institution, then registration on the new site should be instantaneous in most cases.  Mail and phone support for registration is available as well if needed.
In order to download it, please follow instructions on these pages:


We hope this will foster the use of state of the art mathematical optimization for both operations research projects and data science projects.

Thank you for using CPLEX.

Teradata ANZ

Understanding Your Customer’s Journey

Many of my recent discussions on the role of analytics in achieving business outcomes have been focussed on the customer and how companies can use analytics to better understand the unique set of interactions and transactions that make up the customer journey.

Designer Shoe Warehouse (DSW), a leading shoe retailer in the US connects with customers and inspires front-line employees through analytics. In a case study video Kelly Cook, Vice President of Customer Strategy and Engagement talks about how DSW is benefiting from analytics, “… Customers have no problem telling you everything you could do better. Knowing what you need to do for customers allow you to understand what data is needed, so you’ll know what to attack first.”

Accenture has published Retail Consumer Research 2016 and I simply love the cover page of the Executive Summary which simply states, “Retail customers are shouting – are you adapting?”

In an earlier blog, I talked about The Connected Consumer and the fact is all consumers today leave breadcrumbs across all channels during their journeys with you. There are three sets of capabilities required to be able to listen, adapt and interact with your customers:

monica-woolmer-customer-journey-1 Connected Data – Enable insights from all types of data sources, offline, online and real time, by providing the ability to build complete customer profiles that power data-driven decisions.

monica-woolmer-customer-journey-2Connected Analytics
– Right time analytics and decisioning capabilities at all channels. Provide the ability to see and understand potential and actual outcomes.


monica-woolmer-customer-journey-3Connected Interactions
– Provide the ability to manage a consistent customer experience across all channels by continuing to enhance the integration between all online and offline customer interactions.

Important to note that while the target capabilities are the same, each company will have different starting points and existing capabilities, as well as their own priorities and requirements that must be factored in to designing the overall business solution.

bonprix, one of Germany’s largest clothing merchants is able to micro-segment groups of customers based on shopping behaviour data such as reactions to coupon offers, rankings and website filters. Analysing the interactions helps bonprix determine how to tailor future digital and traditional marketing campaigns. Along with a significant uplift in key markets from the 1.5 billion targeted emails sent each year, bonprix has gained the ability to leverage existing data for improved forecasting, a deeper understanding of product return rates, fraud reduction, budget optimisation and other key business needs. As a result, analytics is providing a benefit back to the company – and to its customers.

As it should be, understanding the customer journey comes down to being connected with your customers. Connecting data, analytics and interactions helps you maintain and grow your relationships with your customers. In today’s world there are many more ways for your customers to transact and interact with you. Have you adapted how you listen and respond?

monica-woolmer-customer-journey-4
Learn more about Teradata’s Customer Journey Analytic Solution, a complete set of capabilities for discerning the behavioural paths of each individual customer, determining the next best interaction and delivering a consistent, personalised brand experience through every channel and touch point.

The post Understanding Your Customer’s Journey appeared first on International Blog.


Simplified Analytics

The Good, The Bad & The Ugly of Internet of Things

The greatest advantage we have today is our ability to communicate with one another. The Internet of Things, also known as IoT, allows machines, computers, mobile or other smart devices to...

...
 

September 26, 2016

Big Data University

Introducing Two New SystemT Information Extraction Courses

This article on information extraction is authored by Laura Chiticariu and Yunyao Li.

We are all hungry to extract more insight from data. Unfortunately, most of the world’s data is not stored in neat rows and columns. Much of the world’s information is hidden in plain sight in text. As humans, we can read and understand the text. The challenge is to teach machines how to understand text and further draw insights from the wealth of information present in text. This problem is known as Text Analytics.

An important component of Text Analytics is Information Extraction. Information extraction (IE) refers to the task of extracting structured information from unstructured or semi-structured machine-readable documents. It has been a well-known task in the Natural Language Processing (NLP) community for a few decades.

Two New Information Extraction Courses

We just released two courses on Big Data University that get you up and running with Information Extraction in no time.

The first one, Text Analytics – Getting Results with System T introduces the field of Information Extraction and how to use a specific system, SystemT, to solve your Information Extraction problem. At the end of this class, you will know how to write your own extractor using the SystemT visual development environment.

The second one, Advanced Text Analytics – Getting Results with System T goes into details about the SystemT optimizer and how it addresses the limitations of previous IE technologies. For a brief introduction to how SystemT will solve your Information Extraction problems, read on.

Common Applications of Information Extraction

The recent rise of Big Data analytics has led to reignited interest in IE, a foundational technology for a wide range of emerging enterprise applications. Here are a few examples.

Financial Analytics. For regulatory compliance, companies submit periodic reports about their quarterly and yearly accounting and financial metrics to regulatory authorities such as the Securities and Exchange Committee. Unfortunately, the reports are in textual format, with most of the data reported in tables with complex structures. In order to automate the task of analyzing the financial health of companies and whether they comply with regulations, Information Extraction is used extract the relevant financial metrics from the textual reports and make them available in structured form to downstream analytics.

Data-Driven Customer Relationship Management (CRM).  The ubiquity of user-created content, particularly those on social media, has opened up new possibilities for a wide range of CRM applications. IE over such content, in combination with internal enterprise data (such as product catalogs and customer call logs), enables enterprises to have a deep understanding of their customers to an extent never possible before.Besides demographic information of their individual customers, IE can extract important information from user-created content and allows enterprises to build detailed profiles for their customers, such as their opinions towards a brand/product/service, their product interests (e.g. “Buying a new car tomorrow!” indicating the intent to buy car), and their travel plans (“Looking forward to our vacation in Hawaii” implies intent to travel) among many other things.  Such comprehensive customer profiles allow the enterprise to manage customer relationship tailored to different demographics at

Besides demographic information of their individual customers, IE can extract important information from user-created content and allows enterprises to build detailed profiles for their customers, such as their opinions towards a brand/product/service, their product interests (e.g. “Buying a new car tomorrow!” indicating the intent to buy car), and their travel plans (“Looking forward to our vacation in Hawaii” implies intent to travel) among many other things. Such comprehensive customer profiles allow the enterprise to manage customer relationship tailored to different demographics at

Such comprehensive customer profiles allow the enterprise to manage customer relationship tailored to different demographics at fine granularity, and even to individual customers. For example, a credit card company can offer special incentives to customers who have indicated plans to travel abroad in the near future and encourage them to use credit cards offered by the company while overseas.

Machine Data Analytics. Modern production facilities consist of many computerized machines performing specialized tasks. All these machines produce a constant stream of system log data. Using IE over the machine-generated log data it is possible to automatically extract individual pieces of information from each log record and piece them together into information about individual production sessions. Such session information permits advanced analytics over machine data such as root cause analysis and machine failure prediction.

A Brief Introduction to SystemT

SystemT is a state-of-the-art Information Extraction system. SystemT allows to express a variety of algorithms for performing information extraction, and automatically optimizes them for efficient runtime execution. SystemT started as a research project in IBM Research – Almaden in 2006 and is now commercially available as IBM BigInsights Text Analytics.

On the high level, SystemT consists of the following three major parts:

1. Language for expressing NLP algorithms. The AQL (Annotation Query Language) language is a declarative language that provides powerful primitives needed in IE tasks including:

  • Morphological Processing including tokenization, part of speech detection, and finding matches of dictionaries of terms;
  • Other Core primitives such as finding matches of regular expressions, performing span operations (e.g., checking if a span is followed by another span) and relational operations (unioning, subtracting, filtering sets of extraction results);
  • Semantic Role Labeling primitives providing information at the level of each sentence, of who did what to whom, where and in what manner;
  • Machine Learning Primitives to embed a machine learning algorithm for training and scoring.

2. Development Environment. The development environment provides facilities for users to construct and refine information extraction programs (i.e., extractors). The development environment supports two kinds of users:

  • Data scientists who do may not wish to learn how to code can develop their extractor in a visual drag-and-drop environment loaded with a variety of prebuilt extractors that they can adapt for a new domain and build on top of. The visual extractor is converted behind the scenes into AQL code.

Information Extraction

  • NLP engineers can write extractors directly using AQL. An example simple statement in AQL is shown below. The language itself looks a lot like SQL, the language for querying relational databases. The familiarity of many software developers with SQL helps them in learning and using AQL.

AQL Information Extraction

3. Optimizer and Runtime Environment. AQL is a declarative language: the developer declares the semantics of the extractor in AQL in a logical way, without specifying how the AQL program should be executed. During compilation, the SystemT Optimizer analyzes the AQL program and breaks it down into specialized individual operations that are necessary to produce the output.

The Optimizer then enumerates many different plans, or ways in which individual operators can be combined together to compute the output, estimates the cost of these plans, and chooses one plan that looks most efficient.

This process is very similar to how SQL queries are optimized in relational database systems, but the types of optimizations are geared towards text operations which are CPU-intensive, as opposed to I/O intensive operations as in relational databases. This helps the productivity of the developer since they only need to focus on “what” to extract, and leave the question of the “how” to do it efficiently to be figured out by the Optimizer.

Given a compiled extractor, the Runtime Environment instantiates and executes the corresponding physical operators. The runtime engine is highly optimized and memory efficient, allowing it to be easily embedded inside the processing pipeline of a larger application. The Runtime has a document-a-time executive model: It receives a continuous stream of documents, annotates each document and output the annotations for further application-specific processing. The source of the document stream depends on the overall applications.

Advantages of SystemT

SystemT handles gracefully requirements dictated by modern applications such as the ones described above. Specifically:

  • Scalability. The SystemT Optimizer and Runtime engine ensures the high-performance execution of the extractors over individual documents. In our tests with many different scenarios, SystemT extractors run extremely fast on a variety of documents, ranging from very small documents such as Twitter messages of 140 bytes to very large documents of tens of megabytes.
  • Expressivity. AQL enables developers to write extractors in a compact manner, and provides a rich set of primitives to handle both natural language text (in many different languages) as well as other kinds of text such as machine generated data, or tables. A few AQL statements may be able to express complex extraction semantics that may require hundreds or thousands lines of code. Furthermore, one can implement functionalities not yet available via AQL natively via User Defined Functions (UDFs). For instance, developers can leverage AQL to extract complex features for statistical machine learning algorithms, and in turn embed the learned models back into AQL.
  • Transparency. As a declarative language, AQL allows developers to focus on what to extract rather than how to extract when developing extractors. It enables developers to write extractors in a much more compact manner, with better readability and maintainability. Since all operations are declared explicitly, it is possible to trace a particular result and understand exactly why and how it is produced, and thus to correct a mistake at its source. Thus, AQL extractors are easy to comprehend, debug and adapt to a new domain.

If you’d like to learn more about how SystemT handles these requirements and how to create your own extractors, enroll today in Text Analytics – Getting Results with System T and then Advanced Text Analytics – Getting Results with System T.

The post Introducing Two New SystemT Information Extraction Courses appeared first on Big Data University.


Revolution Analytics

Deep Learning Part 3: Combining Deep Convolutional Neural Network with Recurrent Neural Network

by Anusua Trivedi, Microsoft Data Scientist This is part 3 of my series on Deep Learning, where I describe my experiences and go deep into the reasons behind my choices. In Part 1, I discussed the...

...
 

September 25, 2016


Simplified Analytics

Why Data Scientist is top job in Digital Transformation

Digital Transformation has become a burning question for all the businesses and the foundation to ride on the wave is being data driven. DJ Patil & Thomas Davenport mentioned in 2012 HBR article,...

...
 

September 23, 2016


Revolution Analytics

Because it's Friday: Illusions at the Periphery

This fantastic optical illusion has been doing the rounds recently, after being tweeted by Will Kerslake: There are twelve black dots in that image, but I bet you can only see one or two of them at a...

...

Revolution Analytics

Microsoft R at the EARL Conference

Slides have now been posted for many of the talks given at the recent Effective Applications of the R Language (London) conference, and I thought I'd highlight a few that featured Microsoft R. Chris...

...
 

September 22, 2016

Silicon Valley Data Science

Noteworthy Links: September 22 2016

We’re at Enterprise Dataversity this week in Chicago, and next week we’ll be in NYC for Strata + Hadoop World. In the midst of this busy September, here are some articles we’ve come across and enjoyed.

MoMA Exhibition and Staff Histories—This open data set contains all exhibitions at MoMA from 1929–1989 (1,788 in all).

Agate: A Data Analysis Library for Journalists—Agate is a new Python library that claims to optimize “for the performance of the human who is using it.” Let us know if you’ve tried it out.

GitHub’s Project Management Tool—GitHub has a new, Trello-esque tool called Projects. We love new tools, and are intrigued by this one. Are you going to switch to Projects?

Tube Heartbeat—This visualization shows the “pulse” of London’s Underground, which is strangely relaxing to watch.

Creativity and Data Visualization—A group of artists is turning data into art through an exhibition called Visualizing the Invisible.

Want to keep up with what we’re up to? Sign up for our newsletter to get updates on blog posts, conferences, and more.

The post Noteworthy Links: September 22 2016 appeared first on Silicon Valley Data Science.

 

September 21, 2016

The Data Lab

Thomas Blyth, Business Development Executive

Thomas has over 16 years of experience in Business Development in the Oil & Gas Industry working with both small and large multi-national technology companies in the UK and Internationally.

Principa

Making the move from Predictive Modelling to Machine Learning

Everyone is wanting to learn more about how Machine Learning can be used in their business. What’s interesting though is that many companies may already be using Machine Learning to some extent without really realising it. The lines between predictive analytics and Machine Learning are actually quite blurred. Many companies will have built up some Machine Learning capabilities using predictive analytics in some area of their business. So if you use static predictive models in your business, then you are already using Machine Learning, albeit of the static variety.  

The move from Predictive Modelling to Machine Learning can be easier than you think. However, before making that move you need to keep two key considerations in mind to ensure that you benefit from all that machine learning has to offer and that your predictive analytics system remains a trustworthy tool that lifts your business, rather than harming it: Retaining Frequency and the Consequence of Failure.

 

September 20, 2016


Revolution Analytics

Welcome to the Tidyverse

Hadley Wickham, co-author (with Garrett Grolemund) of R for Data Science and RStudio's Chief Scientist, has focused much of his R package development on the un-sexy but critically important part of...

...

Revolution Analytics

Linux Data Science Virtual Machine: new and upgraded tools

The Linux edition of the Data Science Virtual Machine on Microsoft Azure was recently upgraded. The Linux DSVM includes Microsoft R, Anaconda Python, Jupyter, CNTK and many other data science and...

...

Rob D Thomas

The End of Tech Companies

“If you aren’t genuinely pained by the risk involved in your strategic choices, it’s not much of a strategy.” — Reed Hastings Enterprise software companies are facing unprecedented market pressure....

...

Revolution Analytics

How to choose the right tool for your data science project

by Brandon Rohrer, Principal Data Scientist, Microsoft R or Python? Torch or TensorFlow? (or MXNet or CNTK)? Spark or map-reduce? When we're getting started on a project, the mountain of tools to...

...

BrightPlanet

ACFE Fraud Conference Canada Recap: OSINT to Strengthen Risk Management

We promoted Tyson’s presentation at last week’s ACFE Fraud Conference in Montreal on our blog and now, we gathered some of his thoughts coming out of the event. From Tyson: The ACFE (Association of Certified Fraud Examiners) did an amazing job hosting and the venue was spectacular. If you have never been to Montreal, you need […] The post ACFE Fraud Conference Canada Recap: OSINT to Strengthen Risk Management appeared first on BrightPlanet.

Read more »
Big Data University

This Week in Data Science (September 20, 2016)

Here’s this week’s news in Data Science and Big Data. AI robot

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

New in Big Data University

  • Data Science Fundamentals Learning Path – When a butterfly flaps its wings what happens? Does it fly away and move on to another flower or is there a spike in the rotation of wind turbines in the British Isles. Come be exposed to the world of data science where we are working to create order out of chaos that will blow you away!

The post This Week in Data Science (September 20, 2016) appeared first on Big Data University.

 

September 19, 2016


Revolution Analytics

YaRrr! The Pirate's Guide to R

Today is Talk Like A Pirate Day, the perfect day to learn R, the programming language of pirates (arrr, matey!). If you have two-and-a-bit hours to spare, Nathaniel Phillips has created a video...

...
Teradata ANZ

Why segment customers in a Big Data world?

Gary Comer, founder of mail order clothing retailer Lands’ End, once said “Think one customer at a time and take care of each one the best way you can”. The only way to implement this in the early 1960s, in the days of limited data and computing power, was to segment consumers into subsets with common needs, interests or priorities and then target them appropriately.

Fifty years on, there is an abundance of behavioral data about each customer: both transactional data indicating past responses to campaigns as well as interactions. Customer values and opinions are also shared on social media networks. Most importantly, there are now scalable supervised learning technologies that can link and analyse all of these granular data to create accurate predictive models. These changes have given marketers the ability to understand each customer’s unique needs and priorities enabling accurate targeting of a single individual rather than segments. Yet, as Rexter’s 2007 data miner survey shows, 4 out of 5 data miners still conduct segmentation analyses, i.e., unsupervised learning on data with sufficient information to perform supervised learning. And this is more frequently used by those working with CRM/Marketing data, in other words, for targeting customers.

So why are most marketers still persisting with targeting segments rather than an individual?

There are a number of reasons for this including:
1. Choice of analytic platforms,
2. Campaign funding structures and
3. The fact that simplistic segmentation models are easier to sell to the senior management.

1. Choice of analytic platforms for CRM

The Figure below shows the results of the 2015 KDnuggets poll on computing resources for analytics and data mining. A whopping 85% of all data miners still use PC/laptop for their analysis (even though they may also use other platforms).

bhavani-raskutti_segmentation-1Now, treating each person as a unique individual requires understanding their preferences, needs and current priorities. A person is targeted if and only if all of their behavior, social conversations and current events of relevance indicate that the campaign would be of interest. Such focused targeting needs building specific predictive models for each marketing campaign taking into account all of the customer’s transactions and interactions in all channels as well as any events of relevance.

Clearly, analysing such vast amounts of data in a timely manner is difficult or impossible with the PC/laptop, the analytic platforms used by most CRM analysts. Hence, the use of less resource-intensive approaches to support their campaigns, namely, segments based on the small demographic, economic and lifestyle datasets.

bhavani-raskutti_segmentation-22. Campaign funding structures

Predictive models for campaigns enable targeting of a very small population to achieve a high response rate (RR). The accompanying gain chart is typical of what is possible with models. Targeting 0.5% of the population in 200 such campaigns results in a contact rate of 100% (0.5% X 200) and an average RR of 40% (10% of 2% X 200).

Contacting the same number of people in fewer campaigns reduces the overall effectiveness. Yet, the minimum volumes needed to justify funding for campaigns mean that marketers have to contact more people in each campaign. This reduces the effectiveness of predictive models to the same level as that of simplistic segmentation models.

3. Segmentation is easier to sell 

When there are only five to seven non-overlapping segments, they can be explained to senior management with compelling visuals. Catchy segment titles such as “Indulgent traveler”, “Fashionista professional” and “Senior sippers” evoke images in our minds and it is then possible to mobilise funding for an entire program around that segmentation.

In comparison, predictive models that crunch 1000s of variables to then spit out likelihood scores are distinctly unexciting. Further, the process of crunching needs automation of CRM and investment in data science and that is not the province of the marketing staff controlling the funding. Marketing would much rather spend it on a marketing program based on a different subjective segmentation strategy.

So, how do we move away from just segmentation for CRM to embracing the most effective technique? The answer is to examine the data available to solve the business problem. Use segmentation, if there is insufficient historical information to learn a predictive model and the only option is to use undirected learning.

Thus, segmentation would be the technique of choice when:
• Launching products that are completely different from anything previously sold,
• Exploring new markets with very different geo-demographics and
• Designing new products by understanding gaps in current offerings in current markets through a market research done across the whole population, not just the customer base.

For all other campaigns predictive models would be the choice. This combination of supervised learning for regular campaigns and segmentation for new ones will ensure long term viable CRM.

The post Why segment customers in a Big Data world? appeared first on International Blog.

Data Digest

Driving the CX Agenda: Who’s Behind the Wheel?


“Have a very good reason for everything you do” – Laurence Olivier

How does your customer experience look under the glare of your customers' expectations? Olivier’s sentiment cries out for justification, to put the thought process behind every business decision impacting the customer up in neon lights for reflection. What would be revealed?

According to PwC’s 2016 Global CEO Survey ‘Customers remain the top priority, with 90% of CEOs indicating they have a high or very high impact on their business strategy’. But how is that translating into existing CX strategy? The Survey states that ‘customer behaviour, in particular, has become more complicated as values and buying preferences evolve.’

Undoubtedly this rapidly evolving environment makes customer centricity a cornerstone, but who has stepped out from the shadows to ensure it stays firmly in the spotlight?

Customer advocates, Chief Customer Officers, are active the boardroom championing the cause of the customer and putting in place the strategy to promote change, inter-discipline collaboration, organisational alignment and customer-centric decision making. 

NAB announced in July 2016 that they are creating not one but three Chief Customer Officer roles.

However the role of a customer advocate will look very different across organisations and backgrounds vary significantly between individuals. When we take a closer look at how Chief Customer Officers have arrived at their destination, we get a better flavour of the complex nature and diverse remit of the CCO role.

For example Julie Batch was appointed as the Chief Analytics Officer at Insurance Australia Group (IAG) in July 2014. By December 2015, Ms. Batch was heading up IAG’s Customer Labs as Chief Customer Officer, responsible for developing customer propositions and marketing strategies. For IAG, customer experience strategy is intrinsically linked to driving product innovation through data and insights. A natural progression for a CAO.

For Carsales.com.au their Chief Customer Officer, Vladka Kazda, was Chief Marketing Officer at the company for over five years before arriving at the CCO position. Ms Kazda owned and influenced customer experience at every level during her journey to CCO so a logical move.

For others, a natural rise in the ranks via customer experience roles has seen them awarded the CCO role. Damian Hearne, Chief Customer Officer at Auswide Bank has excelled in the leadership qualities required of a CCO to unite across  silos and move the business from delivering an uncoordinated experience to a reliable, deliberate and preferred customer experience. 

Mark Reinke, Chief Customer Experience Officer, Suncorp, has also united the critical elements of customer, data and marketing. The customer listening path is critical but alone, it cannot deliver. It needs the proactive and innovative advocate with the leadership skills to drive initiatives. 

CCOs often have a broad remit but primarily the requirement to develop the competency to operationalise the brand promise. Looking at the language, prioritisation, decision-making, bringing together operating groups, transforming the collaboration process, implementing the customer experience design. There is both faith and science behind the Chief Customer Officer.

Ultimately everyone in the business is involved in putting the customer first, but employee customer advocates are only fostered from a successful customer centric culture. What metrics are being used to measure the impact and success of a CCO?

To learn more about driving change, overcoming the challenges and critically measuring the success of the CCO, join Julie Batch, Vladka Kazda, Damian Hearne, and Mark Reinke as they share their insights at Chief Customer Officer Sydney, 28-29 November 2016

Learn how other organisations are addressing their CX challenges, learn about new approaches and strategies, whilst making new connections with industry peers. Join The Chief Customer Officer Forum LinkedIn group here.
 

September 18, 2016


Simplified Analytics

What is Cognitive Computing?

Although computers are better for data processing and making calculations, they were not able to accomplish some of the most basic human tasks, like recognizing Apple or Orange from basket of fruits,...

...
 

September 16, 2016


Revolution Analytics

Because it's Friday: A big chart about climate change

The problem with representing change on a geologic timescale is just that: scale. We humans have only been around for a tiny fraction of the planet's history, and that inevitably colours our perceptions of changes that occur over millennia. That's one of the things that make the climate change debate so difficult: it's hard to represent the dramatic changes in the climate over the last 200 years in the context of climate history. (Deliberate FUD also has a lot to do with it.) It's difficult just to chart temperatures over that timescale, because the interesting part to us (human history) gets lost in the expanse of time. As a result, most representations of climate data are either compressed or truncated, which dampens the impact.

Randall Munroe has figured out a clever way to demonstrate the dramatic impact of modern climate change in a recent issue of XKCD: to simply plot the history of global temerature since the last ice age in one really, really, tall chart -- liberally decorated with the usual XKCD humour, of course:

Xkcd climatepng

That's just one tiny excerpt of the chart, you really need to click through and scroll through to the end to really appreciate its impact.

That's all from us here at the blog for this week. Have a great weekend, and we'll see you back here on Monday. Enjoy!


Revolution Analytics

Because it's Friday: A big chart about climate change

The problem with representing change on a geologic timescale is just that: scale. We humans have only been around for a tiny fraction of the planet's history, and that inevitably colours our...

...

Revolution Analytics

Reflections on EARL London 2016

The Mango Solutions team have done it again: another excellent Effective Applications of R (EARL) conference just wrapped up here in London. The conference was attended by almost 400 R users from...

...
Data Digest

Top 10 Takeaways at the Chief Data & Analytics Officer Melbourne 2016


Feeling empowered and inspired after a fantastic three-day conference in Melbourne last week with over 200 data enthusiasts! The topics were vast and the speakers kept everyone engaged with their wealth of knowledge and stories shared. A massive thank you to all our speakers, sponsors and attendees – we learnt lots and had a lot of fun!

Here are my top 10 takeaways from the conference:



1. On leveraging open data for social good: we can all be superheroes! Thank you Jeanne Holm, City of Los Angeles for your inspiring stories of open data being used to improve the world we live in.


2. On building a culture for data governance: “Work with the willing and win hearts” Kate Carruthers, Chief Data Officer, University of New South Wales

3. On establishing a data quality framework: create a team brand that represents value add to your organisation, and keep your dq metrics simple and powerful - Michelle Pinheiro, IAG



4. On ensuring the success of your data analytics projects: agile, agile, agile, AGILE!

5. On IT and business alignment– it’s simple, says World Vision International's John Petropoulos, if your partner doesn’t get it, you need to re-write it!
   
6. Big data personalisation = world class, machine learning predictive model @Woolworths
   
7. On data privacy – and where is that creepy line? A key point for consideration from Brett Woolley, NAB is content vs intent of personal information used.
   
8. On developing a cost model for data governance… you really ought to check out Gideon Stephanus du Toit’s presentation: http://bit.ly/2cvewwW
   
9. On leveraging machine learning for safer flights into Queenstown – a very cool use case from Mark Sheppard, GE Capital



10. On marketing analytics: “don’t fall for vanity metrics” Geoff Kwitko, Edible Blooms

Thanks again all, and we look forward to catching up in Sydney on 6-8 March 2017!

To discuss the Chief Data & Analytics Officer Sydney 2017 event and speaking/ sponsorship opportunities, please get in touch: monica.mina@coriniumintelligence.com   


By Monica Mina:

Monica Mina is the organiser of the CDAO Melbourne, consulting with the industry about their key challenges and trying to find exciting and innovative ways to bring people together to address those issues. For enquiries, monica.mina@coriniumintelligence.com.
 

September 15, 2016

Silicon Valley Data Science

Jupyter Notebook Best Practices for Data Science

Editor’s note: Welcome to Throwback Thursdays! Every third Thursday of the month, we feature a classic post from the earlier days of our company, gently updated as appropriate. We still find them helpful, and we think you will, too! The original version of this post can be found here.

The Jupyter Notebook is a fantastic tool that can be used in many different ways. Because of its flexibility, working with the Notebook on data science problems in a team setting can be challenging. We present here some best-practices that SVDS has implemented after working with the Notebook in teams and with our clients—and that might help your data science teams as well.

The need to keep work under version control, and to maintain shared space without getting in each other’s way, has been a tricky one to meet. We present here our current view into a system that works for us—and that might help your data science teams as well.

Overall thought process

There are two kinds of notebooks to store in a data science project: the lab notebook and the deliverable notebook. First, there is the organizational approach to each notebook.

Lab (or dev) notebooks:

Let a traditional paper laboratory notebook be your guide here:

  • Each notebook keeps a historical (and dated) record of the analysis as it’s being explored.
  • The notebook is not meant to be anything other than a place for experimentation and development.
  • Each notebook is controlled by a single author: a data scientist on the team (marked with initials).
  • Notebooks can be split when they get too long (think turn the page).
  • Notebooks can be split by topic, if it makes sense.

Deliverable (or report) notebooks

  • They are the fully polished versions of the lab notebooks.
  • They store the final outputs of analysis.
  • Notebooks are controlled by the whole data science team, rather than by any one individual.

Version control

Here’s an example of how we use git and GitHub. One beautiful new feature of Github is that they now render Jupyter Notebooks automatically in repositories.

When we do our analysis, we do internal reviews of our code and our data science output. We do this with a traditional pull-request approach. When issuing pull-requests, however, looking at the differences between updated .ipynb files, the updates are not rendered in a helpful way. One solution people tend to recommend is to commit the conversion to .py instead. This is great for seeing the differences in the input code (while jettisoning the output), and is useful for seeing the changes. However, when reviewing data science work, it is also incredibly important to see the output itself.

For example, a fellow data scientist might provide feedback on the following initial plot, and hope to see an improvement:

not-great

better-fit

The plot on the top is a rather poor fit to the data, while the plot on the bottom is better. Being able to see these plots directly in a pull-request review of a team-member’s work is vital.

See the Github commit example here.

Note that there are three ways to see the updated figure (options are along the bottom).

Post-save hooks

We work with many different clients. Some of their version control environments lack the nice rendering capabilities. There are options for deploying an instance of nbviewer behind the corporate firewall, but sometimes that still is not an option. If you find yourself in this situation, and you want to maintain the above framework of reviewing code we have a workaround. In these situations, we commit the .ipynb, .py, and .html of every notebook in each commit. Creating the .py and .html files can be done simply and automatically every time a notebook is saved by editing the jupyter config file and adding a post-save hook.

The default jupyter config file is found at: ~/.jupyter/jupyter_notebook_config.py

If you don’t have this file, run: jupyter notebook --generate-config to create this file, and add the following text:

c = get_config()
### If you want to auto-save .html and .py versions of your notebook:
# modified from: https://github.com/ipython/ipython/issues/8009
import os
from subprocess import check_call
def post_save(model, os_path, contents_manager):
    """post-save hook for converting notebooks to .py scripts"""
    if model['type'] != 'notebook':
        return # only do this for notebooks
    d, fname = os.path.split(os_path)
    check_call(['jupyter', 'nbconvert', '--to', 'script', fname], cwd=d)
    check_call(['jupyter', 'nbconvert', '--to', 'html', fname], cwd=d)
c.FileContentsManager.post_save_hook = post_save

Run jupyter notebook and you’re ready to go!

If you want to have this saving .html and .py files only when using a particular “profile,” it’s a bit trickier as Jupyter doesn’t use the notion of profiles anymore.
First create a new profile name via a bash command line:

export JUPYTER_CONFIG_DIR=~/.jupyter_profile2
jupyter notebook --generate-config

This will create a new directory and file at ~/.jupyter_profile2/jupyter_notebook_config.py. Then run jupyter notebook and work as usual. To switch back to your default profile you will have to set (either by hand, shell function, or your .bashrc) back to: export JUPYTER_CONFIG_DIR=~/.jupyter.

Now every save to a notebook updates identically-named .py and .html files. Add these in your commits and pull-requests, and you will gain the benefits from each of these file formats.

Putting it all together

Here’s the directory structure of a project in progress, with some explicit rules about naming the files.

Example directory structure

- develop # (Lab-notebook style)
 + [ISO 8601 date]-[DS-initials]-[2-4 word description].ipynb
 + 2015-06-28-jw-initial-data-clean.html
 + 2015-06-28-jw-initial-data-clean.ipynb
 + 2015-06-28-jw-initial-data-clean.py
 + 2015-07-02-jw-coal-productivity-factors.html
 + 2015-07-02-jw-coal-productivity-factors.ipynb
 + 2015-07-02-jw-coal-productivity-factors.py
- deliver # (final analysis, code, presentations, etc)
 + Coal-mine-productivity.ipynb
 + Coal-mine-productivity.html
 + Coal-mine-productivity.py
- figures
 + 2015-07-16-jw-production-vs-hours-worked.png
- src # (modules and scripts)
 + init.py
 + load_coal_data.py
 + figures # (figures and plots)
 + production-vs-number-employees.png
 + production-vs-hours-worked.png
- data (backup-separate from version control)
 + coal_prod_cleaned.csv

Benefits

There are many benefits to this workflow and structure. The first and primary one is that they create a historical record of how the analysis progressed. It’s also easily searchable:

  • by date (ls 2015-06*.ipynb)
  • by author (ls 2015*-jw-*.ipynb)
  • by topic (ls *-coal-*.ipynb)

Second, during pull-requests, having the .py files lets a person quickly see which input text has changed, while having the .html files lets a person quickly see which outputs have changed. Having this be a painless post-save-hook makes this workflow effortless.

Finally, there are many smaller advantages of this approach that are too numerous to list here—please get in touch if you have questions, or suggestions for further improvements on the model! For more on this topic, check out the related video from O’Reilly Media.

The post Jupyter Notebook Best Practices for Data Science appeared first on Silicon Valley Data Science.

The Data Lab

Calling all Data Scientists to join our "Office Hours" initiative

We would like to invite data scientists and academics to attend our offices at 15 South College Street, Edinburgh EH8 9AA, to come together and discuss problems, share insights, and meet like-minded technicians in a relaxed and informal environment. From 9:30 the attendees will each give a very brief presentation to the group (around 5 minutes each), then afterwards are welcome to use our offices for follow-up discussions, networking, or simply to do their day job!

Please follow the link to the application form. For this first iteration spaces are limited to fifteen and therefore places will be allocated at The Data Lab’s discretion. We intend to run these days regularly with increased capacity across various locations in Scotland, so please give us some information on topics of interest to help us plan future days.

If you have any queries please do not hesitate to contact us at science.group@thedatalab.com.

 

Template: 
Image
Share: 
Google Plus
Facebook
Pinterest
Twitter
Principa

What is Machine Learning?

Here's a blog post covering some of the most frequently asked questions we get on Machine Learning and Artificial Intelligence, or Cognitive Computing. We start off with "What is Machine Learing?" and finish off with addressing some of the fears and misconceptions of Artificial Intelligence.

So, what is machine learning? A simple search on Google for the answer will yield many definitions for it that leave most non-analytical people confused and entering more "What is..." statements into Google. So, I asked our Head of Marketing to try his hand at defining Machine Learning in the most simplistic way he can: explain Machine Learning to someone you've just met at a social gathering. Here's his definition - a "Machine Learning for Beginners' " definition if you will. 

 

September 14, 2016


David Corrigan

Now You See Me, Now You Don’t

The Trials & Tribulations of the Anonymous Customer I bought an office chair from an office retailer a few months ago.  Seeing as I was buying something I wanted vs. something I needed...

...
Jean Francois Puget

Machine Learning From Weather Forecast

Using weather data in machine learning is promising.   For instance, everyone knows that weather forecast influences buying patterns, be it for apparel, food, or travel.  Wouldn't it be nice to capture weather forecast effect on these?

All we need to do is to use weather data in addition to other data we have , then use our favorite machine learning toolbox.

It sounds simple but there is a catch.  I try to explain what it is below.

For the sake of simplicity we will assume we want to predict future sales, but what follows applies to any situation where we want to use weather forecast as part of a machine learning application.

If you omit weather data, then sales forecast is a classical problem for which several statistical and machine learning techniques can be applied, for instance Arima.  These techniques deal with time series: past sales data is ordered by time, and the model extrapolates the time series.  In a nutshell, the model finds trends in past sales data, and applies the trend to the current data. 

One way to assess model accuracy is to run it against historical data, and compare predictions with actual sales.  For this you must use historical data that was not used when creating the model, see Overfitting In Machine Learning to know why, if it is not clear.  

Assuming you use held out historical data, you would, for each week w, run your model with all weeks wi with wi < w as input, then you would compare the output of the model with sales at week w. If your model is good, then you would get something like this where the predicted values are close to the actual values:

image

You can then use the model to predict future sales.

If you want to predict with weather forecast, your model will take two time series as input:

  • Past sales
  • Past weather forecast

and it will output sales forecast for the coming week.

Issue is that past forecasts aren't available in general.  Most weather forecast providers store past observed weather, but they don't store past weather forecast because it would require way more storage capacity.

The usual way to get around for the lack of past weather forecast is to approximate it by using past observed weather.  The weather forecast for past week w will be the observed weather for week w

While this seems appealing it creates an issue. Indeed, when using actual weather in place of weather forecast we assume perfect weather forecast.   But when we will use our model on current data, we will have current weather forecast as input, which is unlikely to be perfect.  Our model will assume it is perfect anyway, and it may rely on it too much.  This may negatively impact our sales forecast accuracy. 

The cure to this issue is to really use past forecast.  Unfortunately, as said above, this data isn't stored in general.  One way to cope with its absence is to reconstruct it by running weather forecasting models with past weather data.  For each week in the past we would run the weather forecast model with previous weeks as input, and store the result.  This is what our weather forecast team at IBM does using Deep Thunder technology ( wikipedia link ).  This is the rich man's solution, expensive but quite effective.

If you cannot reconstruct past weather forecasts, a poor man solution can be to add noise to past weather data when you use it as a proxy for weather forecast.  This way you no longer assume perfect weather forecasts, and your model will probably be better off. 

Chose the solution you want, but do not use raw past weather as a proxy for past weather forecast.  It will lead you to disappointing results.

 

 

 


Revolution Analytics

How an algorithm behind Deep Learning works

There are many algorithms behind Deep Learning (see this comparison of deep learning frameworks for details), but one common algorithm used by many frameworks is Convolutional Neural Networks (CNNs)....

...
Teradata ANZ

Hey! Are Your Colleagues Cheating The Business?


If your organisation is ignoring its Big Data potential, your colleagues are cheating the business because data-driven companies get a much higher return on investment than their competitors.

Fact.

And the key to maximising your data potential also happens to be the fourth ‘V’ in Big Data; ‘Value’. Or rather, ‘Value’ spliced with ‘Time-to-Market’.

Now, for the majority of businesses data-driven competition presents a massive challenge. Particularly if they take the traditional, often tortuous approach to the development of products and services. Typically, it can take a ‘last-gen’ IT team anything from six months to a year (or more) to deliver the goods, leaving their hamstrung c-suite helpless in the face of more agile pace setters.

The quickest route to business value

In that kind of torpid environment, scope creep often leads to widespread under-delivery (numbers, KPIs, insights, etc.). Also, the team have to face up to the thorny question of what happens if development fails to confirm the original business case. Or what if the investment isn’t written off immediately (a real dollar-burner)?

Clearly, a fresher, more vigorous approach is required. A radical approach where innovation is the catalyst for a sustainable reduction in development costs, deadlines, and the time-to-market for insights, as well an increase in business value.

Because in markets where Big Data and IoT as well as disruptive technologies and business models, are triggering unprecedented change, accelerating the time-to-market for your data insights can make the difference between success and failure.

It pays to know what you don’t know… yet

Therefore, you need to know what’s worth and not worth pursuing as soon as possible so you can divert resources to the most profitable and productive business areas.

This calls for a kind of thrash-metal approach to R&D researching deeper and developing faster than ever before. Which is not as crazy as it sounds. This discovery-driven approach follows the same fail-fast principles adopted by life sciences companies (new drug development), McDonald’s (new recipes), and most other organisations developing new and uniquely successful IP.

Try. Fail. Repeat. Try. Fail. Repeat. Try, fail, and repeat until you find a data-driven business case that carries enough real value to warrant operationalisation.

Delivering actionable insights – faster

Okay, so a fail-fast approach provides a springboard for improved and sustainable profitability. The trouble is it also causes great upheaval, and that can hit hard. Your whole operation could need transforming.

Scary, huh? Actually, it’s not as difficult to get your head around as you might think. Especially as data science is on hand to help you discover the business value buried in your data.

And that’s the point. To get more business value, you need to know your data in detail. Then visual analytics can turbo-charge your knowledge, providing clarity and a shorter time-to-market en route to a brand new rapid-insight development cycle.

Creating a culture of analytics

The quicker you can identify the business value in your data, the greater (potentially) the ROI. To this end, visual analytics cut straight to the money shot, offering an easy method of fast tracking discovery and breaking down data complexity. And creative visualisations of the results, like the images in the Art of Analytics collection*, make the implications more readily understood.

That said, a method is just a method and to flourish, your organisation needs to establish a pioneering culture; a culture of analytics which encourages the whole team to think outside the box. That means assigning the right people and embedding the right analytical and fail-fast processes, while enabling them with the right technology, the right methodology, and the right incentives (no one should be afraid of failing).

The Art of Analytics

The Art of Analytics* is a series of pictorial data visualisations drawn from real-world use cases and presented as works of art in their own right. Alongside each image are details of the practical benefits gained by the organisation concerned. These stunning artworks are the product of intensive analytical work aimed at creating extra business value and solving real business problems. The visualisation process involves interrogating diverse types and quantities of data with a tailored mix of analytical tools and methods.

Check out the practical benefits of the Art of Analytics in ‘Visual Analytics Beat Even Oprah For Working Out Complex Relationships. #1: Les Misérables’ – the next in this series of Art of Analytics blogs.

This post first appeared on Forbes TeradataVoice on 27/05/2016.

The post Hey! Are Your Colleagues Cheating The Business? appeared first on International Blog.

 

September 13, 2016


BrightPlanet

How to Find and Harvest Dark Web Data from the TOR Network

The Internet is constantly changing, and that’s more apparent on the TOR Network than anywhere else. In this section of the Dark Web, you can see a URL one minute and it’ll be gone the next. These fly-by-night URLs make it challenging to collect TOR data, but it’s important to stay on top of Dark […] The post How to Find and Harvest Dark Web Data from the TOR Network appeared first on BrightPlanet.

Read more »
Teradata ANZ

How about trying complex Data Analytics Solutions for size and fit, before you splash the cash?

The pressure to deliver Alpha – potential above the market average – is immense. And that’s not surprising because in a quick-change disruptive world, Alpha consistency puts an organisation at the sharp edge of its digitally-driven market. If it can be done quickly, that is.

One of the biggest stumbling blocks to creating breakthrough insights that drive value though, has been the time it takes to realise or monetise the business potential in data. But what if I said that instead of waiting months or even years for project results you could deliver Alpha, fully, within 6-10 weeks?

And what if I told you we could predict the business value of analytic solutions before you shell out a ton of money on the technology and other resources?

What would that be worth to your organisation? To you, personally?

The RACE for business value in data

Predicting outcomes is complicated. A myriad of different things that complicate the deployment and business use of analytic solutions need to be taken into consideration, like new data sources (including IoT sensor data) and new analytic techniques, for instance. Yet, in spite of this giant basket of variables, the potential ROI and strategic business impact of any analytic solution is expected to be totted-up and delivered to the table before any money changes hands.

Which is where Teradata’s RACE approach comes in. Teradata’s agile, technology-agnostic process, RACE (Rapid Analytic Consulting Engagement), has been developed to complement both agile development (e.g. CRISP) and agile methodology (e.g. SCRUM).

Crossing the Business – IT divide

The RACE process also soothes a number of old wounds. Business departments pass their needs and ideas onto their analysts, who simply respond. And, whereas IT departments have their processes, business and their analysts don’t really have a way of streamlining business value identification before hitting IT with a development request.

Often (surprise, surprise), business departments don’t understand the analytical potential of data. At the same time, neither the analysts nor the IT department understand business processes and ideas. Consequently, business thinks IT is too slow; IT feel they are not taken seriously and have no clue about how the business is really run.

One of the great things about RACE is that it fuses business and IT together through its leadership and commitment model. At the same time it enables both sides to intensively learn from another.

RACEing involves three primary phases:

  1. Align – together, business and IT identify and align the highest-potential-value uses cases, and validate the availability of key data assets to support the use case.
  2. Create – data scientists load and prepare the data developing new, or applying existing, analytic models to the selected use cases. This phase involves rapid iterations with the business to ensure the analytic insights hit the right business targets.
  3. Evaluate – business and the analysts / data scientists analyse the results and document the potential ROI of deploying the analytic use cases at scale, as well as developing a deployment recommendation.

RACE leverages multi-genre analytics to generate new business insights, reducing time to market (takes average of 6 weeks to validate ROI in the new business insights), and minimising deployment risk (generated insights act as a prototype for operationalisation). Oh yes, it identifies the Alpha by validating use-case business potential, too.

And the upshot is that you begin each project with a clear ROI roadmap which answers three burning business questions: “How?”, “Where?”, and “What will it be worth?”.

What’s not to like?

The post How about trying complex Data Analytics Solutions for size and fit, before you splash the cash? appeared first on International Blog.

Big Data University

This Week in Data Science (September 13, 2016)

Here’s this week’s news in Data Science and Big Data. NBA data

Don’t forget to subscribe if you find this useful!

Interesting Data Science Articles and News

Upcoming Data Science Events

Cool New Courses

The post This Week in Data Science (September 13, 2016) appeared first on Big Data University.

Teradata ANZ

DevOps Decoded: Modular Design

I have been reading and thinking a lot about DevOps recently, specifically in the area of development/test/deployment automation and how it would be best applied to building analytic solutions.

Like metadata, I believe DevOps is coming into its prime, with the advancements in open source and resurgence in programming combining to provide all of the enabling components to build an automated delivery pipeline.

In a simple delivery pipeline you will have a number of stages that the code moves through before final deployment into a production environment.

nathan-green_devops-1

Code is created by a developer or generated by a tool and committed into a source code repository. The change is detected and triggers the packaging process.

QA checks are performed and a deployment package is built. This contains the set of changes (delta) which must be applied to the production environment to deploy the new code.

The package is then deployed into a series of environments where testing is executed and the package is automatically promoted throughout the test environments based on the “green light” results of the previous round of testing.

Test results are monitored throughout the entire process and failures are detected and reported back to the developer immediately for correction. The faster you can find and report on bugs, the easier it is for the developer to fix the issue, as it will still be “top of mind” and requires least effort to remedy.

Finally packages are deployed into production, either automatically (continuous delivery) or as part of a scheduled release.

An automated delivery pipeline like this looks simple on the surface, but as soon as you peel back the covers and look into the detail you quickly realise that there is a lot of complexity contained within.

Some of the challenges are technical in nature, “how do I generate delta changes to a database object?” and others are firmly in the design/architecture realm, “how do I perform continuous integration on a monolithic data maintenance process?”.

Whilst I am not going to be able to explore all of these issues within this article, I would like to discuss the core principle which I believe is the key to solving this complex puzzle – modular design.

Modular design certainly is not a new concept, indeed it has been used in traditional software development for many years. Micro services are a great example of modular design in action, and I believe they will play a far greater role in analytic solution development in the future.

Many analytic solutions (eg: Data Warehouse, Data Lake etc…) will have a modular design to some degree, but most do not extend the concept down to a low enough level to enable the benefits of continuous integration & delivery that DevOps automation can provide.

nathan-green_devops-2The monolithic process design encapsulates multiple functions within a single object, commonly a script or piece of SQL code.

To test this object, we must supply object definitions for all inputs, test data for all inputs and expected results for the output.

Testing this single component does not provide any particular challenges when done in isolation, however when integration test requirements are considered the limitations of monolithic design become apparent.

Consider the case where the output of this process is consumed by 2 downstream processes.

nathan-green_devops-3For integration testing, we must also test the downstream processes to ensure that they still produce expected results. The testing scope has now increased significantly and this may reduce the frequency of our integration testing, based on the elapsed time of executing the full suite of tests required.

Organisations who have implemented automated deployment pipelines often report that integration testing is only done overnight, as the end to end elapsed time is “hours”.

When implementing automated deployment into an existing environment this is going to be the starting position, as you must incorporate the existing monolithic processes and implement improvements over time.

A modular process design can be thought of as a monolithic, complex process decomposed into a series of simple, atomic functions. In the agile world this is analogous to taking a user story and decomposing it into a number of tasks.

nathan-green_devops-4We now have four separate objects implementing the previous monolithic process.

In general, there will be a reduced number of inputs to the object, as only the inputs necessary for that specific function are required.

Output objects will tend to be reused in subsequent functions, and there will be an increased number of working/temporary objects to store intermediate results.

The number of objects and associated artifacts (code, scripts, object definitions, test data, etc…) has increased, while the complexity of each artifact has decreased.

Managing the artifacts is a perfect candidate for automation. Many of the artifacts can be generated using templates and customised with parameters.

nathan-green_devops-5Unit testing now has a much reduced scope, if we make a change to the business rule we just need to test the inputs and output objects.

Each individual test will be simpler, and it will be easier to expand the test coverage. The elapsed time of testing will be shorter, allowing more frequent testing.

How does this impact on the scope of integration testing? Impact to external processes in our example should be constrained to the case where we change the “apply changes” function, as that is the only point where output is produced ready for use by other processes. In this case we must integration test that function and any consumers.

Let us assume that the output table of our changed function is used in both the “resolve keys” and “apply business rule” functions of the dependent processes.

nathan-green_devops-6We now have what I call a “minimum testable component” for integration testing which ensures that all dependent processes are tested, but also keeps the scope of the testing to a minimum.

There will tend to be less permanent source objects involved in the testing, the functions tested are simple, the individual tests are simple but comprehensive and elapsed time will be the minimum possible.

This is the road to the holy grail of continuous integration testing, where the end to end elapsed time of the testing fits within a small enough window that the testing can be performed on demand, for every change committed into the source code repository.

Continuous integration enables continuous delivery, where every change has the potential of automatically flowing through the pipeline to be deployed into production.

Organisations I have seen that have implemented this do not let all changes automatically deploy to production, or indeed let all changes proceed with the minimum testable component for integration testing, that is an aspirational goal rather than normal practice.

Changes are rated according to risk – high, medium & low. Low risk changes can be automatically deployed to production, given all testing succeeds. Medium risk changes have some manual governance checks applied before deployment, and may trigger off integration testing with expanded scope. High risk changes are typically implemented in a scheduled release, with supporting resources (developers and operations) close at hand.

The goal is to minimise the risk of each change. Over time development teams will realise that they can deploy much faster by implementing many simple, low risk changes rather than a smaller number of monolithic high risk changes.

Following modular design techniques will help you to maximise the number of low risk changes, enabling agility in delivery of new functionality within an analytics environment which traditionally has been viewed (and built) as a large monolithic application.

Modular design will also help you to start to unravel the complexities of automating your deployment pipeline, with every small step forward providing benefits though increased code quality, reduction in production defects and faster time to market for new functionality.

The post DevOps Decoded: Modular Design appeared first on International Blog.

 

September 12, 2016


Revolution Analytics

2016 Data Science Salary Survey results

O'Reilly has released the results of the 2016 Data Science Salary Survey. This survey is based on data from over 900 respondents to a 64-question survey about data-related tasks, tools, and the...

...

Revolution Analytics

Volunteer to help improve R's documentation

The R Consortium, in its most recent funding round, awarded a grant of $10,000 to The R Documentation Task Force, whose mission is to design and build the next generation R documentation system....

...
Principa

How Marketers can use Machine Learning to boost Customer Loyalty

Thanks to mobile technology, wearable devices, social media and the general pervasiveness of the internet, an abundance of new customer information is now available to marketers. This data, if leveraged optimally, can create opportunities for companies to better align their products and services to the fluctuating needs of a demanding market space.


Simplified Analytics

Digital Transformation - Top 5 challenges to overcome

The Digital Tsunami is moving at a rapid pace, encompassing all aspects of business and society. It touches every function of a business from purchasing, finance, human resources, operations, IT and...

...
Data Digest

17 Quotes on Big Data and Analytics that Will Open Your Eyes to Reality


There are times when perception is not a clear representation of reality. Take for example the topic of Big Data and Analytics. The perception is that this ushers in a brave new world where there is actionable intelligence, on-demand data and sexy graphs and charts popping up on our computer screens on the fly. While this could be a reality for some, this is certainly not the case for many – at least, not yet.

In the course of our conversations with noted Chief Data Officers, Chief Analytics Officers and Chief Data Scientists for our conferences and events, some priceless gems of knowledge had been uncovered. Knowledge that can only come from people who’ve actually been on the front lines and understand Big Data and Analytics, warts and all. Here are 17 quotes that will inspire you in your own journey and open your eyes to reality.

1. The reality that ‘real-time’ is not necessarily good all the time.


2. The reality that data governance is an absolute must.


3. The reality that personalization is the name of the game.


4. The reality that CDOs/CAOs/CDSs need to be leaders more than anything.


5. The reality that acceptance to data-driven decision making takes consistent effort.


6. The reality that a CEO buy-in is absolutely important.


7. The reality that the success of Big Data relies heavily on people.


8. The reality that change can only happen when you change yourself.


9. The reality that data and analytics cannot succeed if it’s used as a tool for punishment.


10. The reality that data analytics is not an option. It’s a must.


11. The reality that cool is in.


12. The reality that data breach comes from all angles.


13. The reality that quick wins must happen for long term gain to be sustained.


14. The reality that ‘data ownership’ is passé in today’s environment.


15. The reality that privacy must be ensured at all times.


16. The reality that Big Data will be available to everyone and not just a few select individuals.


17. The reality that expectations today are greater than ever before.


To learn more about Big Data, Analytics and Digital Innovation or to attend our upcoming conferences and meet the leading Chief Data Officers, Chief Analytics Officers and Chief Data Scientists, visit www.coriniumintelligence.com 
 

September 09, 2016


Revolution Analytics

Because it's Friday: The Happy Files

We've looked before at how performing a cheerful song in a minor key makes it mournful. Now, here's the other side of that same coin: the X-File title theme, played in a major key, sounds downright...

...

Revolution Analytics

A predictive maintenance solution template with SQL Server R Services

by Jaya Mathew, Data Scientist at Microsoft By using R Services within SQL Server 2016, users can leverage the power of R at scale without having to move their data around. Such a solution is...

...

Revolution Analytics

The palettes of Earth

Take a satellite image, and extract the pixels into a uniform 3-D color space. Then run a clustering algorithm on those pixels, to extract a number of clusters. The centroids of those clusters them make a representative palette of the image. Here's the palette of Chicago:

Chicago
The palette of Chicago

The R package earthtones by Will Cornwell, Mitch Lyons, and Nick Murray — now available on CRAN — does all this for you. Pass the get_earthtones function a latitude and longitude, and it will grab the Google Earth tile at the requested zoom level (8 works well for cities) and generate a palette with the desired number of colors. This Shiny app by Homer Strong uses the earthtones package to make the process even easier: it grabs your current location for the first palette, or you can pass in an address and it geolocates it for another. That's what I used to create the image above. (Another Shiny app by Andrew Clark shows the size of the clusters as a bar chart, but I prefer the simple palettes.) There are a few more examples below, and you can see more in the earthtones vignette. If you find more interesting palettes, let us know where in the world you found them in the comments.

 

Broome
The palette of Broome, Australia

 

 

 

Qatar
The palette of the middle of Qatar

Will Cornwell (github): earthtones


Revolution Analytics

In case you missed it: August 2016 roundup

In case you missed them, here are some articles from August of particular interest to R users. An amusing short video extols the benefits of reproducible research with R. A guide to implementing a...

...
 

September 08, 2016


BrightPlanet

Tyson Johnson Session Speaker at ACFE Fraud Conference Canada

Our own Tyson Johnson will be speaking at next week’s ACFE (Association of Certified Fraud Examiners) Fraud Conference Canada. The conference is taking place September 11-14 in Montreal, QC. Tyson’s session is titled Building an Online Anti-Fraud Open Source Monitoring Program. In the session he’ll cover how fraud examiners are becoming more adept at using open sources […] The post Tyson Johnson Session Speaker at ACFE Fraud Conference Canada appeared first on...

Read more »

Revolution Analytics

The elements of scaling R-based applications with DeployR

If you want to build an application using R that serves many users simultaneously, you're going to need to be able to run a lot of R sessions simultaneously. If you want R to run in the cloud, you...

...
Silicon Valley Data Science

Image Processing in Python

Editor’s note: This post is part of our Trainspotting series, a deep dive into the visual and audio detection components of our Caltrain project. You can find the introduction to the series here.

The first step in developing our Caltrain project was creating a proof of concept for the image processing component of the device we used to detect passing trains. We’re big fans of Jupyter Notebooks at SVDS, and so we’ve created a notebook to walk you through that proof of concept.

Check out the notebook here

You can also download the .ipynb version and do some motion detection of your own. Later blog posts in this series will cover making this algorithm robust enough for real-time deployment on a Raspberry Pi.

Let us know any questions in the comments below, or share which pieces of the Caltrain project you’re most interested in. If you’d like to keep up with our activities, please check out our newsletter.

The post Image Processing in Python appeared first on Silicon Valley Data Science.

Principa

From Credit-Worthy to Target-Worthy: How Predictive Scoring is being used in Marketing

As a Marketer or Customer Engagement professional, imagine the cost-savings if you knew who in your database or lead list were likely to be the most profitable customers or most likely to respond? Would you bother mailing a list of a million contacts if you knew that only 100,000 of those contacts were “worthy” of your campaign and very likely to respond?

Innovation is not necessarily the invention of something new, but be the result of finding a new use for an existing product, service, methodology or practice. Take the use of predictive scoring in Marketing.

Data Digest

Chief Analytics Officer Survey: 57% say 'culture' a key barrier in advancing data and analytics strategy


We have recently conducted a survey of the Chief Analytics Officer Forum attendees to find out some of the key issues facing them and their solutions investment plan in the next 12-24 months. In this survey, it was revealed that 57% of respondents found ‘Driving cultural change’ as the biggest barrier to advancing data and analytics strategy. This was closely followed by ‘Integration of new technology with legacy systems’ (52%) and ‘Getting buy in from business units’ (41%).

The result of this survey actually tallies with key industry findings. In an article entitled ‘4 strategies for driving analytics culture change’ published by CIO.com, it was boldly declared that: “Culture change is hard." It further continues that "the solution lies in a mix of tooling and analysis and information delivery architecture. Often, culture changing strategies can fall flat because they approach the problem from purely a tooling perspective. Vendors offering such tools paint a rosy picture of how the right tool can change the culture and behavior within an organization. However, the problem often is more complex.”

Culture change is hard. The solution lies in a mix of tooling and analysis and information delivery architecture.

Meanwhile, the Chief Analytics Officer Survey also looked at the respondents’ investment plans and a whopping 70% of all respondents (81% of whom are C-Suite Decision Makers) revealed that they plan to invest in Data Analytics solutions in the next 12-24 months, followed by Predictive Analytics solutions (64%) and Business Intelligence tools (62%). It’s also interesting to note that 68% of them choose solutions at Conferences and Events.

More of our findings from the infograph below:


 For more information on how you can join our upcoming events and conferences or if you're interested in sponsorship opportunities, visit www.coriniumintelligence.com   
Teradata ANZ

Analyzing your analytics

‘Eat your own dog food’ and ‘practice what you preach’. I love these sayings and I strongly support the idea behind them. If you believe in something, prove it. In the world of data and analytics, we’ve been able to ignore this wisdom for a long time. The time for change has come though! We are going to analyze the analytics.

In fact, understanding how data is used and what kind of analysis is done might just become a key capability in successfully utilizing company data. Of course, analytics have been the subject of analysis for a long time now. Most organizations have nice trending overviews of their data growth, the daily run times of the ETL jobs, and how many queries were run per application, per user, and per department. We need to take it to the next level though – just like we augmented the traditional (and still very important) dashboards and KPIs with self-service BI and advanced analytics.

Growth and self-service
Because of the rise in self-service analytics, a more efficient analysis of data is necessary. The sort of trending analysis mentioned above was sufficient to get a good idea of what’s going on when the realm of data was controlled by data engineers, with strict processes and managed governance. But the more you open your data to savvy business analytics, data scientists, app builders and others, the more you lose sight of who is doing what with which data and for which purpose. And, by all means: we don’t want to burden our creative people with a pile of administrative paperwork. That’s where the metadata comes in.

Follow the trail
All this experimental work leaves its traces. When analyzed properly, these traces can tell a lot about what’s going on. They can provide insight into which data is used the most, and in combination with what other data. Do people perform their own additional transformation when using the data? Is data just simply exported or is it used the way it was initially intended? These insights can shed light on who will get angry if some data will be deleted.

If possible, all analytics should be performed pro-actively. If the results are made available in an open, searchable way, everyone can profit – this kind of information can be valuable to a wider range of professionals than just the DBAs and the data engineers. If an analyst discovers that everyone else is using a different join condition than he does, it might be good for him to find out why this is the case.

Homegrown solution
As far as I know, the number of organizations that already started analyzing their analytics is still limited. In the few active cases, homemade solutions are used to manage the analytics. This proves that there is still a long way to go. Nonetheless, there are already a couple of tools popping up addressing this need. A personal favorite of mine is Alation. This tool manages to bring more value from the available metadata. An example of how this is done is the use of Google’s page rank algorithm, to determine which tables are most important. Besides the automated analysis of logs etc., it also aims to efficiently use the results by augmenting them with lots of user input. This makes it a very useful tool for collaboration – another key aspect for being successful in the world of unlimited analytical freedom.

Investing in data
Earlier, I stated that analyzing your own analytics is necessary to get a competitive edge out of using your data. Without this input, it will be impossible to efficiently manage the provisioning and management of all your data. As the move towards an analytical ecosystem with multiple platforms/techniques keeps gaining more steam – and with it the increasing complexity of organizing data – you need all the help you can get. Using these analytics to find out which data is used often and which isn’t, will help you to decide which data should be moved to your cheaper storage platform. Finding out on which data users apply much extra transformation logic can tell you where more modeling is needed. And last but not least, data that is frequently copied or exported may simply reside on a platform with the wrong capabilities.

Just like companies use web analytics to continuously improve their website (based on what they learn from analyzing the web visits), companies need to improve their data by examining the way it is being used. Truly understanding how your data is used requires facts and smart analytics. Lots of them.

 

The post Analyzing your analytics appeared first on International Blog.

decor