A BIG DATA PROJECT IS FULL OF PITFALLS
Many pitfalls! Who are not only on IT! During my visit to the Big Data Salon, held in Paris on March 6 and 7, 2017, I was able to attend to 14 feedback, informative and operational at a time. I propose a summary of this visit around the five crucial points not to neglect when you go to get started.
#DATAMASTER: FORMATTING THE DATA, THAT’S 80% OF THE PROJECT!
Yes, you read it correctly. 80% of the time is spent doing ETL (Extract Transform Load) in order to obtain a workable data set for statistical analysis. This time is necessary to map data sources, retrieve them, analyse their structures (from an IT point of view), clean, format and arbitrate between contradictory or incoherent data.
And this is the Double effect Kiss Cool ®!
First effect, the “ping pong” :
- To whom should we entrust this task? A freshly graduated data scientist from Polytechnique? A C ++ developer with 10 years of Front Office experience? A third-grade intern? In general, the ADR (Average Daily Rate) of a data scientist is higher than that of a developer. Is this data scientist ready to spend more than three quarters of his time doing ETL? Nevertheless, who can do it better than him, qualitatively speaking?
- How can we justify all this time in the eyes of the project sponsors, who are not usually computer scientists, and are unaware of the low quality of their data? Sponsors who, lulled by the enchanting promises of the sellers of Big Data Solutions, are persuaded to be holders of a treasure, and that it is enough to reap the RIO’s promises.
Then, the second effect :
- The attention paid by the business can decrease strongly because your project then goes into what one can call: a tunnel. What communication should be used to keep the attention of the decision-makers in relation to your project, without exacerbating their impatience?
- If your project uses perishable data (for example, the daily prices of the stock Exchange), can you provide data ready for analysis and who still makes sense, a business value, once the algorithm is ready?
- Will the TTM (Time To Market) be respected if we take into account the actual burden of this work?
Morality: The ROI (return on investment) of a BIG data project must not underestimate this crucial stage of data qualification, which can be fatal if it drags.
#CYBERSECURITY: DEADLOCK ON SECURITY, NO WAY!
The majority of companies are beginning to worry about security in two situations:
- Either when the external or internal compliance has put a red card.
- Either – and this is the worst – when hackers recovers data (or not very friendly state agencies) and we then talk about Data breach.
The obligations are reinforced with the GDPR (General Regulations for the Protection of Data of the European Union).
As of May 25, 2018, you will indeed have to report to the CNIL any data leakage in less than 48 hours. For the record, fines are up to 4% of the company’s annual international turnover, with a ceiling of EUR 20 million. Therefore, security must be at the heart of the Big Data infrastructure. And I wrote it must be. It is necessary to stop believing that safety can only be addressed after the production is done. Security must be taken into consideration by all project actors from the moment of its conception (privacy by design) and throughout its life (accountability).
In summary, your 3 priorities: Security, Security AND SECURITY!
#PROFILING: COPY THE NSA, BAD STRATEGY!
Another topic on which authorities like the CNIL could hold you accountable. If you cross your internal data with external data (Open data…), you can potentially have a lower level of detail on the profile of your customers, subscribers, prospects, and your algorithms can be considered profiling. Again, you are subject to the RGPD.
As you know, since the Computer Law and liberties, there is a lot of information that should not go into the data analysis, such as religion, skin colour and many others. Moreover, it is necessary to be extremely careful with data sets that come from other countries like for example the USA which do not have the same regulations and are less restrictive on the analysis of religious, ethnic, etc.
It is always necessary to work in collaboration with the legal services, the compliance, and tomorrow the DPO (Data Protection Officer) to avoid a misstep, which could be raised very expensive in financial terms, and image…
#TRANSPARENCY: A MODEL IS NOT A BLACK BOX!
More and more choices affecting people will be made tomorrow by algorithms. You need to understand the implications of these models and be able to explain them.
For example, deep learning (neural networks with several hidden layers) is very powerful but it is still not easy to explain how the tool works in detail. Scientists continue to carry out theoretical research to better understand it and thus have a better mastery. To know more, read: Why does deep and cheap learning work so well?
There will be more and more requests from users to understand the choices made by these algorithms. Remember the demand of high school students in order to know the algorithm responsible for the post baccalaureate orientations. This is part of a demand for greater transparency on the part of the company, and it is quite legitimate.
Another famous example is the model developed during a contest organized by Netflix to improve the recommendation of videos. Admittedly, the winning team won 1 million. However, their model has never come to light. Blame it on a cost of industrialization far too high! To understand why, look over there: why Netflix Never implemented the Algorithm That Won the Netflix $1 million Challenge
One last point, but not the least: the trap of confusing correlation and causality. Sometimes you can find correlations without being able to explain them. A company that worked for a railway company came to this conclusion: there is a correlation between the incidents on the doors of trains and the incidents on their brakes. The data scientist could not explain the relationship between these two types of failures, on works without any mechanical or functional connection. If you are working on predictive maintenance, you have to be very careful about it and it may even prevent you from sleeping…
A good algorithm is an algorithm that can be explained.
#TEAM: DO NOT SEARCH FOR THE 5 LEGGED SHEEP!
Do not waste your time looking for a data scientist capable of doing everything alone in his corner. A Big Data project requires complementary skills. The profiles that will be involved in the successive stages are as follows:
- Architects: they will choose, qualify the technical solutions and propose an architecture
- The developers: they will make the ETL with data and industrialization in order to put into production
- Data scientists: they will analyze the data and propose models
- Those skilled in the profession: they will be able to validate the model or invalidate it if there are no tangible explanations in terms of causality
- Specialists in IT security to prevent data leakage (RSSI – IT Security Officer)
- Compliance-Type profiles – both legal and technical – to put the right guards (Risk Manager, Compliance Officer, and tomorrow the highly anticipated CDO – Chief Data Officer)
To implement a Big Data project, the pitfalls are numerous and of different nature. It is necessary to build a team with a diverse profile and a strong ability to communicate within it. Objective: To avoid the end of the project. Main threats: A poorly conducted internal project (deadline and budget exceeded, lack of interest of decision-makers…) in a poorly controlled regulatory environment (CNIL, European Union, users…).
In summary, the 5 prerequisites:
#Provide formatted data in a limited time
#Do not forget about security
#Always work with compliance
#Being able to explain his model
#Knowing mix profiles to have a strong team
And you what do you think? Do you have any experiences to share after this article, in comment? And if you want, we can talk about it over a cup of coffee.