This blog post is part of Big Data Week interviews, aimed at speakers in the Big Data Week conference and as well at influencers in the Big Data space. We asked our participants to answer a list of questions based on the main topics of the Big Data Week conference agenda.
In this interview we chatted with Ben Lorica, Chief Data Scientist @OReillyMedia and Director of Content Strategy @strataconf. We wanted to get Ben’s view on one of the hop topics of our conference: the data scientist role, but as well his take on domains where Big Data innovation is happening and what 2016 will bring for the data market.
What matters is what do you want to do with the data
Big Data is such a buzzword today – mainly because it applies to all the possible business domains, but some people react very strongly hearing it. Still although we had lots of data even 100 years ago, the generated data had a huge boom in the past years.
What does Big Data stands for in your opinion?
Ben Lorica: “People now kind of take Big Data for granted in many ways, in the sense that they have to cope with it at some point. In terms of definition I consider the old one still stands: Big Data it’s an intersection of many factors, like data sets of different types (variety), real time and then massive volume – an umbrella of techniques and approaches. Sometimes you take the volume into account, but sometimes is the variety that is very important (structured/semi structured data).
I think besides the definition what matters is what you want to do next with this data, being able to extract value. Because even for small scale data people still need to derive information, either if it’s for real time / near real time reporting or something more complex.”
Smart Cities – using data to empower people
Since we are discussing that more important then the definition is what you do with data, what innovative areas/fields of using Big Data have you seen lately. From your position at O’Reilly you get to see a lot of cases around the world and discover a lot of local innovations.
Which areas you think benefit the most today out of Big Data?
Ben Lorica: “One of the areas I consider interesting is Smart Cities – I consider it a concrete manifestation and application for Internet of Things. Wide spread availability of sensors combined with high speed communication networks constitute a real instrument of gathering data. Plus real time big data platforms that can ingest massive amounts of data at high rate. Also people are improving the algorithms to extract intelligence out of this information. Better transportation and transit in cities would be very much helped by this, face to face meetings – which are still one of the most powerful ways of communicating – could be very much helped by being able to optimize the transit and transportation in the cities. Smart cities is about empowering the citizens be more mobile, thus I believe is one of the exciting areas of Big Data.
Remaining still in Smart Cities subject, even collecting the city information in one place allows you to do amazing things like pattern mining, correlation analysis. Some cities like New York have started to do this. Also the data lakes come into place, gathering all the information in one place. “
Other areas of innovation?
Ben Lorica: “Metadata – the power of understanding your data. Metadata might seem boring, but the notion here is that if you have a big data warehouse then metadata has been long used – but now people started talking about the need to have an open and vendor neutral metadata services. There are some startups that are emerging in this field but as well some nice projects coming out of UC Berkeley.
Another interesting area is Structured data extraction, where you take unstructured data and you try to extract and convert parts of that unstructured data into a database table, imagine tools that take web pages and mine all of the structured info and turn that into a query-able format. There is a startup called ClearCut Analytics that does amazing things in this domain.
Data scientists job is one you mostly grow into
There is a long debate, in Europe at least, with regards to the data scientist job description and the scarcity of trained people to cover the growing market requests. Also, there are voices that say that a data scientist is not always needed, in some cases having the right tools could help companies with understanding and finding insights in their data. What is your opinion on this?
Ben Lorica: “First we have to distinguish between these jobs: the data engineer and the data scientist. Data engineers are the ones that build and maintain the data and the infrastructure – these are hot jobs, without those people we don’t really have data because there is no infrastructure to put that data in.
Once those people are in you can start hiring data scientists.
In the US at least this means a class of individuals who can bridge multiple skill sets: machine learning and analytics, visualization, they know statistics so they can understand the patterns in the data, they have some programming skills so they can acquire and prepare data on their own. Increasingly, at least in US, data scientists started to mean as well people that have strong communication skills, who are not just technically prepared in the above skill sets but as well ones that can work with product groups, managers, to present results but as well to understand the line of business and know how to interact with the domain experts. It’s a job that you mostly grow into, you start from a skills of sets, then you develop others and then you have to learn how to interact with people and develop your presentation skills. ”
But are these skills set replaceable by a tool?
Ben Lorica: “There is an interesting trend in the sense that there are tools that allow business analysts to do some of the things that a traditional data analyst would do, so business people can run these, without knowing the details they can run some analysis. In companies data will become a little more democratized in the sense that a lot more people will be empowered to do some of the things a data scientist would do.”
2016 will be more about applications
We are approaching the end of 2015 and we are already projecting 2016. From your point of view how will 2016 look like?
Ben Lorica: “I hope 2016 will be more about applications. For example on the open source side there are tools like: Spark, Kafka, Cassandra, Hadoop, every each of them really great on it’s own but piecing them together into an architecture its the challenge. Once integrated building applications would be much easier.
Another trend are the public cloud providers: not just AWS, but Azure and Google Compute Engine, they currently have components to do real time intelligent applications and will make it easier for companies to piece things together and they don’t even – in this case – have to have sys admins or Devops people required to manage these things for them. They can focus on building apps rather then looking at the technology, focus on building solutions for their problems.
Interview realized by Valentina Crisan, program coordinator Big Data Week