I’ve seen things you people wouldn’t believe. Pivot tables on fire off a dashboard. R scripts glittering in the dark near the Hadoop cluster.
But (with apologies to Rutger Hauer for hijacking his amazing monologue) I’ve also seen a lot of data science being done technology-first without taking people or processes into account, and I thought I’d lay down some notions that stem from my experience steering customers and partners through these waters.
As a budding corporate anthropologist, (recovering) technical director and international cat herder, I am often amazed at how much emphasis is placed on technical skills and tooling rather than on actually building a team that works.
And as an engineer by training (albeit one with a distinctively quantitative bend), I am fascinated by the number of opinions out there on the kind of technology, skill sets and even the kind of data required to make a data science team successful, because there’s actually very little hard data on which of those are the critical factors.
So I’m going to take a step back from the tech and science involved and look at the way the process should work, and some of the things you should consider when running a data science team regardless of your background.
People, Processes, Technology
A few years back I got hammered into me (by a former CTO of mine) that excellence is a process, and the motto stuck with me because he meant “excellence” in the sense of both personal and team growth rather than riding the tech hype or getting aboard the Six Sigma train.
Mind you, tooling and technology are critical, but you have to look at the wider picture.
Take deep learning, for instance: Tensorflow might be the go-to library at the moment, but Keras will give you a nicer abstraction that also lets you leverage CNTK as a back-end and possibly get faster turnaround when iterating on a problem, so I’d argue it should be the higher-level tool that you (and your team) need to invest in.
If you take the long view, running the gamut from purely statistical/regressive approaches to RNNs implies a deep commitment not just in terms of learning the science behind them, but also about understanding where they fit in in the range of challenges you have to address.
And believe me, choosing tools is not the one you need to tackle first - what you should tackle first is your team, and then the context in which it operates.
The First Mistake
The first mistake organizations (and managers) make is thinking that the data scientists reporting to you are your whole team.
No matter how much people go on about matrix management and the need for cross-functional teams, there is a natural human tendency to sort out people (and things) into nice, tidy bins, and when you have to motivate and drive people, there’s an added bias involved - after all, as a manager, your primary role is to make sure the team you’ve been assigned works cohesively, and in data science these days (especially in companies new to the field) there’s also a need to prove your worth.
And by that I mean the team’s worth - you might be a whiz on your own right, but your job is to make sure your team delivers, and that the goalposts and expectations are clearly defined both inside and outside your direct reports.
So your actual team comprises stakeholders of various kinds - product owners, management, and (just as importantly) everyone else in technical roles, because what you do (and the insights you obtain) inevitably impacts the rest of the business and how it’s built/implemented/deployed/etc. You don’t exist in a vacuum, but rather are the conduit between what data you have (or, more often, don’t) and what the business needs to improve (and I’m deliberately avoiding the reverse flow here, which is when you’re tasked by the business to improve something that’s already implemented).
I’ve seen a very similar thing happen before during the Big Data hype, and the way we tackled that successfully was by setting up “pods” of people to address each specific problem - each pod being composed of the usual triumvirate of a data scientist (who is usually a direct report to you), an implementer (who might or might not be) and a domain expert (who is usually a product owner or a business stakeholder).
I use the term “implementer” above because depending on the issue you’re tackling, the problem domain might require:
- quick iteration on data conversion (in which case you’d fill that role with a developer or DBA)
- putting together a data visualization (a front-end developer or a designer)
- or figuring out how to deploy a model at scale (an architect or devops whiz)
Either way, the net effect is that in more formal organizations you will find yourself having to mix and match schedules with your management peers, so it helps if you can communicate clearly about what the overall goals are and what sort of skills you need to tackle a particular challenge.
Living up to the role, inside and out
You have plenty of hands-on experience, your team looks up to you, and you let yourself get involved in all sorts of discussions regarding architecture, feature engineering, algorithm selection and model evaluation - and that is fine and good, except that management has nothing to do with that.
Running a team requires you to go out of your comfort zone and juggle priorities, commit to deadlines, steer people’s careers and all the messy, unscientific trappings that come with leading people and delivering results in a business environment.
The key thing here is to avoid falling prey to impostor syndrome - remember, you got the job for a reason, right? And being a manager doesn’t mean you stop doing science work - in fact, you will likely be doing a lot more science work that you usually do (but at a higher level), simply because you need to understand what your extended team is doing, identify pitfalls or roadblocks, and steer people in the right direction.
And to do that, you need to learn how to communicate effectively - not just inside your team, but outside it, and quite often to people who don’t have the same kind of background (technical or otherwise).
Putting processes in place
With work coming in and all your pods in a row, your team starts building a pipeline of challenges to go through (usually with multiple sub-challenges as you start drilling down on things). So how do you farm those challenges out to your team while keeping everyone happy?
Well, before we tackle that, we need to take a step back and think about how people will most likely be spending their time.
There are essentially two constraints involved in scaling up a data science team, and they both boil down to time: time spent understanding the problem and working out a solution, and time spent implementing and rolling it out.
In practice, what generally happens is that:
- 80% of your team’s time will be spent wringing meaning out of the data: hunting down datasets, doing ETL and doing initial feature selection. The first part is by far the least glamorous part of the work, but ties in well with the human mind’s need to get an intuitive feel for the data and problem domain and you should step out of the way. By all means do daily stand-ups (if that’s your thing), sit in with the team and discuss what they’re doing, but try to only bring your own expertise to bear when asked/required - don’t micromanage people, but help them take the long view and turn what they’re doing into repeatable process.
- The remaining 20% of the time is usually spent figuring out how to make your data and models available to the rest of the company - and this is where most data science teams skimp.
As it happens, there is a lot more to delivering data science than churning out reports and dashboards. So you (as a manager) should expect to spend a lot of time (probably up to 80% of it, in the early days) going over those 20% above and working with your team to turn what they did into a repeatable, measurable process, either by leveraging continuous integration tools to compare models between iterations or by defining checks and balances: What features were added to the model? How does it perform against new datasets? How fast will it become stale given the rate at which your data (or your business process) is updated?
A good manager knows there is a balance to be struck between the “fail fast” approach and not rushing into things - by all means have people go and experiment with new methods (that’s a big part of keeping people happy), but define the yardstick you’re going to measure the results with - do we get faster turnaround out of the new model or tools? Does it perform much better without impacting performance? Do we get side benefits like automatic removal of biases or preventing overfitting?
It might take considerably longer in the beginning depending on your company, resources and processes, but the idea here is that as you start delivering solutions you will be building a set of APIs, infrastructure or datasets that other people will consume, defining a roadmap for those, and iterating upon them - so the bridges you build with other teams will be invaluable here.
Growing your team, with science!
Soon enough, you’ll figure out who has the knack or expertise to tackle specific kinds of problems. Instead of assigning them to everything that looks like a nail, though, take those people and pair them with someone else who’s never done it before.
Have them broaden their horizons and, again, turn that into a repeatable process, but also present it to their peers. Do step in to steer the discussion, but remember that personal growth comes from learning new things and passing them on, and that your team will be more effective (and happy) if processes are clear to everyone and if they value those processes as an integral part of their work.
Don’t get caught up in processes as efficiency, or start tallying up KPIs for their own sake - rather, think of processes as adding structure (and thus meaning) to the work you’re all doing, and take a leaf (or two) from the Kaizen handbook.
Make Data Science part of your company culture
Once everything else is in place, the best way to make sure the organization understands your team’s role involves reaching outside it - which means leveraging your communications skills yet again to:
- Foster a data-centric culture in other teams, making it plain that it is not really about storing heaps of raw data, but about making sure the datasets you have are clearly identified and easy to get at (with the usual caveats about personally identifiable information and proper data hygiene in that respect).
- Agree on common data representations (or bridge formats) that can be exchanged with minimum development/integration overhead, and on APIs for other teams to access the trained models your team produces.
- Address the really hard problems, like moving from batch-oriented processes to event streaming. Fraud detection, recommendation engines, and other staples businesses rely on require instant access to data, and (speaking from experience) there is nothing like dealing with streaming data, both technically and from a business perspective.
- Understand what the business wants. Nothing that is really necessary is actually impossible (even if it seems hard given the hype around data science these days), and there will be a lot of patiently steering people back to the realm of feasibility, but remember, you were chosen as a manager because you are good at building bridges, in all respects.
Above all, don’t freak out - you’re still doing data science
Even if some of the above doesn’t come naturally to you at first, don’t worry. You’ll be fine, as long as you keep re-training your own mental model of what role your team (and you) have to play in the larger picture.
And rest assured that you will be able to spend a lot of time doing actual data science, if only because most of the business and team-related aspects outlined above evolve a lot more slowly than you’d expect - there’s no gradient descent algorithm for optimizing human organizations, and, all things considered, I think that’s a good thing.
This essay originally appeared on LinkedIn on July 2nd, and then a week later on Medium.