Don’t drown in work. Time management and priority management tips for Data Scientists

A Reddit post reminded me that Data Scientists often struggle with time management and priority management. There is a customer expectation of fast turnaround – especially in Analytics. Data is complex. Supporting tools are only starting to catch up with their engineering equivalents. This post shares time management and priority management tips for data scientists that I have learned in building teams. The temptation is to lean on automation and ‘better tools’ but the reality is that discipline and assertiveness will have the biggest effect.

Three challenges to effective time management and priority management for data scientists

The main challenges to effective time management and priority management for data scientists are:

  • Allowing stakeholders to invade team time.
  • Allowing stakeholders to invade priorities.
  • Lack of internal processes and discipline cause high communication and coordinate overhead, reducing the team’s effectiveness.

Time management and priority management tips

Preventing invasion of team time

These tips are straightforward and aim to move disruptions to controlled times on the team’s schedule.

  • blocking team technical time in calendars allows data scientists to engage in several hours of focused work.
  • no meeting blocks or even no meeting days again allow productive data science to be done.
  • office hours help stakeholders who have questions meet with a team member on the team’s schedule.
  • an equivalent of ‘first line support’ allows dedicated team members to respond to quick fire requests without the whole team being disrupted. Done on a rotation, this is an efficient way for mature teams to defend their core working time.

Preventing invasion of team priorities

Firstly, it is impossible to prioritise without a searchable list of the active work and incoming work the team faces. Many teams are pulled in different directions because they cannot communicate their current work and their priorities to stakeholders.

  • Workflow tracking can be as simple as a spreadsheet or as complex as modern workflow tracking software depending on the size of the team, the nature of the work and the number of stakeholders.
  • Maintain backlogs of work that the team are aware of but hasn’t been started yet. This allows you to measure and communicate the work on the team’s plate as well as constructively discuss what should be done next.
  • Educate on how to say ‘no’. This is difficult, involves and cultural change and is often something junior team members struggle with. It is much easier to say No when you can be clear on the day’s current priorities, when the work waiting on the backlog can be discussed and when there are other avenues for a solution such as the office hours and first line mentioned above.

These tips involve discipline from the team to always write up requests as well as training in how to write requests clearly in a way that the deliverable is understood.

Improving team internal processes

Even with the tools of prioritisation and time management in place, data science teams, by the nature of their work, are often far less effective than they could be. Changing data, changing understanding of business processes, a cultural lack of awareness of version control and release management, a habit of ‘solo’ development in notebooks and personal spaces all contribute to burdensome internal communication. The incorrect reaction is often bespoke configuration documentation to try and keep the team in sync.

Convention is a far more effective approach that configuration for coordinating the team’s activities and outputs. Fortunately, Guerrilla Analytics can help with principles and practices to help data scientists adopt conventions that allow them to operate more effectively with minimal overhead.

Data Science Workflows – A Reality Check

Data Science projects aren’t a nice clean cycle of well defined stages. More often, they are a slog towards delivery with repeated setbacks. Most steps are highly iterative between your Data Science team and IT or your Data Science team and the business. These setbacks are due to disruptions. Recognising this and identifying the cause of these disruptions is the first step in mitigating their impact on your delivery with Guerrilla Analytics.

workflow

Data Science projects aren’t a nice clean cycle of well defined stages. More often, they are a slog towards delivery with repeated setbacks. Most steps are highly iterative between your Data Science team and IT or your Data Science team and the business. These setbacks are due to disruptions. Recognising this and identifying the cause of these disruptions is the first step in mitigating their impact on your delivery with Guerrilla Analytics.

The Situation

Doing Data Science work in consulting (both internal and external) is complicated. This is for a number of reasons that have nothing to do with machine learning algorithms, statistics and math, or model sophistication. The cause of this complexity is far more mundane.

  • Project requirements change often, especially as data understanding improves.
  • Data is poorly understood, contains flaws you have yet to discover, IT struggle to create the required data extracts for you etc.
  • Your team and the client’s team will have a variety of skills and experience
  • The technology available due to licensing costs and the client’s IT landscape may not be ideal.

The discussion of Data Science workflows does not sufficiently represent this reality. Most workflow representations are derived from the Cross-Industry Standard Process for Data Mining (CRISP-DM) [1].

CRISP-DM_Process_Diagram

Others report variations on CRISP-DM such as the blog post referenced below [2].

rp-overview

It’s all about disruptions

These workflow representations correctly capture the high level stages of Data Science, specifically:

  • defining the problem,
  • acquiring data,
  • preparing it,
  • doing some analysis and
  • reporting results

However, a more realistic representation must acknowledge that at pretty much every stage of Data Science, a variety of set backs or new knowledge can return you to any of the previous stages. You can think of these set backs and new knowledge as disruptions. They are disruptions because they necessitate modifying or redoing work instead of progressing directly to your goal of delivery. Here are some examples.

  • After doing some early analyses, a data profiling exercise reveals that some of your data extract has been truncated. It takes you significant time to check that you did not corrupt the file yourself when loading it. Now you have to go all the way back to source and get another data extract.
  • On creating a report, a business user highlights an unusual trend in your numbers. On investigation, you find a small bug in your code that when repaired, changes the contents of your report and requires re-issuing your report.
  • On presenting some updates to a client, you together agree there is no value in the current approach and a different one must be taken. No new data is required but you must now shape the data differently to apply a different kind of algorithm and analysis.

The list goes on. The point here is that Data Science on anything beyond a toy example is going to be a highly iterative process where at every stage, your techniques and approach need to be easily modified and re-run so that your analyses and code are robust to all of those disruptions.

The Guerrilla Analytics Workflow

Here is what I term the Guerrilla Analytics workflow. You can think of it like the game of Snakes and Ladders where any unlucky move sends you back down the board.

image

The Guerrilla Analytics workflow considers Data Science as the following stages from source data through to delivery. I’ve also added some examples of typical disruptions at each of these stages.

Data Science Workflow Example Disruptions
Extract: taking data from a source system, the web, front end system reports
  • incorrect data format extracted
  • truncated data
  • changing requirements mean different data is required
Receive: storing extracted data in the analytics environment and recording appropriate tracking information
  • lost data
  • file system mess of old data, modified data and raw data
  • multiple copies of data files
Load: transferring data from receipt location into an analytics environment
  • truncation of data
  • no clear link between data source and loaded datasets
Analytics: the data preparation, reshaping, modelling and visualization needed to solve the business problem
  • changing requirements
  • incorrect choice of analysis or model
  • dropping or overwriting records and columns so numbers cannot be explained
Work Products and Reporting: the ad-hoc analyses and formal project deliverables
  • changing requirements
  • incorrect or damaged data
  • code bugs
  • incorrect or unsuccessful analysis

This is just a sample of the disruptions that I have experienced in my projects. I’m sure you have more to add too and it would be great to hear them.

Further Reading

You can learn about disruptions and the practice tips for making your Data Science robust to disruptions in my book Guerrilla Analytics: A Practical Approach to Working with Data.

References

[1] Wikipedia https://en.wikipedia.org/wiki/Cross_Industry_Standard_Process_for_Data_Mining, Accessed 2015-02-14

[2] Communications of the ACM Blog, http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/fulltext