Guerrilla Analytics: Book, Speaking and Training

Don’t drown in work. Time management and priority management tips for Data Scientists

A Reddit post reminded me that Data Scientists often struggle with time management and priority management. There is a customer expectation of fast turnaround – especially in Analytics. Data is complex. Supporting tools are only starting to catch up with their engineering equivalents. This post shares time management and priority management tips for data scientists that I have learned in building teams. The temptation is to lean on automation and ‘better tools’ but the reality is that discipline and assertiveness will have the biggest effect.

Three challenges to effective time management and priority management for data scientists

The main challenges to effective time management and priority management for data scientists are:

Allowing stakeholders to invade team time.
Allowing stakeholders to invade priorities.
Lack of internal processes and discipline cause high communication and coordinate overhead, reducing the team’s effectiveness.

Time management and priority management tips

Preventing invasion of team time

These tips are straightforward and aim to move disruptions to controlled times on the team’s schedule.

blocking team technical time in calendars allows data scientists to engage in several hours of focused work.
no meeting blocks or even no meeting days again allow productive data science to be done.
office hours help stakeholders who have questions meet with a team member on the team’s schedule.
an equivalent of ‘first line support’ allows dedicated team members to respond to quick fire requests without the whole team being disrupted. Done on a rotation, this is an efficient way for mature teams to defend their core working time.

Preventing invasion of team priorities

Firstly, it is impossible to prioritise without a searchable list of the active work and incoming work the team faces. Many teams are pulled in different directions because they cannot communicate their current work and their priorities to stakeholders.

Workflow tracking can be as simple as a spreadsheet or as complex as modern workflow tracking software depending on the size of the team, the nature of the work and the number of stakeholders.
Maintain backlogs of work that the team are aware of but hasn’t been started yet. This allows you to measure and communicate the work on the team’s plate as well as constructively discuss what should be done next.
Educate on how to say ‘no’. This is difficult, involves and cultural change and is often something junior team members struggle with. It is much easier to say No when you can be clear on the day’s current priorities, when the work waiting on the backlog can be discussed and when there are other avenues for a solution such as the office hours and first line mentioned above.

These tips involve discipline from the team to always write up requests as well as training in how to write requests clearly in a way that the deliverable is understood.

Improving team internal processes

Even with the tools of prioritisation and time management in place, data science teams, by the nature of their work, are often far less effective than they could be. Changing data, changing understanding of business processes, a cultural lack of awareness of version control and release management, a habit of ‘solo’ development in notebooks and personal spaces all contribute to burdensome internal communication. The incorrect reaction is often bespoke configuration documentation to try and keep the team in sync.

Convention is a far more effective approach that configuration for coordinating the team’s activities and outputs. Fortunately, Guerrilla Analytics can help with principles and practices to help data scientists adopt conventions that allow them to operate more effectively with minimal overhead.

Building a Data Science Capability: Inspirational Keynote at Data Leaders Summit Europe

I’ve just delivered the inspirational keynote at Data Leaders Summit Europe, 2018. Lots of great engagement and feedback. In particular, it seems people liked a clear definition of what data science actually is and the practical steps (miss-steps) I took in building a capability at Sainsbury’s.

13 Steps to Better Data Science: A Joel Test of Data Science Maturity

Data Science teams have different levels of maturity in terms of their ways of working. In the worst case, every team member works as an individual. Results are poorly explained and impossible to reproduce. In the best case, teams reach full scientific reproducibility with simple conventions and little overhead. This leads to efficiency and confidence in results and minimal friction in productionising models. It is important to be able to measure a team’s maturity so that you can improve your ways of working and so you can attract and retain great talent. This series of questions is a Joel Test of Data Science Maturity. As with Joel’s original test for software development, all questions are a simple Yes/No and a score below 10 is cause for concern. Depressingly, many teams seem to struggle around a 3.

Data Science jargon buster – for Data Scientists

Data Scientists need to communicate without jargon so customers understand, believe and care about their recommendations. Here is a Data Science jargon buster to help with communicating data science project results.

Bamboozled. That’s your customers’ reaction to the Data Scientists in your organisation. Data Scientists need to communicate without jargon so customers understand, believe and care about their recommendations. Here is a Data Science jargon buster to help with communicating data science project results.

Reproducible Data Science: faster iterations, reviews and production

Data Science involves applying the scientific method to the discovery of opportunities and efficiencies in business data. An essential part of the scientific method is reproducibility. Reproducible Data Science is essential for scientific credibility but also improves your Data Science efficiency in 3 keys ways – faster iterations, reviews and pushes to production.

Data Science involves applying the scientific method to the discovery of opportunities and efficiencies in business data. An essential part of the scientific method is reproducibility. Reproducible Data Science is essential for scientific credibility but also improves your Data Science efficiency in 3 keys ways – faster iterations, reviews and pushes to production.
If you start to apply the 7 Principles of Guerrilla Analytics your teams will quickly achieve reproducibility and benefit from these efficiencies.

To Become A Data Scientist, Focus On Competencies before Skills

Too often, the path to becoming a Data Scientist focuses on technology skills in vogue rather than more permanent competencies. Competencies are a more general combination of skills, behaviours and knowledge. You can have great Powerpoint skills creating beautiful slides but still be a terrible communicator. It is competencies that are most important when you build a Data Science career that is robust to changing trends in skills like languages and technology platforms. This post describes the most important competencies for being successful in data science.

Too often, the path to becoming a Data Scientist focuses on technology skills in vogue rather than more permanent competencies. Competencies are a more general combination of skills, behaviours and knowledge. You can have great Powerpoint skills creating beautiful slides but still be a terrible communicator. It is competencies that are most important when you build a Data Science career that is robust to changing trends in skills like languages and technology platforms. This post describes the most important competencies for being successful in data science.

The Rigour of Science is Essential for Successful Data Science in Business

The rigour of Science is essential for successful Data Science in business. The scientific method helps drive successful data science projects in business. This post will show you how.

Data Science – A Definition And How To Get Started

Confusion, hype, failure to start. Data Science has huge potential to change an organisation. But many organisations become mired in the associated cultural, technological and people change. Data Science is delivered as an interesting report rather than a driver of change. Data Science identifies algorithms that are run in the safety of a lab but never make it into production.

This is my keynote talk from the Polish Data Science with Business Conference.

10 Data Science Capabilities (and supporting tools)

People want some guidance in what can be a very overwhelming and fast moving field. Managers want to know what to buy and where to invest training. Junior data scientists, students, dev ops engineers and system administrators want to know what to learn. I will focus on the important capabilities for a Data Science team and the tools I have found useful for enabling those capabilities.

Data Scientists and data science managers want some guidance on supporting tools choice for their data science capabilities. Managers want to know what to buy and where to invest training. Junior data scientists, students, dev ops engineers and system administrators want to know what to learn. I will focus on the important capabilities for a Data Science team and the tools I have found useful for enabling those capabilities.

10 Data Science Capabilities Your Team Needs (and the Tools to Support Them)

I’ve recently had several people ask me about tools for Data Science both online and at conference talks. This post lists some of the tools I use and the capabilities they provide.

When writing Guerrilla Analytics: A Practical Approach to Working with Data I deliberately avoided mention of tools. People can be dogmatic about tools and I thought this would be a distraction from the book’s core message around principles for doing effective Data Science in dynamic real-world projects.

That said, people do want some guidance in what can be a very overwhelming and fast moving field. Managers want to know what to buy and where to invest training. Junior data scientists, students, dev ops engineers and system administrators want to know what to learn. I will focus on the important capabilities for a Data Science team and the tools I have found useful for enabling those capabilities.

1. Version control with Git and git-flow

Capability: Typically you will go through many iterations of your code and the work products your code produces. It quickly becomes impossible to track changes and reproduce earlier work without some code version control tool. This is only exacerbated when your team size is >1.

Tool: Git is a great version control system and the effort to learn its command line interface is a very worthwhile investment.

Git is incredibly flexible. However this can lead to confusion and inconsistency in how it is applied. Git-flow is a set of scripts that automate much of what you will need to do in Git subject to a particular convention that happens to be very helpful for Data Science.

2. Wrangling and persisting data with PostgreSQL

Capability: Even if your data is small enough to fit in memory, reproducing work will involve running all those scripts into memory before you can pick up where you left off. Other team members have to do the same. This is painful and inefficient. You therefore need to persist your work (raw data, intermediate datasets and work products).

Tool: A database gives you a way to persist your workings and intermediate datasets as well as share with team members. Pick a database the is performant and flexible. I use PostgreSQL. It has an amazing set of features and this flexibility is what you want when doing Data Science.

3. Wrangling and visualizing data with Pandas, Matplotlib and Seaborn

Capability: Getting your head around your data and preparing it for a variety of algorithms is probably the most time-consuming and important part of the Data Science life cycle. Some preparations are easier done outside of many databases e.g. some natural language processing. Visualizing the data is really important here too.

Tool: Pick a programming language that has great data reshaping and visualization capabilities. If you work in Python, Pandas is a powerful set of data structures and algorithms for wrangling. Seaborn and Matplotlib are good places to start for visualization. And don’t waste time trying to get all these things to work together. Just use Continuum’s excellent distribution Anaconda.

Read: Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython

4. Documentation with Markdown

Capability: Data Science is useless without communication (to your customer and within your team). You could just write a report as a Word document. There’s nothing wrong with that and it’s a format your business customers will expect. However, it would be great to have a documentation that is easy to version control and can be kept close to your project code.

Tool: Markdown is a nice platform-neutral way to document your project. Because it’s plain text it’s easy to version control (see above). And if your report isn’t too complicated you can convert it to Word from Markdown. Win.

5. Fast file manipulation, cleaning, and summarising at the command line

Capability: You get hundreds of data files. You get huge files in strange formats with broken delimiters. You want to chop these up, patch them together, change their encodings, unravel XML etc etc. No, trying to open the file in a text editor or spreadsheet is not the answer.

Tool: This is best done at a powerful command line. Linux is worth learning.

Read: Data Science at the Command Line: Facing the Future with Time-Tested Tools

6. Story telling with Jupyter Notebooks

Capability: Data Science is difficult to communicate. It’s often a slightly meandering journey with dead ends, back-tracking, unexpected insights leading to new research avenues etc. When updating your customer, you need to walk them through some of this journey using narratives interleaved with graphics and tabular data. Code files won’t do. Duplicating into Powerpoint is a lot of extra work for a quick interim update.

Tool: Jupyter allows all of the above in presentation quality. The close interleaving of analysis and documentation helps other team members join a project. And it reduces duplication when you decide it’s time to stop coding and start updating your customer.

7. Build automation with Luigi

Capability: eventually, your understanding and your code start to consolidate. There are some core datasets. They go through some agreed preparatory steps. There are some reports and algorithm datasets that you want to lock down and reproduce several times during the project. Manually running all those code files is a pain.

Tool: build automation tools allow you to automate tasks such as executing code files, creating documentation, importing and exporting data etc etc. I’ve used command line scripts (see above) and software build tools like Ant for this automation. More sophisticated tools like Luigi are now reaching a level of maturity where you could consider them for your team too.

8. Workflow tracking with JIRA

Capability: what the hell is everybody doing? Where did that data come from? Where is the conversation with the system SME that led to that business rule? Where is the deliverable from 2 weeks ago and who sent it to which customer?

Tool: workflow tracking tools like JIRA help answer all the above questions. Look for a tool that is customizable as Data Science doesn’t need all the detail of a large scale software development project. Do make sure you track where your data is coming from and what deliverables are going out the door (see Guerrilla Analytics).

9. Packaging it all up with Vagrant

Capability: the diverse nature of Data Science activities leads to a correspondingly diverse set of tools as you’ve seen above. When you get things working, you would rather not break them and you would rather not force every team member to go through the same painful installations and configurations and risk inconsistency.

Tool: Vagrant and other ‘dev ops’ tools allow you to define your tech setups and their configuration in program code. What does that mean? It means that you can build your entire technology stack and configure it by running some code. It also means that the installation of all your tools and their configuration can be version controlled. As your technology stack evolves, update your code and issue a new release to your team. If you trash your technology or need to move to other servers, everything you need to reproduce your environment has been captured and you should be back up and running in minutes.

Read: Vagrant: Up and Running

10. Putting it all together – Operations with Guerrilla Analytics

I’ve covered a lot here. How do you put this all together without choking a team in conventions, rules, tools etc? How do you reduce Data Science chaos and continue to deliver iteratively and at pace? That’s where Guerrilla Analytics can help.

Have a read around this blog, check out the book and get in touch with any questions!

Three challenges to effective time management and priority management for data scientists

Time management and priority management tips

Preventing invasion of team time

Preventing invasion of team priorities

Improving team internal processes

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

Share this:

1. Version control with Git and git-flow

2. Wrangling and persisting data with PostgreSQL

3. Wrangling and visualizing data with Pandas, Matplotlib and Seaborn

4. Documentation with Markdown

5. Fast file manipulation, cleaning, and summarising at the command line

6. Story telling with Jupyter Notebooks

7. Build automation with Luigi

8. Workflow tracking with JIRA

9. Packaging it all up with Vagrant

10. Putting it all together – Operations with Guerrilla Analytics

Share this: