News and Articles

My Journey To Becoming a Data Scientist

In early 2015 I went to Nepal with some friends to hike to the Everest Base Camp. We hiked for days and days, often in silence, and it gave me time to reflect on my career and life. Life was good; career was growing stagnant. I had about six months prior become very interested in data science and AI, but I didn’t know how to make a path to this career. So, on my return flight from Lukla to Kathmandu I definitively decided…I would quit my job.

Within a week of returning to my job I set up a meeting with my boss to hand in my resignation. I was going to take a year off as a sabbatical and ardently dive into data science. I had been toying around with data science, but now…I wanted to be serious. No more hobby. I showed up to his office with my resignation in hand (he doesn’t know this story but now will). He had some questions for me first so my resignation lay face down on his desk between us for a moment. “We need some help in the data science area.” He said, “I know you have an interest in this, would you be willing to help out? We’re not sure of how yet. It’s a bit new to us all.”

I furtively retrieved my resignation.

That was early 2015 and since then I’ve taken many different paths, many were dead-ends and some successes. I had little to no guidance; I made a lot of mistakes and wasted a lot of time. I would like to outline for you what worked for me. For many developers starting this process, it is confusing…what do I do first, second, next? Everyone’s path will be different…I’m just saying that in hindsight, if I could go back to early 2015, I would slap myself and say “do this!”.

Statistics is NOT optional

While you do not have to have a PhD in mathematics and statistics to use Machine Learning algorithms, you do need a basic if not advanced basic knowledge in statistics. I was misled in the beginning of my journey by many opinions and blogs to the contrary. There is a lot of wonderful free programs and tutorials out there for statistics. I felt that the free pdf OpenIntro Statistics was the best and I wish I had worked through this in the beginning, instead of a year later when I realized I needed a more solid stats foundation. Jason Brownlee from Machine Learning Mastery has a fantastic perspective on this.

Start with Chapter One

As an experienced developer I felt that I could skip things. Why learn the basics of programming in R or Python? I know what a class is, I know what Object Oriented Programming is, I know what functions are; I’ll just jump to chapter 10. No no no no…my biggest mistake. I fought with R packages and Python libraries and compatibility for longer than I care to admit. Start at chapter 1…it may seem boring but there is stuff in there you need to know. Take a basic course in the language of your choice and start on page 1.

Pick a Language and then get good at it

In my arrogance, I picked both Python and R. They are both wonderful languages and I wanted to do them both. Truth is…I became mediocre at both when I could have been exceptional at one. I have since focused my energy on R and all that Python work, while not worthless, is no longer my focus. You can always pick up another language later, but in the start to learn two at once is a bit difficult. I should have known this, I speak Polish, Russian, German and (of course) English. I never attempted to learn any of them simultaneously. The same should be true of programming languages. Perhaps there are people out there who can do multiple language simultaneously…I’m not one.

Lean on the developer community when in need only

Both Python and R have incredibly rich communities. Just about any question that you run across will be answered with a simple google query, which often ends up at StackOverflow. However, try as much as you can to solve your question on your own. Read the package documentation first before looking for a direct answer. Don’t knee-jerk google every question you have as this will just make you a good googler and not a good developer. When stuck, there are plenty of resources out there to help you along.

YouTube is more than cat videos

I’m going to be honest here, I didn’t really understand how rich YouTube channels could be. For instance, SentDex has a channel with 100s of videos in a code-along format for Python. He describes things in a very clear and easy to follow manner. He has an entire series of 57 videos on Machine Learning and it includes Convolutional Neural Networks and TensorFlow. Note: He’s the reason I am still doing more Python than I should. Another channel I found very beneficial was Econometricsacademy with numerous R and stats tutorials. Also done in a very easy to follow and understandable way.

As soon as you can … ML

When I was 15 I did my first Apple IIe program using peek, pokes and calls. That was 35 years ago and I remember to this day what it was like to see so few lines of code do so much. I distinctly remember sitting back in my chair and saying, “Wow!” I hadn’t had that programming experience until I wrote my first R Machine Learning program using the caret package. And…I again sat back in my chair and said “Wow!” As soon as you are ready, give yourself some joy and write an ML program. You won’t regret it, and it will fire your imagination and desire. When you are ready for this phase, the best book I’ve run across (and I’ve purchased a LOT) is Applied Predictive Modelling by Max Kuhn and Kjell Johnson.

Podcasts are a wonderful

Podcasts and stories of data science make your journey feel like part of the real world. Coding and studying in your home office can be an isolating experience (especially if you are not yet able to apply your skills on the job). My favorite (by quite a measure) is Linnear Digressions, followed by The Data Skeptic and then Partially Derivative. There are also numerous data related episodes in This American Life (iconic), RadioLab and Freakonomics Radio.

Weka is more than a small bird from New Zealand

I never took Weka seriously until recently and then I said to myself, “Where have you been all my life?” I would not recommend starting here, but keep this in mind after you get some basic understanding of Machine Learning algorithms. It is a fantastic exploratory tool that is just another tool in your Data Scientist toolbox. I read on a blog somewhere a critical comment about Weka, “Why would you use this instead of R?” Well, and here is a fundamental concept for your journey in Data Science, you will need many tools for different tasks. Weka compliments that tool set greatly. It is in no way a substitute for R. When ready, there is a fantastic tutorial by the University of Waikato’s profession Ian Witten.

I hope you find this helpful in planning your own journey into Data Science. I could go on and on and keep thinking of different things to add so I’ll force myself to stop. Please respond with resources you’ve found helpful in your data science practice and/or stories of what approaches you’ve found helpful. It is still a long journey for us all. After all…I’ve not even tapped into TensorFlow yet!

In the words of Ian Witten from the University of Waikato, “Off ya go now.”

Leave a Reply

Your email address will not be published. Required fields are marked *