Skip to main content

How does a total beginner start to learn machine learning if they have some knowledge of programming languages?

I work with people who write C/C++ programs that generate GBs of data, people who manage TBs of data distributed across giant databases, people who are top notch programmers in SQL, Python, R, and people who have setup an organization wide databases working with Hadoop, Sap, Business Intelligence etc.
My inspiration to anyone and everyone would be following:
  1. Learn all the basics from Coursera, but if I really have to compare what you would get out of Coursera compared to the vastness of data science, let us say ~ Coursera is as good as eating a burrito at Chipotle Mexican Grill. You certainly can satiate yourself, and you have a few things to eat there.
  2. The pathway to value adding data science is really quite deep, and I consider it equivalent to a five star buffet offering 20 cuisines and some 500 different recipes.
  3. Coursera is certainly a good starting point, and one should certainly go over these courses, but I personally never paid any money to Coursera, and I could easily learn a variety of things bit by bit over time.
  4. Kaggle is a really good resource for budding engineers to look at various other people’s ideas and build on them.


My own learning came from actually building things. I started with SQL, then I learned Python, then I learned R, then I learned many libraries in Python and R. Then I learned html, decent GUI programming using VB script, C# programming. Then I learned Scikit learn. Finally I talked to various statisticians at my work place whose day in day out job is to derive conclusions out of data, and in the process I learned JMP/JSL scripting. I learned a lot of statistics in the process.
Here is an overall sequence of how I progressed myself.
The first thing I want to inspire anyone and everyone is to learn the “science”. Data science is 90 % Science, and 10 % managing data. Without knowing science, and without knowing what you want to achieve and why you want to achieve it, you would not be able to use whatever you learn on Coursera in any way. I can almost guarantee you that.
I have seen my friends going through some of those courses, but at the end of the day, they do not build anything, they do not derive correct conclusions, and they do not really “use” anything that they learn. More than that, they do not even really use the skills they acquire.
The way all this happened to me is as follows:
  1. I dived deep into data, understood their structure, understood their types. I understood why we were even collecting all those data, how we were collecting them, how we were storing them, and how we were processing them before storing them.
  2. I learned how data could be handled with these programming languages effectively. I learned to clean the data, process them as much as I wanted to, and plot them with with every possible way I could. Just plotting the data took me hours and hours to see how various plots could show the data in one way compared to another.
  3. I learned from my friends who manage databases how they did that and what went in the background. I learned the structures of the database tables.
  4. Then I learned how to plot some relevant plots, and calculate the return on investment for doing anything. Here is where Data science started coming together. There is no plot that I cannot plot. Basically - every plot I saw on the internet, I learned how to plot it. This is extremely important, and this is what will lead you to story telling.
  5. Then I learned automating things, and this is really amazing, because you would be able to do a few things automatically, which would save you a lot of time.
  6. Automation came really easily with Python, R, VBscript, C# programming.
    I can tell you that there is roughly speaking nothing that is not automated for me. I have a computer program for anything and everything, and most of my things are done by a button click ~ Or lets say - a few button clicks.
  7. Then I learned report writing. What I learned is that I had to send a lot of data and plots to others over a mail. And believe me, people have no time, and no interest. But if you make colorful plots, write down a coherent report demonstrating what you want to say, and pack enormous and powerful information in few really colorful plots, you can make a case.
  8. Then I learned story telling. What this simply means is that you should be able to tell the vice president of the company what the topmost problems of your division are. And they way you should be able to derive these conclusions are by creating engaging plots that tell a story. Without this, you would not be able to convince anyone. People are not interested in numbers. All they remember is names, places, things, inspiration, and why someone wants to do something. A true data scientist is also a true presenter of the data.
  9. Then I read every possible blog on the internet to see how others were doing these things. How people were writing their programs, how they were creating various plots, how they were automating things and so on. I also derived a lot of ideas from how someone used their skills to do an amazing project. This is a really nice way to see how others imagine. Then you can borrow their imagination and build things, and eventually as things are easier for you, you would begin imagining things yourself.
Just take a look at the number of blogs available to you from where you can learn a lot of things.
I have gone through many of these blogs, and I have read them in depth. This took weeks of efforts and multiple Saturdays and Sundays experimenting with data, and programming languages.
My most frequently used websites:
I would now give you a more comprehensive approach, so that you have a lot of inspiration to hold on to.

How does a typical engineer’s job look like, and how can data science help on those lines?
  1. Decision making: In my job, I have several decisions to make and several actions to take in a day. In addition, I have various stake holders to update, various people to give guidance to, various data sets to look at, and various tools and machines to handle. Some of these machines are physical machines making things, and some others are simply computer programs and software platforms creating settings for these machines.
  2. Data: Most of the data we have is on various servers which are distributed across various units, or is on some shared drive, or on some hard disk drive available on a server.
  3. Databases: These database servers can be used to get data with SQL, or direct data pull, or by grabbing them somehow (Say copying by FTP), sometimes even manually copying, and pasting into excel, CSV or notepad. Usually we have multiple methods to do direct data pull from the servers. There are various SQL platforms such as TOAD, Business Intelligence, and even in house built platforms.
    1. SQL can be learned easily using these platforms, and one can create plenty of SQL scripts.
    2. You can even create scripts that can write scripts.
    3. I would inspire you to learn SQL as it is one of the most used language for just getting data.
  4. Data again: The data on these databases can be highly structured, or somewhat unstructured - such as human comments or so on.
    1. These data can often have a fixed number of variables, or varying number of variables.
    2. Sometimes data can be missing too, and sometimes they can be incorrectly entered in the databases.
      1. Every time something like this is found, and immediate response is sent to database managers, and they correct the bugs if there are any in the system.
      2. Usually before setting up a whole giant project of setting up a database, multiple people unite and discuss how the data should look like, how they should be distributed into various tables, and how the tables should be connected.
      3. Such people are true data scientists, as they know what the end user is going to want on a daily basis over and over.
      4. They always try to structure the data as much as possible, because it makes it very easy to handle it.
  5. Scripting and scheduling: Using multiple scripts that are scheduled to run at specific timings, or sometimes setup to run on an adhoc basis, I get and dump data in various folders on a dedicated computer. I have a decently large HDD to store a lot of data.
    1. Usually I append new data to existing data sets, and purge out older data in a timely way.
    2. Sometimes I have programs running with sleep commands, that at scheduled timings merely check something quickly, and sleep back again.
  6. More scripting: Furthermore, there are multiple scripts that are setup to crunch these data sets and create a bunch of decisions from them.
    1. Cleaning data, creating valuable pivot tables, and plots is one of the biggest time hold ups for anyone trying to achieve value out of this.
    2. To achieve something like this, first you would have to understand your data in and out, and you should be very capable of doing all sorts of hand calculations, generating excel sheets, and visualizing data.
    3. Science: What I would inspire you with is that before you do data-science, do the science, learn the physics behind your data, and understand it in and out. Say ~ If you work in a T-Shirt industry, you should know every aspect of a T-shift in and out, you should have access to all possible information around T-shirts, and you should know very well what the customers want and like, without even looking at any of the data.
    4. Without understanding the science, data-science is valueless, and trying to achieve something with it may be a fruitless effort.
    5. Caveats: I have seen plenty of people not even knowing what to plot against what.
      1. The worst I have seen is that people plot just about some random variables against each other and they derive conclusions out of them.
      2. True, that correlations exist in many things, but you should always know if there is any causation.
      3. Example: There is a significant correlation between number of Nobel Laureates and per-capita chocolate consumption of various countries; But is it a causation? May be not!
  7. Back to programs: There is usually a sequence in which all the scripts run, and create all sorts of tables, and plots to look at.
    1. Some scripts are sequential, whereas some programs are mere executables. Executables usually are written for speed, and C, C++, C# etc can be used for them.
    2. Scripts can be written in Python, VB etc.
  8. Decision making: When certain {If/Then} conditions are met, more computer programs self trigger and run more data analysis.
  9. Data science: This usually unfolds into a lot of statistics, classification, regression.
    1. Here is where machine learning comes in. One can use programming languages such as Python or R to do this.
    2. Based off the machine learning algorithms’ results, more computer programs are ran and more plots are generated or more programs are triggered.
  10. Plotting: Ultimately, a lot of plots are stored in a coherent fashion for humans to make decisions.
  11. Self sustaining reports: The reports are self triggering, self sustained programs that tell me what to do.
  12. The feeling of being ironman: I usually look at the results from all the reports in 10 mins, and make decisions on what to do next for many hours. Every now and then I look at the reports again to re-define the decisions or change them on the fly if this has to be done.
What are the advantages of doing all this?
  1. First of all, when a computer does something, it would do it at a much faster speed than a human.
  2. A computer will do it tirelessly, and endlessly.
  3. Computer programs need sufficient amount of training, and multiple levels of testing for varying inputs, but once all that is done, it would keep doing that job for ever until either the sample space itself changes, or something drastically changes to the input itself.
  4. By programming it to the level that the entire output is set on a dashboard, it is very easy to see what the order of the projects should be.
How do you now create value from something like this?
  1. One should always be behind science! and by knowing your data as well as possible you would be able to order the implementation of your projects.
  2. The decision you would make, and the actions you would take would be harder, better, faster, stronger.
  3. You would be able to derive conclusions and generate some lean sigma projects.
  4. You would be able to update the stakeholders well ahead of time, and be able to be on the top of your projects.
  5. You would be able to focus only on the science aspect instead of just trying to manually create plots.
  6. You would be able to find out trends in your data more easily, and say things one way or the other if the data tell you to make decisions in favor of one choice over other.
  7. Last but not the least, you can reduce human efforts significantly and automate all the things for you.
    1. I even have scripts that push buttons for me or fill up forms for me.
    2. I have several image analysis programs that analyze images and make decisions on the fly without humans looking at them.
I hope this answer is elaborate and gives you some insight on what you can work on.

Comments

  1. Awesome questions! Machine learning is a great field to get into; not only is it highly sought after by employers, it also helps you understand the world in a new way.

    Most machine learning algorithms are based heavily in math, and are made possible by programming. Here are the basic things I would suggest picking up as you tackle machine learning:

    Matrix Algebra: Matrix algebra is really important when you start working with large amounts of data; here’s a good online matrix algebra class from MIT: Linear Algebra
    Statistics: It’s been argued that machine learning is really just computer aided statistics. I’m not sure if I totally agree with that, but having a basis in statistics will help you wrap your head around a lot of the simpler learning algorithms (i.e. regression). I haven’t taken this course specifically, but I’ve heard good things about Udacity’s statistics offering: Elementary Statistics Course Online
    Calculus: I know, now it sounds like I’m just listing off every math class I know—like I said, machine learning is math-heavy. You don’t need that much calculus, but having a basic grasp of what a derivative is will be really helpful. This page is pretty simple, but if you can get through it and fell like you understand what’s going on, you’re in good shape (at least to start): The Definition of the Derivative
    Programming: Of course you’ll have to program in order to actually implement learning algorithms, and it’s good to know a general purpose programming language. You said you have experience with Java and Python and those are great. If you didn’t I would recommend picking up Python through CodeCademy: Python
    MatLab: It’s important to know how to program in general, but it’s also really helpful to be familiar with MatLab; you can effectively study machine learning in another language (i.e. Python) but so many of the resources for beginners use MatLab. If you are in college you can probably get MatLab for free through your institution. If not, I would suggest trying out Octave; it’s fairly similar to MatLab, and its free.
    Basic Learning Algorithms: Finally to the fun stuff. To get a feel for the basics I would strongly suggest you check out Andrew Ng’s Coursera course on machine learning. It’s well made, and very accessible. In it he draws on all the things in this list; although he briefly introduces each of these subjects, it’ll be a lot easier if you have a foundation in all of them before tackling machine learning: Machine Learning - Stanford University | Coursera
    If you can make your way through this list, by the end you should at least be familiar with the field of machine learning, and be prepared to figure out what you want to learn next. Good luck!

    ReplyDelete

Post a Comment

Popular posts from this blog

Where have you seen Machine Learning in your everyday life?

1 –  Google’s AI-Powered Predictions Using anonymized location data from smartphones, Google Maps (Maps) can analyze the speed of movement of traffic at any given time. And, with its acquisition of crowdsourced traffic app Waze in 2013, Maps can more easily incorporate user-reported traffic incidents like construction and accidents. Access to vast amounts of data being fed to its proprietary algorithms means Maps can reduce  commutes by suggesting the fastest routes to and from work. 2 –  Ridesharing Apps Like Uber and Lyft How do they determine the price of your ride? How do they minimize the wait time once you hail a car? How do these services optimally match you with other passengers to minimize detours? The answer to all these questions is ML. Engineering Lead for Uber ATC Jeff Schneider  discussed in an NPR interview how the company uses ML to predict rider demand to ensure that “surge pricing”(short periods of sharp price increases to decrease rider demand and increase driver supply…

Why do we need Machine Learning?

Every time, it is seen that whenever you opens a browser, you will find someone written about machine learning. About applications to self-driving cars, everything is covered in a articles and blogs. So many companies are focusing towards "Machine Learning as the Future" but what does that really mean? Machine learning is the idea that there are generic algorithms that can tell you something interesting about a set of data without you having to write any custom code specific to the problem. Instead of writing code, you feed data to the generic algorithm and it builds its own logic based on the data. Think of machine learning like this. As a human, and as a user of technology, you complete certain tasks that require you to make a decision or classify something. For instance, when you read your inbox in the morning, you decide to mark that ‘Win a Free Cruise if you click here’ email as spam. How would a computer know to do the same thing? Machine learning is comprised of algori…