Top 10 Data Science Topics and Areas
by Mahmut on December 15, 2020
In this decade, Data science seems to be the leading field of study because of the numerous opportunities it offers in terms of business and financial solutions. In addition, the expertise in these areas puts you in a good position to secure a good job privately, publicly, or as a consultant in your respective areas.
In this article, I am going to talk about the top 10 data science topics and areas.
In this article you are going to learn :
What is Data Science
Everybody can define data science their own way, but if we want to explain it as simple as possible :
Data Science is the process of deriving useful insights from huge amounts of data to solve real-world problems.
Data science involves data and some science. The definition or the name came up in the 80s and 90s when some professors were looking into the statistics curriculum, and they thought it would be better to call it data science. I would see data science as one’s attempt to work with data to find answers to questions that they are exploring. In a nutshell, it’s more about data than it is about science. If you have data, and if you have curiosity, and you are working with data and you are manipulating it, you are exploring it, the very exercises of going through analyzing data, trying to get some answers from it is data science.
Data science is relevant today because we have tons of data available. We used to worry about lack of data. Now we have data deluge. In the past, we didn’t have algorithms, now we have algorithms. In the past, software was expensive, now it’s open-source and free. In the past, we couldn’t store large amounts of data, now for a fraction of the cost, we can have gazillions of datasets for a very low cost. So, the tools to work with data, the very availability of data, and the ability to store and analyze data, it’s all cheap, it’s all available, it’s all ubiquitous, it’s here.
For anyone just getting started on their data science journey, the range of technical options can be overwhelming. There is a dizzying amount of choice when it comes to programming languages. Each has its own strengths and weaknesses and there is no one right answer to the question of which one you should learn first.
The language you choose to learn will depend on the things you need to accomplish and the problems you need to solve. It will also depend on what company you work for, what role you have, and the age of your existing application.
We will put most of our focus on the top three Data Science languages: Python, R, and SQL.
Python is a powerhouse language. It is by far the most popular programming language for data science.
According to the 2019 Kaggle Data Science and Machine Learning Survey, %75 of the over 10.000 respondents from around the world reported that they use Python on a regular basis.
Glassdoor reported that in 2019 more than %75 of data science positions listed included Python in their job descriptions.
When asked which language an inspiring data scientist should learn first, most data scientists say Python.
If you already know how to program, then Python is great for you because it uses clear, readable syntax. You can do many of the things you are used to doing in other programming languages but with Python, you can do it with less code. If you want to learn to program, it’s also a great starter language because of the huge global community and wealth of documentation.
Python is useful for many situations, including data science, AI and Machine Learning, web development, and IoT devices like Raspberry Pi.
Large organizations that use Python heavily include IBM, Wikipedia, Google, Yahoo!, CERN, NASA, Facebook, Amazon, Instagram, Spotify, and Reddit.
If you want to learn Python, I created a free Python Tutorial for those who want to learn Python. Click here to go to my Python Tutorial.
Like Python, R is free to use, but it’s a GNU project instead of being open source, it’s actually free software. So if Python is open source and R is free software, what’s the difference? Well, Both open source and free software commonly refer to the same set of licenses. Many open source projects use the GNU General Public License, for example. Both open source and free software support collaboration. In many cases (but not all), these terms can be used interchangeably. The Open Source Initiative (OSI) champions open source while the Free Software Foundation (FSF) defines free software. Open source is more business focused, while free software is more focused on a set of values.
Back to why you should learn R. Because this is a free software project, you can use the language in the same way that you contribute to open source, and it allows for public collaboration and private and commercial use. Plus, R is another language supported by a wide global community of people passionate about making it possible to use the language to solve big problems.
It’s most often used by statisticians, mathematicians, and data miners for developing statistical software, graphing, and data analysis. The language’s array-oriented syntax makes it easier to translate from math to code, especially for someone with no or minimal programming background. According to Kaggle’s Data Science and Machine Learning Survey, most folks learn R when they’re a few years into their data science career, but it remains a welcoming language to those who don’t have a software programming background. R is popular in academia but companies that use R include IBM, Google, Facebook, Microsoft, Bank of America, Ford, TechCrunch, Uber, and Trulia.
SQL is a bit different from the other languages we’ve covered so far. First off, it’s formally pronounced “ess cue el,” although some people say “sequel.” While the acronym stands for “Structured Query Language,” many people do not consider SQL to be like other software development languages because it’s a non-procedural language and its scope is limited to querying and managing data. While it is not a “data science” language per se, data scientists regularly use it because it’s simple and powerful. Another couple of neat facts about SQL: it’s much older than Python and R, by about 20 years, having first appeared in 1974. And, SQL was developed at IBM.
This language is useful in handling structured data; that is, the data incorporating relations among entities and variables. SQL was designed for managing data in relational databases. Here you can see a diagram showing the general structure of a relational database. A relational database is formed by collections of two-dimensional tables; for example, datasets and Microsoft Excel spreadsheets. Each of these tables is then formed by a fixed number of columns and any number of rows.
Knowing SQL will help you do many different jobs in data science, including business and data analyst, and it’s a must in data engineering. When performing operations with SQL, you access the data directly. There’s no need to copy it beforehand. This can speed up workflow executions considerably. SQL is the interpreter between you and the database. SQL is an American National Standards Institute, or “ANSI,” standard, which means if you learn SQL and use it with one database, you will be able to easily apply that SQL knowledge to many other databases.
Mathematics is the bedrock of any contemporary discipline of science. Almost all the techniques of modern data science, including machine learning, have a deep mathematical underpinning.
Here are my suggestions for the topics to study to be at the top of the game in data science.
Linear algebra is a field of mathematics that is widely used in various disciplines.
A good understanding of linear algebra really enhances the understanding of many machine learning algorithms. Foremost, to really understand deep learning algorithms, linear algebra is essential.
Statistics and machine learning are two closely related areas of study. Statistics is an important prerequisite for applied machine learning, as it helps us select, evaluate and interpret predictive models.
Calculus (and it’s closely related counterpart, linear algebra) has some very narrow (but very useful) applications to data science.
It is an important field in mathematics and it plays an integral role in many machine learning algorithms. If you want to understand what’s going on under the hood in your machine learning work as a data scientist, you’ll need to have a solid grasp of the fundamentals of calculus.
Glassdoor analyzed data from data scientist job postings on Glassdoor and found that SQL is listed as one of the top three skills for a data scientist. Before you step into the field of data science, it is vitally important that you set yourself apart by mastering the foundations of this field. One of the foundational skills that you will require is SQL.
SQL is a powerful language that’s used for communicating with databases. Every application that manipulates any kind of data needs to store that data somewhere; whether it’s big data, or just a table with a few simple rows for government, or a small startup, or a big database that spans over multiple servers or a mobile phone that runs its own small database.
Here are some of the advantages of learning SQL for someone interested in data science.
Data is one of the most critical assets of any business. It is used and collected practically everywhere. Your bank stores data about you, your name, address, phone number, account number et cetera. Your credit card company and your paypal accounts also store data about you. Data is important; so, it needs to be secure, and it needs to be stored and accessed quickly.
So, what is a database? Databases are everywhere and used every day, but they are largely taken for granted. A database is a repository of data. It is a program that stores data. A database also provides the functionality for adding, modifying, and querying that data. There are different kinds of databases of different requirements.
The data can be stored in various forms. When data is stored in tabular form, the data is organized in tables like in a spreadsheet, which is columns and rows. That’s a relational database. The columns contain properties about the item such as last name, first name, email address, city. A table is a collection of related things like a list of employees or a list of book authors. In a relational database, you can form relationships between tables. So a database is a repository of data.
A set of software tools for the data in the database is called a database management system or DBMS for short. The terms database, database server, database system, data server, and database management systems are often used interchangeably. For relational databases, it’s called a relational database management system or RDBMS. RDBMS is a set of software tools that controls the data such as access, organization, and storage. And RDBMS serves as the backbone of applications in many industries including banking, transportation, health, and so on.
Examples of relational database management systems are my SQL, Oracle Database, DB2 Warehouse, and DB2 on Cloud. For the majority of people using a database, there are five simple commands to create a table, insert data to populate the table, select data from the table, update data in the table, delete data from the table. So those are the building blocks for SQL for data science.
Data analysis is the process of gathering, cleaning, analyzing, and mining data, interpreting results and reporting the findings. With data analysis, we find patterns within data and correlations between different data points. And it is through these patterns and correlations those insights are generated, and conclusions are drawn.
Data analysis helps businesses understand their past performance and informs their decision-making for future actions. Using data analysis, businesses can validate a course of action before committing to it. Saving valuable time and resources and also ensuring greater success.
Data visualization is a way to show complex data in a form that is graphical and easy to understand. This can be especially useful when one is trying to explore the data and getting acquainted with it. Also, since a picture is worth a thousand words, then plots and graphs can be very effective in conveying a clear description of the data, especially when disclosing findings to an audience or sharing the data with other peer data scientists.
They can be very valuable when it comes to supporting any recommendations you make to clients, managers, or other decision-makers in your field.
The matplotlib package is the most well-known library for data visualization, and it’s excellent for making graphs and plots. The graphs are also highly customizable. It was created by John Hunter, who was a neurobiologist and was part of a research team that was working on analyzing Electrocorticography signals, ECoG for short. The team was using a proprietary software for the analysis. However, they had only one license and were taking turns in using it. So in order to overcome this limitation, John set out to replace the proprietary software with a MATLAB based version that could be utilized by him and his teammates, and that could be extended by multiple investigators. As a result, Matplotlib was originally developed as an ECoG visualization tool, and just like MATLAB, Matplotlib was equipped with a scripting interface for quick and easy generation of graphics, represented by pyplot.
Machine learning is the subfield of computer science that gives “computers the ability to learn without being explicitly programmed.”
Let me explain what I mean when I say “without being explicitly programmed.”
Assume that you have a dataset of images of animals such as cats and dogs, and you want to have the software or an application that can recognize and differentiate them. The first thing that you have to do here is to interpret the images as a set of feature sets. For example, does the image show the animal’s eyes? If so, what are their sizes? Does it have ears? What about tails? How many legs? Does it have wings?
Prior to machine learning, each image would be transformed to a vector of features. Then traditionally, we had to write down some rules or methods in order to get computers to be intelligent and detect the animals. But, it was a failure. Why? Well as you can guess, it needed a lot of rules, highly dependent on the current dataset, and not generalized enough to detect out of sample cases. This is when machine learning entered the scene. Using machine learning, allows us to build a model that looks at all the feature sets, and their corresponding type of animals, and it learns the pattern of each animal.
It is a model built by machine learning algorithms. It detects without explicitly being programmed to do so. In essence, machine learning follows the same process that a 4 years old child uses to learn, understand and differentiate animals. So, machine learning algorithms, inspired by the human learning process, iteratively learn from data, and allow computers to find hidden insights. These models help us in a variety of tasks, such as object recognition, summarization, recommendation and so on.
You can go to my How to Start Learning Machine Learning article.
Neural Networks and Deep Learning
Computer Sciences attempt to mimic real neurons, in how our brain actually functions. So 20-23 years ago, a neural network would have some inputs that would come in. They would be fed into different processing nodes that would then do some transformation on them and aggregate them or something, and then maybe go to another level of nodes.
So a neural network is trying to use a computer program that will mimic how neurons, how our brains use neurons to process things, neurons, and synapses and building these complex networks that can be trained.
So this neural network starts out with some inputs and some outputs, and you keep feeding these inputs in to try to see what kinds of transformations will get to these outputs. And you keep doing this over, and over, and over again in a way that this network should converge. So with these inputs, the transformations will eventually get these outputs.
Deep learning is a specialized subset of machine learning that uses layered neural networks to simulate human decision-making. Deep learning algorithms can label and categorize information and identify patterns. It is what enables AI systems to continuously learn on the job and improve the quality and accuracy of results by determining whether decisions were correct.
Artificial neural networks, often referred to simply as neural networks, take inspiration from biological neural networks, although they work quite a bit differently. A neural network in AI is a collection of small computing units called neurons that take incoming data and learn to make decisions over time.
Neural networks are often layer-deep and are the reason deep learning algorithms become more efficient as the data sets increase in volume, as opposed to other machine learning algorithms that may plateau as data increases.
In this digital world, everyone leaves a trace. From our travel habits to our workouts and entertainment, the increasing number of internet-connected devices that we interact with on a daily basis record vast amounts of data about us. There’s even a name for it: Big Data.
Ernst and Young offer the following definition: “Big Data refers to the dynamic, large and disparate volumes of data being created by people, tools, and machines. It requires new, innovative, and scalable technology to collect, host, and analytically process the vast amount of data gathered in order to derive real-time business insights that relate to consumers, risk, profit, performance, productivity management, and enhanced shareholder value.”
There is no one definition of Big Data, but there are certain elements that are common across the different definitions, such as velocity, volume, variety, veracity, and value. These are the V’s of Big Data.
Velocity is the speed at which data accumulates. Data is being generated extremely fast, in a process that never stops. Near or real-time streaming, local, and cloud-based technologies can process information very quickly.
Volume is the scale of the data or the increase in the amount of data stored. Drivers of volume are the increase in data sources, higher resolution sensors, and scalable infrastructure.
Variety is the diversity of the data. Structured data fits neatly into rows and columns, in relational databases while unstructured data is not organized in a pre-defined way, like Tweets, blog posts, pictures, numbers, and video.
Variety also reflects that data comes from different sources, machines, people, and processes, both internal and external to organizations. Drivers are mobile technologies, social media, wearable technologies, geo technologies, and many more.
Veracity is the quality and origin of data and its conformity to facts and accuracy. Attributes include consistency, completeness, integrity, and ambiguity. Drivers include cost and the need for traceability. With the large amount of data available, the debate rages on about the accuracy of data in the digital era.
Value is our ability and need to turn data into value. Value isn’t just profit. It may have medical or social benefits, as well as customer, employee, or personal satisfaction. The main reason that people invest time to understand Big Data is to derive value from it.
Data mining is the process of automatically searching and analyzing data, discovering previously unrevealed patterns. It involves preprocessing the data to prepare it and transforming it into an appropriate format. Once this is done, insights and patterns are mined and extracted using various tools and techniques ranging from simple data visualization tools to machine learning and statistical models.
Data mining is often combined with various sources of data including enterprise data that is secured by an organization and has privacy issues and sometimes multiple sources are integrated including third-party data, customer demographics, and financial data, etc.