A Beginner’s Guide to Data Analysis in Python

As a beginner in the programming world, it can be overwhelming to learn different libraries and tools available in a programming language. Python is a popular language, which has a robust ecosystem in data analysis and processing. In this article, we will discuss a beginner's Python data analysis guide. We will go through the libraries, tools, and concepts that you will need to get started with data analysis using Python.

Prerequisites

Before we dive into data analysis in Python, there are fundamental skills that you must have. Python assumes that you already have a good understanding of the basics of programming concepts like variables, loops, and functions. If you lack a solid foundation in these concepts, it is recommended to take a beginner's course in Python before attempting data analysis.

Libraries

Pandas

When it comes to data analysis in Python, Pandas is the de facto library. It is a powerful library that has a broad range of tools and functionalities that make it very useful in data analysis. It is perfect for any dataset that can be represented as a table. It has data structures for representing data, operations that convert data structures, data cleaning, data visualization, and data manipulation.

A Pandas dataframe is a two-dimensional table that has rows and columns, and each column can have different data types. The following example shows how to import the Pandas library and create a dataframe;

import pandas as pd

data = {
    "name": ["Python", "Java", "C++"],
    "popularity": [2, 1, 3]
}

df = pd.DataFrame(data)
print(df)

This will output;

     name popularity
0  Python          2
1    Java          1
2     C++          3

NumPy

Numpy is another powerful library that is suitable for scientific computing and data analysis. It has many tools for vector and matrix operations and also has tools for statistical computation. Numpy ndarrays are similar to a Pandas dataframe, but ndarrays have homogeneous types for every element. An example of how to create an ndarray;

import numpy as np

arr = np.array([1, 2, 3])
print(arr)

The output will be;

[1 2 3]

Matplotlib

Matplotlib is an excellent visualization library in Python. It has a broad range of visualization options, from simple line diagrams, scatter plots, and many more.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [1, 4, 9, 16]

plt.plot(x, y, 'ro')
plt.xlabel('X')
plt.ylabel('Y')
plt.show()

The output will be a line graph.

Data Analysis in Python

The pandas library provides many tools that are very useful for data analysis. In this section, we will discuss some of these tools.

Data Access

Using pandas to access data is simple, as shown in the following example;

import pandas as pd

df = pd.read_csv("data.csv")
print(df)

The read_csv method creates a pandas dataframe object from a CSV file. The resulting dataframe can be processed however we want.

Data Filtering

We can filter a dataframe using boolean indexing. This is where the dataframe is traversed by checking each row where the condition is met. Here is an example

>>> import pandas as pd
>>> data = {
...     'name': ['Bob', 'Jane', 'Mike', 'Zac'],
...     'age': [25, 32, 18, 10]
... }
>>> df = pd.DataFrame(data)
>>> df[df['age'] > 20]
   name  age
0   Bob   25
1  Jane   32

Data Manipulation

Pandas provides many tools for data manipulation. These include;

Grouping

The groupby() method is used to group dataframe objects based on their column value.

>>> import pandas as pd
>>> data = {
...     'name': ['Bob', 'Jane', 'Mike', 'Zac', 'Zara'],
...     'age': [25, 32, 18, 10, 24],
...     'gender': ['M', 'F', 'M', 'M', 'F']
... }
>>> df = pd.DataFrame(data)
>>> df.groupby('gender')['age'].mean()
gender
F    28.0
M    17.67
Name: age, dtype: float64

Visualizations

Matplotlib should be used for data visualization, and Pandas integrates well with it. It has a plot() method for creating graphs that allow the user to customize the graph.

import pandas as pd
import matplotlib.pyplot as plt

data = {
    'name': ['Bertie', 'Sandra', 'Chris', 'Peter', 'Amy', 'Lana', 'Mila'],
    'age': [28, 30, 33, 25, 35, 40, 20],
    'salary': [60000, 90000, 80000, 50000, 120000, 100000, 55000]
}

df = pd.DataFrame(data)

plt.scatter(df.age, df.salary, color='r')
plt.title('Age vs Salary')
plt.xlabel('Age')
plt.ylabel('Salary')
plt.grid(True)
plt.show()

Conclusion

In conclusion, Python provides a vast array of tools and libraries that make data analysis simple. We have learned about some of the libraries, tools, and concepts that we will need to get started with data analysis using Python. However, this is just the tip of the iceberg as the libraries mentioned have a lot of functions and functionality that can take years to master, but this article should provide a good starting point for beginners.