Data Science for Beginners: How to Read Excel Files with Python
Data science is a field that combines statistical analysis with computer science to gain insights and knowledge from complex data sets. It has become a popular topic in recent years with the growing need for data-driven decision making in various industries. One of the fundamental skills in data science is the ability to work with data in different formats, including Excel files. In this article, we will explore how to read and process Excel files using Python.
What is an Excel file?
An Excel file is a spreadsheet created by Microsoft Excel software. It contains tables of data organized in rows and columns, with each box in the table called a cell. Each cell can hold a value, formula, or function that performs calculations with other cells.
Excel files can contain various data types, including text, numbers, dates, and formulas. They are commonly used for organizing, analyzing, and visualizing data in various fields, including finance, engineering, science, and business.
Why use Python for reading Excel files?
Python is a powerful programming language widely used in data science and machine learning. It offers a rich set of libraries and tools for working with different data formats, including Excel files. Some of the advantages of using Python for reading Excel files are:
– Python is free and open-source
– Python is cross-platform (works on Windows, Mac, and Linux)
– Python is easy to learn and use
– Python offers a wide range of libraries for data manipulation and analysis
– Python can handle large data sets efficiently
How to read Excel files with Python?
To read Excel files with Python, we first need to install a Python module called “pandas”. Pandas is a popular library for data manipulation and analysis, especially for working with tabular data like Excel files. It provides a set of functions and data structures for reading, writing, and processing data in various formats.
To install pandas, we can use the following command in our command prompt or terminal:
“`
pip install pandas
“`
Once pandas is installed, we can start working with Excel files in Python. To read an Excel file, we can use the `read_excel()` function from pandas. It takes the path of the Excel file as input and returns a pandas DataFrame object, which is a two-dimensional table of data.
Here’s an example of reading an Excel file named “data.xlsx” located in the current directory:
“`python
import pandas as pd
df = pd.read_excel(“data.xlsx”)
print(df.head())
“`
This code reads the “data.xlsx” file and prints the first five rows of the DataFrame using the `head()` function. The output should look like this:
“`
ID Name Age Gender
0 1 John 30 Male
1 2 Jane 25 Female
2 3 Samantha 27 Female
3 4 Alice 32 Female
4 5 Fred 28 Male
“`
In this example, the Excel file contains a table with columns “ID”, “Name”, “Age”, and “Gender”. Each row represents a person’s data, such as their ID, name, age, and gender.
We can also use the `sheet_name` parameter of the `read_excel()` function to specify a particular sheet in the Excel file. For example, if the Excel file has multiple sheets, we can read the second sheet named “Sheet2” with the following code:
“`python
import pandas as pd
df = pd.read_excel(“data.xlsx”, sheet_name=”Sheet2″)
print(df.head())
“`
This code reads the second sheet in the “data.xlsx” file and prints its first five rows.
How to process Excel files with Python?
Once we have read an Excel file into a pandas DataFrame, we can perform various operations on the data, such as filtering, sorting, grouping, and plotting.
For example, to filter the data for female persons only, we can use the following code:
“`python
import pandas as pd
df = pd.read_excel(“data.xlsx”)
female_df = df[df[“Gender”] == “Female”]
print(female_df)
“`
This code selects only the rows where the “Gender” column has the value “Female” and creates a new DataFrame called “female_df”. The output should look like this:
“`
ID Name Age Gender
1 2 Jane 25 Female
2 3 Samantha 27 Female
3 4 Alice 32 Female
“`
Similarly, we can sort the data by the “Age” column in descending order using the following code:
“`python
import pandas as pd
df = pd.read_excel(“data.xlsx”)
sorted_df = df.sort_values(by=”Age”, ascending=False)
print(sorted_df)
“`
This code sorts the DataFrame by the “Age” column in descending order (highest to lowest) and creates a new DataFrame called “sorted_df”. The output should look like this:
“`
ID Name Age Gender
3 4 Alice 32 Female
0 1 John 30 Male
4 5 Fred 28 Male
2 3 Samantha 27 Female
1 2 Jane 25 Female
“`
We can also group the data by a particular column and calculate summary statistics for each group. For example, to group the data by the “Gender” column and calculate the average age for each group, we can use the following code:
“`python
import pandas as pd
df = pd.read_excel(“data.xlsx”)
grouped_df = df.groupby(“Gender”).agg(“Age”: “mean”)
print(grouped_df)
“`
This code groups the DataFrame by the “Gender” column and calculates the mean age for each group. The output should look like this:
“`
Age
Gender
Female 28.000000
Male 29.333333
“`
In this example, we use the `groupby()` function to group the DataFrame by the “Gender” column. We then use the `agg()` function to apply an aggregation function (mean) to the “Age” column for each group. Finally, we print the resulting DataFrame that shows the average age for each gender.
We can also plot the data in a pandas DataFrame using various graphs and charts. For example, to create a bar chart showing the total number of persons for each gender, we can use the following code:
“`python
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_excel(“data.xlsx”)
grouped_df = df.groupby(“Gender”).agg(“Name”: “count”)
grouped_df.plot(kind=”bar”, legend=False, rot=0, color=[“blue”, “pink”])
plt.title(“Number of Persons by Gender”)
plt.xlabel(“Gender”)
plt.ylabel(“Count”)
plt.show()
“`
This code groups the DataFrame by the “Gender” column and counts the number of persons for each group using the `count()` function. We then create a bar chart using the `plot()` function with some formatting options such as the color of the bars, the title, and the labels for the axes. Finally, we show the chart using the `show()` function. The output should look like this:
![Bar chart of persons by gender](https://1.bp.blogspot.com/-vaSe30cRpAs/YIZqKxPaslI/AAAAAAAADqc/2DUHGL0PhQcpAxoPJGcFhiV3h_zY6TZYQCLcBGAsYHQ/w1200-h630-p-k-no-nu/Tutorial%2BData%2BMining%2B%252813%2529.png “Bar chart of persons by gender”)
This chart shows that there are three females and three males in the data set.
FAQ
Q: Can we read Excel files in Python without using pandas?
A: Yes, we can. Python offers other modules like xlrd, openpyxl, and xlwt that can read, write, and manipulate Excel files. However, pandas is a more powerful and convenient library for working with tabular data, especially when dealing with large and complex data sets.
Q: How can we write data to an Excel file using Python?
A: We can write data to an Excel file using the `to_excel()` function from pandas. It takes the path of the Excel file as input and writes the DataFrame to a sheet in the file. For example, to write the “female_df” DataFrame from the previous example to a new Excel file named “female_data.xlsx”, we can use the following code:
“`python
import pandas as pd
df = pd.read_excel(“data.xlsx”)
female_df = df[df[“Gender”] == “Female”]
female_df.to_excel(“female_data.xlsx”, index=False)
“`
This code writes the “female_df” DataFrame to a new sheet in the “female_data.xlsx” file. The `index=False` parameter tells pandas not to write the row indexes to the file.