What is the fastest way to apply string operation on whole dataframe column?

Issue related to:Data Manipulation, Automation

Module used: Pandas 

Issue link:https://www.reddit.com/r/learnpython/comments/g5y4pa/what_is_the_fastest_way_to_apply_string_operation/

Issue:

I have a csv tabular data around 3M rows, in the following structure:

row Column 1
1 [978]
2 [2341]
3 234
4 213
5 23
30000000 23

Some of the raw data is corrupted.

Like the first two row which have a number store in list type in which I am only interest to the number inside.

All I want is to extract the number 978 , 2341 out of the square brackets.

The whole column is read as object to dataframe initially, and now I want to extract the number and cast whole column back to number.

I have try to use a for loop with condition to convert the item one by one but it is very slow.

Is there any more efficient method ?

Short answer:

Use the following function:

df["Column 1"] = df["Column 1"].astype(str).str.replace("[\]\[]",'').astype(int)

Explanation:

The user would like to:

  • Familiarize our self with the data
  • Select the column known as "Column 1"
  • Identify and remove data that is in [square brackets]
  • Commit the changes 

I will begin by creating a csv file in a similar structure to the one above and save this file off asBook1.csv

To tackle this task we will use string manipulation to get rid of the brackets 

Familiarize our self with the data

In this project we will be using the library "pandas" as the functions that we want to use are readily available so we not need to fiddle with loops. We will begin by importing pandas;

import pandas

Our next step is to read the file - We will use the function pandas.read_csv() and pass the name of our CSV file which in this case is Book1.csv. This will be saved in the variable df:

df = pandas.read_csv("Book1.csv")

Select the column known as "Column 1"

To get hold of the first column we will use df["Column 1"]

We now need to make sure we are taking the column in as a series of string - we can use the DataFrame.astype() function In our case the DataFrame is df["Column 1"] and the parameter we are passing is str . This will overwrite the variable df["Column 1"] :

df["Column 1"] = df["Column 1"].astype(str)

Identify and remove data that is in [square brackets]

For the strings we need to remove:

  • The open bracket [
  • The close bracket ]
  • or both brackets together if their is no number inside []

To do this we will use the function series.str.replace() the series will be df["Column 1"].

In addition, we are passing two parameters:

The first parameter will be what we are removing in our case '[\]\[]' . The \ is similar to 'or' so what we are saying is [ or ] or both brackets.

The second parameter will be what the replacement will be in our case ' '. Just a space to remove it.

This will be saved to the variable df["Column 1"]

What we now have is:

df["Column 1"] = df["Column 1"].str.replace("[\]\[]",'')

Now that we have a column of numbers it is best to convert our string back to integers - we can use DataFrame.astype() function again but this time passing int and save this variable to df["Column 1"]:

df["Column 1"] = df["Column 1"].astype(int)

Commit the changes

Now that we have got the changes we want the only step is to save our data - We will use the function DataFrame.to_csv() In our case the dataframe will be df and we will pass the parameter of our new file name in this case I will call it Book2.csv. We also need to pass the parameter index=False so that we have no index column 

df.to_csv('Book2.csv',index=False)

When we run this code we will have a file called Book2.csv with the formatted data we want 

Our code in the end looks like this:

import pandas

df = pandas.read_csv("Book1.csv")
df["Column 1"] = df["Column 1"].astype(str)
df["Column 1"] = df["Column 1"].str.replace("[\]\[]",'')
df["Column 1"] = df["Column 1"].astype(int)

df.to_csv('Book2.csv',index=False)

We can simplify the code to the following:

import pandas

df["Column 1"] = df["Column 1"].astype(str).str.replace("[\]\[]",'').astype(int)

df.to_csv('Book2.csv',index=False)