一道数据清洗的问题,求解答

问题:从https://www.digitalanalytics.id.au/static/files/artists-spotify.csv读取数据。基于popularity创建一个名为‘Popularity_cat’的新变量。将0-50的popularity写为“low popularity”,将51-100的popularity为“high popularity”。将已清理和处理的数据帧保存为'artists-spotify-clean.csv'

问题原文是

  • Create a new variable named 'Popularity_cat' based on popularity. A popularity score of 0-50 is coded as a value 'low popularity', 51-100 is coded as a value 'high popularity'.
  • Save the cleaned and processed dataframe as 'artists-spotify-clean.csv'

目前写成这样,从19行开始是关于这个问题的代码

import pandas as pd

df = pd.read_csv('https://www.digitalanalytics.id.au/static/files/artists-spotify.csv',sep=';')

print(df.info())

print(df.duplicated().sum())
print(df[df.duplicated()])
df = df.drop_duplicates()

print(df.isnull().sum())
isolatemissing = pd.isnull(df['x']) 
print(df[isolatemissing]) 
df.dropna() 

df = df.sort_values(by=['popularity'], ascending=False)
print(df[['popularity']].head(20))

#def['x'] = df ['popularity'].str[-1:]

def Popularity_cat(x,y):
  if x <=50
  y = 'low popularity'
  if x >=51
  y = 'high popularity'

 df['Popularity_cat'] = df['popularity'].apply(lambda x: Popularity_cat(x))

print(df[['popularity','Popularity_cat']])

df.to_csv('artists-spotify-clean.csv',sep=';',index=False)

 

不用定义函数,用np.where()或者列表推导式写更简单