现有一个矩阵(tsv文件),行名为ENSG开头+数字编号
前五行五列情况如图
思路:字符串分割后分组,分组后比较排序后筛选剔除
import pandas as pd
df = pd.read_csv('xxxx.tsv', sep='\t')
df['tempOne'] = df['A'].apply(lambda x: x.split(".")[0])
df['tempTwo'] = df['A'].apply(lambda x: int(x.split(".")[1]))
df_new = df.iloc[df.groupby('tempOne').apply(lambda o: o['tempTwo'].idxmax())]
df_new.drop(['tempOne', 'tempTwo'], axis=1, inplace=True)
源数据:
文件发我邮箱383817842@qq.com,我给你搞
import pandas as pd
import pandasql as ps
df = pd.DataFrame({"A": ['ENSG0002.14', 'ENSG0002.8', 'ENSG0003.18', 'ENSG0006.1'],
"B": ['B4', 'B5', 'B6', 'B7'],
"C": ['C4', 'C5', 'C6', 'C7'],
"D": ['D4', 'D5', 'D6', 'D7'],
"F": ['F4', 'F5', 'F6', 'F7']}
)
sql = """SELECT * FROM df as t
WHERE
NOT EXISTS(SELECT 1 FROM df WHERE
SUBSTR(A,1,INSTR(A,'.')-1)= SUBSTR(T.A,1,INSTR(T.A,'.')-1)
AND
CAST(SUBSTR(A,INSTR(A,'.')+1, LENGTH(A)-INSTR(A,'.')) as int) >CAST(SUBSTR(t.A,INSTR(t.a,'.')+1,LENGTH(t.A)-INSTR(t.A,'.')) as int)
)
"""
f = lambda x: ps.sqldf(x, globals())
print(f(sql))
--result
A B C D F
0 ENSG0002.14 B4 C4 D4 F4
1 ENSG0003.18 B6 C6 D6 F6
2 ENSG0006.1 B7 C7 D7 F7
使用pandas 列取最大值