有一个xml文本test.xml,里面的主要信息内容是Hit,提取Hit信息中的两个tag内容,组成新的结果res.txt。希望脚本运行是这样的:bash xml-info.sh test.xml,或者python xml-info.py test.xml,自动生成res.txt结果。
hit信息基本组成单位,要想利用shell脚本或者python脚本提取xml文件里面的两个信息,Hit_accession和Hsp_hseq:
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|SRA|SRR6675308.2504034.2</Hit_id>
<Hit_def>2504034</Hit_def>
<Hit_accession>SRR6675308.2504034.2</Hit_accession>
<Hsp_align-len>50</Hsp_align-len>
<Hsp_qseq>GFRKLPMGVGLSPFLLAQFTSSLASMVRRNFPHCMVFAYMDDVVLGAKSV</Hsp_qseq>
<Hsp_hseq>GFRKIPMGVGLSPFLLAQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV</Hsp_hseq>
<Hsp_midline>GFRK+PMGVGLSPFLLAQFTS++ S+VRR FPHC+ F+YMDDVVLGAKSV</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
输出成为res.txt,格式如下,第一行Hit_accession,以>开头(不知道为啥帖子里面显示不出来,用“>”代替),第二行Hsp_hseq,直接是Hsp_hseq内容
“>”SRR6675308.2504034.2
GFRKIPMGVGLSPFLLAQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV
测试文件test.xml
<?xml version="1.0"?>
<!DOCTYPE BlastOutput PUBLIC "-//NCBI//NCBI BlastOutput/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_BlastOutput.dtd">
</Parameters>
</BlastOutput_param>
<BlastOutput_iterations>
<Iteration>
<Iteration_iter-num>1</Iteration_iter-num>
<Iteration_query-len>838</Iteration_query-len>
<Iteration_hits>
<Hit>
<Hit_num>1</Hit_num>
<Hit_id>gnl|SRA|SRR6675308.2504034.2</Hit_id>
<Hit_def>2504034</Hit_def>
<Hit_accession>SRR6675308.2504034.2</Hit_accession>
<Hit_len>150</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>50</Hsp_align-len>
<Hsp_qseq>GFRKLPMGVGLSPFLLAQFTSSLASMVRRNFPHCMVFAYMDDVVLGAKSV</Hsp_qseq>
<Hsp_hseq>GFRKIPMGVGLSPFLLAQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV</Hsp_hseq>
<Hsp_midline>GFRK+PMGVGLSPFLLAQFTS++ S+VRR FPHC+ F+YMDDVVLGAKSV</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>2</Hit_num>
<Hit_id>gnl|SRA|SRR10821940.27046739.1</Hit_id>
<Hit_def>27046739</Hit_def>
<Hit_accession>SRR10821940.27046739.1</Hit_accession>
<Hit_len>150</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_num>1</Hsp_num>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>50</Hsp_align-len>
<Hsp_qseq>GFRKLPMGVGLSPFLLAQFTSSLASMVRRNFPHCMVFAYMDDVVLGAKSV</Hsp_qseq>
<Hsp_hseq>GFRKIPMGVGLSPFLLAQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV</Hsp_hseq>
<Hsp_midline>GFRK+PMGVGLSPFLLAQFTS++ S+VRR FPHC+ F+YMDDVVLGAKSV</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
<Hit>
<Hit_num>3</Hit_num>
<Hit_id>gnl|SRA|SRR10821940.8209197.2</Hit_id>
<Hit_def>8209197</Hit_def>
<Hit_accession>SRR10821940.8209197.2</Hit_accession>
<Hit_len>150</Hit_len>
<Hit_hsps>
<Hsp>
<Hsp_query-frame>0</Hsp_query-frame>
<Hsp_hit-frame>-1</Hsp_hit-frame>
<Hsp_gaps>0</Hsp_gaps>
<Hsp_align-len>50</Hsp_align-len>
<Hsp_qseq>GFRKLPMGVGLSPFLLAQFTSSLASMVRRNFPHCMVFAYMDDVVLGAKSV</Hsp_qseq>
<Hsp_hseq>GFRKIPMGVGLSPFLLAQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV</Hsp_hseq>
<Hsp_midline>GFRK+PMGVGLSPFLLAQFTS++ S+VRR FPHC+ F+YMDDVVLGAKSV</Hsp_midline>
</Hsp>
</Hit_hsps>
</Hit>
</Iteration_hits>
<Iteration_stat>
<Statistics>
<Statistics_db-num>1266261802</Statistics_db-num>
<Statistics_db-len>1968308817</Statistics_db-len>
</Statistics>
</Iteration_stat>
</Iteration>
</BlastOutput_iterations>
</BlastOutput>
将test.xml文件放在xml-info.py的同一目录下,运行如下代码即可:
import argparse
from lxml import etree
parser=argparse.ArgumentParser()
parser.add_argument('f',default='test.xml')
parser.add_argument('-o', default='res.txt')
args=parser.parse_args()
fname=args.f
with open(fname,'r') as fn:
dt=fn.read()
html=etree.XML(dt)
a=html.xpath('//Hit_accession/text()')
b = html.xpath('//Hsp_hseq/text()')
s='\n'.join(['>'+x+'\n'+y for x,y in zip(a,b)])
with open(args.o,'w',encoding='utf-8') as fw:
fw.write(s)
运行结果的res.txt中内容:
>SRR6675308.2504034.2
GFRKIPMGVGLSPFLLAQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV
>SRR10821940.27046739.1
GFRKIPMGVGLSPFLLAQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV
>SRR10821940.8209197.2
GFRKIPMGVGLSPFLLAQFTSAICSVVRRAFPHCLAFSYMDDVVLGAKSV
如有帮助,请点采纳。
使用shell脚本读取xml的属性值和节点值
#FUNCRION: GetNodeValue
#DESC : Get xmlnode value
#INPUT : 1-XmlFilePath 2-NodeName
#OUTPUT : nodevalue
function GetNodeValue
{
if [ $# -ne 2 ];then
echo " error: arguments is not enough"
echo " USAGE: $0 XmlFilePath NodeName"
echo " XmlFilePath xmlfile path type[${HOME}/config/datasource/bmp-xa-ds.xml]"
echo " NodeName nodename type[xa-datasource-property]"
echo e.g.: $0 ${HOME}/config/datasource/bmp-xa-ds.xml 'xa-datasource-property name="URL"'
return
fi
CurrentTime=date +"%Y%m%d%H%M%S"
tmpfile="$$_$CurrentTime"
FilePath=$1
NodeName=$2
NodePre=awk -v Node="$NodeName" 'BEGIN {split(Node,NodeAdd," ");print NodeAdd[1]}'
FLAG=0
sed 's/>/>\n/g' $FilePath | sed 's/
do
ISFIRST=`echo $line | sed -n "/
if [ "x$ISFIRST" != "x" ]; then
FLAG=1
fi
if [ ${FLAG} -eq 1 ] ; then
echo $line >> "$tmpfile"
fi
ISSEC=echo $line | sed -n "//"p
if [ "x$ISSEC" != "x" ]; then
FLAG=0
fi
done
awk '{ORS=""}{print $0}' $tmpfile | awk 'BEGIN{FS=">";RS=""}{print $NF}' | sed '/^(\s)*$/d'
rm $tmpfile
}
#FUNCRION: GetNodeAttr
#DESC : Get xmlnode attribute
#INPUT : 1-XmlFilePath 2-AttrName
#OUTPUT : node attribute
function GetNodeAttr
{
if [ $# -ne 2 ];then
echo " error: arguments is not enough"
echo " USAGE: $0 XmlFilePath AttrName"
echo " XmlFilePath xmlfile path type[${HOME}/config/DiamBaseConfig.xml]"
echo " AttrName nodename type[PeerIp]"
echo " e.g.: $0 ${HOME}/config/DiamBaseConfig.xml PeerIp"
return
fi
FilePath=$1
AttrName=" $2="
sed 's/>/>\n/g' $FilePath | sed 's/
}
您好,我是有问必答小助手,您的问题已经有小伙伴帮您解答,感谢您对有问必答的支持与关注!