创建一个CSV文件,其中每行包含两个网址,指定一个链接。即,每行的第一个和第二个值指定了有关链接的来源和目的地。例如:
url1,url2
url1,url3
url2,url3
url4,url5
url2,url4
给定这样的 CSV 文件,编写一个 MapReduce 程序,在相应的 URL 链接中查找长度 2 的所有路径。就是说,它找到了网址的三倍(u,v,w),这样就有一个链接从u到v和一个链接从v到w。 例如,上面的示例 CSV 文件包含以下长度 2 的路径:
url2, url4, url5
url1, url2, url3
url1, url2, url4
现进行到初步代码如下:
class part(MRJob):
def steps(self):
return [MRStep(mapper=self.mapper, reducer=self.reducer)]
#return [MRStep(mapper=self.mapper)]
def mapper(self, key, document):
for word in document.split(','):
yield word, 1
def reducer(self, word, line):
yield word, line
part.run()
from mrjob.job import MRJob
class part(MRJob):
def mapper(self, _, line):
words = line.split(',')
yield words[1], words
yield words[0], words
def reducer(self, key, values):
starts, ends = [], []
for x in values:
if x[0] == key:
ends.append(x[1])
elif x[1] == key:
starts.append(x[0])
for start in starts:
for end in ends:
yield key, ','.join([start,key,end])
if __name__ == '__main__':
part.run()
得到的输出结果是这样:
C:\Coding\Python\CSDN\ByName\godsavedme>python firstMrjob.py urls.csv
No configs found; falling back on auto-configuration
No configs specified for inline runner
Creating temp directory C:\Users\zhang\AppData\Local\Temp\firstMrjob.zhang.20210409.091651.884561
Running step 1 of 1...
job output is in C:\Users\zhang\AppData\Local\Temp\firstMrjob.zhang.20210409.091651.884561\output
Streaming final output from C:\Users\zhang\AppData\Local\Temp\firstMrjob.zhang.20210409.091651.884561\output...
"url2" "url1,url2,url3"
"url2" "url1,url2,url4"
"url4" "url2,url4,url5"
Removing temp directory C:\Users\zhang\AppData\Local\Temp\firstMrjob.zhang.20210409.091651.884561...
输出的前面单词是整个路径的中间路径名。
我这里测试通过啦~
记得采纳一波哦~如果有问题可以继续私信交流~
参考https://mrjob.readthedocs.io/en/latest/guides/writing-mrjobs.html#writing-basics,先将文件读到mapper里。