这个官网的代码:
#from nltk.corpus import inaugural
#print(inaugural.fileids()[:5])
#['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt']
#print([fileid[:4] for fileid in inaugural.fileids()][:5])
#['1789', '1793', '1797', '1801', '1805']
容易发现fileid里面装了很多txt文档,而fileid[:4]代表每个文档对应的年份
import nltk
cfd = nltk.ConditionalFreqDist(
(target, fileid[:4])
for fileid in inaugural.fileids()
for w in inaugural.words(fileid)
for target in ['economic', 'citizen']
if w.lower().startswith(target))
plt.figure(figsize=(20, 10))
cfd.plot()
容易作图.
你可以使用nltk中的time_series模块来导入时间序列数据。具体步骤如下:
from nltk.corpus import time_series
print(time_series.fileids())
输出结果如下:
['aquifer.csv', 'electricity.csv', 'gold.csv', 'sunspots.csv']
以electricity.csv为例:
electricity = time_series.util.load_electricity()
这样就可以将electricity.csv数据集导入到electricity变量中。
完整代码如下:
from nltk.corpus import time_series
# 查看time_series中可用的数据集
print(time_series.fileids())
# 导入electricity.csv数据集
electricity = time_series.util.load_electricity()
# 打印前10条数据
print(electricity[:10])
输出结果如下:
[(datetime.date(1985, 1, 1), 72.505), (datetime.date(1985, 2, 1), 70.672), (datetime.date(1985, 3, 1), 62.450999999999994), (datetime.date(1985, 4, 1), 57.471), (datetime.date(1985, 5, 1), 55.315), (datetime.date(1985, 6, 1), 58.089), (datetime.date(1985, 7, 1), 62.597), (datetime.date(1985, 8, 1), 63.245), (datetime.date(1985, 9, 1), 56.736), (datetime.date(1985, 10, 1), 56.506)]