首页 > 技术文章 > 学习笔记(15)- 保险行业的问答语料 insuranceqa_data

xuehuiping 2020-02-05 13:10 原文

数据概览


'''
pool data are translated Chinese data with Google API from original English data
'''
POOL_TEST_DATA = os.path.join(curdir, 'pool', 'test.json.gz')
POOL_TRAIN_DATA = os.path.join(curdir, 'pool', 'train.json.gz')
POOL_VALID_DATA = os.path.join(curdir, 'pool', 'valid.json.gz')
POOL_ANS_DATA = os.path.join(curdir, 'pool', 'answers.json.gz')

'''
pair data are segmented and labeled after pool data
'''
PAIR_TEST_DATA = os.path.join(curdir, 'pairs','iqa.test.json.gz')
PAIR_VALID_DATA = os.path.join(curdir, 'pairs','iqa.valid.json.gz')
PAIR_TRAIN_DATA = os.path.join(curdir, 'pairs','iqa.train.json.gz')
PAIR_VOCAB_DATA = os.path.join(curdir, 'pairs', 'iqa.vocab.json.gz')

注意:作者给的下载的代码里面,逻辑优点不太一致。我这里做了一次拷贝操作,看起来数据冗余。

下载语料

pip install insuranceqa_data

我下载之后的目录为:~anaconda3/lib/python3.7/site-packages/insuranceqa_data/

数据的加载

(1)可以使用代码

import insuranceqa_data as insuranceqa
train_data = insuranceqa.load_pairs_train()
test_data = insuranceqa.load_pairs_test()
valid_data = insuranceqa.load_pairs_valid()

(2)也可以手动查看文件

查看词典文件:
vocab_data = insuranceqa.load_pairs_vocab()
或者

import json

data = json.load(open('~/anaconda3/lib/python3.7/site-packages/insuranceqa_data/iqa.vocab.json'))
print(data.keys())

# 词频统计
tf = data['tf']
print(tf)

id2word = data['id2word']
print(id2word)

word2id = data['word2id']
print(word2id)

# 单词总数
total = data['total']
print(total)

# 未登录词的标识为UNKNOWN,未登录词的id为0。

查看训练数据

import json

data = json.load(open('~/anaconda3/lib/python3.7/site-packages/insuranceqa_data/pairs/train.json'))
print(data.keys())
# dict_keys(['0', '1', '2', '3', '4', '5'...

ele = data['9']
print(ele)
#{'zh': '汽车保险是否预付?', 'en': 'Is  Car  Insurance  Prepaid?', 'domain': 'auto-insurance', 'answers': ['20900'], 'negatives': ['9205', '8237', '25854', '22830', '12148', '997', '501', '20044', '2314', '22527', '7128', '1601', '21267', '16601', '9571', '19628', '14469', '23956', '9427', '22387', '738', '1', '5190', '8195', '14318', '11879', '21030', '10957', '22231', '24492', '12153', '21880', '23859', '19981', '10646', '9140', '20189', '4191', '6647', '18815', '6274', '20874', '7107', '9746', '11822', '13733', '19645', '15981', '24842', '8913', '10691', '25538', '5279', '19014', '26418', '8214', '23728', '25211', '18892', '17753', '25460', '17614', '1667', '26374', '24488', '3627', '13523', '900', '13183', '17585', '18986', '22756', '4270', '11475', '26948', '13960', '18940', '6367', '7431', '14788', '18019', '21438', '22612', '5852', '24435', '14610', '27254', '2211', '3299', '3845', '4016', '4764', '5995', '6310', '9049', '12617', '13287', '14288', '14869', '20064', '25295', '26138', '4380', '21594', '26283', '208', '3789', '3934', '6125', '9520', '9766', '16968', '22882', '12698', '20543', '20391', '5974', '5475', '6077', '8949', '11547', '15002', '15071', '19286', '20301', '23292', '25685', '3176', '13885', '20913', '10883', '8649', '24349', '11324', '12507', '12514', '14284', '14410', '25670', '5260', '6264', '9125', '9596', '20590', '22729', '17815', '25618', '4318', '8153', '9967', '15544', '27256', '9088', '5614', '11911', '12307', '25467', '5119', '6399', '8606', '11722', '17244', '17664', '21659', '23644', '27354', '11302', '12141', '17939', '18431', '19187', '1982', '3810', '6486', '9294', '10393', '17006', '936', '3252', '5756', '12657', '13413', '18435', '21526', '25068', '2352', '2306', '3691', '4868', '4896', '5347', '6396', '7035', '7642', '8263', '8500', '8719', '8974', '9539', '11243']}


answers_id = ele['answers']
print(answers_id) # 20900
#['20900']

answers = json.load(open('~/anaconda3/lib/python3.7/site-packages/insuranceqa_data/pool/answers.json'))
print(answers.keys())
print(answers[answers_id[0]]) #一个正确答案,多个错误答案
#{'zh': '\xa0是的,汽车保险通常是提前支付的。一般不少于三十天。每个承运人对新覆盖的初始支付金额设定自己的要求。大多数运营商允许客户每月,每季度,半年或每年支付一次。如果您全额支付半年或每年的保险费,您还可能会收到您的房价的折扣(这仅由承运人自行决定)。', 'en': ' Yes, automobile insurance is typically paid in advance. Normally no less than thirty days at a time. Each carrier sets their own requirements as to the initial payment amount for new coverage. Most carriers allow clients to pay monthly, quarterly, semi-annually, or annually. If you pay your premium in full for semi-annual or annual you may also receive a discount on your rate ( this is solely at the discretion of the carrier ).'}

print(answers['9205'])
print(answers['8237'])

推荐阅读