首页 > 解决方案 > Tensorflow 实验数据集:UnicodeDecodeError:“utf-8”编解码器无法解码位置 30 中的字节 0xd5:无效的继续字节

问题描述

我的数据集是一组包含西班牙语和英语句子的 2 列。我使用以下代码使用 Dataset API 创建了一个训练数据集:

train_examples =  tf.data.experimental.CsvDataset("./Data/train.csv", [tf.string, tf.string])
val_examples = tf.data.experimental.CsvDataset("./Data/validation.csv", [tf.string, tf.string])

##从训练数据集中创建一个自定义子词标记器。

tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)

tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

我收到以下错误:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 30: invalid continuation byte

追溯:

   ---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-27-c90f5c60daf2> in <module>
      1 tokenizer_en = tfds.features.text.SubwordTextEncoder.build_from_corpus(
----> 2     (en.numpy() for pt, en in train_examples), target_vocab_size=2**13)
      3 
      4 tokenizer_pt = tfds.features.text.SubwordTextEncoder.build_from_corpus(
      5     (pt.numpy() for pt, en in train_examples), target_vocab_size=2**13)

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_datasets/core/features/text/subword_text_encoder.py in build_from_corpus(cls, corpus_generator, target_vocab_size, max_subword_length, max_corpus_chars, reserved_tokens)
    291         generator=corpus_generator,
    292         max_chars=max_corpus_chars,
--> 293         reserved_tokens=reserved_tokens)
    294 
    295     # Binary search on the minimum token count to build a vocabulary with

/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/site-packages/tensorflow_datasets/core/features/text/subword_text_encoder.py in _token_counts_from_generator(generator, max_chars, reserved_tokens)
    394   token_counts = collections.defaultdict(int)
    395   for s in generator:
--> 396     s = tf.compat.as_text(s)
    397     if max_chars and (num_chars + len(s)) >= max_chars:
    398       s = s[:(max_chars - num_chars)]

~/venv/lib/python3.7/site-packages/tensorflow/python/util/compat.py in as_text(bytes_or_text, encoding)
     85     return bytes_or_text
     86   elif isinstance(bytes_or_text, bytes):
---> 87     return bytes_or_text.decode(encoding)
     88   else:
     89     raise TypeError('Expected binary or unicode string, got %r' % bytes_or_text)

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd5 in position 30: invalid continuation byte

标签: python-3.xtensorflowtensorflow2.0machine-translation

解决方案


我的错。只需将 CSV 保存为“CSV UTF-8(逗号分隔)”格式就可以解决这个问题。


推荐阅读