python - Tensorflow preprocessing split string to chars
问题描述
I want to write use TextVectorization
preprocessing layer but split strings into chars.
data = tf.constant(
[
"The Brain is wider than the Sky",
"For put them side by side",
"The one the other will contain",
"With ease and You beside",
]
)
# Instantiate TextVectorization with "int" output_mode
text_vectorizer = preprocessing.TextVectorization(output_mode="int")
# Index the vocabulary via `adapt()`
text_vectorizer.adapt(data)
TextVectorization
class have split
param which can be a function.
On pure python I want to write something like this :
text_vectorizer = preprocessing.TextVectorization(output_mode="int",split=lambda x:list(x)))
but how should I write it in the TensorFlow world ?
解决方案
尝试先使用tf.strings.regex_replace
并将每个序列转换为单个字符串,然后tf.strings.regex_replace
再次应用以将字符串拆分为字符。接下来,用于tf.strings.strip
从每个字符串中删除前导和尾随空格。最后,拆分并返回您的字符串:
import tensorflow as tf
def split_chars(input_data):
s = tf.strings.regex_replace(input_data, ' ', '')
tf.print('Single string --> ', s)
s = tf.strings.regex_replace(s, '', ' ')
tf.print('Characters --> ', s)
s = tf.strings.strip(s)
tf.print('Stripped --> ', s)
s = tf.strings.split(s, sep = ' ')
tf.print('Split --> ', s)
return s
data = tf.constant(
[
"The Brain is wider than the Sky",
"For put them side by side",
"The one the other will contain",
"With ease and You beside",
]
)
input_text_processor = tf.keras.layers.TextVectorization(split = split_chars)
input_text_processor.adapt(data)
tf.print(f"Vocabulary --> {input_text_processor.get_vocabulary()}")
Single string --> ["thebrainiswiderthanthesky" "forputthemsidebyside" "theonetheotherwillcontain" "witheaseandyoubeside"]
Characters --> [" t h e b r a i n i s w i d e r t h a n t h e s k y " " f o r p u t t h e m s i d e b y s i d e " " t h e o n e t h e o t h e r w i l l c o n t a i n " " w i t h e a s e a n d y o u b e s i d e "]
Stripped --> ["t h e b r a i n i s w i d e r t h a n t h e s k y" "f o r p u t t h e m s i d e b y s i d e" "t h e o n e t h e o t h e r w i l l c o n t a i n" "w i t h e a s e a n d y o u b e s i d e"]
Split --> [['t', 'h', 'e', ..., 's', 'k', 'y'], ['f', 'o', 'r', ..., 'i', 'd', 'e'], ['t', 'h', 'e', ..., 'a', 'i', 'n'], ['w', 'i', 't', ..., 'i', 'd', 'e']]
Vocabulary --> ['', '[UNK]', 'e', 't', 'i', 'h', 's', 'n', 'o', 'd', 'a', 'r', 'y', 'w', 'b', 'u', 'l', 'p', 'm', 'k', 'f', 'c']
推荐阅读
- python - Django 默认查询集值速记
- python - 有一个带有两个引号的字符串?
- python-3.x - 如何检查我的 python 路径是否设置正确?
- java - 如何计算电话号码的位数?
- ffmpeg - 如何使用 netcat 流式传输 H.264?
- angular - 在 Angular CLI 中生成的 Git
- reactjs - 在反应中创建条件样式
- node.js - 为什么不记名令牌给出错误“无法读取未定义的属性'拆分'”?
- python - Python Tkinter 命令优先级
- visual-studio - 卡在 Visual Studio 中运行的 Flutter 项目