首页 > 解决方案 > 如何在(普通)SQL 中生成真实的、任意长度的文本

问题描述

我有一个测试要求,要在 SQL 数据库中填充任意长度(例如 100 到 32k 个字符之间)的文本字段到任意数量的行(最多几百万行)。文本应该是模糊的现实(英文),所以一个字符重复N次是不好的。下面的示例是为 IBM Db2 编写的,但想法是它应该在任何主要的 RDBMS 中工作,而无需使用特定于单个数据库的功能(如果您的答案仅适用于 Sql Server 或 Postgres,那很好,但更多通用的更好)。到目前为止,我提出了以下方法:

  1. 创建一个带有 id 和文本字段的表:
CREATE TABLE STAGING.DATA_TEST (
    id      BIGINT          NOT NULL    GENERATED BY DEFAULT AS IDENTITY,
    text    VARCHAR(32000)  NULL
);
  1. 插入一些任意长度的预制文本(例如来自流行书籍的引用):
INSERT INTO STAGING.DATA_TEST (text) 
VALUES 
    ('It was all very well to say “Drink me,” but the wise little Alice was not going to do that in a hurry. “No, I’ll look first,” she said, “and see whether it’s marked ‘poison’ or not."'),
    ('“Curiouser and curiouser!” cried Alice (she was so much surprised, that for the moment she quite forgot how to speak good English).'),
    ('“I wish I hadn’t cried so much!” said Alice, as she swam about, trying to find her way out. “I shall be punished for it now, I suppose, by being drowned in my own tears! That will be a queer thing, to be sure! However, everything is queer to-day.”'),
    ('“Well! I’ve often seen a cat without a grin,” thought Alice; “but a grin without a cat! It’s the most curious thing I ever saw in all my life!” '),
    ('“Who are you?” said the Caterpillar. This was not an encouraging opening for a conversation. Alice replied, rather shyly, “I—I hardly know, Sir, just at present—at least I know who I was when I got up this morning, but I think I must have been changed several times since then.”')
;
  1. 将上表连接到自身(伪随机顺序)并连接文本字段。将结果插入同一个表中:
INSERT INTO STAGING.DATA_TEST (text)
    SELECT CONCAT(t1.text, t2.text)
        FROM STAGING.DATA_TEST as t1
        JOIN (SELECT MAX(id) AS MAXID, MIN(id) AS MINID FROM STAGING.DATA_TEST) AS C ON 1 = 1
        JOIN STAGING.DATA_TEST AS t2 ON t1.id = (MOD(t2.id * 7, MAXID - MINID) + MINID)
;

然后我重复第 3 步,这使我将行加倍,并且每次将这些行中的文本长度加倍。在某些时候,我的列宽用完了,此时我不能再重复第 3 步,如果我仍然需要更多行,我可以根据需要一次又一次地复制整个表格。我用这种方法遇到的问题是:

  1. 我对生成的文本长度的分布没有细粒度的控制。
  2. 我没有对生成的行数进行细粒度控制(在上面的随机连接方法用完列宽之后)。
  3. 我不知道我的伪随机性的天真来源(参见上面的连接语句)是否在我生成的数据中给了我很好的熵。

要开始回答 3,我用起始值 [A,B,C,D,E] 测试了上面的内容,它会产生如下字符串:

这对我来说似乎很随机,但没有统计学背景,我不确定它是否有任何好处。非常感谢您对上述内容的反馈和任何建议的改进。

标签: sqldatabaserandomtextdb2

解决方案


推荐阅读