首页 > 解决方案 > Apache ORC 在读取数据时跳过条纹

问题描述

我正在尝试读取驻留在 S3 存储中的 ORC 文件。但是,我不想扫描/读取整个文件,而是它的特定部分。比如一个ORC文件大小是1GB,我想处理scan-range400到400000。

为此,我阅读了 ORC 文件的页脚并获取所有条纹的偏移量。

所以,我知道哪些是必需的条纹。例如,我只想访问条带 4 和 5。有没有办法跳过 ORC 文件的一些条带?

我的代码如下:


    std::string file_path = "./tt.orc";

    orc::RowReaderOptions row_reader_opts;
    row_reader_opts.include(read_cols);

    orc::ReaderOptions reader_opts;
    reader_opts.getSerializedFileTail();
    std::unique_ptr<orc::Reader> reader = orc::createReader(orc::readFile(file_path), reader_opts);
    std::cout << reader->getFileFooterLength() << std::endl;

    std::unique_ptr<orc::RowReader> row_reader = reader->createRowReader(row_reader_opts);

    std::unique_ptr<orc::ColumnVectorBatch> batch = row_reader->createRowBatch(4);

    //double field
    auto *fields = dynamic_cast<orc::StructVectorBatch *>(batch.get());

    std::cout << fields->fields.size() << std::endl;
    std::cout << fields->fields[0]->toString() << std::endl;
    std::cout << fields->fields[1]->toString() << std::endl;

    auto *col0 = dynamic_cast<orc::DoubleVectorBatch *>(fields->fields[0]);
    double *buffer1 = col0->data.data();

    //string field
    auto *col4 = dynamic_cast<orc::StringVectorBatch *>(fields->fields[4]);
    char **buffer2 = col4->data.data();
    long long *lengths = col4->length.data();

    while (row_reader->next(*batch)) {
        for (uint32_t r = 0; r < batch->numElements; ++r) {
            std::cout << "line " << buffer1[r] << "," << std::string(buffer2[r], lengths[r]) << "\n";
        }
        //std::cout << "this batch nums" << " " << batch->numElements << " " << "lines\n";
    }

标签: c++parquetorc

解决方案


推荐阅读