首页 > 解决方案 > multipart/form-data 在大文件上丢失字节

问题描述

我正在multipart/form-data用 C++ 编写解析器,因为可用的选项似乎非常稀缺。

我最初的方法是istream::getline一次缓冲一条线(或部分线),以便检测边界。但是,虽然这适用于较小的文件,但不适用于较大的文件。cin对于大 (>50MB) 文件,有时会设置坏位,在清除 istream 后,我注意到我会丢失字节。我不知道为什么,这就是这个问题的目的。

但是,如果我将缓冲区大小增加到 4MB 并将istream::read整个multipart/form-data请求转储到文件中,我不会丢失任何字节并且cin永远不会设置坏位。然后我可以重新打开转储文件ifstream而不是使用cin,我原来的小缓冲区getline方法工作得很好。

关于这里发生了什么的任何见解?会不会是 FastCGI 或 Lighttpd 的一些副作用?

编辑:

以下是相关的代码片段:

#include <fcgio.h>
//...

int main()
{
    //...
    FCGX_Request request;

    FCGX_Init();
    FCGX_InitRequest(&request, 0, 0);

    const size_t LEN = 1024;
    vector<char> v(LEN); // Workaround for getting duplicates of every byte?
    while (FCGX_Accept_r(&request) == 0) {
        fcgi_streambuf cin_fcgi_streambuf(request.in, &v[0], v.size());
        //... (eventually calls _parseMultipartFormFieldFile)
    }

    //...
}

/*
    Extract a file from a multipart form section

    istream should already have boundary and headers removed up throguh the final "\r\n"

    Note that there are a lot of potential off-by-one errors here. Need to pay special attention
    to gcount() and what is present in the buffer in each given scenario. Hence why you see:

    gcount
    gcount-1
    gcount-2

    These offsets are due to null terminator sometimes being appended, sometimes not, and/or '\r' being present or not.

    It is possible for a few rare things to happen that will break this function:

    1. Malicious content length

    Client could lie about content length and send much more than we have room for. Should count bytes eventually, but easy enough to configure webserver to protect us.
*/
bool _parseMultipartFormFieldFile(
    Request & req,
    istream & input,
    const string & name,
    const string & upload_dir,
    const string & boundary,
    const string & end_boundary
)
{
    static unsigned int file_id = 0; //used to generate unique file names

    //Need fixed buffer size to prevent running out of RAM (malicious or not)
    char buf[4096];

    string file_name = upload_dir + ECPP_TMP_FILE + to_string(file_id++);

    ofstream f(file_name, std::ofstream::out | std::ofstream::binary);
    if (!f.is_open())
        return false;

    bool eof = false;
    while (!eof) {
        //Out of space in flash?
        if (!f.good())
            return false;

        f.flush();

        input.getline(buf, sizeof(buf));
        unsigned int gcount = input.gcount();

        if (input.bad()) {
            //Crap! If we're here, we have most likely lost a few bytes...
            input.clear();
            continue;
        }
        else if (input.eof()) {
            //If we are here, the multipart/form-data request was malformed
            f.close();
            remove(file_name.c_str()); //Delete malformed file
            return false;
        }
        else if (input.fail()) {
            //If we are in this condition, it means we encountered a line longer than our buffer
            //There is no null terminator in this case, so write out what we have
            f.write(buf, gcount);
            input.clear(); //clear fail flag
            continue;
        }

        if (gcount >= 2 && buf[gcount-2] == '\r') {
            string peek = peekLine(input); //uses putback - modifies gcount()
            if (peek == boundary || peek == end_boundary) {
                //If we are in here, it means we encountered the last line in the section
                //That means there is a trailing '\r' which we need to remove in addition to the null terminator
                f.write(buf, gcount-2); // Remove null terminator and \r before writing
                req.file[name] = file_name;
                eof = true;
                continue;
            }
        }

        //If we are here it means we read in the entire line.
        //Write out everything (minus the null terminator), and also add in the newline that was stripped by getline()
        f.write(buf, gcount-1);
        f.write("\n", 1);
    }

    return true;
}

所以,简而言之,问题是如果我传递cin_fcgi_streambuf_parseMultipartFormFieldFile,我会丢失字节(触发坏位),但如果我不加选择地转储cin_fcgi_streambuf到带有char buf[4000000]+的文件,然后将该文件input.read()的一个传递ifstream_parseMultipartFormFieldFile,那么它工作正常.

标签: c++multipartform-datafastcgilighttpd

解决方案


没有. input.getline_ CRLF所以如果你发布一个binary文件,会发生什么?否则,您的示例source code无法管理multiple posted file request. 案例,您刚刚打开了一个文件流。这就是为什么你必须改变你的源代码模式。

您可以上传无限大小的data|file. 试试这个解决方案

const char* ctype = "multipart/form-data; boundary=----WebKitFormBoundaryfm9qwXVLSbFKKR88";
size_t content_length = 1459606;
http_payload* hp = new http_payload(ctype, content_length);
if (hp->is_multipart()) {
    int ret = hp->read_all("C:\\temp\\");
    if (ret < 0) {
        std::cout << hp->get_last_error() << std::endl;
        hp->clear();
    }
    else {
        std::string dir_str("C:\\upload_dir\\");
        ret = hp->read_files([&dir_str](http_posted_file* file) {
            std::string path(dir_str.c_str());
            path.append(file->get_file_name());
            file->save_as(path.c_str());
            file->clear(); path.clear();
            std::string().swap(path);
        });
        hp->clear();
        std::cout << "Total file uploaded :" << ret << std::endl;
    }
}
else {
    int ret = hp->read_all();
    if (ret < 0) {
        std::cout << hp->get_last_error() << std::endl;
        hp->clear();
    }
    else {
        std::cout << "Posted data :" << hp->get_body() << std::endl;
        hp->clear();

    }
}

https://github.com/safeonlineworld/web_jsx/blob/0d08773c95f4ae8a9799dbd29e0a4cd84413d108/src/web_jsx/core/http_payload.cpp#L402


推荐阅读