xgboost - H2O XGBoost 因本地服务器死机或挂起而崩溃(?)
问题描述
我需要做一些工作来构建一个较小的测试用例,而且我必须获得发布数据的许可(在我将其匿名之后),但是 H2O 始终与这些数据和参数一起崩溃。(它通常通过特征输入和参数的不同组合成功,但似乎总是在下面的特征和参数下失败)。
数据有 12847393 行(这可能是问题所在?)
这是我得到的丑陋的堆栈爬行。(似乎可以重现。)
归档为错误: https ://0xdata.atlassian.net/projects/PUBDEV/issues/PUBDEV-7321
---------------------------------------------------------------------------
ConnectionResetError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
599 body=body, headers=headers,
--> 600 chunked=chunked)
601
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
383 # otherwise it looks like a programming error was the cause.
--> 384 six.raise_from(e, None)
385 except (SocketTimeout, BaseSSLError, SocketError) as e:
/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
379 try:
--> 380 httplib_response = conn.getresponse()
381 except Exception as e:
/usr/lib/python3.6/http/client.py in getresponse(self)
1345 try:
-> 1346 response.begin()
1347 except ConnectionError:
/usr/lib/python3.6/http/client.py in begin(self)
306 while True:
--> 307 version, status, reason = self._read_status()
308 if status != CONTINUE:
/usr/lib/python3.6/http/client.py in _read_status(self)
267 def _read_status(self):
--> 268 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
269 if len(line) > _MAXLINE:
/usr/lib/python3.6/socket.py in readinto(self, b)
585 try:
--> 586 return self._sock.recv_into(b)
587 except timeout:
ConnectionResetError: [Errno 104] Connection reset by peer
During handling of the above exception, another exception occurred:
ProtocolError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
448 retries=self.max_retries,
--> 449 timeout=timeout
450 )
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
637 retries = retries.increment(method, url, error=e, _pool=self,
--> 638 _stacktrace=sys.exc_info()[2])
639 retries.sleep()
/usr/local/lib/python3.6/dist-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
366 if read is False or not self._is_method_retryable(method):
--> 367 raise six.reraise(type(error), error, _stacktrace)
368 elif read is not None:
/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in reraise(tp, value, tb)
684 if value.__traceback__ is not tb:
--> 685 raise value.with_traceback(tb)
686 raise value
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
599 body=body, headers=headers,
--> 600 chunked=chunked)
601
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
383 # otherwise it looks like a programming error was the cause.
--> 384 six.raise_from(e, None)
385 except (SocketTimeout, BaseSSLError, SocketError) as e:
/usr/local/lib/python3.6/dist-packages/urllib3/packages/six.py in raise_from(value, from_value)
/usr/local/lib/python3.6/dist-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
379 try:
--> 380 httplib_response = conn.getresponse()
381 except Exception as e:
/usr/lib/python3.6/http/client.py in getresponse(self)
1345 try:
-> 1346 response.begin()
1347 except ConnectionError:
/usr/lib/python3.6/http/client.py in begin(self)
306 while True:
--> 307 version, status, reason = self._read_status()
308 if status != CONTINUE:
/usr/lib/python3.6/http/client.py in _read_status(self)
267 def _read_status(self):
--> 268 line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
269 if len(line) > _MAXLINE:
/usr/lib/python3.6/socket.py in readinto(self, b)
585 try:
--> 586 return self._sock.recv_into(b)
587 except timeout:
ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
ConnectionError Traceback (most recent call last)
/usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to)
473 headers=headers, timeout=self._timeout, stream=stream,
--> 474 auth=self._auth, verify=verify, proxies=self._proxies)
475 if isinstance(save_to, types.FunctionType):
/usr/local/lib/python3.6/dist-packages/requests/api.py in request(method, url, **kwargs)
59 with sessions.Session() as session:
---> 60 return session.request(method=method, url=url, **kwargs)
61
/usr/local/lib/python3.6/dist-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
532 send_kwargs.update(settings)
--> 533 resp = self.send(prep, **send_kwargs)
534
/usr/local/lib/python3.6/dist-packages/requests/sessions.py in send(self, request, **kwargs)
645 # Send the request
--> 646 r = adapter.send(request, **kwargs)
647
/usr/local/lib/python3.6/dist-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
497 except (ProtocolError, socket.error) as err:
--> 498 raise ConnectionError(err, request=request)
499
ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))
During handling of the above exception, another exception occurred:
H2OConnectionError Traceback (most recent call last)
<ipython-input-1-56b68eefa416> in <module>
87 start_time = time.time()
88 model = H2OXGBoostEstimator(**param)
---> 89 model.train(x=x, y="y", training_frame=hdf6)
90 elapsed_time = time.time() - start_time
/usr/local/lib/python3.6/dist-packages/h2o/estimators/estimator_base.py in train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose)
110 self._train(x=x, y=y, training_frame=training_frame, offset_column=offset_column, fold_column=fold_column,
111 weights_column=weights_column, validation_frame=validation_frame, max_runtime_secs=max_runtime_secs,
--> 112 ignored_columns=ignored_columns, model_id=model_id, verbose=verbose)
113
114
/usr/local/lib/python3.6/dist-packages/h2o/estimators/estimator_base.py in _train(self, x, y, training_frame, offset_column, fold_column, weights_column, validation_frame, max_runtime_secs, ignored_columns, model_id, verbose, extend_parms_fn)
263 return
264
--> 265 model.poll(poll_updates=self._print_model_scoring_history if verbose else None)
266 model_json = h2o.api("GET /%d/Models/%s" % (rest_ver, model.dest_key))["models"][0]
267 self._resolve_model(model.dest_key, model_json)
/usr/local/lib/python3.6/dist-packages/h2o/job.py in poll(self, poll_updates)
58 pb.execute(self._refresh_job_status, print_verbose_info=ft.partial(poll_updates, self))
59 else:
---> 60 pb.execute(self._refresh_job_status)
61 except StopIteration as e:
62 if str(e) == "cancelled":
/usr/local/lib/python3.6/dist-packages/h2o/utils/progressbar.py in execute(self, progress_fn, print_verbose_info)
169 # Query the progress level, but only if it's time already
170 if self._next_poll_time <= now:
--> 171 res = progress_fn() # may raise StopIteration
172 assert_is_type(res, (numeric, numeric), numeric)
173 if not isinstance(res, tuple):
/usr/local/lib/python3.6/dist-packages/h2o/job.py in _refresh_job_status(self)
96 def _refresh_job_status(self):
97 if self._poll_count <= 0: raise StopIteration("")
---> 98 jobs = h2o.api("GET /3/Jobs/%s" % self.job_key)
99 self.job = jobs["jobs"][0] if "jobs" in jobs else jobs["job"][0]
100 self.status = self.job["status"]
/usr/local/lib/python3.6/dist-packages/h2o/h2o.py in api(endpoint, data, json, filename, save_to)
121 # type checks are performed in H2OConnection class
122 _check_connection()
--> 123 return h2oconn.request(endpoint, data=data, json=json, filename=filename, save_to=save_to)
124
125
/usr/local/lib/python3.6/dist-packages/h2o/backend/connection.py in request(self, endpoint, data, json, filename, save_to)
481 if self._local_server and not self._local_server.is_running():
482 self._log_end_exception("Local server has died.")
--> 483 raise H2OConnectionError("Local server has died unexpectedly. RIP.")
484 else:
485 self._log_end_exception(e)
H2OConnectionError: Local server has died unexpectedly. RIP.
传递的参数:
param = {
"ntrees" : 15
, "min_rows" : 5
, "max_depth" : 5
, "learn_rate" : 0.02
, "sample_rate" : 0.7
, "col_sample_rate_per_tree" : 0.9
, "seed": 42
, "score_tree_interval": 100
}
有 14 个输入列,其中 5 个是分类特征。
我这样初始化 H2O:
h2o.init(
strict_version_check=False,
# nthreads=1, # Crashes either with 1 or 4 threads.
log_dir="/tmp/clem-h2o/",
log_level='TRACE'
)
Checking whether there is an H2O instance running at http://localhost:54321 ..... not found.
Attempting to start a local H2O server...
Java Version: openjdk version "1.8.0_242"; OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08); OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)
Starting server from /usr/local/lib/python3.6/dist-packages/h2o/backend/bin/h2o.jar
Ice root: /tmp/tmpkeo9aau1
JVM stdout: /tmp/tmpkeo9aau1/h2o_unknownUser_started_from_python.out
JVM stderr: /tmp/tmpkeo9aau1/h2o_unknownUser_started_from_python.err
Server is running at http://127.0.0.1:54321
Connecting to H2O server at http://127.0.0.1:54321 ... successful.
H2O cluster uptime: 01 secs
H2O cluster timezone: Etc/UTC
H2O data parsing timezone: UTC
H2O cluster version: 3.28.0.3
H2O cluster version age: 14 days, 3 hours and 57 minutes
H2O cluster name: H2O_from_python_unknownUser_xuimzh
H2O cluster total nodes: 1
H2O cluster free memory: 4.445 Gb
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster status: accepting new members, healthy
H2O connection url: http://127.0.0.1:54321
H2O connection proxy: {'http': None, 'https': None}
H2O internal security: False
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
Python version: 3.6.9 final
不幸的是,警告、错误和致命日志文件是空的。在匿名化功能名称之前,我无法发布其他日志文件......
我不知道是否还有其他一些调试开关可以帮助诊断问题。
这是版本信息:
H2O Version: 3.28.0.3
Python 3.6.9
Ubuntu 18.04.3 LTS
解决方案
推荐阅读
- c++ - 如何以json格式保存elasticsearch的数据?
- sql - oracle中多列的条件唯一约束
- angular - 类型“JQuery”上不存在日期选择器
- python - 初学者如何在 Argv 中分隔值?
- java - 如何在 JUnit 的 @Rule 中使用 @Values 字段?
- go - 输出反向链表时出现无限循环
- scala - 如何处理 Flink 的 Table API 窗口中的后期元素?
- wso2-am - WSO2 API-M 不能使用在租户中创建的 API
- android - 没有任何代码更改的 Android Studio 的 Gradle 构建失败
- javascript - 如何获取 javascript 承诺的返回值?