java - Tika - 内存不足异常
问题描述
我一直致力于Tika
仅从各种文件中提取文本内容。我在解析包含图像的 doc 文件时发现了一个特殊问题。调用了 Image fetcher 并抛出了java.lang.OutOfMemoryError: Java heap space
.
我也在尝试同样的tika-app 1.22 gui
方法,但遇到了以下异常:
Exception in thread "Image Fetcher 2" java.lang.OutOfMemoryError: Java heap space
at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
at java.awt.image.Raster.createPackedRaster(Raster.java:467)
at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
at sun.awt.image.ImageRepresentation.createBufferedImage(ImageRepresentation.java:253)
at sun.awt.image.ImageRepresentation.setPixels(ImageRepresentation.java:559)
at sun.awt.image.ImageDecoder.setPixels(ImageDecoder.java:138)
at sun.awt.image.PNGImageDecoder.sendPixels(PNGImageDecoder.java:549)
at sun.awt.image.PNGImageDecoder.produceImage(PNGImageDecoder.java:470)
at sun.awt.image.InputStreamImageSource.doFetch(InputStreamImageSource.java:269)
at sun.awt.image.ImageFetcher.fetchloop(ImageFetcher.java:205)
at sun.awt.image.ImageFetcher.run(ImageFetcher.java:169)
Exception in thread "Image Fetcher 0" java.lang.OutOfMemoryError: Java heap space
at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
at java.awt.image.Raster.createPackedRaster(Raster.java:467)
at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
at sun.awt.image.ImageRepresentation.createBufferedImage(ImageRepresentation.java:253)
at sun.awt.image.ImageRepresentation.setPixels(ImageRepresentation.java:559)
at sun.awt.image.ImageDecoder.setPixels(ImageDecoder.java:138)
at sun.awt.image.PNGImageDecoder.sendPixels(PNGImageDecoder.java:549)
at sun.awt.image.PNGImageDecoder.produceImage(PNGImageDecoder.java:470)
at sun.awt.image.InputStreamImageSource.doFetch(InputStreamImageSource.java:269)
at sun.awt.image.ImageFetcher.fetchloop(ImageFetcher.java:205)
at sun.awt.image.ImageFetcher.run(ImageFetcher.java:169)
Exception in thread "Image Fetcher 1" java.lang.OutOfMemoryError: Java heap space
at java.awt.image.DataBufferInt.<init>(DataBufferInt.java:75)
at java.awt.image.Raster.createPackedRaster(Raster.java:467)
at java.awt.image.DirectColorModel.createCompatibleWritableRaster(DirectColorModel.java:1032)
at sun.awt.image.ImageRepresentation.createBufferedImage(ImageRepresentation.java:253)
at sun.awt.image.ImageRepresentation.setPixels(ImageRepresentation.java:559)
at sun.awt.image.ImageDecoder.setPixels(ImageDecoder.java:138)
at sun.awt.image.PNGImageDecoder.sendPixels(PNGImageDecoder.java:549)
at sun.awt.image.PNGImageDecoder.produceImage(PNGImageDecoder.java:470)
at sun.awt.image.InputStreamImageSource.doFetch(InputStreamImageSource.java:269)
at sun.awt.image.ImageFetcher.fetchloop(ImageFetcher.java:205)
at sun.awt.image.ImageFetcher.run(ImageFetcher.java:169)
我的问题是:
- 为什么我需要获取图像以仅从文档中提取文本?
- 在这种情况下,如何配置
Tika
以跳过获取图像。我不想增加我的堆内存来解决这个问题,而是优雅地跳过图像。
编辑:我正在将文件作为流读取并将其包装为 tikaInputStream,然后打开一个 outputStream 以将结果写入另一个文件。
outputWriter = new OutputStreamWriter(outputStream);
WriteOutContentHandler writeOutContentHandler = new WriteOutContentHandler(outputWriter, writeLimit);
AutoDetectParser parser = new AutoDetectParser();
Metadata metadata = new Metadata();
parser.parse(inputStream, writeOutContentHandler, metadata);
我附上了我用于测试的文件并得到了以下异常:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@4f209819
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143)
at tika.Tikaimpl.main(Tikaimpl.java:49)
Caused by: java.lang.IndexOutOfBoundsException: Block 96991 not found
at org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:434)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.readBAT(POIFSFileSystem.java:406)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.readCoreContents(POIFSFileSystem.java:359)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:239)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFileSystem.java:172)
at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:121)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
... 3 more
Caused by: java.lang.IndexOutOfBoundsException: Position 49659904 past the end of the file
at org.apache.poi.poifs.nio.FileBackedDataSource.read(FileBackedDataSource.java:88)
at org.apache.poi.poifs.filesystem.POIFSFileSystem.getBlockAt(POIFSFileSystem.java:432)
... 9 more
用于测试的文件。
解决方案
推荐阅读
- azure-active-directory - 为用户颁发可验证凭证时,什么是有效的凭证主体?
- apache - Apache 的缓存问题;css 和 html 没有更新
- css - 更改特定 valueBox 中值的字体大小(flexdashboard)
- r - 使用 group by 聚合 r 中的数据并汇总
- python - 我使用 for 循环创建了一个多项选择测验,并希望添加有限的提示(详细说明)
- r - 在ggplot2中对齐轴并使用多图轴标签
- python - Python + 子模块:ImportError:尝试使用没有已知父包的相对导入
- php - 在videojs中触摸大播放按钮时添加功能
- tensorflow - 如何覆盖 Tensorboard 相同全局步骤的检查点?
- java - 使用关联类或仅使用关联箭头