我试图将DMOZ内容/结构XML文件解析到MySQL中,但是所有现有的脚本都是非常旧的,并且效果不佳。我怎样才能在PHP中打开一个大的(+ 1GB)XML文件进行解析?解析PHP中的巨大XML文件
回答
只有两个php API非常适合处理大文件。第一个是旧的expat api,第二个是较新的XMLreader函数。这些apis读取连续流而不是将整个树加载到内存中(这是simplexml和DOM的作用)。
举个例子,你可能想看看DMOZ-目录的这个部分解析器:
<?php
class SimpleDMOZParser
{
protected $_stack = array();
protected $_file = "";
protected $_parser = null;
protected $_currentId = "";
protected $_current = "";
public function __construct($file)
{
$this->_file = $file;
$this->_parser = xml_parser_create("UTF-8");
xml_set_object($this->_parser, $this);
xml_set_element_handler($this->_parser, "startTag", "endTag");
}
public function startTag($parser, $name, $attribs)
{
array_push($this->_stack, $this->_current);
if ($name == "TOPIC" && count($attribs)) {
$this->_currentId = $attribs["R:ID"];
}
if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) {
echo $attribs["R:RESOURCE"] . "\n";
}
$this->_current = $name;
}
public function endTag($parser, $name)
{
$this->_current = array_pop($this->_stack);
}
public function parse()
{
$fh = fopen($this->_file, "r");
if (!$fh) {
die("Epic fail!\n");
}
while (!feof($fh)) {
$data = fread($fh, 4096);
xml_parse($this->_parser, $data, feof($fh));
}
}
}
$parser = new SimpleDMOZParser("content.rdf.u8");
$parser->parse();
处理大型XML大多数肯定是最好的答案 – Evert 2009-05-26 21:27:27
这是一个伟大的答案,但我花了很长时间才发现需要使用[xml_set_default_handler()](http://php.net/manual/en/function.xml-set-default-handler.php)来访问XML节点数据,通过上面的代码,您只能看到节点的名称及其属性。 – DirtyBirdNJ 2012-01-18 17:53:56
我会建议使用基于SAX解析器,而不是基于DOM解析。
在PHP中使用SAX信息:http://www.brainbell.com/tutorials/php/Parsing_XML_With_SAX.htm
这并不是一个很好的解决方案,而只是抛出另一种选择在那里:
可以打破许多大型XML文件成块,特别是那些这实际上只是类似元素的列表(因为我怀疑你正在使用的文件是)。
例如,如果您的文档是这样的:
<dmoz>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
<listing>....</listing>
...
</dmoz>
您可以在一个或两个MEG一次读它,人为地包裹你的根级别标记加载的几个完整<listing>
标签,然后负载他们通过simplexml/domxml(我采用domxml,采取这种方法时)。
坦率地说,如果您使用PHP < 5.1.2,我更喜欢这种方法。在5.1.2及更高版本中,XMLReader是可用的,这可能是最好的选择,但在此之前,您坚持使用上述分块策略或旧的SAX/expat库。我不知道其他人,但我恨写/维护SAX/expat解析器。
但是请注意,当您的文档不包含包含许多相同的底层元素(例如,它适用于任何种类的文件或URL列表等)时,此方法并不实际。 ,但对解析大型HTML文档没有意义)
我最近不得不解析一些非常大的XML文档,并且需要一次读取一个元素的方法。
如果你有以下文件complex-test.xml
:
<?xml version="1.0" encoding="UTF-8"?>
<Complex>
<Object>
<Title>Title 1</Title>
<Name>It's name goes here</Name>
<ObjectData>
<Info1></Info1>
<Info2></Info2>
<Info3></Info3>
<Info4></Info4>
</ObjectData>
<Date></Date>
</Object>
<Object></Object>
<Object>
<AnotherObject></AnotherObject>
<Data></Data>
</Object>
<Object></Object>
<Object></Object>
</Complex>
,并希望返回<Object/>
小号
PHP:
require_once('class.chunk.php');
$file = new Chunk('complex-test.xml', array('element' => 'Object'));
while ($xml = $file->read()) {
$obj = simplexml_load_string($xml);
// do some parsing, insert to DB whatever
}
###########
Class File
###########
<?php
/**
* Chunk
*
* Reads a large file in as chunks for easier parsing.
*
* The chunks returned are whole <$this->options['element']/>s found within file.
*
* Each call to read() returns the whole element including start and end tags.
*
* Tested with a 1.8MB file, extracted 500 elements in 0.11s
* (with no work done, just extracting the elements)
*
* Usage:
* <code>
* // initialize the object
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk'));
*
* // loop through the file until all lines are read
* while ($xml = $file->read()) {
* // do whatever you want with the string
* $o = simplexml_load_string($xml);
* }
* </code>
*
* @package default
* @author Dom Hastings
*/
class Chunk {
/**
* options
*
* @var array Contains all major options
* @access public
*/
public $options = array(
'path' => './', // string The path to check for $file in
'element' => '', // string The XML element to return
'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk
);
/**
* file
*
* @var string The filename being read
* @access public
*/
public $file = '';
/**
* pointer
*
* @var integer The current position the file is being read from
* @access public
*/
public $pointer = 0;
/**
* handle
*
* @var resource The fopen() resource
* @access private
*/
private $handle = null;
/**
* reading
*
* @var boolean Whether the script is currently reading the file
* @access private
*/
private $reading = false;
/**
* readBuffer
*
* @var string Used to make sure start tags aren't missed
* @access private
*/
private $readBuffer = '';
/**
* __construct
*
* Builds the Chunk object
*
* @param string $file The filename to work with
* @param array $options The options with which to parse the file
* @author Dom Hastings
* @access public
*/
public function __construct($file, $options = array()) {
// merge the options together
$this->options = array_merge($this->options, (is_array($options) ? $options : array()));
// check that the path ends with a/
if (substr($this->options['path'], -1) != '/') {
$this->options['path'] .= '/';
}
// normalize the filename
$file = basename($file);
// make sure chunkSize is an int
$this->options['chunkSize'] = intval($this->options['chunkSize']);
// check it's valid
if ($this->options['chunkSize'] < 64) {
$this->options['chunkSize'] = 512;
}
// set the filename
$this->file = realpath($this->options['path'].$file);
// check the file exists
if (!file_exists($this->file)) {
throw new Exception('Cannot load file: '.$this->file);
}
// open the file
$this->handle = fopen($this->file, 'r');
// check the file opened successfully
if (!$this->handle) {
throw new Exception('Error opening file for reading');
}
}
/**
* __destruct
*
* Cleans up
*
* @return void
* @author Dom Hastings
* @access public
*/
public function __destruct() {
// close the file resource
fclose($this->handle);
}
/**
* read
*
* Reads the first available occurence of the XML element $this->options['element']
*
* @return string The XML string from $this->file
* @author Dom Hastings
* @access public
*/
public function read() {
// check we have an element specified
if (!empty($this->options['element'])) {
// trim it
$element = trim($this->options['element']);
} else {
$element = '';
}
// initialize the buffer
$buffer = false;
// if the element is empty
if (empty($element)) {
// let the script know we're reading
$this->reading = true;
// read in the whole doc, cos we don't know what's wanted
while ($this->reading) {
$buffer .= fread($this->handle, $this->options['chunkSize']);
$this->reading = (!feof($this->handle));
}
// return it all
return $buffer;
// we must be looking for a specific element
} else {
// set up the strings to find
$open = '<'.$element.'>';
$close = '</'.$element.'>';
// let the script know we're reading
$this->reading = true;
// reset the global buffer
$this->readBuffer = '';
// this is used to ensure all data is read, and to make sure we don't send the start data again by mistake
$store = false;
// seek to the position we need in the file
fseek($this->handle, $this->pointer);
// start reading
while ($this->reading && !feof($this->handle)) {
// store the chunk in a temporary variable
$tmp = fread($this->handle, $this->options['chunkSize']);
// update the global buffer
$this->readBuffer .= $tmp;
// check for the open string
$checkOpen = strpos($tmp, $open);
// if it wasn't in the new buffer
if (!$checkOpen && !($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkOpen = strpos($this->readBuffer, $open);
// if it was in there
if ($checkOpen) {
// set it to the remainder
$checkOpen = $checkOpen % $this->options['chunkSize'];
}
}
// check for the close string
$checkClose = strpos($tmp, $close);
// if it wasn't in the new buffer
if (!$checkClose && ($store)) {
// check the full buffer (in case it was only half in this buffer)
$checkClose = strpos($this->readBuffer, $close);
// if it was in there
if ($checkClose) {
// set it to the remainder plus the length of the close string itself
$checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize'];
}
// if it was
} elseif ($checkClose) {
// add the length of the close string itself
$checkClose += strlen($close);
}
// if we've found the opening string and we're not already reading another element
if ($checkOpen !== false && !($store)) {
// if we're found the end element too
if ($checkClose !== false) {
// append the string only between the start and end element
$buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen));
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
} else {
// append the data we know to be part of this element
$buffer .= substr($tmp, $checkOpen);
// update the pointer
$this->pointer += $this->options['chunkSize'];
// let the script know we're gonna be storing all the data until we find the close element
$store = true;
}
// if we've found the closing element
} elseif ($checkClose !== false) {
// update the buffer with the data upto and including the close tag
$buffer .= substr($tmp, 0, $checkClose);
// update the pointer
$this->pointer += $checkClose;
// let the script know we're done
$this->reading = false;
// if we've found the closing element, but half in the previous chunk
} elseif ($store) {
// update the buffer
$buffer .= $tmp;
// and the pointer
$this->pointer += $this->options['chunkSize'];
}
}
}
// return the element (or the whole file if we're not looking for elements)
return $buffer;
}
}
谢谢。这真的很有帮助。 – 2014-11-11 16:58:47
这是一个非常类似的问题,以Best way to process large XML in PHP但有非常好的具体答案upvoted解决DMOZ目录解析的具体问题。 然而,由于这是一个很好的谷歌打在一般大个XML,我会重新发布从其他的问题我的答案,以及:
我对此采取:
https://github.com/prewk/XmlStreamer
一个简单的类将在流式传输文件时将所有孩子提取到XML根元素。 经过来自pubmed.com的108 MB XML文件进行测试。
class SimpleXmlStreamer extends XmlStreamer {
public function processNode($xmlString, $elementName, $nodeIndex) {
$xml = simplexml_load_string($xmlString);
// Do something with your SimpleXML object
return true;
}
}
$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml");
$streamer->parse();
这太棒了!谢谢。一个问题:如何使用这个获得根节点的属性? – 2013-10-15 10:35:31
@gyaani_guy我不认为现在可能不幸。 – oskarth 2013-12-22 21:53:09
这只是将整个文件加载到内存中! – 2014-03-07 16:14:34
- 1. 用600M解析巨大的XML文件
- 2. PHP:解析巨大的XML无内存
- 3. PHP:如何解析一个巨大的XML文件
- 4. 解析Java中的巨大XML
- 5. Python解析一个巨大的文件
- 6. 解析一个巨大的JSON文件
- 7. 解析原生vs javascript的巨大XML文件
- 8. SAX解析器为一个非常巨大的XML文件
- 9. 使用Go解析巨大的XML文件
- 10. 如何解析一个巨大的XML文件
- 11. 巨大文件解析算法
- 12. 解析大XML文件
- 13. 解析大型XML文件?
- 14. 解析android中的大xml文件
- 15. 解析Android中的大型XML文件
- 16. 用php解析xml文件
- 17. PHP不解析XML文件
- 18. 如何用Go中的各种元素来解析巨大的XML文件?
- 19. 如何使用Python解析一个巨大的xml文件(在旅途中)
- 20. 如何解析PHP中的大型XML文件?
- 21. 大型XML文件解析PHP中的块数据扫描
- 22. JAVA - 解析巨大(超大)JSON文件的最佳方法
- 23. 替代解决方案解析巨大的文件
- 24. 将PHP文件解析为XML文件?
- 25. Perl - 在Windows中解析巨大的* .gz文件
- 26. 解析目录中的巨大记录器文件
- 27. 在Python中解析巨大的日志文件
- 28. 解析Python 2.7中巨大的结构化文件
- 29. 如何在PHP中解析XML文件
- 30. 如何在PHP中解析XML文件?
http://amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb/它如此简单的红宝石 – 2014-02-19 21:07:00