2009-05-26 89 views
48

我试图将DMOZ内容/结构XML文件解析到MySQL中,但是所有现有的脚本都是非常旧的,并且效果不佳。我怎样才能在PHP中打开一个大的(+ 1GB)XML文件进行解析?解析PHP中的巨大XML文件

+0

http://amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb/它如此简单的红宝石 – 2014-02-19 21:07:00

回答

74

只有两个php API非常适合处理大文件。第一个是旧的expat api,第二个是较新的XMLreader函数。这些apis读取连续流而不是将整个树加载到内存中(这是simplexml和DOM的作用)。

举个例子,你可能想看看DMOZ-目录的这个部分解析器:

<?php 

class SimpleDMOZParser 
{ 
    protected $_stack = array(); 
    protected $_file = ""; 
    protected $_parser = null; 

    protected $_currentId = ""; 
    protected $_current = ""; 

    public function __construct($file) 
    { 
     $this->_file = $file; 

     $this->_parser = xml_parser_create("UTF-8"); 
     xml_set_object($this->_parser, $this); 
     xml_set_element_handler($this->_parser, "startTag", "endTag"); 
    } 

    public function startTag($parser, $name, $attribs) 
    { 
     array_push($this->_stack, $this->_current); 

     if ($name == "TOPIC" && count($attribs)) { 
      $this->_currentId = $attribs["R:ID"]; 
     } 

     if ($name == "LINK" && strpos($this->_currentId, "Top/Home/Consumer_Information/Electronics/") === 0) { 
      echo $attribs["R:RESOURCE"] . "\n"; 
     } 

     $this->_current = $name; 
    } 

    public function endTag($parser, $name) 
    { 
     $this->_current = array_pop($this->_stack); 
    } 

    public function parse() 
    { 
     $fh = fopen($this->_file, "r"); 
     if (!$fh) { 
      die("Epic fail!\n"); 
     } 

     while (!feof($fh)) { 
      $data = fread($fh, 4096); 
      xml_parse($this->_parser, $data, feof($fh)); 
     } 
    } 
} 

$parser = new SimpleDMOZParser("content.rdf.u8"); 
$parser->parse(); 
+0

处理大型XML大多数肯定是最好的答案 – Evert 2009-05-26 21:27:27

+9

这是一个伟大的答案,但我花了很长时间才发现需要使用[xml_set_default_handler()](http://php.net/manual/en/function.xml-set-default-handler.php)来访问XML节点数据,通过上面的代码,您只能看到节点的名称及其属性。 – DirtyBirdNJ 2012-01-18 17:53:56

4

这并不是一个很好的解决方案,而只是抛出另一种选择在那里:

可以打破许多大型XML文件成块,特别是那些这实际上只是类似元素的列表(因为我怀疑你正在使用的文件是)。

例如,如果您的文档是这样的:

<dmoz> 
    <listing>....</listing> 
    <listing>....</listing> 
    <listing>....</listing> 
    <listing>....</listing> 
    <listing>....</listing> 
    <listing>....</listing> 
    ... 
</dmoz> 

您可以在一个或两个MEG一次读它,人为地包裹你的根级别标记加载的几个完整<listing>标签,然后负载他们通过simplexml/domxml(我采用domxml,采取这种方法时)。

坦率地说,如果您使用PHP < 5.1.2,我更喜欢这种方法。在5.1.2及更高版本中,XMLReader是可用的,这可能是最好的选择,但在此之前,您坚持使用上述分块策略或旧的SAX/expat库。我不知道其他人,但我恨写/维护SAX/expat解析器。

但是请注意,当您的文档不包含包含许多相同的底层元素(例如,它适用于任何种类的文件或URL列表等)时,此方法并不实际。 ,但对解析大型HTML文档没有意义)

9

我最近不得不解析一些非常大的XML文档,并且需要一次读取一个元素的方法。

如果你有以下文件complex-test.xml

<?xml version="1.0" encoding="UTF-8"?> 
<Complex> 
    <Object> 
    <Title>Title 1</Title> 
    <Name>It's name goes here</Name> 
    <ObjectData> 
     <Info1></Info1> 
     <Info2></Info2> 
     <Info3></Info3> 
     <Info4></Info4> 
    </ObjectData> 
    <Date></Date> 
    </Object> 
    <Object></Object> 
    <Object> 
    <AnotherObject></AnotherObject> 
    <Data></Data> 
    </Object> 
    <Object></Object> 
    <Object></Object> 
</Complex> 

,并希望返回<Object/>小号

PHP:

require_once('class.chunk.php'); 

$file = new Chunk('complex-test.xml', array('element' => 'Object')); 

while ($xml = $file->read()) { 
    $obj = simplexml_load_string($xml); 
    // do some parsing, insert to DB whatever 
} 

########### 
Class File 
########### 

<?php 
/** 
* Chunk 
* 
* Reads a large file in as chunks for easier parsing. 
* 
* The chunks returned are whole <$this->options['element']/>s found within file. 
* 
* Each call to read() returns the whole element including start and end tags. 
* 
* Tested with a 1.8MB file, extracted 500 elements in 0.11s 
* (with no work done, just extracting the elements) 
* 
* Usage: 
* <code> 
* // initialize the object 
* $file = new Chunk('chunk-test.xml', array('element' => 'Chunk')); 
* 
* // loop through the file until all lines are read 
* while ($xml = $file->read()) { 
*  // do whatever you want with the string 
*  $o = simplexml_load_string($xml); 
* } 
* </code> 
* 
* @package default 
* @author Dom Hastings 
*/ 
class Chunk { 
    /** 
    * options 
    * 
    * @var array Contains all major options 
    * @access public 
    */ 
    public $options = array(
    'path' => './',  // string The path to check for $file in 
    'element' => '',  // string The XML element to return 
    'chunkSize' => 512 // integer The amount of bytes to retrieve in each chunk 
); 

    /** 
    * file 
    * 
    * @var string The filename being read 
    * @access public 
    */ 
    public $file = ''; 
    /** 
    * pointer 
    * 
    * @var integer The current position the file is being read from 
    * @access public 
    */ 
    public $pointer = 0; 

    /** 
    * handle 
    * 
    * @var resource The fopen() resource 
    * @access private 
    */ 
    private $handle = null; 
    /** 
    * reading 
    * 
    * @var boolean Whether the script is currently reading the file 
    * @access private 
    */ 
    private $reading = false; 
    /** 
    * readBuffer 
    * 
    * @var string Used to make sure start tags aren't missed 
    * @access private 
    */ 
    private $readBuffer = ''; 

    /** 
    * __construct 
    * 
    * Builds the Chunk object 
    * 
    * @param string $file The filename to work with 
    * @param array $options The options with which to parse the file 
    * @author Dom Hastings 
    * @access public 
    */ 
    public function __construct($file, $options = array()) { 
    // merge the options together 
    $this->options = array_merge($this->options, (is_array($options) ? $options : array())); 

    // check that the path ends with a/
    if (substr($this->options['path'], -1) != '/') { 
     $this->options['path'] .= '/'; 
    } 

    // normalize the filename 
    $file = basename($file); 

    // make sure chunkSize is an int 
    $this->options['chunkSize'] = intval($this->options['chunkSize']); 

    // check it's valid 
    if ($this->options['chunkSize'] < 64) { 
     $this->options['chunkSize'] = 512; 
    } 

    // set the filename 
    $this->file = realpath($this->options['path'].$file); 

    // check the file exists 
    if (!file_exists($this->file)) { 
     throw new Exception('Cannot load file: '.$this->file); 
    } 

    // open the file 
    $this->handle = fopen($this->file, 'r'); 

    // check the file opened successfully 
    if (!$this->handle) { 
     throw new Exception('Error opening file for reading'); 
    } 
    } 

    /** 
    * __destruct 
    * 
    * Cleans up 
    * 
    * @return void 
    * @author Dom Hastings 
    * @access public 
    */ 
    public function __destruct() { 
    // close the file resource 
    fclose($this->handle); 
    } 

    /** 
    * read 
    * 
    * Reads the first available occurence of the XML element $this->options['element'] 
    * 
    * @return string The XML string from $this->file 
    * @author Dom Hastings 
    * @access public 
    */ 
    public function read() { 
    // check we have an element specified 
    if (!empty($this->options['element'])) { 
     // trim it 
     $element = trim($this->options['element']); 

    } else { 
     $element = ''; 
    } 

    // initialize the buffer 
    $buffer = false; 

    // if the element is empty 
    if (empty($element)) { 
     // let the script know we're reading 
     $this->reading = true; 

     // read in the whole doc, cos we don't know what's wanted 
     while ($this->reading) { 
     $buffer .= fread($this->handle, $this->options['chunkSize']); 

     $this->reading = (!feof($this->handle)); 
     } 

     // return it all 
     return $buffer; 

    // we must be looking for a specific element 
    } else { 
     // set up the strings to find 
     $open = '<'.$element.'>'; 
     $close = '</'.$element.'>'; 

     // let the script know we're reading 
     $this->reading = true; 

     // reset the global buffer 
     $this->readBuffer = ''; 

     // this is used to ensure all data is read, and to make sure we don't send the start data again by mistake 
     $store = false; 

     // seek to the position we need in the file 
     fseek($this->handle, $this->pointer); 

     // start reading 
     while ($this->reading && !feof($this->handle)) { 
     // store the chunk in a temporary variable 
     $tmp = fread($this->handle, $this->options['chunkSize']); 

     // update the global buffer 
     $this->readBuffer .= $tmp; 

     // check for the open string 
     $checkOpen = strpos($tmp, $open); 

     // if it wasn't in the new buffer 
     if (!$checkOpen && !($store)) { 
      // check the full buffer (in case it was only half in this buffer) 
      $checkOpen = strpos($this->readBuffer, $open); 

      // if it was in there 
      if ($checkOpen) { 
      // set it to the remainder 
      $checkOpen = $checkOpen % $this->options['chunkSize']; 
      } 
     } 

     // check for the close string 
     $checkClose = strpos($tmp, $close); 

     // if it wasn't in the new buffer 
     if (!$checkClose && ($store)) { 
      // check the full buffer (in case it was only half in this buffer) 
      $checkClose = strpos($this->readBuffer, $close); 

      // if it was in there 
      if ($checkClose) { 
      // set it to the remainder plus the length of the close string itself 
      $checkClose = ($checkClose + strlen($close)) % $this->options['chunkSize']; 
      } 

     // if it was 
     } elseif ($checkClose) { 
      // add the length of the close string itself 
      $checkClose += strlen($close); 
     } 

     // if we've found the opening string and we're not already reading another element 
     if ($checkOpen !== false && !($store)) { 
      // if we're found the end element too 
      if ($checkClose !== false) { 
      // append the string only between the start and end element 
      $buffer .= substr($tmp, $checkOpen, ($checkClose - $checkOpen)); 

      // update the pointer 
      $this->pointer += $checkClose; 

      // let the script know we're done 
      $this->reading = false; 

      } else { 
      // append the data we know to be part of this element 
      $buffer .= substr($tmp, $checkOpen); 

      // update the pointer 
      $this->pointer += $this->options['chunkSize']; 

      // let the script know we're gonna be storing all the data until we find the close element 
      $store = true; 
      } 

     // if we've found the closing element 
     } elseif ($checkClose !== false) { 
      // update the buffer with the data upto and including the close tag 
      $buffer .= substr($tmp, 0, $checkClose); 

      // update the pointer 
      $this->pointer += $checkClose; 

      // let the script know we're done 
      $this->reading = false; 

     // if we've found the closing element, but half in the previous chunk 
     } elseif ($store) { 
      // update the buffer 
      $buffer .= $tmp; 

      // and the pointer 
      $this->pointer += $this->options['chunkSize']; 
     } 
     } 
    } 

    // return the element (or the whole file if we're not looking for elements) 
    return $buffer; 
    } 
} 
+0

谢谢。这真的很有帮助。 – 2014-11-11 16:58:47

12

这是一个非常类似的问题,以Best way to process large XML in PHP但有非常好的具体答案upvoted解决DMOZ目录解析的具体问题。 然而,由于这是一个很好的谷歌打在一般大个XML,我会重新发布从其他的问题我的答案,以及:

我对此采取:

https://github.com/prewk/XmlStreamer

一个简单的类将在流式传输文件时将所有孩子提取到XML根元素。 经过来自pubmed.com的108 MB XML文件进行测试。

class SimpleXmlStreamer extends XmlStreamer { 
    public function processNode($xmlString, $elementName, $nodeIndex) { 
     $xml = simplexml_load_string($xmlString); 

     // Do something with your SimpleXML object 

     return true; 
    } 
} 

$streamer = new SimpleXmlStreamer("myLargeXmlFile.xml"); 
$streamer->parse(); 
+0

这太棒了!谢谢。一个问题:如何使用这个获得根节点的属性? – 2013-10-15 10:35:31

+0

@gyaani_guy我不认为现在可能不幸。 – oskarth 2013-12-22 21:53:09

+4

这只是将整个文件加载到内存中! – 2014-03-07 16:14:34