2014-01-13 63 views
0

我从Nokogiri :: XML :: Reader上使用Xml :: Parser从XML文件中提取条目。我想只抓住“Property/PropertyID/Identification ['OrganizationName'=='northsteppe']”的标签,但无法找出正确的语法来完成此操作,这里是我一直在构建的整个耙子任务接下来是一个样本节点,其中包含所有信息和标签。任何指导将不胜感激。在特定节点上只抓取具有特定属性值的条目

================ UPDATE ===============

我解析该文件正在使用中的拉open-uri,因为它来自外部来源,我只是在本地机器上使用旧版本的硬拷贝,以便在开发过程中加快速度,因为文件大小为300MB +。我试图使用一个SAX解析器,但是这个逻辑似乎有点复杂,我真的能够掌握发生了什么,并且遇到了同样的问题,这限制了我只抓住那些'northsteppe'作为Identification标签中的OrganizationName,我说过,我选择使用当前的方法尝试相同的任务,我能够抓住几乎所有我需要的信息,我只是错过了上面提到的最后一部分。

===============抵达尽可能具体=============

所以,我觉得好像描述的确切我正在尝试预成型的任务将有助于消除任何缺失的空白。任务如下。

<Identification>标记中的OraganizationName ='northsteppe'的XML文件中抓取每个属性,然后分别获取与每个属性相关的所有相应信息并将其插入散列。在将单个财产的所有信息收集并放入该散列之后,需要将其作为单独条目上载到数据库,该数据库已按照其需要的方式构建。一旦该属性被插入到数据库中,则耙取任务将移动到Property的下一个条目,该条目符合<Identification>标记中具有OrganizationName ='northsteppe'的规范并重复该过程,直到满足上述列表中的所有属性规格已插入到数据库中。这样做的目的是为了让我可以快速搜索Northsteppe属性的数据,而无需使用XML文件中的每个属性将系统陷入困境。

最终,我将使用open-uri从该文件的外部源中提取文件,并运行一个cron作业,每6小时执行一次这个rake任务并替换数据库。

================= CODE =================

namespace :db do 

# RAKE TASK DESCRIPTION 
desc "Fetch property information and insert it into the database" 

# RAKE TASK NAME  
task :print_properties => :environment do 

    require 'rubygems' 
    require 'nokogiri' 

    module Xml 
     class Parser 
     def initialize(node, &block) 
      @node = node 
      @node.each do 
      self.instance_eval &block 
      end 
     end 

     def name 
      @node.name 
     end 

     def inner_xml 
      @node.inner_xml.strip 
     end 

     def is_start? 
      @node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT 
     end 

     def is_end? 
      @node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT 
     end 

     def attribute(attribute) 
      @node.attribute(attribute) 
     end 

     def for_element(name, &block) 
      return unless self.name == name and is_start? 
      self.instance_eval &block 
     end 

     def inside_element(name=nil, &block) 
      return if @node.self_closing? 
      return unless name.nil? or (self.name == name and is_start?) 

      name = @node.name 
      depth = @node.depth 

      @node.each do 
      return if self.name == name and is_end? and @node.depth == depth 
      self.instance_eval &block 
      end 
     end 
     end 
    end 


    Xml::Parser.new(Nokogiri::XML::Reader(open("app/assets/xml/mits.xml"))) do 
     inside_element 'Property' do 

      # OPEN AND PARSE THE <PropertyID> TAG 
      inside_element 'PropertyID' do 

       inside_element 'Identification' do 
        puts attribute_nodes() 
       end 

       # OPEN AND PARSE THE <Address> TAG 
       inside_element 'Address' do 
        for_element 'AddressLine1' do puts "Street Address: #{inner_xml}" end 
        for_element 'City' do puts "City: #{inner_xml}" end 
        for_element 'PostalCode' do puts "Zipcode: #{inner_xml}" end 
       end 

      for_element 'MarketingName' do puts "Short Description: #{inner_xml}" end 
      end 

      # OPEN AND PARSE THE <Information> TAG 
      inside_element 'Information' do 
       for_element 'LongDescription' do puts "Long Description: #{inner_xml}" end 
       inside_element 'Rents' do 
        for_element 'StandardRent' do puts "Rent: #{inner_xml}" end 
       end 
      end 

      inside_element 'Fee' do 
       for_element 'ApplicationFee' do puts "Application Fee: #{inner_xml}" end 
      end 

      inside_element 'ILS_Identification' do 
       for_element 'Latitude' do puts "Latitude: #{inner_xml}" end 
       for_element 'Longitude' do puts "Longitude: #{inner_xml}" end 
      end 

     end 
    end 

end #END INSERT_PROPERTIES TASK 

end #END NAMESPACE 

和样品该XML -

<Property IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8"> 
<PropertyID> 
    <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/> 
    <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/> 
    <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName> 
    <WebSite>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</WebSite> 
    <Address AddressType="property"> 
    <Description>Address of Available Listing</Description> 
    <AddressLine1>1689 N 4th St </AddressLine1> 
    <City>Columbus</City> 
    <State>OH</State> 
    <PostalCode>43201</PostalCode> 
    <Country>US</Country> 
    </Address> 
    <Phone PhoneType="office"> 
    <PhoneNumber>(614) 299-4110</PhoneNumber> 
    </Phone> 
    <Email>[email protected]</Email> 
</PropertyID> 
<ILS_Identification ILS_IdentificationType="Apartment" RentalType="Market Rate"> 
    <Latitude>39.997694</Latitude> 
    <Longitude>-82.99903</Longitude> 
    <LastUpdate Month="11" Day="11" Year="2013"/> 
</ILS_Identification> 
<Information> 
    <StructureType>Standard</StructureType> 
    <UnitCount>1</UnitCount> 
    <ShortDescription>Spacious House Central Campus OSU, available fall</ShortDescription> 
    <LongDescription>One of our favorites! This great house is perfect for students or a single family. With huge living and sleeping rooms, there is plenty of space. The kitchen is totally modernized with new appliances, and the bathroom has been updated. Natural woodwork and brick accents are seen within the house, and the decorative mantles. Ceiling fans and mini-blinds are included, as well as a FREE stack washer and dryer. The front and side deck. On site parking available.</LongDescription> 
    <Rents> 
    <StandardRent>2000.00</StandardRent> 
    </Rents> 
    <PropertyAvailabilityURL>http://northsteppe.appfolio.com/listings/listings/642da00e-9be3-4a7c-bd50-66a4f0d70af8</PropertyAvailabilityURL> 
</Information> 
<Fee> 
    <ProrateType>Standard</ProrateType> 
    <LateType>Standard</LateType> 
    <LatePercent>0</LatePercent> 
    <LateMinFee>0</LateMinFee> 
    <LateFeePerDay>0</LateFeePerDay> 
    <NonRefundableHoldFee>0</NonRefundableHoldFee> 
    <AdminFee>0</AdminFee> 
    <ApplicationFee>30.00</ApplicationFee> 
    <BrokerFee>0</BrokerFee> 
</Fee> 
<Deposit DepositType="Security Deposit"> 
    <Amount AmountType="Actual"> 
    <ValueRange Exact="2000.00" Currency="USD"/> 
    </Amount> 
</Deposit> 
<Policy> 
    <Pet Allowed="false"/> 
</Policy> 
<Phase IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8"> 
    <Name/> 
    <Description/> 
    <UnitCount>1</UnitCount> 
    <RentableUnits>1</RentableUnits> 
    <TotalSquareFeet>0</TotalSquareFeet> 
    <RentableSquareFeet>0</RentableSquareFeet> 
</Phase> 
<Building IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8"> 
    <Name/> 
    <Description/> 
    <UnitCount>1</UnitCount> 
    <SquareFeet>0</SquareFeet> 
</Building> 
<Floorplan IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8"> 
    <Name/> 
    <UnitCount>1</UnitCount> 
    <Room RoomType="Bedroom"> 
    <Count>4</Count> 
    <Comment/> 
    </Room> 
    <Room RoomType="Bathroom"> 
    <Count>1</Count> 
    <Comment/> 
    </Room> 
    <SquareFeet Min="0" Max="0"/> 
    <MarketRent Min="2000" Max="2000"/> 
    <EffectiveRent Min="2000" Max="2000"/> 
</Floorplan> 
<ILS_Unit IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8"> 
    <Units> 
    <Unit> 
     <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="UL Portfolio"/> 
     <MarketingName>Spacious House Central Campus OSU, available fall</MarketingName> 
     <UnitBedrooms>4</UnitBedrooms> 
     <UnitBathrooms>1.0</UnitBathrooms> 
     <MinSquareFeet>0</MinSquareFeet> 
     <MaxSquareFeet>0</MaxSquareFeet> 
     <SquareFootType>internal</SquareFootType> 
     <UnitRent>2000.00</UnitRent> 
     <MarketRent>2000.00</MarketRent> 
     <Address AddressType="property"> 
     <AddressLine1>1689 N 4th St </AddressLine1> 
     <City>Columbus</City> 
     <PostalCode>43201</PostalCode> 
     <Country>US</Country> 
     </Address> 
    </Unit> 
    </Units> 
    <Availability> 
    <VacateDate Month="7" Day="23" Year="2014"/> 
    <VacancyClass>Unoccupied</VacancyClass> 
    <MadeReadyDate Month="7" Day="23" Year="2014"/> 
    </Availability> 
    <Amenity AmenityType="Other"> 
    <Description>All new stainless steel appliances! Refinished hardwood floors</Description> 
    </Amenity> 
    <Amenity AmenityType="Other"> 
    <Description>Ceramic tile</Description> 
    </Amenity> 
    <Amenity AmenityType="Other"> 
    <Description>Ceiling fans</Description> 
    </Amenity> 
    <Amenity AmenityType="Other"> 
    <Description>Wrap-around porch</Description> 
    </Amenity> 
    <Amenity AmenityType="Dryer"> 
    <Description>Free Washer and Dryer</Description> 
    </Amenity> 
    <Amenity AmenityType="Washer"> 
    <Description>Free Washer and Dryer</Description> 
    </Amenity> 
    <Amenity AmenityType="Other"> 
    <Description>off-street parking available</Description> 
    </Amenity> 
</ILS_Unit> 
<File Active="true" FileID="820982141"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/31077069-6e81-4373-8a89-508c57585543/medium.jpg</Src> 
    <Width>360</Width> 
    <Height>300</Height> 
    <Rank>1</Rank> 
</File> 
<File Active="true" FileID="820982145"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/84e1be40-96fd-4717-b75d-09b39231a762/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>2</Rank> 
</File> 
<File Active="true" FileID="820982149"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/cd419635-c37f-4676-a43e-c72671a2a748/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>3</Rank> 
</File> 
<File Active="true" FileID="820982152"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/6b68dbd5-2cde-477c-99d7-3ca33f03cce8/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>4</Rank> 
</File> 
<File Active="true" FileID="820982155"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/17b6c7c0-686c-4e46-865b-11d80744354a/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>5</Rank> 
</File> 
<File Active="true" FileID="820982157"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/3545ac8b-471f-404a-94b2-fcd00dd16e25/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>6</Rank> 
</File> 
<File Active="true" FileID="820982160"> 
    <FileType>Photo</FileType> 
    <Description>Unit Photo</Description> 
    <Name/> 
    <Caption/> 
    <Format>image/jpeg</Format> 
    <Src>http://pa.cdn.appfolio.com/northsteppe/images/02471172-2183-4bf1-a3d7-33415f902c1c/medium.jpg</Src> 
    <Width>350</Width> 
    <Height>265</Height> 
    <Rank>7</Rank> 
</File> 
    </Property> 
+0

http://amolnpujari.wordpress.com/2012/03/31/reading_huge_xml-rb/ 我还发现,在阅读大型XML时,牛比nokogiri快5倍。 另外我有一个包装器,它只是让你用ox来搜索大的xml,允许你迭代指定的元素。 https://gist.github.com/amolpujari/5966431 –

回答

0

于是我发现了解决方案是在一个叫做Saxerator(https://github.com/soulcutter/saxerator)小宝石。它没有Nokogiri(谢谢),SAX解析,具有优秀的文档,运行速度超快。我鼓励任何未来需要使用SAX解析器的人去调查这个小小的宝石(双关语意图),并减轻处理所有可怕的Nokogiri文档的负担。我的问题的解决方案如下,位于我的seeds.rb文件。

require 'saxerator' 

parser = Saxerator.parser(File.new("app/assets/xml/mits_snip.xml")) do |config| 
    config.put_attributes_in_hash! 
    config.symbolize_keys! 
end 


parser.for_tag(:Property).each do |property| 
    if property[:PropertyID][:Identification][1][:OrganizationName] == 'northsteppe' 
     property_attributes = { 
      street_address:  property[:PropertyID][:Address][:AddressLine1], 
      city:    property[:PropertyID][:Address][:City], 
      zipcode:   property[:PropertyID][:Address][:PostalCode], 
      short_description: property[:PropertyID][:MarkertName], 
      long_description: property[:Information][:LongDescription], 
      rent:    property[:Information][:Rents][:StandardRent], 
      application_fee: property[:Fee][:ApplicationFee], 
      vacancy_status:  property[:ILS_Unit][:Availability][:VacancyClass], 
      month_available: property[:ILS_Unit][:Availability][:MadeReadyDate][:Month], 
      latitude:   property[:ILS_Identification][:Latitude], 
      longitude:   property[:ILS_Identification][:Longitude] 

     } 

     if Property.create! property_attributes 
      puts "wahoo" 
     else 
      puts "nope" 
     end 
    end 
end 

============== UPDATE =================

所以我实际上改写了这个任务做工作好多了,只是想分享下来,因为任何人都会遇到这个问题 - 这是我的种子.rb文件

require 'saxerator' 
require 'open-uri' 
@company_name = 'northsteppe' 
parser = Saxerator.parser(File.new("../../shared/assets/xml/mits.xml")) do |config| 
    config.put_attributes_in_hash! 
    config.symbolize_keys! 
end 
puts "DELETED ALL EXISITNG PROPERTIES" if Property.delete_all 
puts "PULLING RELEVENT XML ENTERIES" 
@@count = 0 
file = File.new("../../shared/assets/xml/nsr_properties.xml",'w') 
properties = [] 
parser.for_tag(:Property).each do |property| 
    print '*' 
    if property[:PropertyID][:Identification][1][:OrganizationName] == @company_name 
     properties << property 
     @@count = @@count +1 
    end 
    # break if @@count == 417 
end 
file.write(properties.to_xml) 
file.close 
puts "ADDING PROPERTIES TO THE DATABASE" 
nsr_properties = File.open("../../shared/assets/xml/nsr_properties.xml") 
doc = Nokogiri::XML(nsr_properties) 
doc.xpath("//saxerator-builder-hash-elements/saxerator-builder-hash-element").each do |property| 
    print '.' 
    @images =[] 
    property.xpath("File/File").each do |image| 
     @images << image.at_xpath("Src/text()").to_s 
    end 
    @amenities = [] 
    property.xpath("ILS-Unit/Amenity/Amenity").each do |amenity| 
     @amenities << amenity.at_xpath("Description/text()").to_s 
    end 
    information = { 
     "street_address" => property.at_xpath("PropertyID/Address/AddressLine1/text()").to_s, 
     "city" => property.at_xpath("PropertyID/Address/City/text()").to_s, 
     "zipcode" => property.at_xpath("PropertyID/Address/PostalCode/text()").to_s, 
     "short_description" => property.at_xpath("PropertyID/MarketingName/text()").to_s, 
     "long_description" => property.at_xpath("Information/LongDescription/text()").to_s, 
     "rent" => property.at_xpath("Information/Rents/StandardRent/text()").to_s, 
     "application_fee" => property.at_xpath("Fee/ApplicationFee/text()").to_s, 
     "bedrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBedrooms/text()").to_s, 
     "bathrooms" => property.at_xpath("ILS-Unit/Units/Unit/UnitBathrooms/text()").to_s, 
     "vacancy_status" => property.at_xpath("ILS-Unit/Availability/VacancyClass/text()").to_s, 
     "month_available" => property.at_xpath("ILS-Unit/Availability/MadeReadyDate/@Month").to_s, 
     "latitude" => property.at_xpath("ILS-Identification/Latitude/text()").to_s, 
     "longitude" => property.at_xpath("ILS-Identification/Longitude/text()").to_s, 
     "images" => @images, 
     "amenities" => @amenities 
    } 
    Property.create!(information) 
end 
puts "DONE, WAHOO" 
1

试试这个一开始:

require 'nokogiri' 

doc = Nokogiri::XML(File.read('test.xml')) 
doc.search('*[OrganizationName="northsteppe"]') 
# => [#<Nokogiri::XML::Element:0x3fd82499131c name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd8249912b8 name="IDValue" value="642da00e-9be3-4a7c-bd50-66a4f0d70af8">, #<Nokogiri::XML::Attr:0x3fd8249912a4 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd824991290 name="IDType" value="property">]>, #<Nokogiri::XML::Element:0x3fd824990a70 name="Identification" attributes=[#<Nokogiri::XML::Attr:0x3fd824990a0c name="IDValue" value="6e1e61523972d5f0e260e3d38eb488337424f21e">, #<Nokogiri::XML::Attr:0x3fd8249909f8 name="OrganizationName" value="northsteppe">, #<Nokogiri::XML::Attr:0x3fd8249909e4 name="IDType" value="Company">]>] 

要做出什么引入nokogiri发现有点更具可读性:

puts doc.search('*[OrganizationName="northsteppe"]').map{ |n| n.to_xml } 
# >> <Identification IDValue="642da00e-9be3-4a7c-bd50-66a4f0d70af8" OrganizationName="northsteppe" IDType="property"/> 
# >> <Identification IDValue="6e1e61523972d5f0e260e3d38eb488337424f21e" OrganizationName="northsteppe" IDType="Company"/> 

我发现使用CSS通常比XPath更具可读性。在这种情况下,这是一种折腾。


...实际的文件是300MB和装载在DOM崩溃的服务器。

如果您的服务器无法处理文件大小,那么您最好的选择是SAX解析器,它的内存效率与您所能获得的一样高效。下面是使用示例XML一个简单的例子:

require 'nokogiri' 

class MyDocument < Nokogiri::XML::SAX::Document 
    @@tags = [] 

    def start_element name, attributes = [] 

    attribute_hash = Hash[attributes] 
    if (name == 'Identification') && (attribute_hash['OrganizationName'] == 'northsteppe') 
     @@tags << { 
     name: name, 
     attributes: attribute_hash 
     } 
    end 
    end 

    def tags 
    @@tags 
    end 
end 

doc = MyDocument.new 

# Create a new parser 
parser = Nokogiri::XML::SAX::Parser.new(doc) 

# Feed the parser some XML 
parser.parse(File.open('test.xml')) 

doc.tags 
# => [{:name=>"Identification", 
#  :attributes=> 
#  {"IDValue"=>"642da00e-9be3-4a7c-bd50-66a4f0d70af8", 
#  "OrganizationName"=>"northsteppe", 
#  "IDType"=>"property"}}, 
#  {:name=>"Identification", 
#  :attributes=> 
#  {"IDValue"=>"6e1e61523972d5f0e260e3d38eb488337424f21e", 
#  "OrganizationName"=>"northsteppe", 
#  "IDType"=>"Company"}}] 
+0

不幸的是,这种方法不能正常工作,因为实际文件是300MB,并且在DOM中加载导致服务器崩溃。 :/ –

+0

你没有提到*非常*重要的信息。你已经把所有的约束都放在了你的问题中。不要让我们一块一块地弄清楚。 –

+0

我的确道歉,我不是故意忽略那条信息。我已经添加了上述问题的两个更新以尽可能具体。我已经运行了你的代码,并且确实将所有的Identification标签都拉到了OrganizationName ='northsteppe'的位置,这是我在使用SAX之前已经能够做到的。 :)也许上面的更新将阐明我正在努力完成的确切过程,而不是我只是要求拼图的部分,并试图找出其余部分(已证明这一特定任务不成功)。 –

相关问题