2016-09-02 119 views
0

我有一个文件夹,其中包含许多xml文件,每个文件都有100mbs我想通过标记解析它并将其存储到sqlite数据库中。
这里是我的例子xml,它以<conversation>这样的标签开头,就像这个75-80在1个文件中的对话标签。我需要获取所有标记信息conversationID,LoginName,StartTime,CompanyName,EmailAddress,DateTime,AccountNumber,FirmNumber,MessageContent,EndTime并插入到表格行中。
我需要多少桌子?我只是想创建一个包含许多列的表格,以便根据conversationID逐行填充所有数据。 然后我的处理包括计算会话中有多少用户,他们发送了什么消息,他们的电子邮件ID是什么等等。
任何xpath标记更容易处理或stax元素处理?没有SAX和DOM,因为我总是得到内存不足的错误,因为它是巨大的数据解析大100mb xml并将其存储到sqlite db

XML输入文件示例

<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<!-- Data provided by xyz LP. --> 
<FileDump> 
<Version>IBXML 1.3</Version> 
<Conversation Perspective=" " RoomType="P"> 
<RoomID>PCHAT-0x3000001CA8361</RoomID> 
<StartTime>03/31/2016 13:39:01</StartTime> 
<StartTimeUTC>1459431541</StartTimeUTC> 
<ParticipantEntered InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>SWONG00</LoginName> 
<FirstName>STEPHEN</FirstName> 
<LastName>WONG</LastName> 
<UUID>4397109</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>231115</AccountNumber> 
<CompanyName>ABC BANK LIMITED HON</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>03/31/2016 13:39:01</DateTime> 
<DateTimeUTC>1459431541</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantEntered> 
<ParticipantLeft InteractionType="H"> 
<User> 
<LoginName>JAU31</LoginName> 
<FirstName>JIMMY</FirstName> 
<LastName>AU</LastName> 
<UUID>8724958</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>91189</AccountNumber> 
<CompanyName>ABC BANK (HONG KONG)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>03/29/2016 10:45:47</DateTime> 
<DateTimeUTC>1459248347</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantLeft> 
<ParticipantEntered InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>G_LO</LoginName> 
<FirstName>GARY</FirstName> 
<LastName>LO</LastName> 
<UUID>7054548</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>91189</AccountNumber> 
<CompanyName>abc BANK (HONG KONG)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>03/31/2016 14:56:22</DateTime> 
<DateTimeUTC>1459436182</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantEntered> 
<ParticipantLeft InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>G_LO</LoginName> 
<FirstName>GARY</FirstName> 
<LastName>LO</LastName> 
<UUID>7054548</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>91189</AccountNumber> 
<CompanyName>abc BANK (HONG KONG)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>03/31/2016 19:30:01</DateTime> 
<DateTimeUTC>1459452601</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantLeft> 
<ParticipantLeft InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>SWONG00</LoginName> 
<FirstName>STEPHEN</FirstName> 
<LastName>WONG</LastName> 
<UUID>4397109</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>231115</AccountNumber> 
<CompanyName>abc BANK LIMITED HON</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>03/31/2016 19:33:56</DateTime> 
<DateTimeUTC>1459452836</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantLeft> 
<ParticipantEntered InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>SWONG00</LoginName> 
<FirstName>STEPHEN</FirstName> 
<LastName>WONG</LastName> 
<UUID>4397109</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>231115</AccountNumber> 
<CompanyName>abc BANK LIMITED HON</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>03/31/2016 19:45:16</DateTime> 
<DateTimeUTC>1459453516</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantEntered> 
<ParticipantLeft InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>SWONG00</LoginName> 
<FirstName>STEPHEN</FirstName> 
<LastName>WONG</LastName> 
<UUID>4397109</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>231115</AccountNumber> 
<CompanyName>abc BANK LIMITED HON</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>03/31/2016 23:08:09</DateTime> 
<DateTimeUTC>1459465689</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantLeft> 
<ParticipantEntered InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>G_LO</LoginName> 
<FirstName>GARY</FirstName> 
<LastName>LO</LastName> 
<UUID>7054548</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>91189</AccountNumber> 
<CompanyName>abc BANK (HONG KONG)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>03/31/2016 23:14:23</DateTime> 
<DateTimeUTC>1459466063</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantEntered> 
<Message InteractionType="N"> 
<User> 
<LoginName>G_LO</LoginName> 
<FirstName>GARY</FirstName> 
<LastName>LO</LastName> 
<UUID>7054548</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>91189</AccountNumber> 
<CompanyName>abc BANK (HONG KONG)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 00:10:57</DateTime> 
<DateTimeUTC>1459469457</DateTimeUTC> 
<Content> 
abcdefgghhhhhh 
</Content> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</Message> 
<ParticipantEntered InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>WVU</LoginName> 
<FirstName>WHEELOCK</FirstName> 
<LastName>VU</LastName> 
<UUID>8266852</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>91189</AccountNumber> 
<CompanyName>abc BANK (HONG KONG)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 00:14:05</DateTime> 
<DateTimeUTC>1459469645</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantEntered> 
<ParticipantEntered InteractionType="N"> 
<User> 
<LoginName>FCHAN95</LoginName> 
<FirstName>FLORENCE</FirstName> 
<LastName>CHAN</LastName> 
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress></CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 00:29:19</DateTime> 
<DateTimeUTC>1459470559</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantEntered> 
<Message InteractionType="N"> 
<User> 
<LoginName>FCHAN95</LoginName> 
<FirstName>FLORENCE</FirstName> 
<LastName>CHAN</LastName> 
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress></CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 00:29:19</DateTime> 
<DateTimeUTC>1459470559</DateTimeUTC> 
<Content> 
ajdakjgdljsgdsafhkafa 
</Content> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</Message> 
<Message InteractionType="N"> 
<User> 
<LoginName>FCHAN95</LoginName> 
<FirstName>FLORENCE</FirstName> 
<LastName>CHAN</LastName> 
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress></CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 00:29:19</DateTime> 
<DateTimeUTC>1459470559</DateTimeUTC> 
<Content> 
akjdgljsafdlshf;kdsjf 
</Content> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</Message> 
<Message InteractionType="N"> 
<User> 
<LoginName>WVU</LoginName> 
<FirstName>WHEELOCK</FirstName> 
<LastName>VU</LastName> 
<UUID>8266852</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>91189</AccountNumber> 
<CompanyName>abc BANK (HONG KONG)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 00:39:32</DateTime> 
<DateTimeUTC>1459471172</DateTimeUTC> 
<Content> 
sagdksajdlsahd 
</Content> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</Message> 
<ParticipantEntered InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>SWONG00</LoginName> 
<FirstName>STEPHEN</FirstName> 
<LastName>WONG</LastName> 
<UUID>4397109</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>231115</AccountNumber> 
<CompanyName>abc BANK LIMITED HON</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 01:01:27</DateTime> 
<DateTimeUTC>1459472487</DateTimeUTC> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantEntered> 
<Message InteractionType="N"> 
<User> 
<LoginName>SWONG00</LoginName> 
<FirstName>STEPHEN</FirstName> 
<LastName>WONG</LastName> 
<UUID>4397109</UUID> 
<FirmNumber>13133</FirmNumber> 
<AccountNumber>231115</AccountNumber> 
<CompanyName>abc BANK LIMITED HON</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 01:31:29</DateTime> 
<DateTimeUTC>1459474289</DateTimeUTC> 
<Content> 
ajdslsahdsj;a 
</Content> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</Message> 
<Message InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>FCHAN95</LoginName> 
<FirstName>FLORENCE</FirstName> 
<LastName>CHAN</LastName> 
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress></CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 02:49:46</DateTime> 
<DateTimeUTC>1459478986</DateTimeUTC> 
<Content> 
sagdkjsagdkjashdlasjd 
</Content> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</Message> 
<Message InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>FCHAN95</LoginName> 
<FirstName>FLORENCE</FirstName> 
<LastName>CHAN</LastName> 
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress></CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 02:49:46</DateTime> 
<DateTimeUTC>1459478986</DateTimeUTC> 
<Content> 
jsdhkshdksjdlsjdlks 
</Content> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</Message> 
<Message InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>FCHAN95</LoginName> 
<FirstName>FLORENCE</FirstName> 
<LastName>CHAN</LastName> 
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress></CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 03:47:37</DateTime> 
<DateTimeUTC>1459482457</DateTimeUTC> 
<Content> 
jshdkshdksjdlskld 
</Content> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</Message> 
<Message InteractionType="N" DeviceType="M"> 
<User> 
<LoginName>FCHAN95</LoginName> 
<FirstName>FLORENCE</FirstName> 
<LastName>CHAN</LastName> 
<CompanyName>GOLDMAN SACHS (ASIA)</CompanyName> 
<EmailAddress>[email protected]</EmailAddress> 
<CorporateEmailAddress></CorporateEmailAddress> 
</User> 
<DateTime>04/01/2016 03:47:37</DateTime> 
<DateTimeUTC>1459482457</DateTimeUTC> 
<Content> 
aasasasasas 
</Content> 
<ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</Message> 
<EndTime>04/01/2016 03:47:37</EndTime> 
<EndTimeUTC>1459482457</EndTimeUTC> 
</Conversation> 
</FileDump> 
+0

*“没有SAX或DOM bcos我总是会出错内存错误”*首先,这不是聊天网站,所以请拼出单词。其次,我可以看到DOM如何导致OutOfMemory,但SAX绝对不是OutOfMemory的原因,因为它明确设计为不这样做。它实际上是OutOfMemory的主要解决方案之一,与新的StAX相邻。 – Andreas

+0

你的文件有多大?我认为如果文件是1GB,vtd-xml不会导致内存问题,并且可以使用xpath显着缩短代码 –

+0

嗨,每个文件最少为70-100mb ..就像这个xml文件夹中包含75个文件。所以当我使用SAX/DOM解析器时,我提到当我处于第30-35个文件循环时,它会给我outOfMemory异常。最后,我需要有多少用户参与一次对话,他们写的内容是什么?那么,如何编写SQLite查询来获取每个对话ID中的用户数呢?和他们的内容? –

回答

0

看起来像你应该做的3个或2个表 - 对话(的conversationId,开始时间,结束时间),用户(登录名,公司名称,电子邮件地址,固件编号),消息(日期时间,消息内容,帐户编号)

一旦我用php导入xml,但它是php,并且有1GB xml文件。奇怪的是,你有问题与Java和100 MB的XML。但是如果你的内存有问题,我可以给你建议我的决定 - 用普通的java类获取文件,并逐行读取它(如果你的情况不可能,char by char)。在这段阅读过程中,你应该定义开始和结束标记(<User> and </User>)),并在循环中读取这些数据,也许你会对每个文件进行3次处理 - 第一次迭代获取所有用户,第二次获取所有会话,第三次获取所有消息,但看起来,这是一个一次性的过程,所以应该对你是好的。

+0

不要逐行读取XML文件或逐字符读取并自行解析。 Java带有多个非常有用的XML解析器,它们不占用大量内存:SAX和StAX。我推荐StAX,因为它比SAX用于结构化解析要容易得多。 – Andreas

+0

如果我创建3个表,如对话/用户/消息,如何使用sqlite查询获得每个会话ID中的用户数?或者我必须在每次对话中统计用户并将计数存储在表格行中?使用java或XAPTH解决方案来计数和存储SQLite计数查询是否很好? –

+0

我从来没有使用sqllite,但认为必须存在一些聚合函数,像'select count(user_id)from conversation where id = {yourConversationId}' – degr

0

你应该如何使用StAX解析XML文件,这样处理它。

读取XML的初始部分,验证它,然后忽略它

<?xml version="1.0" encoding="UTF-8" standalone="no"?> 
<!-- Data provided by xyz LP. --> 
<FileDump> 
    <Version>IBXML 1.3</Version> 

阅读对话的开始:

<Conversation Perspective=" " RoomType="P"> 
    <RoomID>PCHAT-0x3000001CA8361</RoomID> 
    <StartTime>03/31/2016 13:39:01</StartTime> 
    <StartTimeUTC>1459431541</StartTimeUTC> 

在数据库的Conversation表中创建一条新记录,将新记录的标识返回。

阅读一个参与者的条目,并将其保存在Participant表(其中进入VS左边是一个柱):

<ParticipantEntered InteractionType="N" DeviceType="M"> 
    <User> 
     <LoginName>SWONG00</LoginName> 
     <FirstName>STEPHEN</FirstName> 
     <LastName>WONG</LastName> 
     <UUID>4397109</UUID> 
     <FirmNumber>13133</FirmNumber> 
     <AccountNumber>231115</AccountNumber> 
     <CompanyName>ABC BANK LIMITED HON</CompanyName> 
     <EmailAddress>[email protected]</EmailAddress> 
     <CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
    </User> 
    <DateTime>03/31/2016 13:39:01</DateTime> 
    <DateTimeUTC>1459431541</DateTimeUTC> 
    <ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</ParticipantEntered> 

读取消息条目,并将其保存在Message表:

<Message InteractionType="N"> 
    <User> 
     <LoginName>G_LO</LoginName> 
     <FirstName>GARY</FirstName> 
     <LastName>LO</LastName> 
     <UUID>7054548</UUID> 
     <FirmNumber>13133</FirmNumber> 
     <AccountNumber>91189</AccountNumber> 
     <CompanyName>abc BANK (HONG KONG)</CompanyName> 
     <EmailAddress>[email protected]</EmailAddress> 
     <CorporateEmailAddress>[email protected]</CorporateEmailAddress> 
    </User> 
    <DateTime>04/01/2016 00:10:57</DateTime> 
    <DateTimeUTC>1459469457</DateTimeUTC> 
    <Content> 
abcdefgghhhhhh 
    </Content> 
    <ConversationID>PCHAT-0x3000001CA8361</ConversationID> 
</Message> 

保持读取和保存条目:<ParticipantEntered>,<ParticipantLeft><Message>

读取对话结束:

<EndTime>04/01/2016 03:47:37</EndTime> 
    <EndTimeUTC>1459482457</EndTimeUTC> 
</Conversation> 

更新之前创建的Conversation记录。

读取和验证XML文档的末尾:

</FileDump> 

大功告成,具有极低的内存占用。

注意:您可能还有第4 User表。

相关问题