C# SqlBulkCopy 和 DataTables 与标识列上的父/子关系

Question

提问by James Hugard

We have a need to update several tables that have parent/child relationships based on an Identity primary-key in the parent table, which is referred to by one or more child tables as a foreign key.

我们需要根据父表中的 Identity 主键更新具有父/子关系的多个表，该主键被一个或多个子表称为外键。

Due to the high volume of data, we would like to build these tables in memory, then use SqlBulkCopy from C# to update the database en mass from either the DataSet or the individual DataTables.
We would further like to do this in parallel, from multiple threads, processes, and possibly clients.

由于数据量很大，我们希望在内存中构建这些表，然后使用 C# 中的 SqlBulkCopy 从 DataSet 或单个 DataTables 更新数据库。
我们还希望从多个线程、进程和可能的客户端并行执行此操作。

Our prototype in F# shows a lot of promise, with a 34x performance increase, but this code forces known Identity values in the parent table. When not forced, the Identity column does get correctly generated in the databasewhen SqlBulkCopy inserts the rows, but the Identity values do NOT get updated in the in-memory DataTable. Further, even if they were, it is not clear if the DataSet would correctly fix-up the parent/child relationships, so that the child tables could subsequently be written with correct foreign key values.

我们在 F# 中的原型显示了很多希望，性能提高了 34 倍，但此代码强制父表中的已知标识值。如果不强制，当 SqlBulkCopy 插入行时，会在数据库中正确生成 Identity 列，但不会在内存中的 DataTable 中更新 Identity 值。此外，即使是这样，也不清楚 DataSet 是否会正确修复父/子关系，以便随后可以使用正确的外键值写入子表。

Can anyone explain how to have SqlBulkCopy update Identity values, and further how to configure a DataSet so as to retain and update parent/child relationships, if this is not done automatically when a DataAdapter is called to FillSchema on the individual DataTables.

任何人都可以解释如何让 SqlBulkCopy 更新标识值，以及如何配置数据集以保留和更新父/子关系，如果在单个数据表上调用 DataAdapter 到 FillSchema 时这不是自动完成的。

Answers that I'm not looking for:

我不是在寻找的答案：

Read the database to find the current highest Identity value, then manually increment it when creating each parent row. Does not work for multiple processes/clients and as I understand it failed transactions may cause some Identity values to be skipped, so this method could screw up the relation.
Write the parent rows one-at-a-time and ask for the Identity value back. This defeats at least some of the gains had by using SqlBulkCopy (yes, there are a lot more child rows than parents ones, but there are still a lot of parent rows).

读取数据库以查找当前最高标识值，然后在创建每个父行时手动增加它。不适用于多个进程/客户端，据我所知，失败的事务可能会导致跳过某些身份值，因此此方法可能会破坏关系。
一次写入一个父行并要求返回 Identity 值。这至少抵消了使用 SqlBulkCopy 所带来的一些好处（是的，子行比父行多得多，但仍然有很多父行）。

Similar to the following unanswered question:

类似于以下未回答的问题：

How to update Dataset Parent & Child tables with Autogenerated Identity Key?

如何使用自动生成的身份密钥更新数据集父子表？

Answer 1

回答by Paul Farry

I guess the trade off you face is the performance of the BulkInsert vs the reliabilty of the Identity.

我想您面临的权衡是 BulkInsert 的性能与 Identity 的可靠性。

Can you put the database into SingleUserMode temporarily to perform your insert?

您可以将数据库暂时置于 SingleUserMode 以执行插入吗？

I faced a very similar issue with my conversion project where I am adding an Identity column to very large tables, and they have children. Fortunately I was able to setup the identity the parent and child sources (i used a TextDataReader) to perform the BulkInsert, and I generated the Parent and child files at the same time.

我在我的转换项目中遇到了一个非常相似的问题，我向非常大的表添加了一个 Identity 列，并且他们有孩子。幸运的是，我能够设置父源和子源的身份（我使用了 TextDataReader）来执行 BulkInsert，并且我同时生成了父文件和子文件。

I also gained the performance gains you are talking about, OleDBDataReader Source -> StreamWriter ... and then TextDataReader -> SQLBulk

我还获得了您所说的性能提升，OleDBDataReader Source -> StreamWriter ... 然后 TextDataReader -> SQLBulk

Answer 2

回答by Achim

First of all: SqlBulkCopy is not possible to do what you want. As the name suggests, it's just a "one way street". I moves data into sql server as quick as possible. It's the .Net version of the old bulk copy command which imports raw text files into tables. So there is no way to get the identity values back if you are using SqlBulkCopy.

首先：SqlBulkCopy 不可能做你想做的事。顾名思义，它只是一条“单向街”。我尽可能快地将数据移动到 sql server 中。它是将原始文本文件导入表格的旧批量复制命令的 .Net 版本。因此，如果您使用的是 SqlBulkCopy，则无法取回标识值。

I have done a lot of bulk data processing and have faced this problem several times. The solution depends on your architecture and data distribution. Here are some ideas:

我做过很多批量数据处理，也遇到过好几次这个问题。解决方案取决于您的架构和数据分布。这里有一些想法：

Create one set of target tables for each thread, import in these tables. At the end join these tables. Most of this can implemented in a quite generic way where you generate tables called TABLENAME_THREAD_ID automatically from tables called TABLENAME.
Move ID generation completly out of the database. For example, implement a central webservice which generates the IDs. In that case you should not generate one ID per call but rather generate ID ranges. Otherwise network overhead becomes usually a bottle neck.
Try to generate IDs out your data. If it's possible, your problem would have been gone. Don't say "it's not possible" to fast. Perhaps you can use string ids which can be cleaned up in a post processing step?

为每个线程创建一组目标表，在这些表中导入。最后加入这些表。其中大部分可以以一种非常通用的方式实现，您可以从名为 TABLENAME 的表中自动生成名为 TABLENAME_THREAD_ID 的表。
将 ID 生成完全移出数据库。例如，实现一个生成 ID 的中央 Web 服务。在这种情况下，您不应为每个调用生成一个 ID，而是生成 ID 范围。否则网络开销通常会成为瓶颈。
尝试从您的数据中生成 ID。如果可能的话，你的问题就没有了。不要说“不可能”禁食。也许您可以使用可以在后期处理步骤中清理的字符串 ID？

And one more remark: An increase of factor 34 when using BulkCopy sounds to small in opinion. If you want to insert data fast, make sure that your database is configured correctly.

还有一点要注意：使用 BulkCopy 时因子增加 34 听起来很小。如果您想快速插入数据，请确保您的数据库配置正确。

Answer 3

回答by TheBoyan

The only way you could do what you want by using SqlBulkCopy is to first insert the data to a staging table. Then use a stored procedure to distribute the data to the destinate tables. Yes, this will cause a slowdown but it will still be fast.

使用 SqlBulkCopy 执行所需操作的唯一方法是首先将数据插入到临时表中。然后使用存储过程将数据分发到目标表。是的，这会导致减速，但它仍然会很快。

You might also consider redesigning your data, i.e. splitting it up, denormalizing it etc.

您还可以考虑重新设计您的数据，即将其拆分、非规范化等。

Answer 4

回答by Nicholas Carey

set identity_insert <table> onand dbcc checkidentare your friends here. This is something like what I've done in the past (see code sample). The only real caveat is that the update process is the only one that can be inserting data: everybody else has to get out of the pool while the update is going on. You could, of course, do this sort of mapping programmatically prior to loading the production tables. But the same restriction on the inserts applies: the update process is the only process that gets to play.

set identity_insert <table> on和dbcc checkident是你的朋友在这里。这类似于我过去所做的（参见代码示例）。唯一真正需要注意的是，更新过程是唯一可以插入数据的过程：在更新进行时，其他所有人都必须离开池。当然，您可以在加载生产表之前以编程方式进行此类映射。但是对插入的限制同样适用：更新过程是唯一可以发挥作用的过程。

--
-- start with a source schema -- doesn't actually need to be SQL tables
-- but from the standpoint of demonstration, it makes it easier
--
create table source.parent
(
  id   int         not null primary key ,
  data varchar(32) not null ,
)
create table source.child
(
  id        int         not null primary key ,
  data      varchar(32) not null ,
  parent_id int         not null foreign key references source.parent(id) ,
)

--
-- On the receiving end, you need to create staging tables.
-- You'll notice that while there are primary keys defined,
-- there are no foreign key constraints. Depending on the
-- cleanliness of your data, you might even get rid of the
-- primary key definitions (though you'll need to add
-- some sort of processing to clean the data one way or
-- another, obviously).
--
-- and, depending context, these could even be temp tables
--
create table stage.parent
(
  id   int         not null primary key ,
  data varchar(32) not null ,
)

create table stage.child
(
  id        int         not null primary key ,
  data      varchar(32) not null ,
  parent_id int         not null ,
)

--
-- and of course, the final destination tables already exist,
-- complete with identity properties, etc.
--
create table dbo.parent
(
  id int not null identity(1,1) primary key ,
  data varchar(32) not null ,
)
create table dbo.child
(
  id int not null identity(1,1) primary key ,
  data varchar(32) not null ,
  parent_id int not null foreign key references dbo.parent(id) ,
)

-----------------------------------------------------------------------
-- so, you BCP or otherwise load your staging tables with the new data
-- frome the source tables. How this happens is left as an exercise for
-- the reader. We'll just assume that some sort of magic happens to
-- make it so. Don't forget to truncate the staging tables prior to
-- loading them with data.
-----------------------------------------------------------------------

-------------------------------------------------------------------------
-- Now we get to work to populate the production tables with the new data
--
-- First we need a map to let us create the new identity values.
-------------------------------------------------------------------------
drop table #parent_map
create table #parent_map
(
  old_id int not null primary key nonclustered       ,
  offset int not null identity(1,1) unique clustered ,
  new_id int     null ,  
)
create table #child_map
(
  old_id int not null primary key nonclustered ,
  offset int not null identity(1,1) unique clustered ,
  new_id int     null ,
)

insert #parent_map ( old_id ) select id from stage.parent
insert #child_map  ( old_id ) select id from stage.child

-------------------------------------------------------------------------------
-- now that we've got the map, we can blast the data into the production tables
-------------------------------------------------------------------------------

--
-- compute the new ID values
--
update #parent_map set new_id = offset + ( select max(id) from dbo.parent )

--
-- blast it into the parent table, turning on identity_insert
--
set identity_insert dbo.parent on

insert dbo.parent (id,data)
select id   = map.new_id   ,
       data = staging.data
from stage.parent staging
join #parent_map  map     on map.old_id = staging.id

set identity_insert dbo.parent off

--
-- reseed the identity properties high water mark
--
dbcc checkident dbo.parent , reseed


--
-- compute the new ID values
--
update #child_map set new_id = offset + ( select max(id) from dbo.child )

--
-- blast it into the child table, turning on identity_insert
--
set identity_insert dbo.child on

insert dbo.child ( id , data , parent_id )
select id        = parent.new_id   ,
       data      = staging.data    ,
       parent_id = parent.new_id

from stage.child staging
join #child_map  map      on map.old_id    = staging.id
join #parent_map parent   on parent.old_id = staging.parent_id

set identity_insert dbo.child off

--
-- reseed the identity properties high water mark
--
dbcc checkident dbo.child , reseed

------------------------------------
-- That's about all there is too it.
------------------------------------

Answer 5

回答by 100r

Read this article. I think this is exactly what you are looking for and more. Very nice and elegant solution.

阅读这篇文章。我认为这正是您正在寻找的，甚至更多。非常漂亮和优雅的解决方案。

http://www.codinghelmet.com/?path=howto/bulk-insert

C# SqlBulkCopy 和 DataTables 与标识列上的父/子关系

提问by James Hugard

回答by Paul Farry

回答by Achim

回答by TheBoyan

回答by Nicholas Carey

回答by 100r

相关推荐

最近更新

标签

C# SqlBulkCopy 和 DataTables 与标识列上的父/子关系

提问by James Hugard

回答by Paul Farry

回答by Achim

回答by TheBoyan

回答by Nicholas Carey

回答by 100r

相关推荐

C# 如何在繁忙循环期间显示进度？

C# 使用“yield”关键字实现状态机

C# 为什么这个带有下划线的名称不符合 CLS？

C# .NET 中最快的图像大小调整

相关推荐

最近更新

标签