Skip to content Skip to sidebar Skip to footer

Executing An SQL Update Statement From A Pandas Dataframe

Context: I am using MSSQL, pandas, and pyodbc. Steps: Obtain dataframe from query using pyodbc (no problemo) Process columns to generate the context of a new (but already existing

Solution 1:

After I recommended .executemany() in a comment to the question, a subsequent comment from @Charlieface suggested that a table-valued parameter (TVP) would provide even better performance. I didn't think it would make that much difference, but I was wrong.

For an existing table named MillionRows

ID  TextField
--  ---------
 1  foo
 2  bar
 3  baz
…

and example data of the form

num_rows = 1_000_000
rows = [(f"text{x:06}", x + 1) for x in range(num_rows)]
print(rows)
# [('text000000', 1), ('text000001', 2), ('text000002', 3), …]

my test using a standard executemany() call with cnxn.autocommit = False and crsr.fast_executemany = True

crsr.executemany("UPDATE MillionRows SET TextField = ? WHERE ID = ?", rows)

took about 180 seconds (3 minutes).

However, by creating a user-defined table type

CREATE TYPE dbo.TextField_ID AS TABLE 
(
    TextField nvarchar(255) NULL, 
    ID int NOT NULL, 
    PRIMARY KEY (ID)
)

and a stored procedure

CREATE PROCEDURE [dbo].[mr_update]
@tbl dbo.TextField_ID READONLY
AS
BEGIN
    SET NOCOUNT ON;
    UPDATE MillionRows SET TextField = t.TextField
    FROM MillionRows mr INNER JOIN @tbl t ON mr.ID = t.ID
END

when I used

crsr.execute("{CALL mr_update (?)}", (rows,))

it did the same update in approximately 80 seconds (less than half the time).


Post a Comment for "Executing An SQL Update Statement From A Pandas Dataframe"