在Django中对一个大表进行内存高效(恒定)和速度优化的迭代[英] Memory efficient (constant) and speed optimized iteration over a large table in Django

本文是小编为大家收集整理的关于在Django中对一个大表进行内存高效(恒定)和速度优化的迭代的处理/解决方法,可以参考本文帮助大家快速定位并解决问题,中文翻译不准确的可切换到English标签页查看源文。

问题描述

我有一张很大的桌子. 目前在MySQL数据库中. 我使用django.

我需要迭代表的每个元素,以预先计算某些特定数据(也许如果我更好,我可以做,但这不是重点).

我想在不断使用内存的情况下尽可能快地保持迭代.

因为它已经在限制内存使用* django queryset 为什么要通过大型django queryset迭代消耗大量内存?,django中所有对象的简单迭代会杀死机器,因为它将从数据库中检索所有对象.

朝向解决方案

首先,要减少记忆消耗,您应该确保调试为false(或猴子补丁光标: 通过不内部存储缓存的结果(尽管不一定在postgresql!)中,可以为您节省一些内存.但显然仍然会从数据库中检索整个对象.

天真的解决方案

第一个问题中的解决方案是将结果切成chunk_size> chunk_size的结果.有几种写作方法,但基本上它们都归结为sql中的OFFSET + LIMIT查询.

类似:

qs = Model.objects.all()
counter = 0
count = qs.count()
while counter < count:     
    for model in qs[counter:counter+count].iterator()
        yield model
    counter += chunk_size

虽然这是内存有效的(与chunk_size成正比的常数内存使用情况),但在速度方面确实很差:随着偏移的增长,MySQL和PostgreSQL(并且可能大多数DBS)都会开始窒息和放慢.

更好的解决方案

这篇文章 Thierry Schellenbach. 它在PK上过滤,该PK比抵消速度快(可能取决于DB)

pk = 0
last_pk = qs.order_by('-pk')[0].pk
queryset = qs.order_by('pk')
while pk < last_pk:
    for row in qs.filter(pk__gt=pk)[:chunksize]:
        pk = row.pk
        yield row
    gc.collect()

这开始变得令人满意.现在内存= o(c),速度〜= o(n)

"更好"解决方案的问题

更好的解决方案仅在QuerySet中可用时起作用. 不幸的是,情况并非总是如此,特别是当QuerySet包含不同(group_by)和/或值(valuequeryset)的组合时.

对于这种情况,不能使用"更好的解决方案".

我们可以做得更好吗?

现在,我想知道我们是否可以更快地走,并避免有关无PK的问题的问题. 也许使用我在其他答案中发现的东西,但仅在纯SQL中:使用光标.

由于我对RAW SQL非常糟糕,尤其是在Django中,这是一个真正的问题:

我们如何为大桌子构建更好的django queryset迭代器

我从我所读到的看法是,我们应该使用服务器端光标(显然(请参阅参考)使用标准Django光标无法实现相同的结果,因为默认情况下,Python-Mysql和Psycopg Cache均可达到相同的结果.结果).

这真的是一个更快(和/或更有效)的解决方案吗?

可以使用Django中的RAW SQL完成此操作吗?还是我们应该根据数据库连接器编写特定的Python代码?

postgresql =" http://mysql-python.sourceforge.net/mysqldb.html#using-and-and-pending" rel =" noreferrer"> mysql

那是我目前所能得到的...

django chunked_iterator()

现在,当然最好将此方法用作queryset.iterator(),而不是iterate(queryset),并且是Django Core的一部分或至少是可插入的应用程序.

更新感谢" t"在评论中找到 django ticket/a>携带一些其他信息.连接器行为的差异使得它可能是最好的解决方案是创建一种特定的chunked方法,而不是透明地扩展iterator(听起来像是我的好方法). 实施存量尚未在任何工作中而且看起来作者还没有准备好跳下来.

附加参考:

  1. 为什么MySQL更高的限制降低了查询下降?
  2. 我如何在限制子句中使用较大偏移量加快MySQL查询?
  3. .
  4. postgresql:offset + offset + limim限制变得非常慢
  5. 改善Postgresql中的偏移性能
  6. http://www.depesz.com/2011/05/20/with-fixed-rorder/
  7. 如何获得行列 - python中的mysql结果集 MySQL中的服务器端光标

编辑:

django 1.6正在添加持久数据库连接

django数据库持久连接

这应该在某些条件下使用光标促进.仍然超出了我目前的技能(以及学习的时间)如何实施这种解决方案.

另外,"更好的解决方案"绝对在所有情况下都没有起作用,也不能用作通用方法,只有一个情况下可以通过情况进行调整...

推荐答案

基本答案:使用服务器端光标的RAW SQL .

可悲的是,在Django 1.5.2之前,没有正式的方法来创建服务器端MySQL光标(不确定其他数据库引擎).所以我写了一些魔法代码来解决这个问题.

对于Django 1.5.2和MySQLDB 1.2.4,以下代码将起作用.另外,这也有很好的评论.

注意:这不是基于公共API,因此它可能会破坏未来的Django版本.

# This script should be tested under a Django shell, e.g., ./manage.py shell

from types import MethodType

import MySQLdb.cursors
import MySQLdb.connections
from django.db import connection
from django.db.backends.util import CursorDebugWrapper


def close_sscursor(self):
    """An instance method which replace close() method of the old cursor.

    Closing the server-side cursor with the original close() method will be
    quite slow and memory-intensive if the large result set was not exhausted,
    because fetchall() will be called internally to get the remaining records.
    Notice that the close() method is also called when the cursor is garbage 
    collected.

    This method is more efficient on closing the cursor, but if the result set
    is not fully iterated, the next cursor created from the same connection
    won't work properly. You can avoid this by either (1) close the connection 
    before creating a new cursor, (2) iterate the result set before closing 
    the server-side cursor.
    """
    if isinstance(self, CursorDebugWrapper):
        self.cursor.cursor.connection = None
    else:
        # This is for CursorWrapper object
        self.cursor.connection = None


def get_sscursor(connection, cursorclass=MySQLdb.cursors.SSCursor):
    """Get a server-side MySQL cursor."""
    if connection.settings_dict['ENGINE'] != 'django.db.backends.mysql':
        raise NotImplementedError('Only MySQL engine is supported')
    cursor = connection.cursor()
    if isinstance(cursor, CursorDebugWrapper):
        # Get the real MySQLdb.connections.Connection object
        conn = cursor.cursor.cursor.connection
        # Replace the internal client-side cursor with a sever-side cursor
        cursor.cursor.cursor = conn.cursor(cursorclass=cursorclass)
    else:
        # This is for CursorWrapper object
        conn = cursor.cursor.connection
        cursor.cursor = conn.cursor(cursorclass=cursorclass)
    # Replace the old close() method
    cursor.close = MethodType(close_sscursor, cursor)
    return cursor


# Get the server-side cursor
cursor = get_sscursor(connection)

# Run a query with a large result set. Notice that the memory consumption is low.
cursor.execute('SELECT * FROM million_record_table')

# Fetch a single row, fetchmany() rows or iterate it via "for row in cursor:"
cursor.fetchone()

# You can interrupt the iteration at any time. This calls the new close() method,
# so no warning is shown.
cursor.close()

# Connection must be close to let new cursors work properly. see comments of
# close_sscursor().
connection.close()

其他推荐答案

简单答案

如果您只需要在桌子本身上迭代而不做任何花哨的事情,django带有a 内置迭代器:

queryset.iterator()

这会导致Django清理自己的缓存以减少内存使用.请注意,对于真正的大桌子,这可能还不够.


复杂答案

如果您对每个对象进行更复杂的事情或拥有大量数据,则必须自己编写.以下是一个QuerySet迭代器,将QuerySet拆分为块,并且不比基本迭代器慢得多(它将是线性数量的数据库查询,而不是1个,但每1000行只有一个查询).此函数通过主键,这对于有效实现是必要的,因为偏移是大多数SQL数据库中的线性时间操作.

def queryset_iterator(queryset, page_size=1000):
    if not queryset:
        return
    max_pk = queryset.order_by("-pk")[0].pk
    # Scale the page size up by the average density of primary keys in the queryset
    adjusted_page_size = int(page_size * max_pk / queryset.count())
    
    pages = int(max_pk / adjusted_page_size) + 1
    for page_num in range(pages):
        lower = page_num * adjusted_page_size
        page = queryset.filter(pk__gte=lower, pk__lt=lower+page_size)
        for obj in page:
            yield obj

使用看起来像:

for obj in queryset_iterator(Model.objects.all()):
    # do stuff

此代码有三个假设:

  1. 您的主要密钥是整数(这对UUID主键不起作用).
  2. QuerySet的主要键至少在某种程度上分布.如果这不是真的,那么adjusted_page_size最终可能会太大,您可能会在迭代的一部分中获得一个或几个大型页面.

为了给人以开销的感觉,我在带有40,000个条目的Postgres表上进行了测试. queryset_iterator在迭代时间与原始迭代(2.2秒和1.2秒)中添加了约80%.对于200至10,000之间的页面尺寸,该开销没有很大差异,尽管它开始上升到200次以下.

其他推荐答案

还有其他选项可用.它不会使迭代更快(实际上它可能会放慢速度),但是它会使它使用的记忆少得多.根据您的需求,这可能是合适的.

large_qs = MyModel.objects.all().values_list("id", flat=True)
for model_id in large_qs:
    model_object = MyModel.objects.get(id=model_id)
    # do whatever you need to do with the model here

仅将ID加载到内存中,并根据需要检索并丢弃对象.注意增加的数据库负载和较慢的运行时,这两个用于减少内存使用的权衡.

我在运行异步计划的工作实例上运行异步计划的任务时使用了此此事,这是否真的很慢,但是如果他们尝试使用过多的内存,它们可能会崩溃并因此中止该过程.

本文地址:https://www.itbaoku.cn/post/597432.html

问题描述

I have a very large table. It's currently in a MySQL database. I use django.

I need to iterate over each element of the table to pre-compute some particular data (maybe if I was better I could do otherwise but that's not the point).

I'd like to keep the iteration as fast as possible with a constant usage of memory.

As it is already clearly in Limiting Memory Use in a *Large* Django QuerySet and Why is iterating through a large Django QuerySet consuming massive amounts of memory?, a simple iteration over all objects in django will kill the machine as it will retrieve ALL objects from the database.

Towards a solution

First of all, to reduce your memory consumption you should be sure DEBUG is False (or monkey patch the cursor: turn off SQL logging while keeping settings.DEBUG?) to be sure django isn't storing stuff in connections for debug.

But even with that,

for model in Model.objects.all()

is a no go.

Not even with the slightly improved form:

for model in Model.objects.all().iterator()

Using iterator() will save you some memory by not storing the result of the cache internally (though not necessarily on PostgreSQL!); but will still retrieve the whole objects from the database, apparently.

A naive solution

The solution in the first question is to slice the results based on a counter by a chunk_size. There are several ways to write it, but basically they all come down to an OFFSET + LIMIT query in SQL.

something like:

qs = Model.objects.all()
counter = 0
count = qs.count()
while counter < count:     
    for model in qs[counter:counter+count].iterator()
        yield model
    counter += chunk_size

While this is memory efficient (constant memory usage proportional to chunk_size), it's really poor in term of speed: as OFFSET grows, both MySQL and PostgreSQL (and likely most DBs) will start choking and slowing down.

A better solution

A better solution is available in this post by Thierry Schellenbach. It filters on the PK, which is way faster than offsetting (how fast probably depends on the DB)

pk = 0
last_pk = qs.order_by('-pk')[0].pk
queryset = qs.order_by('pk')
while pk < last_pk:
    for row in qs.filter(pk__gt=pk)[:chunksize]:
        pk = row.pk
        yield row
    gc.collect()

This is starting to get satisfactory. Now Memory = O(C), and Speed ~= O(N)

Issues with the "better" solution

The better solution only works when the PK is available in the QuerySet. Unluckily, that's not always the case, in particular when the QuerySet contains combinations of distinct (group_by) and/or values (ValueQuerySet).

For that situation the "better solution" cannot be used.

Can we do better?

Now I'm wondering if we can go faster and avoid the issue regarding QuerySets without PK. Maybe using something that I found in other answers, but only in pure SQL: using cursors.

Since I'm quite bad with raw SQL, in particular in Django, here comes the real question:

how can we build a better Django QuerySet Iterator for large tables

My take from what I've read is that we should use server-side cursors (apparently (see references) using a standard Django Cursor would not achieve the same result, because by default both python-MySQL and psycopg connectors cache the results).

Would this really be a faster (and/or more efficient) solution?

Can this be done using raw SQL in django? Or should we write specific python code depending on the database connector?

Server Side cursors in PostgreSQL and in MySQL

That's as far as I could get for the moment...

a Django chunked_iterator()

Now, of course the best would have this method work as queryset.iterator(), rather than iterate(queryset), and be part of django core or at least a pluggable app.

Update Thanks to "T" in the comments for finding a django ticket that carry some additional information. Differences in connector behaviors make it so that probably the best solution would be to create a specific chunked method rather than transparently extending iterator (sounds like a good approach to me). An implementation stub exists, but there hasn't been any work in a year, and it does not look like the author is ready to jump on that yet.

Additional Refs:

  1. Why does MYSQL higher LIMIT offset slow the query down?
  2. How can I speed up a MySQL query with a large offset in the LIMIT clause?
  3. http://explainextended.com/2009/10/23/mysql-order-by-limit-performance-late-row-lookups/
  4. postgresql: offset + limit gets to be very slow
  5. Improving OFFSET performance in PostgreSQL
  6. http://www.depesz.com/2011/05/20/pagination-with-fixed-order/
  7. How to get a row-by-row MySQL ResultSet in python Server Side Cursor in MySQL

Edits:

Django 1.6 is adding persistent database connections

Django Database Persistent Connections

This should facilitate, under some conditions, using cursors. Still it's outside my current skills (and time to learn) how to implement such a solution..

Also, the "better solution" definitely does not work in all situations and cannot be used as a generic approach, only a stub to be adapted case by case...

推荐答案

The essential answer: use raw SQL with server-side cursors.

Sadly, until Django 1.5.2 there is no formal way to create a server-side MySQL cursor (not sure about other database engines). So I wrote some magic code to solve this problem.

For Django 1.5.2 and MySQLdb 1.2.4, the following code will work. Also, it's well commented.

Caution: This is not based on public APIs, so it will probably break in future Django versions.

# This script should be tested under a Django shell, e.g., ./manage.py shell

from types import MethodType

import MySQLdb.cursors
import MySQLdb.connections
from django.db import connection
from django.db.backends.util import CursorDebugWrapper


def close_sscursor(self):
    """An instance method which replace close() method of the old cursor.

    Closing the server-side cursor with the original close() method will be
    quite slow and memory-intensive if the large result set was not exhausted,
    because fetchall() will be called internally to get the remaining records.
    Notice that the close() method is also called when the cursor is garbage 
    collected.

    This method is more efficient on closing the cursor, but if the result set
    is not fully iterated, the next cursor created from the same connection
    won't work properly. You can avoid this by either (1) close the connection 
    before creating a new cursor, (2) iterate the result set before closing 
    the server-side cursor.
    """
    if isinstance(self, CursorDebugWrapper):
        self.cursor.cursor.connection = None
    else:
        # This is for CursorWrapper object
        self.cursor.connection = None


def get_sscursor(connection, cursorclass=MySQLdb.cursors.SSCursor):
    """Get a server-side MySQL cursor."""
    if connection.settings_dict['ENGINE'] != 'django.db.backends.mysql':
        raise NotImplementedError('Only MySQL engine is supported')
    cursor = connection.cursor()
    if isinstance(cursor, CursorDebugWrapper):
        # Get the real MySQLdb.connections.Connection object
        conn = cursor.cursor.cursor.connection
        # Replace the internal client-side cursor with a sever-side cursor
        cursor.cursor.cursor = conn.cursor(cursorclass=cursorclass)
    else:
        # This is for CursorWrapper object
        conn = cursor.cursor.connection
        cursor.cursor = conn.cursor(cursorclass=cursorclass)
    # Replace the old close() method
    cursor.close = MethodType(close_sscursor, cursor)
    return cursor


# Get the server-side cursor
cursor = get_sscursor(connection)

# Run a query with a large result set. Notice that the memory consumption is low.
cursor.execute('SELECT * FROM million_record_table')

# Fetch a single row, fetchmany() rows or iterate it via "for row in cursor:"
cursor.fetchone()

# You can interrupt the iteration at any time. This calls the new close() method,
# so no warning is shown.
cursor.close()

# Connection must be close to let new cursors work properly. see comments of
# close_sscursor().
connection.close()

其他推荐答案

Simple Answer

If you just need to iterate over the table itself without doing anything fancy, Django comes with a builtin iterator:

queryset.iterator()

This causes Django to clean up it's own cache to reduce memory use. Note that for truly large tables, this may not be enough.


Complex Answer

If you are doing something more complex with each object or have a lot of data, you have to write your own. The following is a queryset iterator that splits the queryset into chunks and is not much slower than the basic iterator (it will be a linear number of database queries, as opposed to 1, but it will only one query per 1,000 rows). This function pages by primary key, which is necessary for efficient implementation since offset is a linear time operation in most SQL databases.

def queryset_iterator(queryset, page_size=1000):
    if not queryset:
        return
    max_pk = queryset.order_by("-pk")[0].pk
    # Scale the page size up by the average density of primary keys in the queryset
    adjusted_page_size = int(page_size * max_pk / queryset.count())
    
    pages = int(max_pk / adjusted_page_size) + 1
    for page_num in range(pages):
        lower = page_num * adjusted_page_size
        page = queryset.filter(pk__gte=lower, pk__lt=lower+page_size)
        for obj in page:
            yield obj

Use looks like:

for obj in queryset_iterator(Model.objects.all()):
    # do stuff

This code has three assumptions:

  1. Your primary keys are integers (this will not work for UUID primary keys).
  2. The primary keys of the queryset are at least somewhat uniformly distributed. If this is not true, the adjusted_page_size can end up too large and you may get one or several massive pages as part of your iteration.

To give a sense of the overhead, I tested this on a Postgres table with 40,000 entries. The queryset_iterator adds about 80% to the iteration time vs raw iteration (2.2 seconds vs 1.2 seconds). That overhead does not vary substantially for page sizes between 200 and 10,000, though it starts going up below 200.

其他推荐答案

There is another option available. It wouldn't make the iteration faster, (in fact it would probably slow it down), but it would make it use far less memory. Depending on your needs this may be appropriate.

large_qs = MyModel.objects.all().values_list("id", flat=True)
for model_id in large_qs:
    model_object = MyModel.objects.get(id=model_id)
    # do whatever you need to do with the model here

Only the ids are loaded into memory, and the objects are retrieved and discarded as needed. Note the increased database load and slower runtime, both tradeoffs for the reduction in memory usage.

I've used this when running async scheduled tasks on worker instances, for which it doesn't really matter if they are slow, but if they try to use way too much memory they may crash the instance and therefore abort the process.