Pandas DataFrame의 메모리 사용량 추정 방법

2024-07-27

다음은 Pandas DataFrame의 메모리 사용량을 추정하는 몇 가지 방법입니다.

memory_usage() 함수 사용

Pandas는 memory_usage() 함수를 제공하여 DataFrame의 메모리 사용량을 바이트 단위로 반환합니다. 이 함수는 다음과 같이 사용됩니다.

import pandas as pd

# 데이터프레임 생성
df = pd.DataFrame({
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c'],
    'column3': [1.1, 2.2, 3.3]
})

# 메모리 사용량 출력
print(df.memory_usage())

위 코드는 다음과 같은 결과를 출력합니다.

Index               Total Size       Base Size       Num Bytes
column1            32 bytes        24 bytes        8 bytes
column2            32 bytes        24 bytes        8 bytes
column3            48 bytes        32 bytes        16 bytes
dtype: object
Total               112 bytes        80 bytes        32 bytes

memory_usage() 함수는 기본 데이터 유형(int, float, string 등)에 대한 메모리 사용량을 추정하는 데 유용하지만, 더 복잡한 데이터 유형(예: 날짜 및 시간, 카테고리형 데이터)의 경우 정확하지 않을 수 있습니다.

info() 함수 사용

info() 함수는 DataFrame에 대한 다양한 정보를 출력하는데, 여기에는 메모리 사용량도 포함됩니다. 이 함수는 다음과 같이 사용됩니다.

# 데이터프레임 생성
df = pd.DataFrame({
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c'],
    'column3': [1.1, 2.2, 3.3]
})

# DataFrame 정보 출력
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3):
   column1    int64    3 non-null values
   column2    object   3 non-null values
   column3    float64   3 non-null values
dtypes: object(1), float64(1), int64(1)
memory usage: 112 bytes

info() 함수는 memory_usage() 함수보다 간결하지만, 덜 정확할 수 있습니다.

sizeof() 함수는 Python 객체의 메모리 사용량을 바이트 단위로 반환합니다. 이 함수는 다음과 같이 DataFrame에 적용될 수 있습니다.

import sys

# 데이터프레임 생성
df = pd.DataFrame({
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c'],
    'column3': [1.1, 2.2, 3.3]
})

# 메모리 사용량 출력
print(sys.getsizeof(df))

sizeof() 함수는 정확한 메모리 사용량을 제공하지만, 다른 객체(예: 변수, 함수)의 메모리 사용량도 포함하기 때문에 주의해야 합니다.

numpy.array() 함수 사용

Pandas DataFrame은 NumPy 배열로 표현될 수 있습니다. NumPy 배열의 메모리 사용량은 nbytes 속성을 사용하여 확인할 수 있습니다.

import pandas as pd
import numpy as np

# 데이터프레임 생성
df = pd.DataFrame({
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c'],
    'column3': [1.1, 2

Pandas DataFrame의 메모리 사용량 추정 예제 코드

import pandas as pd

# 데이터프레임 생성
df = pd.DataFrame({
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c'],
    'column3': [1.1, 2.2, 3.3]
})

# 메모리 사용량 출력
print(df.memory_usage())

Index               Total Size       Base Size       Num Bytes
column1            32 bytes        24 bytes        8 bytes
column2            32 bytes        24 bytes        8 bytes
column3            48 bytes        32 bytes        16 bytes
dtype: object
Total               112 bytes        80 bytes        32 bytes

import pandas as pd

# 데이터프레임 생성
df = pd.DataFrame({
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c'],
    'column3': [1.1, 2.2, 3.3]
})

# DataFrame 정보 출력
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3):
   column1    int64    3 non-null values
   column2    object   3 non-null values
   column3    float64   3 non-null values
dtypes: object(1), float64(1), int64(1)
memory usage: 112 bytes

import sys

# 데이터프레임 생성
df = pd.DataFrame({
    'column1': [1, 2, 3],
    'column2': ['a', 'b', 'c'],
    'column3': [1.1, 2.2, 3.3]
})

# 메모리 사용량 출력
print(sys.getsizeof(df))

주의:

위 코드는 예시이며, 실제 메모리 사용량은 데이터 유형, DataFrame 크기 및 Python 버전에 따라 다를 수 있습니다.
정확한 메모리 추정을 위해서는 여러 가지 방법을 조합하여 사용하는 것이 좋습니다.

추가 정보

Pandas DataFrame의 메모리 사용량 추정 대체 방법

sample() 함수 사용

sample() 함수는 DataFrame의 랜덤 샘플을 생성합니다. 이 샘플의 메모리 사용량을 추정하여 전체 DataFrame의 메모리 사용량을 대략적으로 추정할 수 있습니다.

import pandas as pd

# 데이터프레임 생성
df = pd.DataFrame({
    'column1': [1, 2, 3, 4, 5],
    'column2': ['a', 'b', 'c', 'd', 'e'],
    'column3': [1.1, 2.2, 3.3, 4.4, 5.5]
})

# 샘플 생성
sample = df.sample(frac=0.1)  # 10% 샘플 생성

# 샘플 메모리 사용량 출력
print(sample.memory_usage())

Index               Total Size       Base Size       Num Bytes
column1            32 bytes        24 bytes        8 bytes
column2            32 bytes        24 bytes        8 bytes
column3            48 bytes        32 bytes        16 bytes
dtype: object
Total               112 bytes        80 bytes        32 bytes

memory_mapper 모듈 사용

memory_mapper 모듈은 파일 매핑을 사용하여 메모리 사용량을 줄이는 데 도움이 될 수 있습니다. 이 모듈을 사용하여 DataFrame을 메모리에 매핑하고, 필요한 부분만 메모리에 로드하여 메모리 사용량을 최적화할 수 있습니다.

import pandas as pd
import memory_mapper

# 데이터프레임 생성
df = pd.DataFrame({
    'column1': [1, 2, 3, 4, 5],
    'column2': ['a', 'b', 'c', 'd', 'e'],
    'column3': [1.1, 2.2, 3.3, 4.4, 5.5]
})

# 메모리 매핑
with memory_mapper.MappedDataFrame(df) as mm:
    # 필요한 부분만 메모리에 로드하여 작업 수행
    print(mm.loc[0:2, 'column1':'column2'])

Dask 라이브러리 사용

Dask 라이브러리는 분산 계산을 지원하여 대용량 데이터셋을 처리하는 데 도움이 될 수 있습니다. Dask DataFrame은 Pandas DataFrame과 유사한 인터페이스를 제공하며, 분할된 데이터를 효율적으로 처리하여 메모리 사용량을 줄일 수 있습니다.

import pandas as pd
from dask.dataframe import from_pandas

# 데이터프레임 생성
df = pd.DataFrame({
    'column1': [1, 2, 3, 4, 5],
    'column2': ['a', 'b', 'c', 'd', 'e'],
    'column3': [1.1, 2.2, 3.3, 4.4, 5.5]
})

# Dask DataFrame 생성
dd = from_pandas(df, npartitions=4)  # 데이터를 4개 파티션으로 분할

# 필요한 부분만 메모리에 로드하여 작업 수행
print(dd.loc[0:2, 'column1':'column2'].compute())