Skip to content

Can you please have a crack at loading Fannie Mae data? #2

@xiaodaigh

Description

@xiaodaigh

I still can't load the Fannie Mae data.

I have written some example code but it takes more than 5 hours and still fails.

The data requires a login to download and the data is 135G in size.

using Distributed, Statistics
addprocs(6)

@time @everywhere using JuliaDB, Dagger

datapath = "c:/data/Performance_All/"
ifiles = joinpath.(datapath, readdir(datapath))

colnames = ["loan_id", "monthly_rpt_prd", "servicer_name", "last_rt", "last_upb", "loan_age",
    "months_to_legal_mat" , "adj_month_to_mat", "maturity_date", "msa", "delq_status",
    "mod_flag", "zero_bal_code", "zb_dte", "lpi_dte", "fcc_dte","disp_dt", "fcc_cost",
    "pp_cost", "ar_cost", "ie_cost", "tax_cost", "ns_procs", "ce_procs", "rmw_procs",
    "o_procs", "non_int_upb", "prin_forg_upb_fhfa", "repch_flag", "prin_forg_upb_oth",
    "transfer_flg"];


fsz = (x->stat(x).size).(ifiles)

nchunks = ceil.(fsz./(250*1024*1024))

mkcmd(in, out, nchunks) = begin
    "split $in -n l/$(Int(nchunks)) -d $out"
end

open("c:/data/script", "w") do f
    for m in mkcmd.(ifiles, "/c/data/Performance_All_split/".*readdir(datapath), nchunks)[nchunks .>= 2]
        write(f, m*"\n")
    end
end

#####################################################################
############## execute the above to split the csvs into smaller csvs
#####################################################################

const fmtypes = [
    String,                     Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},     Union{String, Missing},
    Union{String, Missing},     Union{String, Missing},     Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},
    Union{Float64, Missing},    Union{Float64, Missing},    Union{Float64, Missing},    Union{String, Missing},     Union{Float64, Missing},
    Union{String, Missing}]


#datapath = "C:/data/Performance_All_split"
datapath = "C:/data/ok"
ifiles = joinpath.(datapath, readdir(datapath))

# takes more than 5 hours and then fails.
@time jll = loadtable(
    ifiles,
    output = "c:/data/fm.jldb\\",
    delim='|',
    header_exists=false,
    filenamecol = "filename",
    chunks = length(ifiles),
    #type_detect_rows = 20000,
    colnames = colnames,
    colparsers = fmtypes,
    indexcols=["loan_id", "monthly_rpt_prd"])

using JuliaDB

@time a = load("c:/data/fm.jldb/")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions