Menggunakan Pandas
Pandas adalah library atau paket yang kita gunakan untuk mengutik-utik data. Beberapa hal sederhana yang anda dapat lakukan di pandas, dapat juga dilakukan dengan menggunakan Google Sheet atau Microsoft Excel. Jika anda lebih familiar dengan tools seperti tersebut, gunakan sheets juga nda papa. Lebih lagi, nge-print seluruh isi dataset di Pandas bisa makan memori jika data anda sangat besar. Anda dapat gunakan sheets untuk melihat datanya, tapi dimanipulasi di python juga bisa. Whatever works for you. Di sini, kita akan coba memanipulasi dengan pandas.
Zaman sekarang sih fitur-fitur di Microsoft Excel dan Google Sheet sudah sangat canggih, terlebih dalam hal menyedot data secara real-time dari internet. Tentu saja kedua aplikasi di atas kurang canggih kalau dipakai untuk programming atau statistik yang agak lebih canggih daripada OLS. Karena itu, menggunakan Pandas lebih disarankan.
Memang memanipulasi data di Sheet lebih intuitif dan bisa jadi lebih cepet, terutama kalau datanya tidak terlalu besar. Namun kelebihan Pandas diantaranya adalah ketika anda harus berurusan dengan time-series atau panel data, melakukan sorting / grouping dan semacamnya juga lebih cepat dengan Pandas. Anda juga nanti akan berurusan dengan mengecek dan memanipulasi missing values, yang lebih cepat jika menggunakan Pandas. Kelebihan lainnya, semua manipulasi yang anda lakukan akan terdokumentasi di Notebook anda. Anda tidak perlu (bahkan sebaiknya tidak) mengubah data aslinya. Tinggal re-run saja kode di Notebook jika ingin melihat kembali hasil analisis anda
Untuk menggunakan Pandas, Kita memanggilnya dengan:
import pandas as pd
setelah itu, semua fungsi yang terkait dengan pandas dapat kita gunakan dengan awalan pd. Dokumentasi lengkap soal fungsi apa saja yang ada di pandas dapat dibaca di dokumentasi resmi di sini. Tentu saja anda juga bisa menggunakan gugel.
Data Frame
Data Frame bentuknya seperti ini:
# menulis data dari awal.
data = {'Nama':['imed', 'krisna', 'wisnu', 'gupta'], 'pekerjaan':['Ekonom', 'kontraktor', 'pelajar', 'tukang baso'], 'usia':[30, 27, 19, 25]}
# print
data
{'Nama': ['imed', 'krisna', 'wisnu', 'gupta'],
'pekerjaan': ['Ekonom', 'kontraktor', 'pelajar', 'tukang baso'],
'usia': [30, 27, 19, 25]}
Seperti anda lihat, data di atas bentuknya kurang enak dilihat. Datanya sih jadi, tapi kurang intuitif. Jika kita gunakan panda data frame, hasilnya lebih enak dilihat dan lebih ada artinya.
# jadikan data ke dalam dataframe
data = pd.DataFrame(data)
# Print df.
data
| Nama | pekerjaan | usia | |
|---|---|---|---|
| 0 | imed | Ekonom | 30 |
| 1 | krisna | kontraktor | 27 |
| 2 | wisnu | pelajar | 19 |
| 3 | gupta | tukang baso | 25 |
Tabel di atas memiliki 1 header (baris paling atas), 4 baris (rows), dan 3 kolom (columns). Angka [0,1,2,3] disebut juga dengan index. Index ini juga ada di aplikasi Microsoft Excel maupun Google Sheet, yaitu nomor rows. Ingat kalau di excel kolom diberi index huruf sementara baris diberi index angka. Di Pandas, kolom tidak diberi index spesifik.
Kolom merepresentasikan nama variabel (atau disebut juga dengan series), sementara baris merepresentasikan observasi. Ada 3 variabel di tabel di atas, yaitu Nama, Pekerjaan, dan Usia. Sementara itu, ada 4 observasi. Pokoknya Pandas tuh persis banget lah sama Microsoft Excel maupun Google Sheet.
Membuat series baru
Sekarang kita punya sebuah jenis data, yaitu data yang sudah berbentuk DataFrame. Mari kita coba buat series baru yang berisi angka acak (random). Untuk menghasilkan angka acak, kita butuh paket Numpy
import numpy as np
series=pd.Series(np.random.randn(4), name='gaji')
series
0 0.434327
1 -0.355109
2 -0.229159
3 0.865556
Name: gaji, dtype: float64
data di series bisa kita manipulasi seperti di excel dengan mengalikan atau membagi dengan angka lain. Jika kita kali dengan sebuah angka, maka seluruh anggota series tersebut akan dikali dengan angka tersebut. (catatan: bedakan series dengan matrix!)
series+10 # setiap anggota series kita tambahkan 10
0 10.434327
1 9.644891
2 9.770841
3 10.865556
Name: gaji, dtype: float64
series/10 # setiap anggota series kita bagi 10
0 0.043433
1 -0.035511
2 -0.022916
3 0.086556
Name: gaji, dtype: float64
np.log(series+10) # setiap anggota series kita transformasi natural log 10
0 2.345101
1 2.266428
2 2.279403
3 2.385598
Name: gaji, dtype: float64
kita juga bisa menempelkan series menjadi anggota baru dari df
data['gaji']=series
data
| Nama | pekerjaan | usia | gaji | |
|---|---|---|---|---|
| 0 | imed | Ekonom | 30 | 0.434327 |
| 1 | krisna | kontraktor | 27 | -0.355109 |
| 2 | wisnu | pelajar | 19 | -0.229159 |
| 3 | gupta | tukang baso | 25 | 0.865556 |
dapat anda lihat di atas bahwa saya membuat sebuah kolom baru dengan menggunakan kurung siku. kode data['gaji'] berarti saya membuat sebuah kolom baru bernama gaji, yang anggotanya saya ambil dari series. Dengan kata lain, kode <data.frame>[nama.kolom] kita gunakan untuk memanggil kolom tersebut.
Contoh penggunaan:
data['usia'] # memanggil hanya kolom usia
0 30
1 27
2 19
3 25
Name: usia, dtype: int64
Gajinya kok kecil banget ya. ada yang negatif lagi. mari kita ganti angka-angka di gaji tersebut dengan angka yang lebih masuk akal
data['gaji']=(data['gaji']+1)*10000 # apakah anda bisa menebak apa yang terjadi?
data
| Nama | pekerjaan | usia | gaji | |
|---|---|---|---|---|
| 0 | imed | Ekonom | 30 | 14343.266315 |
| 1 | krisna | kontraktor | 27 | 6448.905246 |
| 2 | wisnu | pelajar | 19 | 7708.407117 |
| 3 | gupta | tukang baso | 25 | 18655.559599 |
Masih kecil?? oh gampang. Yang perlu saya ganti hanyalah nama kolom saja
data=data.rename(columns={'gaji':'gaji (USD)'})
data
| Nama | pekerjaan | usia | gaji (USD) | |
|---|---|---|---|---|
| 0 | imed | Ekonom | 30 | 14343.266315 |
| 1 | krisna | kontraktor | 27 | 6448.905246 |
| 2 | wisnu | pelajar | 19 | 7708.407117 |
| 3 | gupta | tukang baso | 25 | 18655.559599 |
gimana? Jadi gede ya gajinya? Wkwkw. begitulah cara mengganti nama kolom.
Anda juga bisa melihat deskripsi statistik dari data anda dengan menggunakan nama.describe
data.describe()
| usia | gaji (USD) | |
|---|---|---|
| count | 4.000000 | 4.000000 |
| mean | 25.250000 | 11789.034569 |
| std | 4.645787 | 5739.970017 |
| min | 19.000000 | 6448.905246 |
| 25% | 23.500000 | 7393.531649 |
| 50% | 26.000000 | 11025.836716 |
| 75% | 27.750000 | 15421.339636 |
| max | 30.000000 | 18655.559599 |
describe hanya memunculkan informasi float maupun integer tapi tidak string. Count berarti jumlah observasi (atau jumlah baris), sisanya sudah pada tau ya?
Latihan lanjutan
Untuk latihan berikutnya, mari kita gunakan data yang sudah saya siapkan untuk latihan. Datanya bisa ditarik dari website saya.
# menarik data dari internet dengan menggunakan read_csv dan diassign ke `lat`
lat=pd.read_csv('https://krisna.netlify.app/courses/pystat/latihan.csv')
lat
| Country Name | ARI treatment (% of children under 5 taken to a health provider) | Access to clean fuels and technologies for cooking (% of population) | Access to electricity (% of population) | Access to electricity, rural (% of rural population) | Access to electricity, urban (% of urban population) | Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+) | ... | Women participating in the three decisions (own health care, major household purchases, and visiting family) (% of women age 15-49) | Women who believe a husband is justified in beating his wife (any of five reasons) (%) | Women who believe a husband is justified in beating his wife when she argues with him (%) | Women who believe a husband is justified in beating his wife when she burns the food (%) | Women who believe a husband is justified in beating his wife when she goes out without telling him (%) | Women who believe a husband is justified in beating his wife when she neglects the children (%) | Women who believe a husband is justified in beating his wife when she refuses sex with him (%) | Women who were first married by age 15 (% of women ages 20-24) | Women who were first married by age 18 (% of women ages 20-24) | Women's share of population ages 15+ living with HIV (%) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Afghanistan | NaN | NaN | 97.700000 | 97.091973 | 99.500000 | 14.893312 | 7.160685 | 22.536495 | 18.016499 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 29.3 |
| 1 | Albania | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | 40.015171 | 38.102959 | 42.043503 | 42.513248 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 2 | Algeria | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | 42.776627 | 29.268730 | 56.254936 | 48.747078 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 45.9 |
| 3 | American Samoa | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | Andorra | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 212 | Virgin Islands (U.S.) | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 213 | West Bank and Gaza | NaN | NaN | 99.700000 | 99.070331 | 99.900000 | 25.022177 | 15.910773 | 34.436001 | 33.076920 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| 214 | Yemen, Rep. | NaN | NaN | 79.200000 | 68.766799 | 97.735039 | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 22.8 |
| 215 | Zambia | NaN | NaN | 40.300000 | 13.997699 | 75.200000 | 45.863258 | 40.278877 | 51.680923 | 48.338139 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 59.3 |
| 216 | Zimbabwe | NaN | NaN | 40.482048 | 19.152600 | 85.317101 | 55.285137 | 51.664925 | 59.227501 | 58.877064 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 59.7 |
217 rows × 1432 columns
Seperti anda lihat pada tabel di atas, data ini jauh lebih besar daripada data yang cuma 4 baris dan 4 kolom. Kali ini, ada total 217 baris dan 1432 kolom! Data diatas merupakan basis data World Bank World Development Indicators yang diunduh secara bulk. Data tersebut terdiri dari 217 negara yang masing-masing terdapat angka untuk 1432 indikator.
Dengan tabel sebesar itu, anda hanya dapat melihat beberapa saja. Jupyter Notebook hanya akan menampilkan 5 data teratas, 5 terbawah, 10 paling kiri dan 10 paling kanan. Jika dilihat semua, maka akan kegedean dan tidak cukup Jupyter Notebooknya. Jika anda ingin melihat semuanya, lebih baik didownload dalam bentuk csv dan dibuka pakai google sheet atau microsoft excel.
Untuk download data ini, gunakan fungsi:
lat.to_csv("latihan.csv")
Data tersebut akan tersimpan di tempan anda menginstall Anaconda. Biasanya akan ada di C:\Users\nama\.
Untuk lanjut sih kita tidak perlu download datanya. Untuk sekarang, kita akan bermain-main dengan lat
Pertama, Coba kita pilih beberapa negara saja, misalnya negara-negara ASEAN dan RRT. kita akan menggunakan fungsi loc. Langkah pertama untuk menggunakan loc adalah ngeset nama negara menjadi index, lalu menggunakan nama negara untuk membuat dataframe baru sesuai dengan nama-nama negara yang kita pilih.
lat=lat.set_index('Country Name') # menjadikan nama negara sebagai index
# memilih negara-negara yang mau disimpan
countries=['Indonesia','China','Malaysia','Singapore','Thailand','Vietnam','Brunei Darussalam','Lao PDR','Cambodia','Myanmar','Philippines']
lat2=lat.loc[countries] # menggunakan fungsi loc[index] dan menempelkannya ke dataframe baru bernama lat2
lat2 # ngeprint data baru lat2
| ARI treatment (% of children under 5 taken to a health provider) | Access to clean fuels and technologies for cooking (% of population) | Access to electricity (% of population) | Access to electricity, rural (% of rural population) | Access to electricity, urban (% of urban population) | Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+) | Account ownership at a financial institution or with a mobile-money-service provider, poorest 40% (% of population ages 15+) | ... | Women participating in the three decisions (own health care, major household purchases, and visiting family) (% of women age 15-49) | Women who believe a husband is justified in beating his wife (any of five reasons) (%) | Women who believe a husband is justified in beating his wife when she argues with him (%) | Women who believe a husband is justified in beating his wife when she burns the food (%) | Women who believe a husband is justified in beating his wife when she goes out without telling him (%) | Women who believe a husband is justified in beating his wife when she neglects the children (%) | Women who believe a husband is justified in beating his wife when she refuses sex with him (%) | Women who were first married by age 15 (% of women ages 20-24) | Women who were first married by age 18 (% of women ages 20-24) | Women's share of population ages 15+ living with HIV (%) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country Name | |||||||||||||||||||||
| Indonesia | 92.1 | NaN | 98.140000 | 96.030359 | 99.890000 | 48.857452 | 51.354919 | 46.245716 | 49.497471 | 36.595760 | ... | 68.2 | 32.0 | 4.4 | 1.7 | 19.0 | 27.2 | 6.0 | NaN | NaN | 35.6 |
| China | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | 80.229118 | 76.364731 | 83.966637 | 79.261734 | 68.350258 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Malaysia | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | 85.343750 | 82.478539 | 87.916290 | 85.817039 | 80.504547 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 17.8 |
| Singapore | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | 97.931168 | 96.312927 | 99.682663 | 97.947220 | 96.378250 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 9.4 |
| Thailand | NaN | NaN | 99.900000 | 99.900000 | 99.900000 | 81.594002 | 79.780106 | 83.691376 | 83.174179 | 77.546707 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 44.1 |
| Vietnam | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | 30.795792 | 30.419701 | 31.205099 | 29.921171 | 20.277523 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 32.7 |
| Brunei Darussalam | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Lao PDR | 39.8 | NaN | 93.600000 | 90.510483 | 99.500000 | 29.056417 | 31.874203 | 26.115883 | 31.726784 | 17.370375 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 44.7 |
| Cambodia | NaN | NaN | 89.070000 | 86.089343 | 99.060000 | 21.671783 | 21.526394 | 21.845457 | 22.586687 | 14.304063 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 52.9 |
| Myanmar | NaN | NaN | 69.814837 | 59.919135 | 92.554521 | 25.992298 | 25.978712 | 26.007711 | 30.841911 | 22.543983 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 38.2 |
| Philippines | 67.3 | NaN | 93.000000 | 90.023167 | 96.400000 | 34.498409 | 38.854164 | 30.019701 | 38.828045 | 18.026352 | ... | 85.0 | 10.9 | 2.2 | 1.3 | 3.4 | 8.6 | 1.3 | 2.2 | 16.5 | 6.1 |
11 rows × 1431 columns
Setelah kita sukses punya 11 row saja langkah berikutnya adalah memilih variabel yang mau kita simpan. Tabel di atas punya 1431 kolom (atau 1431 indikator). Tentu tidak semuanya kita perlukan. Kita bisa pilih beberapa indikator pilihan. Misalnya mari kita pilih GDP (current US$), Official exchange rate (LCU per US$, period average), dan Merchandise trade (% of GDP).
var=['GDP (current US$)','Official exchange rate (LCU per US$, period average)','Merchandise trade (% of GDP)']
lat2=lat2[var]
lat2
| GDP (current US$) | Official exchange rate (LCU per US$, period average) | Merchandise trade (% of GDP) | |
|---|---|---|---|
| Country Name | |||
| Indonesia | 1.015619e+12 | 13380.833879 | 32.072665 |
| China | 1.231041e+13 | 6.758755 | 33.363133 |
| Malaysia | 3.189582e+11 | 4.300441 | 129.655533 |
| Singapore | 3.418633e+11 | 1.380925 | 205.160629 |
| Thailand | 4.562947e+11 | 33.939811 | 100.407477 |
| Vietnam | 2.237799e+11 | 22370.086667 | 191.229447 |
| Brunei Darussalam | 1.212810e+10 | 1.380891 | 71.371415 |
| Lao PDR | 1.685309e+10 | 8244.843189 | 62.540469 |
| Cambodia | 2.217720e+10 | 4050.579986 | 118.653389 |
| Myanmar | 6.894587e+10 | 1360.358707 | 48.055092 |
| Philippines | 3.284807e+11 | 50.403720 | 51.940336 |
Nah sekarang kita punya 11 negara
Menggunakan Pandas Datareader
Pertama, install dulu di anaconda prompt dengan menggunakan kode:
conda install -c anaconda pandas-datareader
Setelah selesai, kita bisa panggil dengan menggunakan impor
import pandas_datareader as pdr
Mari kita coba
Contoh: Mempersiapkan data untuk latihan
Kita akan menggunakan contoh bagaimana saya mempersiapkan data untuk latihan. Data yang saya gunakan adalah World Development Indicators (bulk download) bikinan The World Bank. Tentu saja anda tidak perlu mendownloadnya sendiri karena saya sudah mendownloadnya. muehehe.
Datanya gede banget, sehingga tidak saya upload di website. Namun anda dapat dengan bebas mendownloadnya. Jika anda ingin mempraktikkan kode-kode yang digunakan di contoh ini, anda perlu download sendiri datanya dari link bulk download di atas.
Data utamanya memiliki ukuran 193,5 MB. Kalau dibuka pakai excel bisa memakan waktu lumayan lama. Datanya memiliki ekstensi .csv sehingga kita import dengan pd.read_csv.
df=pd.read_csv('C:/data WDI/WDIData.csv') # mengimpor data dari lokasi saya menyimpan data tersebut
df # print
| Country Name | Country Code | Indicator Name | Indicator Code | 1960 | 1961 | 1962 | 1963 | 1964 | 1965 | ... | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | Unnamed: 64 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Arab World | ARB | Access to clean fuels and technologies for coo... | EG.CFT.ACCS.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 82.783289 | 83.120303 | 83.533457 | 83.897596 | 84.171599 | 84.510171 | NaN | NaN | NaN | NaN |
| 1 | Arab World | ARB | Access to electricity (% of population) | EG.ELC.ACCS.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 87.199474 | 87.512260 | 88.129881 | 87.275323 | 88.720097 | 89.308602 | 90.283638 | 89.286856 | NaN | NaN |
| 2 | Arab World | ARB | Access to electricity, rural (% of rural popul... | EG.ELC.ACCS.RU.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 75.958878 | 77.251714 | 78.165706 | 75.512153 | 78.211000 | 79.065508 | 81.102134 | 79.248100 | NaN | NaN |
| 3 | Arab World | ARB | Access to electricity, urban (% of urban popul... | EG.ELC.ACCS.UR.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 96.466418 | 96.435957 | 96.772853 | 96.466705 | 96.936319 | 97.290083 | 97.467915 | 97.063959 | NaN | NaN |
| 4 | Arab World | ARB | Account ownership at a financial institution o... | FX.OWN.TOTL.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 22.260538 | NaN | NaN | 30.277130 | NaN | NaN | 37.165211 | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 377779 | Zimbabwe | ZWE | Women who believe a husband is justified in be... | SG.VAW.NEGL.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 21.400000 | NaN | NaN | NaN | 21.400000 | NaN | NaN | NaN | NaN | NaN |
| 377780 | Zimbabwe | ZWE | Women who believe a husband is justified in be... | SG.VAW.REFU.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 16.900000 | NaN | NaN | NaN | 14.500000 | NaN | NaN | NaN | NaN | NaN |
| 377781 | Zimbabwe | ZWE | Women who were first married by age 15 (% of w... | SP.M15.2024.FE.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 3.900000 | NaN | NaN | NaN | 3.700000 | NaN | NaN | NaN | NaN | NaN |
| 377782 | Zimbabwe | ZWE | Women who were first married by age 18 (% of w... | SP.M18.2024.FE.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 30.500000 | NaN | NaN | 33.500000 | 32.400000 | NaN | NaN | NaN | NaN | NaN |
| 377783 | Zimbabwe | ZWE | Women's share of population ages 15+ living wi... | SH.DYN.AIDS.FE.ZS | NaN | NaN | NaN | NaN | NaN | NaN | ... | 59.100000 | 59.300000 | 59.500000 | 59.600000 | 59.600000 | 59.700000 | 59.700000 | 59.800000 | NaN | NaN |
377784 rows × 65 columns
Seperti anda lihat, data di atas aslinya besar sekali, mengandung lebih dari 370ribu baris dan 65 kolom. Setiap baris terdiri dari nama negara dan nama indikator, sementara kolomnya menandakan waktu. Bentuk seperti ini disebut juga dengan Panel Data, yaitu sebuah data yang memiliki struktur cross section (negara dan indikator) dan time series (tahun) sekaligus.
Untuk latihan, saya menggunakan data yang sifatnya cross section, yaitu terdiri atas nama negara dan nama indikator saja. Saya akan mengambil tahun 2017. Karena hanya menangkap indikator di satu tahun saja, makanya disebut cross section.
Gunakan pd.info untuk melihat tipe elemen dan ukuran (memori) data.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 377784 entries, 0 to 377783
Data columns (total 65 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 377784 non-null object
1 Country Code 377784 non-null object
2 Indicator Name 377784 non-null object
3 Indicator Code 377784 non-null object
4 1960 37085 non-null float64
5 1961 41471 non-null float64
6 1962 43673 non-null float64
7 1963 43553 non-null float64
8 1964 44098 non-null float64
9 1965 46608 non-null float64
10 1966 46314 non-null float64
11 1967 48032 non-null float64
12 1968 47594 non-null float64
13 1969 49477 non-null float64
14 1970 70456 non-null float64
15 1971 76771 non-null float64
16 1972 79536 non-null float64
17 1973 79135 non-null float64
18 1974 80551 non-null float64
19 1975 84584 non-null float64
20 1976 86656 non-null float64
21 1977 90100 non-null float64
22 1978 89719 non-null float64
23 1979 90588 non-null float64
24 1980 95296 non-null float64
25 1981 97177 non-null float64
26 1982 98454 non-null float64
27 1983 98272 non-null float64
28 1984 99067 non-null float64
29 1985 100337 non-null float64
30 1986 101118 non-null float64
31 1987 102475 non-null float64
32 1988 102052 non-null float64
33 1989 103850 non-null float64
34 1990 127202 non-null float64
35 1991 132640 non-null float64
36 1992 136741 non-null float64
37 1993 137957 non-null float64
38 1994 139967 non-null float64
39 1995 148014 non-null float64
40 1996 147733 non-null float64
41 1997 148909 non-null float64
42 1998 150831 non-null float64
43 1999 156211 non-null float64
44 2000 181781 non-null float64
45 2001 172155 non-null float64
46 2002 176998 non-null float64
47 2003 177612 non-null float64
48 2004 182912 non-null float64
49 2005 195795 non-null float64
50 2006 194394 non-null float64
51 2007 198393 non-null float64
52 2008 197756 non-null float64
53 2009 197733 non-null float64
54 2010 213972 non-null float64
55 2011 204254 non-null float64
56 2012 207054 non-null float64
57 2013 202080 non-null float64
58 2014 208672 non-null float64
59 2015 205706 non-null float64
60 2016 204739 non-null float64
61 2017 187824 non-null float64
62 2018 166196 non-null float64
63 2019 84963 non-null float64
64 Unnamed: 64 0 non-null float64
dtypes: float64(61), object(4)
memory usage: 187.3+ MB
kolom memiliki tipe object sementara nilainya memiliki tipe float64 seperti yang telah diduga. Dengan info, kita juga dapat mengetahui index data kita, jikalau kita butuh memanggil-manggil mereka dengan menggunakan index.
Untuk saat ini, saya akan membuang kolom yang tidak perlu, antara lain 'Country Code', 'Indicator Code', dan semua tahun selain 2017. Sebuah catatan, jika anda sedang melakukan penelitian serius, kode negara dan kode indikator sangatlah penting. Selalu simpan data master yang tidak disentuh sama sekali sebagai back-up!!
Kembali ke laptop, kita drop kolom yang tidak perlu untuk latihan:
df=df.drop(columns=['Country Code','Indicator Code']) # menghapus variabel country code dan indicator code
# menghapus kolom 1960 - 2016
for i in range(1960,2017):
a=str(i)
df=df.drop(columns=a)
# menghapus kolom 2018 dan 2019
df=df.drop(columns=['2018','2019','Unnamed: 64'])
df
| Country Name | Indicator Name | 2017 | |
|---|---|---|---|
| 0 | Arab World | Access to clean fuels and technologies for coo... | NaN |
| 1 | Arab World | Access to electricity (% of population) | 90.283638 |
| 2 | Arab World | Access to electricity, rural (% of rural popul... | 81.102134 |
| 3 | Arab World | Access to electricity, urban (% of urban popul... | 97.467915 |
| 4 | Arab World | Account ownership at a financial institution o... | 37.165211 |
| ... | ... | ... | ... |
| 377779 | Zimbabwe | Women who believe a husband is justified in be... | NaN |
| 377780 | Zimbabwe | Women who believe a husband is justified in be... | NaN |
| 377781 | Zimbabwe | Women who were first married by age 15 (% of w... | NaN |
| 377782 | Zimbabwe | Women who were first married by age 18 (% of w... | NaN |
| 377783 | Zimbabwe | Women's share of population ages 15+ living wi... | 59.700000 |
377784 rows × 3 columns
Kita telah sukses menghapus kolom-kolom yang tidak diperlukan. Data diatas seharusnya memiliki ukuran yang jauh lebih kecil dibandingkan sebelumnya.
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 377784 entries, 0 to 377783
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Country Name 377784 non-null object
1 Indicator Name 377784 non-null object
2 2017 187824 non-null float64
dtypes: float64(1), object(2)
memory usage: 8.6+ MB
Dapat anda lihat di atas, sekarang ukuran data kita tinggal sekitar 8,6 MB saja. Jauh dibandingkan sebelumnya ya?
Nah, sekarang kita telah memiliki data yang sifatnya cross section, tapi cross sectionnya masih menurun ke bawah. Yang saya inginkan adalah data yang bentuknya kira-kira seperti ini:
| nama_negara | indikator_1 | indikator_2 | …. | indikator_n |
|---|---|---|---|---|
| Afghanistan | angka_1 | angka_2 | …. | angka_n |
| Australia | angka_1 | angka_2 | …. | angka_n |
| …. | …. | …. | …. | …. |
| Zimbabwe | angka_1 | angka_2 | …. | angka_n |
Beberapa langkah harus kita lakukan
Pertama, anda dapat lihat bahwa observasi paling atas memiliki nama negara “Arab World”. Tentu saja ini bukan nama sebuah negara, tapi nama sebuah regional. Jika kita panggil semua nama unik dari nama negara, anda akan mendapati bahwa nama-nama teratas adalah nama regional, bukan nama negara.
df.set_index(['Country Name','Indicator Name']) # set index
| 2017 | ||
|---|---|---|
| Country Name | Indicator Name | |
| Arab World | Access to clean fuels and technologies for cooking (% of population) | NaN |
| Access to electricity (% of population) | 90.283638 | |
| Access to electricity, rural (% of rural population) | 81.102134 | |
| Access to electricity, urban (% of urban population) | 97.467915 | |
| Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+) | 37.165211 | |
| ... | ... | ... |
| Zimbabwe | Women who believe a husband is justified in beating his wife when she neglects the children (%) | NaN |
| Women who believe a husband is justified in beating his wife when she refuses sex with him (%) | NaN | |
| Women who were first married by age 15 (% of women ages 20-24) | NaN | |
| Women who were first married by age 18 (% of women ages 20-24) | NaN | |
| Women's share of population ages 15+ living with HIV (%) | 59.700000 |
377784 rows × 1 columns
Sekarang, data kita sudah dibuat sedemikian sehingga index-nya adalah Country Name dan Indicator Name. Namun
df['Country Name'].unique()
array(['Arab World', 'Caribbean small states',
'Central Europe and the Baltics', 'Early-demographic dividend',
'East Asia & Pacific',
'East Asia & Pacific (excluding high income)',
'East Asia & Pacific (IDA & IBRD countries)', 'Euro area',
'Europe & Central Asia',
'Europe & Central Asia (excluding high income)',
'Europe & Central Asia (IDA & IBRD countries)', 'European Union',
'Fragile and conflict affected situations',
'Heavily indebted poor countries (HIPC)', 'High income',
'IBRD only', 'IDA & IBRD total', 'IDA blend', 'IDA only',
'IDA total', 'Late-demographic dividend',
'Latin America & Caribbean',
'Latin America & Caribbean (excluding high income)',
'Latin America & the Caribbean (IDA & IBRD countries)',
'Least developed countries: UN classification',
'Low & middle income', 'Low income', 'Lower middle income',
'Middle East & North Africa',
'Middle East & North Africa (excluding high income)',
'Middle East & North Africa (IDA & IBRD countries)',
'Middle income', 'North America', 'Not classified', 'OECD members',
'Other small states', 'Pacific island small states',
'Post-demographic dividend', 'Pre-demographic dividend',
'Small states', 'South Asia', 'South Asia (IDA & IBRD)',
'Sub-Saharan Africa', 'Sub-Saharan Africa (excluding high income)',
'Sub-Saharan Africa (IDA & IBRD countries)', 'Upper middle income',
'World', 'Afghanistan', 'Albania', 'Algeria', 'American Samoa',
'Andorra', 'Angola', 'Antigua and Barbuda', 'Argentina', 'Armenia',
'Aruba', 'Australia', 'Austria', 'Azerbaijan', 'Bahamas, The',
'Bahrain', 'Bangladesh', 'Barbados', 'Belarus', 'Belgium',
'Belize', 'Benin', 'Bermuda', 'Bhutan', 'Bolivia',
'Bosnia and Herzegovina', 'Botswana', 'Brazil',
'British Virgin Islands', 'Brunei Darussalam', 'Bulgaria',
'Burkina Faso', 'Burundi', 'Cabo Verde', 'Cambodia', 'Cameroon',
'Canada', 'Cayman Islands', 'Central African Republic', 'Chad',
'Channel Islands', 'Chile', 'China', 'Colombia', 'Comoros',
'Congo, Dem. Rep.', 'Congo, Rep.', 'Costa Rica', "Cote d'Ivoire",
'Croatia', 'Cuba', 'Curacao', 'Cyprus', 'Czech Republic',
'Denmark', 'Djibouti', 'Dominica', 'Dominican Republic', 'Ecuador',
'Egypt, Arab Rep.', 'El Salvador', 'Equatorial Guinea', 'Eritrea',
'Estonia', 'Eswatini', 'Ethiopia', 'Faroe Islands', 'Fiji',
'Finland', 'France', 'French Polynesia', 'Gabon', 'Gambia, The',
'Georgia', 'Germany', 'Ghana', 'Gibraltar', 'Greece', 'Greenland',
'Grenada', 'Guam', 'Guatemala', 'Guinea', 'Guinea-Bissau',
'Guyana', 'Haiti', 'Honduras', 'Hong Kong SAR, China', 'Hungary',
'Iceland', 'India', 'Indonesia', 'Iran, Islamic Rep.', 'Iraq',
'Ireland', 'Isle of Man', 'Israel', 'Italy', 'Jamaica', 'Japan',
'Jordan', 'Kazakhstan', 'Kenya', 'Kiribati',
'Korea, Dem. People’s Rep.', 'Korea, Rep.', 'Kosovo', 'Kuwait',
'Kyrgyz Republic', 'Lao PDR', 'Latvia', 'Lebanon', 'Lesotho',
'Liberia', 'Libya', 'Liechtenstein', 'Lithuania', 'Luxembourg',
'Macao SAR, China', 'Madagascar', 'Malawi', 'Malaysia', 'Maldives',
'Mali', 'Malta', 'Marshall Islands', 'Mauritania', 'Mauritius',
'Mexico', 'Micronesia, Fed. Sts.', 'Moldova', 'Monaco', 'Mongolia',
'Montenegro', 'Morocco', 'Mozambique', 'Myanmar', 'Namibia',
'Nauru', 'Nepal', 'Netherlands', 'New Caledonia', 'New Zealand',
'Nicaragua', 'Niger', 'Nigeria', 'North Macedonia',
'Northern Mariana Islands', 'Norway', 'Oman', 'Pakistan', 'Palau',
'Panama', 'Papua New Guinea', 'Paraguay', 'Peru', 'Philippines',
'Poland', 'Portugal', 'Puerto Rico', 'Qatar', 'Romania',
'Russian Federation', 'Rwanda', 'Samoa', 'San Marino',
'Sao Tome and Principe', 'Saudi Arabia', 'Senegal', 'Serbia',
'Seychelles', 'Sierra Leone', 'Singapore',
'Sint Maarten (Dutch part)', 'Slovak Republic', 'Slovenia',
'Solomon Islands', 'Somalia', 'South Africa', 'South Sudan',
'Spain', 'Sri Lanka', 'St. Kitts and Nevis', 'St. Lucia',
'St. Martin (French part)', 'St. Vincent and the Grenadines',
'Sudan', 'Suriname', 'Sweden', 'Switzerland',
'Syrian Arab Republic', 'Tajikistan', 'Tanzania', 'Thailand',
'Timor-Leste', 'Togo', 'Tonga', 'Trinidad and Tobago', 'Tunisia',
'Turkey', 'Turkmenistan', 'Turks and Caicos Islands', 'Tuvalu',
'Uganda', 'Ukraine', 'United Arab Emirates', 'United Kingdom',
'United States', 'Uruguay', 'Uzbekistan', 'Vanuatu',
'Venezuela, RB', 'Vietnam', 'Virgin Islands (U.S.)',
'West Bank and Gaza', 'Yemen, Rep.', 'Zambia', 'Zimbabwe'],
dtype=object)
Nama individu negara baru dimulai setelah “world”, tepat di Afghanistan ke bawah. Untuk itu, semua observasi di atas Afghanistan harus kita buang. Pertama cari tau dulu Afghanistan dimulai di observasi nomor berapa.
df['Country Name'][0:67260] # Dapat dimulai dari 0:1000 lalu angka di belakang ditambahkan jika belum ketemu Afghanistan
# dikurangi jika sudah kelewatan.
0 Arab World
1 Arab World
2 Arab World
3 Arab World
4 Arab World
...
67255 World
67256 World
67257 Afghanistan
67258 Afghanistan
67259 Afghanistan
Name: Country Name, Length: 67260, dtype: object
Dapat dilihat bahwa Afghanistan dimulai di baris ke-67257. Mari kita drop semua baris sebelum 67257.
df=df[67258:377784] # jangan lupa bahwa indexing di python dimulai di 0.
df # ngecek data kita setelah dibuang semua non-negara
| Country Name | Indicator Name | 2017 | |
|---|---|---|---|
| 67258 | Afghanistan | Access to electricity (% of population) | 97.700000 |
| 67259 | Afghanistan | Access to electricity, rural (% of rural popul... | 97.091973 |
| 67260 | Afghanistan | Access to electricity, urban (% of urban popul... | 99.500000 |
| 67261 | Afghanistan | Account ownership at a financial institution o... | 14.893312 |
| 67262 | Afghanistan | Account ownership at a financial institution o... | 7.160685 |
| ... | ... | ... | ... |
| 377779 | Zimbabwe | Women who believe a husband is justified in be... | NaN |
| 377780 | Zimbabwe | Women who believe a husband is justified in be... | NaN |
| 377781 | Zimbabwe | Women who were first married by age 15 (% of w... | NaN |
| 377782 | Zimbabwe | Women who were first married by age 18 (% of w... | NaN |
| 377783 | Zimbabwe | Women's share of population ages 15+ living wi... | 59.700000 |
310526 rows × 3 columns
Sudah tampak cantik yah. Setelah itu, yang perlu kita lakukan adalah tinggal melakukan unstack data berdasarkan indicator name. kita harus set index terlebih dahulu terhadap dua object kita untuk menjadikannya cross section.
df=df.set_index(['Country Name','Indicator Name']) # set index
df
| 2017 | ||
|---|---|---|
| Country Name | Indicator Name | |
| Afghanistan | Access to electricity (% of population) | 97.700000 |
| Access to electricity, rural (% of rural population) | 97.091973 | |
| Access to electricity, urban (% of urban population) | 99.500000 | |
| Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+) | 14.893312 | |
| Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+) | 7.160685 | |
| ... | ... | ... |
| Zimbabwe | Women who believe a husband is justified in beating his wife when she neglects the children (%) | NaN |
| Women who believe a husband is justified in beating his wife when she refuses sex with him (%) | NaN | |
| Women who were first married by age 15 (% of women ages 20-24) | NaN | |
| Women who were first married by age 18 (% of women ages 20-24) | NaN | |
| Women's share of population ages 15+ living with HIV (%) | 59.700000 |
310526 rows × 1 columns
Nah, sekarang susunannya sudah sesuai dengan index yang kita inginkan. Terakhir, yang kita lakukan adalah unstack data berdasarkan nama indikator. Inilah yang agak susah dilakukan pakai sheets. (susah buat saya soalnya saya kurang jago microsoft excel).
latihan=df.unstack(['Indicator Name']) # mindahin nama indikator sebagai kolom
latihan
| 2017 | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Indicator Name | ARI treatment (% of children under 5 taken to a health provider) | Access to clean fuels and technologies for cooking (% of population) | Access to electricity (% of population) | Access to electricity, rural (% of rural population) | Access to electricity, urban (% of urban population) | Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+) | Account ownership at a financial institution or with a mobile-money-service provider, poorest 40% (% of population ages 15+) | ... | Women participating in the three decisions (own health care, major household purchases, and visiting family) (% of women age 15-49) | Women who believe a husband is justified in beating his wife (any of five reasons) (%) | Women who believe a husband is justified in beating his wife when she argues with him (%) | Women who believe a husband is justified in beating his wife when she burns the food (%) | Women who believe a husband is justified in beating his wife when she goes out without telling him (%) | Women who believe a husband is justified in beating his wife when she neglects the children (%) | Women who believe a husband is justified in beating his wife when she refuses sex with him (%) | Women who were first married by age 15 (% of women ages 20-24) | Women who were first married by age 18 (% of women ages 20-24) | Women's share of population ages 15+ living with HIV (%) |
| Country Name | |||||||||||||||||||||
| Afghanistan | NaN | NaN | 97.700000 | 97.091973 | 99.500000 | 14.893312 | 7.160685 | 22.536495 | 18.016499 | 13.802503 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 29.3 |
| Albania | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | 40.015171 | 38.102959 | 42.043503 | 42.513248 | 22.751028 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Algeria | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | 42.776627 | 29.268730 | 56.254936 | 48.747078 | 34.970802 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 45.9 |
| American Samoa | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Andorra | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Virgin Islands (U.S.) | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| West Bank and Gaza | NaN | NaN | 99.700000 | 99.070331 | 99.900000 | 25.022177 | 15.910773 | 34.436001 | 33.076920 | 11.981574 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Yemen, Rep. | NaN | NaN | 79.200000 | 68.766799 | 97.735039 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 22.8 |
| Zambia | NaN | NaN | 40.300000 | 13.997699 | 75.200000 | 45.863258 | 40.278877 | 51.680923 | 48.338139 | 31.244963 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 59.3 |
| Zimbabwe | NaN | NaN | 40.482048 | 19.152600 | 85.317101 | 55.285137 | 51.664925 | 59.227501 | 58.877064 | 43.601181 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 59.7 |
217 rows × 1431 columns
Masih ada sedikit masalah di data ini, yaitu kita punya 2 headers, yaitu 2017 dan Indicator Name. Masalahnya adalah, kita tidak bisa langsung memanggil nama variabelnya. memanggil:
latihan['Access to electricity (% of population)']
akan menghasilkan eror. Sementara itu, kita bisa panggil 2017, tapi nama kolom lainnya ikutan terpanggil. Coba kode ini:
latihan['2017']
Cara yang terbaik adalah memanggil keduanya, di mana 2017 adalah nama kolom pertama, diikuti nama kolom yang kita inginkan.
latihan['2017']['Access to electricity (% of population)']
Country Name
Afghanistan 97.700000
Albania 100.000000
Algeria 100.000000
American Samoa NaN
Andorra 100.000000
...
Virgin Islands (U.S.) 100.000000
West Bank and Gaza 99.700000
Yemen, Rep. 79.200000
Zambia 40.300000
Zimbabwe 40.482048
Name: Access to electricity (% of population), Length: 217, dtype: float64
Jadi kita perlu membuang row paling atas yang berisi 2017. kita gunakan fungsi droplevel
latihan.columns=latihan.columns.droplevel(0)
latihan
| Indicator Name | ARI treatment (% of children under 5 taken to a health provider) | Access to clean fuels and technologies for cooking (% of population) | Access to electricity (% of population) | Access to electricity, rural (% of rural population) | Access to electricity, urban (% of urban population) | Account ownership at a financial institution or with a mobile-money-service provider (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, female (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, male (% of population ages 15+) | Account ownership at a financial institution or with a mobile-money-service provider, older adults (% of population ages 25+) | Account ownership at a financial institution or with a mobile-money-service provider, poorest 40% (% of population ages 15+) | ... | Women participating in the three decisions (own health care, major household purchases, and visiting family) (% of women age 15-49) | Women who believe a husband is justified in beating his wife (any of five reasons) (%) | Women who believe a husband is justified in beating his wife when she argues with him (%) | Women who believe a husband is justified in beating his wife when she burns the food (%) | Women who believe a husband is justified in beating his wife when she goes out without telling him (%) | Women who believe a husband is justified in beating his wife when she neglects the children (%) | Women who believe a husband is justified in beating his wife when she refuses sex with him (%) | Women who were first married by age 15 (% of women ages 20-24) | Women who were first married by age 18 (% of women ages 20-24) | Women's share of population ages 15+ living with HIV (%) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Country Name | |||||||||||||||||||||
| Afghanistan | NaN | NaN | 97.700000 | 97.091973 | 99.500000 | 14.893312 | 7.160685 | 22.536495 | 18.016499 | 13.802503 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 29.3 |
| Albania | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | 40.015171 | 38.102959 | 42.043503 | 42.513248 | 22.751028 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Algeria | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | 42.776627 | 29.268730 | 56.254936 | 48.747078 | 34.970802 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 45.9 |
| American Samoa | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Andorra | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| Virgin Islands (U.S.) | NaN | NaN | 100.000000 | 100.000000 | 100.000000 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| West Bank and Gaza | NaN | NaN | 99.700000 | 99.070331 | 99.900000 | 25.022177 | 15.910773 | 34.436001 | 33.076920 | 11.981574 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Yemen, Rep. | NaN | NaN | 79.200000 | 68.766799 | 97.735039 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 22.8 |
| Zambia | NaN | NaN | 40.300000 | 13.997699 | 75.200000 | 45.863258 | 40.278877 | 51.680923 | 48.338139 | 31.244963 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 59.3 |
| Zimbabwe | NaN | NaN | 40.482048 | 19.152600 | 85.317101 | 55.285137 | 51.664925 | 59.227501 | 58.877064 | 43.601181 | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 59.7 |
217 rows × 1431 columns
mari kita coba panggil kolom latihan['Access to electricity (% of population)']
latihan['Access to electricity (% of population)']
Country Name
Afghanistan 97.700000
Albania 100.000000
Algeria 100.000000
American Samoa NaN
Andorra 100.000000
...
Virgin Islands (U.S.) 100.000000
West Bank and Gaza 99.700000
Yemen, Rep. 79.200000
Zambia 40.300000
Zimbabwe 40.482048
Name: Access to electricity (% of population), Length: 217, dtype: float64
Yup! Sudah bisa. Tidak eror lagi. Nanti, di proyek kalian, tentu akan males menulis berkali-kali nulis latihan['Access to electricity (% of population)'] karena kepanjangan. Nanti anda bisa menamakan kembali kolom tersebut dengan nama yang lebih pendek dan gampang diambil. Tentu saja untuk proyek anda nanti, anda nggak harus memakai semua kolom yang ada di dataset ini. Itu dipikirkan nanti saja yah!
Terakhir, yang perlu kita lakukan adalah mengexport si latihan ke bentuk sheets supaya enak dilihat. Selalu gunakan ekstensi .csv yah!
latihan.to_csv("latihan.csv")
---------------------------------------------------------------------------
PermissionError Traceback (most recent call last)
<ipython-input-31-50df1f6bddd1> in <module>
----> 1 latihan.to_csv("latihan.csv")
C:\Users\DELL\Anaconda3\lib\site-packages\pandas\core\generic.py in to_csv(self, path_or_buf, sep, na_rep, float_format, columns, header, index, index_label, mode, encoding, compression, quoting, quotechar, line_terminator, chunksize, date_format, doublequote, escapechar, decimal, errors)
3165 decimal=decimal,
3166 )
-> 3167 formatter.save()
3168
3169 if path_or_buf is None:
C:\Users\DELL\Anaconda3\lib\site-packages\pandas\io\formats\csvs.py in save(self)
188 encoding=self.encoding,
189 errors=self.errors,
--> 190 compression=dict(self.compression_args, method=self.compression),
191 )
192 close = True
C:\Users\DELL\Anaconda3\lib\site-packages\pandas\io\common.py in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors)
491 if encoding:
492 # Encoding
--> 493 f = open(path_or_buf, mode, encoding=encoding, errors=errors, newline="")
494 elif is_text:
495 # No explicit encoding
PermissionError: [Errno 13] Permission denied: 'latihan.csv'
Seperti yang anda lihat, sebenarnya agak ribet ngutik-ngutik data dengan Pandas, apalagi jika anda paham menggunakan pivot table di excel atau google sheet. Salah satu cara terbaik ya tentu saja mengutik-utik datanya di excel atau google sheet, lalu di-save as dengan extensi csv sebelum dibaca di python. Tidak ada masalah. Whatever works for you!