summaryrefslogtreecommitdiff
path: root/net/netfilter/ipvs/ip_vs_ctl.c
Commit message (Collapse)AuthorAge
* ipvs: fix rtnl_lock lockups caused by start_sync_threadJulian Anastasov2018-04-09
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzkaller reports for wrong rtnl_lock usage in sync code [1] and [2] We have 2 problems in start_sync_thread if error path is taken, eg. on memory allocation error or failure to configure sockets for mcast group or addr/port binding: 1. recursive locking: holding rtnl_lock while calling sock_release which in turn calls again rtnl_lock in ip_mc_drop_socket to leave the mcast group, as noticed by Florian Westphal. Additionally, sock_release can not be called while holding sync_mutex (ABBA deadlock). 2. task hung: holding rtnl_lock while calling kthread_stop to stop the running kthreads. As the kthreads do the same to leave the mcast group (sock_release -> ip_mc_drop_socket -> rtnl_lock) they hang. Fix the problems by calling rtnl_unlock early in the error path, now sock_release is called after unlocking both mutexes. Problem 3 (task hung reported by syzkaller [2]) is variant of problem 2: use _trylock to prevent one user to call rtnl_lock and then while waiting for sync_mutex to block kthreads that execute sock_release when they are stopped by stop_sync_thread. [1] IPVS: stopping backup sync thread 4500 ... WARNING: possible recursive locking detected 4.16.0-rc7+ #3 Not tainted -------------------------------------------- syzkaller688027/4497 is trying to acquire lock: (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 but task is already holding lock: IPVS: stopping backup sync thread 4495 ... (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 other info that might help us debug this: Possible unsafe locking scenario: CPU0 ---- lock(rtnl_mutex); lock(rtnl_mutex); *** DEADLOCK *** May be due to missing lock nesting notation 2 locks held by syzkaller688027/4497: #0: (rtnl_mutex){+.+.}, at: [<00000000bb14d7fb>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 #1: (ipvs->sync_mutex){+.+.}, at: [<00000000703f78e3>] do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388 stack backtrace: CPU: 1 PID: 4497 Comm: syzkaller688027 Not tainted 4.16.0-rc7+ #3 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:17 [inline] dump_stack+0x194/0x24d lib/dump_stack.c:53 print_deadlock_bug kernel/locking/lockdep.c:1761 [inline] check_deadlock kernel/locking/lockdep.c:1805 [inline] validate_chain kernel/locking/lockdep.c:2401 [inline] __lock_acquire+0xe8f/0x3e00 kernel/locking/lockdep.c:3431 lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920 __mutex_lock_common kernel/locking/mutex.c:756 [inline] __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893 mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908 rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 ip_mc_drop_socket+0x88/0x230 net/ipv4/igmp.c:2643 inet_release+0x4e/0x1c0 net/ipv4/af_inet.c:413 sock_release+0x8d/0x1e0 net/socket.c:595 start_sync_thread+0x2213/0x2b70 net/netfilter/ipvs/ip_vs_sync.c:1924 do_ip_vs_set_ctl+0x1139/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2389 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1261 udp_setsockopt+0x45/0x80 net/ipv4/udp.c:2406 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:2975 SYSC_setsockopt net/socket.c:1849 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1828 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x446a69 RSP: 002b:00007fa1c3a64da8 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000446a69 RDX: 000000000000048b RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00000000006e29fc R08: 0000000000000018 R09: 0000000000000000 R10: 00000000200000c0 R11: 0000000000000246 R12: 00000000006e29f8 R13: 00676e697279656b R14: 00007fa1c3a659c0 R15: 00000000006e2b60 [2] IPVS: sync thread started: state = BACKUP, mcast_ifn = syz_tun, syncid = 4, id = 0 IPVS: stopping backup sync thread 25415 ... INFO: task syz-executor7:25421 blocked for more than 120 seconds. Not tainted 4.16.0-rc6+ #284 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. syz-executor7 D23688 25421 4408 0x00000004 Call Trace: context_switch kernel/sched/core.c:2862 [inline] __schedule+0x8fb/0x1ec0 kernel/sched/core.c:3440 schedule+0xf5/0x430 kernel/sched/core.c:3499 schedule_timeout+0x1a3/0x230 kernel/time/timer.c:1777 do_wait_for_common kernel/sched/completion.c:86 [inline] __wait_for_common kernel/sched/completion.c:107 [inline] wait_for_common kernel/sched/completion.c:118 [inline] wait_for_completion+0x415/0x770 kernel/sched/completion.c:139 kthread_stop+0x14a/0x7a0 kernel/kthread.c:530 stop_sync_thread+0x3d9/0x740 net/netfilter/ipvs/ip_vs_sync.c:1996 do_ip_vs_set_ctl+0x2b1/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2394 nf_sockopt net/netfilter/nf_sockopt.c:106 [inline] nf_setsockopt+0x67/0xc0 net/netfilter/nf_sockopt.c:115 ip_setsockopt+0x97/0xa0 net/ipv4/ip_sockglue.c:1253 sctp_setsockopt+0x2ca/0x63e0 net/sctp/socket.c:4154 sock_common_setsockopt+0x95/0xd0 net/core/sock.c:3039 SYSC_setsockopt net/socket.c:1850 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1829 do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x42/0xb7 RIP: 0033:0x454889 RSP: 002b:00007fc927626c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00007fc9276276d4 RCX: 0000000000454889 RDX: 000000000000048c RSI: 0000000000000000 RDI: 0000000000000017 RBP: 000000000072bf58 R08: 0000000000000018 R09: 0000000000000000 R10: 0000000020000000 R11: 0000000000000246 R12: 00000000ffffffff R13: 000000000000051c R14: 00000000006f9b40 R15: 0000000000000001 Showing all locks held in the system: 2 locks held by khungtaskd/868: #0: (rcu_read_lock){....}, at: [<00000000a1a8f002>] check_hung_uninterruptible_tasks kernel/hung_task.c:175 [inline] #0: (rcu_read_lock){....}, at: [<00000000a1a8f002>] watchdog+0x1c5/0xd60 kernel/hung_task.c:249 #1: (tasklist_lock){.+.+}, at: [<0000000037c2f8f9>] debug_show_all_locks+0xd3/0x3d0 kernel/locking/lockdep.c:4470 1 lock held by rsyslogd/4247: #0: (&f->f_pos_lock){+.+.}, at: [<000000000d8d6983>] __fdget_pos+0x12b/0x190 fs/file.c:765 2 locks held by getty/4338: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4339: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4340: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4341: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4342: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4343: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 2 locks held by getty/4344: #0: (&tty->ldisc_sem){++++}, at: [<00000000bee98654>] ldsem_down_read+0x37/0x40 drivers/tty/tty_ldsem.c:365 #1: (&ldata->atomic_read_lock){+.+.}, at: [<00000000c1d180aa>] n_tty_read+0x2ef/0x1a40 drivers/tty/n_tty.c:2131 3 locks held by kworker/0:5/6494: #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] work_static include/linux/workqueue.h:198 [inline] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] set_work_data kernel/workqueue.c:619 [inline] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] set_work_pool_and_clear_pending kernel/workqueue.c:646 [inline] #0: ((wq_completion)"%s"("ipv6_addrconf")){+.+.}, at: [<00000000a062b18e>] process_one_work+0xb12/0x1bb0 kernel/workqueue.c:2084 #1: ((addr_chk_work).work){+.+.}, at: [<00000000278427d5>] process_one_work+0xb89/0x1bb0 kernel/workqueue.c:2088 #2: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 1 lock held by syz-executor7/25421: #0: (ipvs->sync_mutex){+.+.}, at: [<00000000d414a689>] do_ip_vs_set_ctl+0x277/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2393 2 locks held by syz-executor7/25427: #0: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 #1: (ipvs->sync_mutex){+.+.}, at: [<00000000e6d48489>] do_ip_vs_set_ctl+0x10f8/0x1cc0 net/netfilter/ipvs/ip_vs_ctl.c:2388 1 lock held by syz-executor7/25435: #0: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 1 lock held by ipvs-b:2:0/25415: #0: (rtnl_mutex){+.+.}, at: [<00000000066e35ac>] rtnl_lock+0x17/0x20 net/core/rtnetlink.c:74 Reported-and-tested-by: syzbot+a46d6abf9d56b1365a72@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+5fe074c01b2032ce9618@syzkaller.appspotmail.com Fixes: e0b26cc997d5 ("ipvs: call rtnl_lock early") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: delete /proc THIS_MODULE referencesAlexey Dobriyan2018-01-19
| | | | | | | | | | | | | | | | | | | | | | | | | /proc has been ignoring struct file_operations::owner field for 10 years. Specifically, it started with commit 786d7e1612f0b0adb6046f19b906609e4fe8b1ba ("Fix rmmod/read/write races in /proc entries"). Notice the chunk where inode->i_fop is initialized with proxy struct file_operations for regular files: - if (de->proc_fops) - inode->i_fop = de->proc_fops; + if (de->proc_fops) { + if (S_ISREG(inode->i_mode)) + inode->i_fop = &proc_reg_file_ops; + else + inode->i_fop = de->proc_fops; + } VFS stopped pinning module at this point. # ipvs Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-nextLinus Torvalds2017-11-15
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: "Highlights: 1) Maintain the TCP retransmit queue using an rbtree, with 1GB windows at 100Gb this really has become necessary. From Eric Dumazet. 2) Multi-program support for cgroup+bpf, from Alexei Starovoitov. 3) Perform broadcast flooding in hardware in mv88e6xxx, from Andrew Lunn. 4) Add meter action support to openvswitch, from Andy Zhou. 5) Add a data meta pointer for BPF accessible packets, from Daniel Borkmann. 6) Namespace-ify almost all TCP sysctl knobs, from Eric Dumazet. 7) Turn on Broadcom Tags in b53 driver, from Florian Fainelli. 8) More work to move the RTNL mutex down, from Florian Westphal. 9) Add 'bpftool' utility, to help with bpf program introspection. From Jakub Kicinski. 10) Add new 'cpumap' type for XDP_REDIRECT action, from Jesper Dangaard Brouer. 11) Support 'blocks' of transformations in the packet scheduler which can span multiple network devices, from Jiri Pirko. 12) TC flower offload support in cxgb4, from Kumar Sanghvi. 13) Priority based stream scheduler for SCTP, from Marcelo Ricardo Leitner. 14) Thunderbolt networking driver, from Amir Levy and Mika Westerberg. 15) Add RED qdisc offloadability, and use it in mlxsw driver. From Nogah Frankel. 16) eBPF based device controller for cgroup v2, from Roman Gushchin. 17) Add some fundamental tracepoints for TCP, from Song Liu. 18) Remove garbage collection from ipv6 route layer, this is a significant accomplishment. From Wei Wang. 19) Add multicast route offload support to mlxsw, from Yotam Gigi" * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2177 commits) tcp: highest_sack fix geneve: fix fill_info when link down bpf: fix lockdep splat net: cdc_ncm: GetNtbFormat endian fix openvswitch: meter: fix NULL pointer dereference in ovs_meter_cmd_reply_start netem: remove unnecessary 64 bit modulus netem: use 64 bit divide by rate tcp: Namespace-ify sysctl_tcp_default_congestion_control net: Protect iterations over net::fib_notifier_ops in fib_seq_sum() ipv6: set all.accept_dad to 0 by default uapi: fix linux/tls.h userspace compilation error usbnet: ipheth: prevent TX queue timeouts when device not ready vhost_net: conditionally enable tx polling uapi: fix linux/rxrpc.h userspace compilation errors net: stmmac: fix LPI transitioning for dwmac4 atm: horizon: Fix irq release error net-sysfs: trigger netlink notification on ifalias change via sysfs openvswitch: Using kfree_rcu() to simplify the code openvswitch: Make local function ovs_nsh_key_attr_size() static openvswitch: Fix return value check in ovs_meter_cmd_features() ...
| * netfilter: ipvs: Fix inappropriate output of procfsKUWAZAWA Takuya2017-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Information about ipvs in different network namespace can be seen via procfs. How to reproduce: # ip netns add ns01 # ip netns add ns02 # ip netns exec ns01 ip a add dev lo 127.0.0.1/8 # ip netns exec ns02 ip a add dev lo 127.0.0.1/8 # ip netns exec ns01 ipvsadm -A -t 10.1.1.1:80 # ip netns exec ns02 ipvsadm -A -t 10.1.1.2:80 The ipvsadm displays information about its own network namespace only. # ip netns exec ns01 ipvsadm -Ln IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.1.1.1:80 wlc # ip netns exec ns02 ipvsadm -Ln IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.1.1.2:80 wlc But I can see information about other network namespace via procfs. # ip netns exec ns01 cat /proc/net/ip_vs IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 0A010101:0050 wlc TCP 0A010102:0050 wlc # ip netns exec ns02 cat /proc/net/ip_vs IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 0A010102:0050 wlc Signed-off-by: KUWAZAWA Takuya <albatross0@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: ipvs: Use %pS printk format for direct addressesHelge Deller2017-11-06
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The debug and error printk functions in ipvs uses wrongly the %pF instead of the %pS printk format specifier for printing symbols for the address returned by _builtin_return_address(0). Fix it for the ia64, ppc64 and parisc64 architectures. Signed-off-by: Helge Deller <deller@gmx.de> Cc: Wensong Zhang <wensong@linux-vs.org> Cc: netdev@vger.kernel.org Cc: lvs-devel@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | netfilter: ipvs: Convert timers to use timer_setup()Kees Cook2017-11-08
|/ | | | | | | | | | | | | | | | | | | | | In preparation for unconditionally passing the struct timer_list pointer to all timer callbacks, switch to using the new timer_setup() and from_timer() to pass the timer pointer explicitly. Cc: Wensong Zhang <wensong@linux-vs.org> Cc: Simon Horman <horms@verge.net.au> Cc: Julian Anastasov <ja@ssi.bg> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu> Cc: Florian Westphal <fw@strlen.de> Cc: "David S. Miller" <davem@davemloft.net> Cc: netdev@vger.kernel.org Cc: lvs-devel@vger.kernel.org Cc: netfilter-devel@vger.kernel.org Cc: coreteam@netfilter.org Signed-off-by: Kees Cook <keescook@chromium.org> Acked-by: Julian Anastasov <ja@ssi.bg> Acked-by: Simon Horman <horms@verge.net.au>
* netfilter: Remove duplicated rcu_read_lock.Taehee Yoo2017-07-24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes duplicate rcu_read_lock(). 1. IPVS part: According to Julian Anastasov's mention, contexts of ipvs are described at: http://marc.info/?l=netfilter-devel&m=149562884514072&w=2, in summary: - packet RX/TX: does not need locks because packets come from hooks. - sync msg RX: backup server uses RCU locks while registering new connections. - ip_vs_ctl.c: configuration get/set, RCU locks needed. - xt_ipvs.c: It is a netfilter match, running from hook context. As result, rcu_read_lock and rcu_read_unlock can be removed from: - ip_vs_core.c: all - ip_vs_ctl.c: - only from ip_vs_has_real_service - ip_vs_ftp.c: all - ip_vs_proto_sctp.c: all - ip_vs_proto_tcp.c: all - ip_vs_proto_udp.c: all - ip_vs_xmit.c: all (contains only packet processing) 2. Netfilter part: There are three types of functions that are guaranteed the rcu_read_lock(). First, as result, functions are only called by nf_hook(): - nf_conntrack_broadcast_help(), pptp_expectfn(), set_expected_rtp_rtcp(). - tcpmss_reverse_mtu(), tproxy_laddr4(), tproxy_laddr6(). - match_lookup_rt6(), check_hlist(), hashlimit_mt_common(). - xt_osf_match_packet(). Second, functions that caller already held the rcu_read_lock(). - destroy_conntrack(), ctnetlink_conntrack_event(). - ctnl_timeout_find_get(), nfqnl_nf_hook_drop(). Third, functions that are mixed with type1 and type2. These functions are called by nf_hook() also these are called by ordinary functions that already held the rcu_read_lock(): - __ctnetlink_glue_build(), ctnetlink_expect_event(). - ctnetlink_proto_size(). Applied files are below: - nf_conntrack_broadcast.c, nf_conntrack_core.c, nf_conntrack_netlink.c. - nf_conntrack_pptp.c, nf_conntrack_sip.c, nfnetlink_cttimeout.c. - nfnetlink_queue.c, xt_TCPMSS.c, xt_TPROXY.c, xt_addrtype.c. - xt_connlimit.c, xt_hashlimit.c, xt_osf.c Detailed calltrace can be found at: http://marc.info/?l=netfilter-devel&m=149667610710350&w=2 Signed-off-by: Taehee Yoo <ap420073@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2017-05-03
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS/OVS fixes for net The following patchset contains a rather large batch of Netfilter, IPVS and OVS fixes for your net tree. This includes fixes for ctnetlink, the userspace conntrack helper infrastructure, conntrack OVS support, ebtables DNAT target, several leaks in error path among other. More specifically, they are: 1) Fix reference count leak in the CT target error path, from Gao Feng. 2) Remove conntrack entry clashing with a matching expectation, patch from Jarno Rajahalme. 3) Fix bogus EEXIST when registering two different userspace helpers, from Liping Zhang. 4) Don't leak dummy elements in the new bitmap set type in nf_tables, from Liping Zhang. 5) Get rid of module autoload from conntrack update path in ctnetlink, we don't need autoload at this late stage and it is happening with rcu read lock held which is not good. From Liping Zhang. 6) Fix deadlock due to double-acquire of the expect_lock from conntrack update path, this fixes a bug that was introduced when the central spinlock got removed. Again from Liping Zhang. 7) Safe ct->status update from ctnetlink path, from Liping. The expect_lock protection that was selected when the central spinlock was removed was not really protecting anything at all. 8) Protect sequence adjustment under ct->lock. 9) Missing socket match with IPv6, from Peter Tirsek. 10) Adjust skb->pkt_type of DNAT'ed frames from ebtables, from Linus Luessing. 11) Don't give up on evaluating the expression on new entries added via dynset expression in nf_tables, from Liping Zhang. 12) Use skb_checksum() when mangling icmpv6 in IPv6 NAT as this deals with non-linear skbuffs. 13) Don't allow IPv6 service in IPVS if no IPv6 support is available, from Paolo Abeni. 14) Missing mutex release in error path of xt_find_table_lock(), from Dan Carpenter. 15) Update maintainers files, Netfilter section. Add Florian to the file, refer to nftables.org and change project status from Supported to Maintained. 16) Bail out on mismatching extensions in element updates in nf_tables. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipvs: explicitly forbid ipv6 service/dest creation if ipv6 mod is disabledPaolo Abeni2017-04-28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When creating a new ipvs service, ipv6 addresses are always accepted if CONFIG_IP_VS_IPV6 is enabled. On dest creation the address family is not explicitly checked. This allows the user-space to configure ipvs services even if the system is booted with ipv6.disable=1. On specific configuration, ipvs can try to call ipv6 routing code at setup time, causing the kernel to oops due to fib6_rules_ops being NULL. This change addresses the issue adding a check for the ipv6 module being enabled while validating ipv6 service operations and adding the same validation for dest operations. According to git history, this issue is apparently present since the introduction of ipv6 support, and the oops can be triggered since commit 09571c7ae30865ad ("IPVS: Add function to determine if IPv6 address is local") Fixes: 09571c7ae30865ad ("IPVS: Add function to determine if IPv6 address is local") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2017-05-01
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter updates for your net-next tree. A large bunch of code cleanups, simplify the conntrack extension codebase, get rid of the fake conntrack object, speed up netns by selective synchronize_net() calls. More specifically, they are: 1) Check for ct->status bit instead of using nfct_nat() from IPVS and Netfilter codebase, patch from Florian Westphal. 2) Use kcalloc() wherever possible in the IPVS code, from Varsha Rao. 3) Simplify FTP IPVS helper module registration path, from Arushi Singhal. 4) Introduce nft_is_base_chain() helper function. 5) Enforce expectation limit from userspace conntrack helper, from Gao Feng. 6) Add nf_ct_remove_expect() helper function, from Gao Feng. 7) NAT mangle helper function return boolean, from Gao Feng. 8) ctnetlink_alloc_expect() should only work for conntrack with helpers, from Gao Feng. 9) Add nfnl_msg_type() helper function to nfnetlink to build the netlink message type. 10) Get rid of unnecessary cast on void, from simran singhal. 11) Use seq_puts()/seq_putc() instead of seq_printf() where possible, also from simran singhal. 12) Use list_prev_entry() from nf_tables, from simran signhal. 13) Remove unnecessary & on pointer function in the Netfilter and IPVS code. 14) Remove obsolete comment on set of rules per CPU in ip6_tables, no longer true. From Arushi Singhal. 15) Remove duplicated nf_conntrack_l4proto_udplite4, from Gao Feng. 16) Remove unnecessary nested rcu_read_lock() in __nf_nat_decode_session(). Code running from hooks are already guaranteed to run under RCU read side. 17) Remove deadcode in nf_tables_getobj(), from Aaron Conole. 18) Remove double assignment in nf_ct_l4proto_pernet_unregister_one(), also from Aaron. 19) Get rid of unsed __ip_set_get_netlink(), from Aaron Conole. 20) Don't propagate NF_DROP error to userspace via ctnetlink in __nf_nat_alloc_null_binding() function, from Gao Feng. 21) Revisit nf_ct_deliver_cached_events() to remove unnecessary checks, from Gao Feng. 22) Kill the fake untracked conntrack objects, use ctinfo instead to annotate a conntrack object is untracked, from Florian Westphal. 23) Remove nf_ct_is_untracked(), now obsolete since we have no conntrack template anymore, from Florian. 24) Add event mask support to nft_ct, also from Florian. 25) Move nf_conn_help structure to include/net/netfilter/nf_conntrack_helper.h. 26) Add a fixed 32 bytes scratchpad area for conntrack helpers. Thus, we don't deal with variable conntrack extensions anymore. Make sure userspace conntrack helper doesn't go over that size. Remove variable size ct extension infrastructure now this code got no more clients. From Florian Westphal. 27) Restore offset and length of nf_ct_ext structure to 8 bytes now that wraparound is not possible any longer, also from Florian. 28) Allow to get rid of unassured flows under stress in conntrack, this applies to DCCP, SCTP and TCP protocols, from Florian. 29) Shrink size of nf_conntrack_ecache structure, from Florian. 30) Use TCP_MAX_WSCALE instead of hardcoded 14 in TCP tracker, from Gao Feng. 31) Register SYNPROXY hooks on demand, from Florian Westphal. 32) Use pernet hook whenever possible, instead of global hook registration, from Florian Westphal. 33) Pass hook structure to ebt_register_table() to consolidate some infrastructure code, from Florian Westphal. 34) Use consume_skb() and return NF_STOLEN, instead of NF_DROP in the SYNPROXY code, to make sure device stats are not fooled, patch from Gao Feng. 35) Remove NF_CT_EXT_F_PREALLOC this kills quite some code that we don't need anymore if we just select a fixed size instead of expensive runtime time calculation of this. From Florian. 36) Constify nf_ct_extend_register() and nf_ct_extend_unregister(), from Florian. 37) Simplify nf_ct_ext_add(), this kills nf_ct_ext_create(), from Florian. 38) Attach NAT extension on-demand from masquerade and pptp helper path, from Florian. 39) Get rid of useless ip_vs_set_state_timeout(), from Aaron Conole. 40) Speed up netns by selective calls of synchronize_net(), from Florian Westphal. 41) Silence stack size warning gcc in 32-bit arch in snmp helper, from Florian. 42) Inconditionally call nf_ct_ext_destroy(), even if we have no extensions, to deal with the NF_NAT_MANIP_SRC case. Patch from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: Remove exceptional & on function nameArushi Singhal2017-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove & from function pointers to conform to the style found elsewhere in the file. Done using the following semantic patch // <smpl> @r@ identifier f; @@ f(...) { ... } @@ identifier r.f; @@ - &f + f // </smpl> Signed-off-by: Arushi Singhal <arushisinghal19971997@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: Use seq_puts()/seq_putc() where possiblesimran singhal2017-04-07
| | | | | | | | | | | | | | | | | | | | | | | | | | | For string without format specifiers, use seq_puts(). For seq_printf("\n"), use seq_putc('\n'). Signed-off-by: simran singhal <singhalsimran0@gmail.com> Acked-by: Simon Horman <horms+renesas@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | netlink: pass extended ACK struct where availableJohannes Berg2017-04-13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an add-on to the previous patch that passes the extended ACK structure where it's already available by existing genl_info or extack function arguments. This was done with this spatch (with some manual adjustment of indentation): @@ expression A, B, C, D, E; identifier fn, info; @@ fn(..., struct genl_info *info, ...) { ... -nlmsg_parse(A, B, C, D, E, NULL) +nlmsg_parse(A, B, C, D, E, info->extack) ... } @@ expression A, B, C, D, E; identifier fn, info; @@ fn(..., struct genl_info *info, ...) { <... -nla_parse_nested(A, B, C, D, NULL) +nla_parse_nested(A, B, C, D, info->extack) ...> } @@ expression A, B, C, D, E; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nlmsg_parse(A, B, C, D, E, NULL) +nlmsg_parse(A, B, C, D, E, extack) ...> } @@ expression A, B, C, D, E; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nla_parse(A, B, C, D, E, NULL) +nla_parse(A, B, C, D, E, extack) ...> } @@ expression A, B, C, D, E; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { ... -nlmsg_parse(A, B, C, D, E, NULL) +nlmsg_parse(A, B, C, D, E, extack) ... } @@ expression A, B, C, D; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nla_parse_nested(A, B, C, D, NULL) +nla_parse_nested(A, B, C, D, extack) ...> } @@ expression A, B, C, D; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nlmsg_validate(A, B, C, D, NULL) +nlmsg_validate(A, B, C, D, extack) ...> } @@ expression A, B, C, D; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nla_validate(A, B, C, D, NULL) +nla_validate(A, B, C, D, extack) ...> } @@ expression A, B, C; identifier fn, extack; @@ fn(..., struct netlink_ext_ack *extack, ...) { <... -nla_validate_nested(A, B, C, NULL) +nla_validate_nested(A, B, C, extack) ...> } Signed-off-by: Johannes Berg <johannes.berg@intel.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | netlink: pass extended ACK struct to parsing functionsJohannes Berg2017-04-13
|/ / | | | | | | | | | | | | | | | | Pass the new extended ACK reporting struct to all of the generic netlink parsing functions. For now, pass NULL in almost all callers (except for some in the core.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netfilter: refcounter conversionsReshetova, Elena2017-03-17
|/ | | | | | | | | | | | | | refcount_t type and corresponding API (see include/linux/refcount.h) should be used instead of atomic_t when the variable is used as a reference counter. This allows to avoid accidental refcounter overflows that might lead to use-after-free situations. Signed-off-by: Elena Reshetova <elena.reshetova@intel.com> Signed-off-by: Hans Liljestrand <ishkamiel@gmail.com> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2017-02-03
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for your net-next tree, they are: 1) Stash ctinfo 3-bit field into pointer to nf_conntrack object from sk_buff so we only access one single cacheline in the conntrack hotpath. Patchset from Florian Westphal. 2) Don't leak pointer to internal structures when exporting x_tables ruleset back to userspace, from Willem DeBruijn. This includes new helper functions to copy data to userspace such as xt_data_to_user() as well as conversions of our ip_tables, ip6_tables and arp_tables clients to use it. Not surprinsingly, ebtables requires an ad-hoc update. There is also a new field in x_tables extensions to indicate the amount of bytes that we copy to userspace. 3) Add nf_log_all_netns sysctl: This new knob allows you to enable logging via nf_log infrastructure for all existing netnamespaces. Given the effort to provide pernet syslog has been discontinued, let's provide a way to restore logging using netfilter kernel logging facilities in trusted environments. Patch from Michal Kubecek. 4) Validate SCTP checksum from conntrack helper, from Davide Caratti. 5) Merge UDPlite conntrack and NAT helpers into UDP, this was mostly a copy&paste from the original helper, from Florian Westphal. 6) Reset netfilter state when duplicating packets, also from Florian. 7) Remove unnecessary check for broadcast in IPv6 in pkttype match and nft_meta, from Liping Zhang. 8) Add missing code to deal with loopback packets from nft_meta when used by the netdev family, also from Liping. 9) Several cleanups on nf_tables, one to remove unnecessary check from the netlink control plane path to add table, set and stateful objects and code consolidation when unregister chain hooks, from Gao Feng. 10) Fix harmless reference counter underflow in IPVS that, however, results in problems with the introduction of the new refcount_t type, from David Windsor. 11) Enable LIBCRC32C from nf_ct_sctp instead of nf_nat_sctp, from Davide Caratti. 12) Missing documentation on nf_tables uapi header, from Liping Zhang. 13) Use rb_entry() helper in xt_connlimit, from Geliang Tang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipvs: free ip_vs_dest structs when refcnt=0David Windsor2017-02-02
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the ip_vs_dest cache frees ip_vs_dest objects when their reference count becomes < 0. Aside from not being semantically sound, this is problematic for the new type refcount_t, which will be introduced shortly in a separate patch. refcount_t is the new kernel type for holding reference counts, and provides overflow protection and a constrained interface relative to atomic_t (the type currently being used for kernel reference counts). Per Julian Anastasov: "The problem is that dest_trash currently holds deleted dests (unlinked from RCU lists) with refcnt=0." Changing dest_trash to hold dest with refcnt=1 will allow us to free ip_vs_dest structs when their refcnt=0, in ip_vs_dest_put_and_free(). Signed-off-by: David Windsor <dwindsor@gmail.com> Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | Introduce a sysctl that modifies the value of PROT_SOCK.Krister Johansen2017-01-24
|/ | | | | | | | | | | | | | | | | | | Add net.ipv4.ip_unprivileged_port_start, which is a per namespace sysctl that denotes the first unprivileged inet port in the namespace. To disable all privileged ports set this to zero. It also checks for overlap with the local port range. The privileged and local range may not overlap. The use case for this change is to allow containerized processes to bind to priviliged ports, but prevent them from ever being allowed to modify their container's network configuration. The latter is accomplished by ensuring that the network namespace is not a child of the user namespace. This modification was needed to allow the container manager to disable a namespace's priviliged port restrictions without exposing control of the network namespace to processes in the user namespace. Signed-off-by: Krister Johansen <kjlx@templeofstupid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Replace <asm/uaccess.h> with <linux/uaccess.h> globallyLinus Torvalds2016-12-24
| | | | | | | | | | | | | This was entirely automated, using the script by Al: PATT='^[[:blank:]]*#[[:blank:]]*include[[:blank:]]*<asm/uaccess.h>' sed -i -e "s!$PATT!#include <linux/uaccess.h>!" \ $(git grep -l "$PATT"|grep -v ^include/linux/uaccess.h) to do the replacement at the end of the merge window. Requested-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Merge tag 'ipvs-for-v4.10' of ↵Pablo Neira Ayuso2016-12-04
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Simon Horman says: ==================== IPVS Updates for v4.10 please consider these enhancements to the IPVS for v4.10. * Decrement the IP ttl in all the modes in order to prevent infinite route loops. Thanks to Dwip Banerjee. * Use IS_ERR_OR_NULL macro. Clean-up from Gao Feng. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * ipvs: Use IS_ERR_OR_NULL(svc) instead of IS_ERR(svc) || svc == NULLGao Feng2016-11-15
| | | | | | | | | | | | | | | | | | This minor refactoring does not change the logic of function ip_vs_genl_dump_dests. Signed-off-by: Gao Feng <fgao@ikuai8.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-11-15
|\ \ | |/ |/| | | | | | | | | Several cases of bug fixes in 'net' overlapping other changes in 'net-next-. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipvs: use IPVS_CMD_ATTR_MAX for family.maxattrWANG Cong2016-11-08
| | | | | | | | | | | | | | | | | | | | | | | | family.maxattr is the max index for policy[], the size of ops[] is determined with ARRAY_SIZE(). Reported-by: Andrey Konovalov <andreyknvl@google.com> Tested-by: Andrey Konovalov <andreyknvl@google.com> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Simon Horman <horms@verge.net.au> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | genetlink: mark families as __ro_after_initJohannes Berg2016-10-27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now genl_register_family() is the only thing (other than the users themselves, perhaps, but I didn't find any doing that) writing to the family struct. In all families that I found, genl_register_family() is only called from __init functions (some indirectly, in which case I've add __init annotations to clarifly things), so all can actually be marked __ro_after_init. This protects the data structure from accidental corruption. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | genetlink: statically initialize familiesJohannes Berg2016-10-27
| | | | | | | | | | | | | | | | | | | | | | | | Instead of providing macros/inline functions to initialize the families, make all users initialize them statically and get rid of the macros. This reduces the kernel code size by about 1.6k on x86-64 (with allyesconfig). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | genetlink: no longer support using static family IDsJohannes Berg2016-10-27
|/ | | | | | | | | | | | | | | | | | | | | | | Static family IDs have never really been used, the only use case was the workaround I introduced for those users that assumed their family ID was also their multicast group ID. Additionally, because static family IDs would never be reserved by the generic netlink code, using a relatively low ID would only work for built-in families that can be registered immediately after generic netlink is started, which is basically only the control family (apart from the workaround code, which I also had to add code for so it would reserve those IDs) Thus, anything other than GENL_ID_GENERATE is flawed and luckily not used except in the cases I mentioned. Move those workarounds into a few lines of code, and then get rid of GENL_ID_GENERATE entirely, making it more robust. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2016-05-09
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following large patchset contains Netfilter updates for your net-next tree. My initial intention was to send you this in two goes but when I looked back twice I already had this burden on top of me. Several updates for IPVS from Marco Angaroni: 1) Allow SIP connections originating from real-servers to be load balanced by the SIP persistence engine as is already implemented in the other direction. 2) Release connections immediately for One-packet-scheduling (OPS) in IPVS, instead of making it via timer and rcu callback. 3) Skip deleting conntracks for each one packet in OPS, and don't call nf_conntrack_alter_reply() since no reply is expected. 4) Enable drop on exhaustion for OPS + SIP persistence. Miscelaneous conntrack updates from Florian Westphal, including fix for hash resize: 5) Move conntrack generation counter out of conntrack pernet structure since this is only used by the init_ns to allow hash resizing. 6) Use get_random_once() from packet path to collect hash random seed instead of our compound. 7) Don't disable BH from ____nf_conntrack_find() for statistics, use NF_CT_STAT_INC_ATOMIC() instead. 8) Fix lookup race during conntrack hash resizing. 9) Introduce clash resolution on conntrack insertion for connectionless protocol. Then, Florian's netns rework to get rid of per-netns conntrack table, thus we use one single table for them all. There was consensus on this change during the NFWS 2015 and, on top of that, it has recently been pointed as a source of multiple problems from unpriviledged netns: 11) Use a single conntrack hashtable for all namespaces. Include netns in object comparisons and make it part of the hash calculation. Adapt early_drop() to consider netns. 12) Use single expectation and NAT hashtable for all namespaces. 13) Use a single slab cache for all namespaces for conntrack objects. 14) Skip full table scanning from nf_ct_iterate_cleanup() if the pernet conntrack counter tells us the table is empty (ie. equals zero). Fixes for nf_tables interval set element handling, support to set conntrack connlabels and allow set names up to 32 bytes. 15) Parse element flags from element deletion path and pass it up to the backend set implementation. 16) Allow adjacent intervals in the rbtree set type for dynamic interval updates. 17) Add support to set connlabel from nf_tables, from Florian Westphal. 18) Allow set names up to 32 bytes in nf_tables. Several x_tables fixes and updates: 19) Fix incorrect use of IS_ERR_VALUE() in x_tables, original patch from Andrzej Hajda. And finally, miscelaneous netfilter updates such as: 20) Disable automatic helper assignment by default. Note this proc knob was introduced by a9006892643a ("netfilter: nf_ct_helper: allow to disable automatic helper assignment") 4 years ago to start moving towards explicit conntrack helper configuration via iptables CT target. 21) Get rid of obsolete and inconsistent debugging instrumentation in x_tables. 22) Remove unnecessary check for null after ip6_route_output(). ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipvs: handle connections started by real-serversMarco Angaroni2016-04-20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using LVS-NAT and SIP persistence-egine over UDP, the following limitations are present with current implementation: 1) To actually have load-balancing based on Call-ID header, you need to use one-packet-scheduling mode. But with one-packet-scheduling the connection is deleted just after packet is forwarded, so SIP responses coming from real-servers do not match any connection and SNAT is not applied. 2) If you do not use "-o" option, IPVS behaves as normal UDP load balancer, so different SIP calls (each one identified by a different Call-ID) coming from the same ip-address/port go to the same real-server. So basically you don’t have load-balancing based on Call-ID as intended. 3) Call-ID is not learned when a new SIP call is started by a real-server (inside-to-outside direction), but only in the outside-to-inside direction. This would be a general problem for all SIP servers acting as Back2BackUserAgent. This patch aims to solve problems 1) and 3) while keeping OPS mode mandatory for SIP-UDP, so that 2) is not a problem anymore. The basic mechanism implemented is to make packets, that do not match any existent connection but come from real-servers, create new connections instead of let them pass without any effect. When such packets pass through ip_vs_out(), if their source ip address and source port match a configured real-server, a new connection is automatically created in the same way as it would have happened if the packet had come from outside-to-inside direction. A new connection template is created too if the virtual-service is persistent and there is no matching connection template found. The new connection automatically created, if the service had "-o" option, is an OPS connection that lasts only the time to forward the packet, just like it happens on the ingress side. The main part of this mechanism is implemented inside a persistent-engine specific callback (at the moment only SIP persistent engine exists) and is triggered only for UDP packets, since connection oriented protocols, by using different set of ports (typically ephemeral ports) to open new outgoing connections, should not need this feature. The following requisites are needed for automatic connection creation; if any is missing the packet simply goes the same way as before. a) virtual-service is not fwmark based (this is because fwmark services do not store address and port of the virtual-service, required to build the connection data). b) virtual-service and real-servers must not have been configured with omitted port (this is again to have all data to create the connection). Signed-off-by: Marco Angaroni <marcoangaroni@gmail.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* | netfilter/ipvs: use nla_put_u64_64bit()Nicolas Dichtel2016-04-25
|/ | | | | Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netfilter: ipvs: avoid unused variable warningsArnd Bergmann2016-02-18
| | | | | | | | | | | | | | | | | | The proc_create() and remove_proc_entry() functions do not reference their arguments when CONFIG_PROC_FS is disabled, so we get a couple of warnings about unused variables in IPVS: ipvs/ip_vs_app.c:608:14: warning: unused variable 'net' [-Wunused-variable] ipvs/ip_vs_ctl.c:3950:14: warning: unused variable 'net' [-Wunused-variable] ipvs/ip_vs_ctl.c:3994:14: warning: unused variable 'net' [-Wunused-variable] This removes the local variables and instead looks them up separately for each use, which obviously avoids the warning. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Fixes: 4c50a8ce2b63 ("netfilter: ipvs: avoid unused variable warning") Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* netfilter: ipvs: Remove noisy debug print from ip_vs_del_serviceYannick Brosseau2016-02-18
| | | | | | | This have been there for a long time, but does not seem to add value Signed-off-by: Yannick Brosseau <scientist@fb.com> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Remove skb_sknetEric W. Biederman2015-09-24
| | | | | | | | This function adds no real value and it obscures what the code is doing. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net into ip_vs_control_net_(init|cleanup)Eric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_control_net_(init|cleanup)_sysctlEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_random_drop_entryEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_start_estimator aned ip_vs_stop_estimatorEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_genl_set_configEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to stop_sync_threadEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to start_sync_threadEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_genl_del_daemonEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_genl_new_daemonEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_genl_find_serviceEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_genl_parse_serviceEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to __ip_vs_get_timeoutsEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to __ip_vs_get_dest_entriesEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to __ip_vs_get_service_entriesEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_set_timeoutEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_proto_data_getEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_zero_allEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
* ipvs: Pass ipvs not net to ip_vs_service_net_cleanupEric W. Biederman2015-09-24
| | | | | | Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>